What are the best ways to visualize raw reads and RPKMs? Not just reporting the stats... thanks!
Announcement
Collapse
No announcement yet.
X

We use IGV to view wiggle files. IGV can visualize SAM/BAM files as well, but for RNASeq there is often too much read depth for that to be useful. We even see some crashes when visualizing BAM files because of insufficient memory (even giving IGV all 24 GB of RAM). IGV is a free download, and runs on all major platforms.
Visualizing RPKMs is a bit trickier. You can plot them in a bar graph, where each bar is a different gene. It depends some on what you want to say about them.
In my experience the RPKM is a terrible way of quantifying transcripts. It may work better for genetic experiments, but I doubt it. The library preparation introduces some positional bias, which can bias RPKM for or against certain genes. This confounds comparing within a sample. When comparing between samples the length normalization doesn't actually help (because it cancels out in most betweensample comparisons). Normalizing by the total read count is a global normalization that can be replaced with quantile normalization ( as in Bullard et al., http://www.biomedcentral.com/14712105/11/94). This has the added advantage of stabilizing the variance to an exactly hypergeometric distribution, so a Fisher Exact Test or chisquare test results in very good estimation of significant Differential Expression.
That's the long way of saying: I wouldn't (and don't) use RPKMs.
Beyond that, how you visualize depends some on what you're looking for.

Our software, SeqMonk, is designed to view very large sequence datasets efficiently on modest hardware. It takes in raw mapped sequence data rather than wig files, but allows you to quantitate your data within the program and view both the quantitated and raw data.
SeqMonk is free software and works on pretty much all platforms.
Comment

@ raw reads:
GBrowse2.03 and SAMtools work fine on my computer (only a Desktop machine). Depends a bit on the settings you choose... Just if you would like to view the individual read alignements on a bigger stretch of the genome you may get problems.
@ expression values (what RPKM is in the end):
There are several ways to display those. Always depending on qhat you would like to show. Certain ideas you might get from microarray studies (as some statistics are not too different).
To get a first overview I normally plot histograms from all samples, scatterplots with loess (sample X vs sample Y and a smoother line), sorted scatterplots (sorted(sampleX) vs sorted(sampleY)  systematic differences are normally quite well visible). This plots normally help to find a proper transformation and normalisation.
@ RPKM:
fully agree with mrawlins... I don't think it is a very good way to quantify expression levels (from statistical points of view as described in 'Transcript length bias in RNAseq data confounds systems biology' or 'Uncovering the Complexity of Transcriptomes with RNASeq').
Experimentally: If one does not sequence full length RNA, the RPKMs are no good choice at all.
I saw in some polyAtail amplified samples that in general all transcripts were only covered at the 600 bases at the 3' end (more or less). In addidion, the coverage was not uniform (coverage drops linearly from the 3' end towards the 5' end). So  the well known amplification bias... This leads to the fact that the number of reads per transcript is not correlated with transcript size anymore (if one excludes the outliers). Division by transcript size would therefore rather be a 'mistake' than a 'correction'.
This has the added advantage of stabilizing the variance to an exactly hypergeometric distribution, so a Fisher Exact Test or chisquare test results in very good estimation of significant Differential Expression.
(*) and  my opinion: no variance  no test.Last edited by schmima; 07302010, 04:18 AM.
Comment

Originally posted by schmima View Post@ expression values (what RPKM is in the end):
There are several ways to display those. Always depending on qhat you would like to show. Certain ideas you might get from microarray studies (as some statistics are not too different).
To get a first overview I normally plot histograms from all samples, scatterplots with loess (sample X vs sample Y and a smoother line), sorted scatterplots (sorted(sampleX) vs sorted(sampleY)  systematic differences are normally quite well visible). This plots normally help to find a proper transformation and normalisation.
Thanks for any advice.
Originally posted by schmima View Post(*) and  my opinion: no variance  no test.
I did read DESeq can deal with nonreplicated data.
Comment

Can't insert a picture, as I'm not having an online storage at hand. Is there a way to upload pictures within the forum?
Anyway  most of the problems regarding basic transformation (what is not the same as normalisation/standardisation between samples) are discussed in most basic statistic courses/books/lectures.
In the following, I write what I personally learned, experienced and think  if there are some bad errors/other opinions/comments I would be happy to read about
First some thoughts about transformations:
Important thereby is that one first checks the assumptions used in the software/tests. As an example  if you do a ttest between two groups you assume that the data is...
1. normally
2. identically
3. independently
distributed.
Now  is this the case?
I would do now in the example:
For every single sample one may plot the observed quantiles versus the theoretical quantiles given by the normal distribution (sometimes called QQplot  you should find it somewhere in the web or books) and also a histogram (some things are a bit more intuitive to see). Additional one should also plot residuals/sd if replicates are available.
Some things to check in the plots:
Does the QQPlot show a straight line?
Does the histogram look like a normal distribution?
Are the residuals/sds similar over all data values?
Yes > fine, go on.
No > how is it screwed? What is wrong in the residual plot?... From here on you may check in books/online.
Some general hints regarding transformations if you assume a normal distribution (!) > Tukey's first aid transformations.
 log(x) for concentrations and absolute values (nonneg values)
 root(x) for counting data
 arcsin(root(x)) for parts (%values/100)
Note  log(x) is also often fitting well for counting data.
However  to be more general again:
For transformations you should always:
1. Know what assumptions are made in later steps (this is anyway crucial)
2. Check if your data fits the assumptions (this is not always just plotting work  you should also think if it is in theory fitting the assumptions  eg: are the values independent from each other? can your data be at all distributed in the way you assume it?)
3. Transform if necessary
4. Check if the transformation really did what you wanted (does the data fit now your assumptions. If yes fine, otherwise start at 1.)
Therefore  as I don't know your data, your assumption and so on  I can't just tell what you should do or not
About 'normalisation' (or  make samples comparable):
I guess a lot of things you will find in microarray literature.
However  some personal ideas/thoughts about your example (DE study with two samples):
Again (yes  this is indeed one of the most important questions):
What do you assume? Let's say you assume that the samples should be relatively similar with few genes being DE and with most genes being more or less equally expressed.
What is the consequence of your assumption  how can you check it?
In a scatterplot you should see most points around the first diagonal (what means that the expression level is very similar in both samples). Plotting a loess curve can help to see things better (a cloud of dots is not always easy to interpret...). It's important to play a bit with the parameters in the program that you get an informative loess curve (means that it shows properly the general trends in your data).
Have a look at the scatterplot  some things you could see:
1. something fitting your assumption very well (loess on first diagonal, some, but not too many outliers, similar number of total reads in both samples)
> leave the data as it is.
2. something that partly fits your assumption (loess is a relatively straight line BUT not on the first diagonal, some but not too many outliers, different number of total reads (this is the reason why the loess is not on the first diagonal)) > choose something that in a way 'moves' the whole cloud of points towards the first diagonal... Or in other words, the loess curve is straight (what is what you want), but its shifted. So  you need to move the cloud/loess towards the first diagonal without changing the shape.
> in this case you could use somekind of (trimmed) mean/median scaling.
3. something that is not fitting your assumptions very well (loess is not straight, maybe also shifted away from the first diagonal, still not too many outliers, maybe different number of total reads). So now you need something that makes the line straight and shifts it to the right place... What is a bit more difficult.
> You could try a loess normalisation or maybe also quantile normalisation (implementations you'll find in microarray studies).
4. something that is absolutely out of your assumptions and a real crap. Well... Either the experiment failed or your assumptions are not correct.
> think about assumptions and the experiment
Now  after the normalisation:
Does your data fit now to your assumptions?
Yes > go on
No > start again/check if you made a mistake in the algorithm
Again more general
The basic principle regarding 'normalisation' is, to make the samples comparable (get rid of systematic differences if they should not be there  what is again an assumption you make). Can't tell what's the best thing to do. Think it is also depending on the number of samples you have and the assumptions you make (eg. invariantset I would maybe try for a bigger set of samples where samples are not too similar; quantile I would maybe use for several samples that should be in theory very similar (and that are also in practice not showing very strange biases); loess I may use if I have strange scatterplots (few or more samples); any kind of scaling I often use if the data fits the assumptions quite well and if there are some global differences).
And note that some algorithms (especially if you adapt from microarray packages) may require different kinds of values / assume different things  you may check this in the respective publications.
Summary  whatever you do:
1. Know the assumptions
2. Check if your data fits
3. Do something with the data if not (just  keep it legal ^^)
4. Check if your data fits now
5. Proceed
By the way:
With some microarray packages/algorithms I encountered a problem I could not solve properly (not just during calculations but also theoretically):
The 0 in RNAseq...
However  I'll write about it another time Maybe I'll find a solution using the full dataset.
edit:
Haven't read yet the Bullard Paper (just returned from vacation ^^). I guess there might be something interesting in
Comment
Latest Articles
Collapse

by seqadmin
The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
...
Channel: Articles
11272023, 01:15 PM 
ad_right_rmr
Collapse
News
Collapse
Topics  Statistics  Last Post  

Started by seqadmin, Yesterday, 10:35 AM

0 responses
13 views
0 likes

Last Post
by seqadmin
Yesterday, 10:35 AM


Started by seqadmin, 12052023, 02:24 PM

0 responses
17 views
0 likes

Last Post
by seqadmin
12052023, 02:24 PM


Started by seqadmin, 12052023, 07:37 AM

0 responses
28 views
0 likes

Last Post
by seqadmin
12052023, 07:37 AM


Started by seqadmin, 12042023, 08:23 AM

0 responses
13 views
0 likes

Last Post
by seqadmin
12042023, 08:23 AM

Comment