Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best way to visualize raw reads, and RPKMs?

    What are the best ways to visualize raw reads and RPKMs? Not just reporting the stats... thanks!

  • #2
    We use IGV to view wiggle files. IGV can visualize SAM/BAM files as well, but for RNA-Seq there is often too much read depth for that to be useful. We even see some crashes when visualizing BAM files because of insufficient memory (even giving IGV all 24 GB of RAM). IGV is a free download, and runs on all major platforms.

    Visualizing RPKMs is a bit trickier. You can plot them in a bar graph, where each bar is a different gene. It depends some on what you want to say about them.

    In my experience the RPKM is a terrible way of quantifying transcripts. It may work better for genetic experiments, but I doubt it. The library preparation introduces some positional bias, which can bias RPKM for or against certain genes. This confounds comparing within a sample. When comparing between samples the length normalization doesn't actually help (because it cancels out in most between-sample comparisons). Normalizing by the total read count is a global normalization that can be replaced with quantile normalization ( as in Bullard et al., http://www.biomedcentral.com/1471-2105/11/94). This has the added advantage of stabilizing the variance to an exactly hypergeometric distribution, so a Fisher Exact Test or chi-square test results in very good estimation of significant Differential Expression.

    That's the long way of saying: I wouldn't (and don't) use RPKMs.

    Beyond that, how you visualize depends some on what you're looking for.

    Comment


    • #3
      You may wish to try a trial of NextGENe it provides good visualization and the RPKM values.

      Comment


      • #4
        Our software, SeqMonk, is designed to view very large sequence datasets efficiently on modest hardware. It takes in raw mapped sequence data rather than wig files, but allows you to quantitate your data within the program and view both the quantitated and raw data.

        SeqMonk is free software and works on pretty much all platforms.

        Comment


        • #5
          @ raw reads:
          GBrowse2.03 and SAMtools work fine on my computer (only a Desktop machine). Depends a bit on the settings you choose... Just if you would like to view the individual read alignements on a bigger stretch of the genome you may get problems.

          @ expression values (what RPKM is in the end):
          There are several ways to display those. Always depending on qhat you would like to show. Certain ideas you might get from microarray studies (as some statistics are not too different).
          To get a first overview I normally plot histograms from all samples, scatterplots with loess (sample X vs sample Y and a smoother line), sorted scatterplots (sorted(sampleX) vs sorted(sampleY) - systematic differences are normally quite well visible). This plots normally help to find a proper transformation and normalisation.

          @ RPKM:
          fully agree with mrawlins... I don't think it is a very good way to quantify expression levels (from statistical points of view as described in 'Transcript length bias in RNA-seq data confounds systems biology' or 'Uncovering the Complexity of Transcriptomes with RNASeq').
          Experimentally: If one does not sequence full length RNA, the RPKMs are no good choice at all.
          I saw in some poly-A-tail amplified samples that in general all transcripts were only covered at the 600 bases at the 3' end (more or less). In addidion, the coverage was not uniform (coverage drops linearly from the 3' end towards the 5' end). So - the well known amplification bias... This leads to the fact that the number of reads per transcript is not correlated with transcript size anymore (if one excludes the outliers). Division by transcript size would therefore rather be a 'mistake' than a 'correction'.

          This has the added advantage of stabilizing the variance to an exactly hypergeometric distribution, so a Fisher Exact Test or chi-square test results in very good estimation of significant Differential Expression.
          hmm - up to now I'm not very convinced by the approaches using Fishers exact or chi-square (mainly due to the frequently provided 'no replicates' mode)... Just compared once two biological replicates -- and as expected it gave around 12'000 differentially expressed genes (out of 19'000 in total). So - a measure of variance is - at least for my samples - still very important (*). But right now I'm still waiting for the rest of the replicates... Curious if the approaches will work better if one has the variance estimate.

          (*) and - my opinion: no variance - no test.
          Last edited by schmima; 07-30-2010, 04:18 AM.

          Comment


          • #6
            Originally posted by schmima View Post
            @ expression values (what RPKM is in the end):
            There are several ways to display those. Always depending on qhat you would like to show. Certain ideas you might get from microarray studies (as some statistics are not too different).
            To get a first overview I normally plot histograms from all samples, scatterplots with loess (sample X vs sample Y and a smoother line), sorted scatterplots (sorted(sampleX) vs sorted(sampleY) - systematic differences are normally quite well visible). This plots normally help to find a proper transformation and normalisation.
            Thanks for the useful posts; I would like to have a go with my data like this, producing the plots. Do you have some example plots and what you would look for, before deciding on how to transform? How would you transform the data following this? Do you apply the Bullard et al. method of normalisation?

            Thanks for any advice.

            Originally posted by schmima View Post
            (*) and - my opinion: no variance - no test.
            Alas, our lab is not so rich to afford replicate RNA-seq experiments
            I did read DESeq can deal with non-replicated data.

            Comment


            • #7
              Can't insert a picture, as I'm not having an online storage at hand. Is there a way to upload pictures within the forum?

              Anyway - most of the problems regarding basic transformation (what is not the same as normalisation/standardisation between samples) are discussed in most basic statistic courses/books/lectures.

              In the following, I write what I personally learned, experienced and think - if there are some bad errors/other opinions/comments I would be happy to read about

              First some thoughts about transformations:
              Important thereby is that one first checks the assumptions used in the software/tests. As an example - if you do a t-test between two groups you assume that the data is...
              1. normally
              2. identically
              3. independently
              distributed.
              Now - is this the case?
              I would do now in the example:
              For every single sample one may plot the observed quantiles versus the theoretical quantiles given by the normal distribution (sometimes called QQ-plot - you should find it somewhere in the web or books) and also a histogram (some things are a bit more intuitive to see). Additional one should also plot residuals/sd if replicates are available.
              Some things to check in the plots:
              Does the QQ-Plot show a straight line?
              Does the histogram look like a normal distribution?
              Are the residuals/sds similar over all data values?
              Yes -> fine, go on.
              No -> how is it screwed? What is wrong in the residual plot?... From here on you may check in books/online.

              Some general hints regarding transformations if you assume a normal distribution (!) -> Tukey's first aid transformations.
              - log(x) for concentrations and absolute values (non-neg values)
              - root(x) for counting data
              - arcsin(root(x)) for parts (%-values/100)
              Note - log(x) is also often fitting well for counting data.

              However - to be more general again:
              For transformations you should always:
              1. Know what assumptions are made in later steps (this is anyway crucial)
              2. Check if your data fits the assumptions (this is not always just plotting work - you should also think if it is in theory fitting the assumptions - eg: are the values independent from each other? can your data be at all distributed in the way you assume it?)
              3. Transform if necessary
              4. Check if the transformation really did what you wanted (does the data fit now your assumptions. If yes fine, otherwise start at 1.)

              Therefore - as I don't know your data, your assumption and so on - I can't just tell what you should do or not


              About 'normalisation' (or - make samples comparable):
              I guess a lot of things you will find in microarray literature.
              However - some personal ideas/thoughts about your example (DE study with two samples):

              Again (yes - this is indeed one of the most important questions):
              What do you assume? Let's say you assume that the samples should be relatively similar with few genes being DE and with most genes being more or less equally expressed.

              What is the consequence of your assumption - how can you check it?
              In a scatterplot you should see most points around the first diagonal (what means that the expression level is very similar in both samples). Plotting a loess curve can help to see things better (a cloud of dots is not always easy to interpret...). It's important to play a bit with the parameters in the program that you get an informative loess curve (means that it shows properly the general trends in your data).

              Have a look at the scatterplot - some things you could see:

              1. something fitting your assumption very well (loess on first diagonal, some, but not too many outliers, similar number of total reads in both samples)
              -> leave the data as it is.

              2. something that partly fits your assumption (loess is a relatively straight line BUT not on the first diagonal, some but not too many outliers, different number of total reads (this is the reason why the loess is not on the first diagonal)) -> choose something that in a way 'moves' the whole cloud of points towards the first diagonal... Or in other words, the loess curve is straight (what is what you want), but its shifted. So - you need to move the cloud/loess towards the first diagonal without changing the shape.
              -> in this case you could use somekind of (trimmed) mean/median scaling.

              3. something that is not fitting your assumptions very well (loess is not straight, maybe also shifted away from the first diagonal, still not too many outliers, maybe different number of total reads). So now you need something that makes the line straight and shifts it to the right place... What is a bit more difficult.
              -> You could try a loess normalisation or maybe also quantile normalisation (implementations you'll find in microarray studies).

              4. something that is absolutely out of your assumptions and a real crap. Well... Either the experiment failed or your assumptions are not correct.
              -> think about assumptions and the experiment

              Now - after the normalisation:
              Does your data fit now to your assumptions?
              Yes -> go on
              No -> start again/check if you made a mistake in the algorithm

              Again more general
              The basic principle regarding 'normalisation' is, to make the samples comparable (get rid of systematic differences if they should not be there - what is again an assumption you make). Can't tell what's the best thing to do. Think it is also depending on the number of samples you have and the assumptions you make (eg. invariantset I would maybe try for a bigger set of samples where samples are not too similar; quantile I would maybe use for several samples that should be in theory very similar (and that are also in practice not showing very strange biases); loess I may use if I have strange scatterplots (few or more samples); any kind of scaling I often use if the data fits the assumptions quite well and if there are some global differences).

              And note that some algorithms (especially if you adapt from microarray packages) may require different kinds of values / assume different things - you may check this in the respective publications.

              Summary - whatever you do:
              1. Know the assumptions
              2. Check if your data fits
              3. Do something with the data if not (just - keep it legal ^^)
              4. Check if your data fits now
              5. Proceed


              By the way:
              With some microarray packages/algorithms I encountered a problem I could not solve properly (not just during calculations but also theoretically):
              The 0 in RNAseq...
              However - I'll write about it another time Maybe I'll find a solution using the full dataset.


              edit:
              Haven't read yet the Bullard Paper (just returned from vacation ^^). I guess there might be something interesting in
              Last edited by schmima; 08-23-2010, 12:25 AM. Reason: adding a paragraph

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              31 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              33 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Working...
              X