Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ZoeG
    Member
    • Jun 2013
    • 31

    How to deal with the variance Of reads in biological duplicates?

    Hi, all, since I am some new to RNASeq analysis, I have a lot of questions... Thanks for your help in advance.
    My question here is, if the total number of reads for my biological duplicates vary, will that have an effect on the results, for example differential gene expression?
    I have several biological duplicates for each group, the total number of reads vary between 80~150M. I will normalize them by RPKM, does this cancel the effect of the difference in reads? If not, what is the way to minimize the effect of these variances in duplicates?
    Last edited by ZoeG; 06-14-2013, 10:57 AM.
  • john_nl
    Member
    • Feb 2012
    • 13

    #2
    Most tools that test for differential expression will account for this problem. For example, see the median count ratio method of the DESeq packages.

    Comment

    • mbblack
      Senior Member
      • Aug 2009
      • 245

      #3
      Originally posted by ZoeG View Post
      Hi, all, since I am some new to RNASeq analysis, I have a lot of questions... Thanks for your help in advance.
      My question here is, if the total number of reads for my biological duplicates vary, will that have an effect on the results, for example differential gene expression?
      I have several biological duplicates for each group, the total number of reads vary between 80~150M. I will normalize them by RPKM, does this cancel the effect of the difference in reads? If not, what is the way to minimize the effect of these variances in duplicates?
      The whole reason one includes biological replicates is because there will inherently be variation amongst any random sample of each population. Having the replicates allows you to not only estimate the population expression level for a given gene, but the natural variation around that estimate. Having that estimate and its variance is what allows you to make a statistical inference of significance in the first place.

      So, all of the statistical approaches to differential gene expression do NOT cancel out the differences in replicates, but use it to actually allow the computation of statistical significance (which really is all about adequately accounting for the natural variation in population estimates of gene expression).

      The natural variation about your estimate of expression within a population is what it is - you now need to test whether, given that observed expression and associated observed variation, you can say anything about the significance of differences in expression between population. And normalization techniques and appropriate statistical tests will deal with that.

      The process of normalization is to adjust the population sampling probability distribution into alignment across all samples, so that means and variance are appropriate for statistical comparison. So it does not cancel out variation, but it adjusts means and the probability distribution of variance into alignment for common camparison across all samples.
      Last edited by mbblack; 06-17-2013, 05:30 AM.
      Michael Black, Ph.D.
      ScitoVation LLC. RTP, N.C.

      Comment

      • ZoeG
        Member
        • Jun 2013
        • 31

        #4
        Thanks, mbblack and john. The explanation of statistical approaches makes sense. We apply normalization first to adjust the probability distribution and then use the parameters such as means, variance, median count ratio to evaluate the consistence or difference of samples.
        In this process, the normalization method is the start point and hence quite important. As I know, RPKM is quite popular to normalize RNAseq data. I also found a paper which stated as a better method to normalize RNAseq data, here is the link http://genomebiology.com/2010/11/3/r25

        Any comment on this method?
        Or is it okay that I applied this method to normalize the data and then run RPKM to do the second normalization ? Will a combined normalization make sense?

        @mbblack @john_nl

        Comment

        • mbblack
          Senior Member
          • Aug 2009
          • 245

          #5
          ZoeG - the current state of the art in RNA-Seq and differential gene expression is that
          there is no clear consensus on what normalization technique is optimal, or under what circumstances one might choose one over another. I can tell you from personal experience that the choice of normalization technique can have a very great affect on differential expression results.

          My suggestion to you would be to not settle on one single analysis approach up front, but to explore several several, read some literature, and try to make the best informed decision you can about how to proceed.

          There are several good, and well documented tools available freely in R/BioConductor, and there are numerous commercial apps you may have access to (Partek, JMP Genomics, CLC Bio's software, to name a few).

          In many ways, I think RNA-Seq DGE analysis is at a state that microarray data was about 10 years or so ago. Many people are exploring many potential ways of treating the raw data, but just what will finally settle out as best practices and optimal analyses will take more time as more and varied data sets are explored and published.

          For what it is worth, I'd still say that RPKM remains the most published normalization scheme to date, and it actually does seem to perform not badly in many circumstances. Once you have RPKM values, you can analyze the data with a simple ANOVA. Then investigate some other approaches like edgeR, or DESeq. There are also some non-parametric tools available to look into.

          As far as your question - no, you should not be combining RPKM with that scaling approach.
          Michael Black, Ph.D.
          ScitoVation LLC. RTP, N.C.

          Comment

          • ZoeG
            Member
            • Jun 2013
            • 31

            #6
            Dear mbblack, thanks a lot for all information.
            wonderful start-guidance for a new researcher in the field.


            Originally posted by mbblack View Post
            ZoeG - the current state of the art in RNA-Seq and differential gene expression is that
            there is no clear consensus on what normalization technique is optimal, or under what circumstances one might choose one over another. I can tell you from personal experience that the choice of normalization technique can have a very great affect on differential expression results.

            My suggestion to you would be to not settle on one single analysis approach up front, but to explore several several, read some literature, and try to make the best informed decision you can about how to proceed.

            There are several good, and well documented tools available freely in R/BioConductor, and there are numerous commercial apps you may have access to (Partek, JMP Genomics, CLC Bio's software, to name a few).

            In many ways, I think RNA-Seq DGE analysis is at a state that microarray data was about 10 years or so ago. Many people are exploring many potential ways of treating the raw data, but just what will finally settle out as best practices and optimal analyses will take more time as more and varied data sets are explored and published.

            For what it is worth, I'd still say that RPKM remains the most published normalization scheme to date, and it actually does seem to perform not badly in many circumstances. Once you have RPKM values, you can analyze the data with a simple ANOVA. Then investigate some other approaches like edgeR, or DESeq. There are also some non-parametric tools available to look into.

            As far as your question - no, you should not be combining RPKM with that scaling approach.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Pathogen Surveillance with Advanced Genomic Tools
              by seqadmin




              The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
              03-24-2025, 11:48 AM
            • seqadmin
              New Genomics Tools and Methods Shared at AGBT 2025
              by seqadmin


              This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

              The Headliner
              The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
              03-03-2025, 01:39 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-20-2025, 05:03 AM
            0 responses
            49 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-19-2025, 07:27 AM
            0 responses
            57 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-18-2025, 12:50 PM
            0 responses
            49 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-03-2025, 01:15 PM
            0 responses
            200 views
            0 reactions
            Last Post seqadmin  
            Working...