Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mihuzx
    Member
    • Apr 2013
    • 20

    RNA-seq bio-replication with low correlation

    Hi!
    I sequeced two biological replicates for one condition with Hiseq 2000 platfom.
    but the typical R(Pearson) correlation of gene expression(Raw Count) between two biological replicates is only 0.93, ie the R2 only about 0.87.
    Can I use these 2 samples to do the differencial analysis?
    Any suggestion for how to use this to call DE genes?
    Or some recommend readings are also very helpful.

    Thanks all.
  • mikep
    Member
    • Feb 2011
    • 45

    #2
    Raw counts don't follow a linear distribution. Use Spearman, not Pearson. And discard any genes with 0 counts. Actually, I'd probably discard genes with < 10.

    Secondly, is this human data, or in other words are your biological replicates sampled from different individuals with a heterogeneous genetic background?

    Comment

    • mihuzx
      Member
      • Apr 2013
      • 20

      #3
      Originally posted by mikep View Post
      Raw counts don't follow a linear distribution. Use Spearman, not Pearson. And discard any genes with 0 counts. Actually, I'd probably discard genes with < 10.

      Secondly, is this human data, or in other words are your biological replicates sampled from different individuals with a heterogeneous genetic background?
      thank you for your advice,
      I removed all genes < 10 and calculated the spearman correlation, but it still only about 0.93.
      and I calculate spearson crrelation with genes < 1RPKM ,it didn't change.
      now I wonder if I can use this to call DE genes. and how much it affect the result.
      or if I use the data ,how can I make the least differrence.

      Comment

      • velt
        Member
        • Jun 2013
        • 10

        #4
        The Pearson and Spearman correlation coefficients are not well suited to RNA-seq count data. Indeed, we want to know if expression values are the same between two samples (linearity => Pearson coefficient), not just whether they have an increasing or decreasing trend (Spearman coefficient). But, Pearson’s r is generally ambiguous and highly dependent on sequencing depth and the range of expression levels inherent to the sample (difference between lowest and highest bin count).

        I think it is difficult, from these coefficients, to determine if the samples are good replicates or not.

        I advise you to read this publication and to use the SERE coefficient, which is well suited to the comparison of RNA-seq samples:

        SERE can therefore serve as a straightforward and reliable statistical procedure for the global assessment of pairs or large groups of RNA-Seq datasets by a single statistical parameter.


        A score of 1 indicating faithful replication. And more the score is high, more the samples are different. I use this coefficient to explore my data.
        Last edited by velt; 08-06-2014, 12:08 AM.

        Comment

        • mikep
          Member
          • Feb 2011
          • 45

          #5
          You didn't mention your sample source. If it is different people then 0.93 might be as good as it gets. I get around 0.95 on my data.

          Another option (for future use) is to use a spikein like ERCC, you can then correlate counts independent of biological variabilty.

          As for DE, my advice is suck it and see.

          Finally Velt, nice call. Assimilating SERE into our pipeline in 3...2...1...

          Comment

          • mihuzx
            Member
            • Apr 2013
            • 20

            #6
            Originally posted by velt View Post
            The Pearson and Spearman correlation coefficients are not well suited to RNA-seq count data. Indeed, we want to know if expression values are the same between two samples (linearity => Pearson coefficient), not just whether they have an increasing or decreasing trend (Spearman coefficient). But, Pearson’s r is generally ambiguous and highly dependent on sequencing depth and the range of expression levels inherent to the sample (difference between lowest and highest bin count).

            I think it is difficult, from these coefficients, to determine if the samples are good replicates or not.

            I advise you to read this publication and to use the SERE coefficient, which is well suited to the comparison of RNA-seq samples:

            SERE can therefore serve as a straightforward and reliable statistical procedure for the global assessment of pairs or large groups of RNA-Seq datasets by a single statistical parameter.


            A score of 1 indicating faithful replication. And more the score is high, more the samples are different. I use this coefficient to explore my data.
            hi velt,
            thank you very much.
            I have tried it with my data. the SERE score is 5.8.
            and another pair replication is about 3.3
            is this too high? or any sugguestions ?
            by the way, I think this standard is really strict.

            Comment

            • mbblack
              Senior Member
              • Aug 2009
              • 245

              #7
              Well, your single greatest source of variation when it comes to differential expression is biological variation amongst individuals in your population. So if these were two different individuals, then your observed correlations might not be far off, at least when looking only at raw read counts.

              Also, did you have equal or near equal read depth for each sample? If you had large differences in read depth across the two samples, then raw counts will also vary a great deal because of that.

              Honestly, I would not worry about such differences in raw counts between biological replicates. That sort of variability is the very reason you use biological replication, so you can compute a robust mean population response. Individuals will inherently vary, often a great deal, in raw expression estimates.

              How do your normalized read counts compare for these two samples? That is by far a more meaningful comparison than raw counts. Also, basing a comparison on an N of just 2 can be very misleading, as you have no idea how those two biological samples fall out in terms of the range of variation in expression for your population.
              Last edited by mbblack; 08-06-2014, 04:09 AM.
              Michael Black, Ph.D.
              ScitoVation LLC. RTP, N.C.

              Comment

              • bkellman16
                Junior Member
                • Apr 2015
                • 1

                #8
                SERE over log transform

                My understanding is that log(poisson) [log(counts) in this case] will approximate a normal distribution thereby achieving linearity. Is there a benefit to using SERE over using the pearson correlation of log transformed counts?

                Comment

                Latest Articles

                Collapse

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, Today, 06:09 AM
                0 responses
                15 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                34 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                39 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                47 views
                0 reactions
                Last Post SEQadmin2  
                Working...