Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • edgeR logFC calculation rounding error?

    Hello all,

    I've been using edgeR to analyze a rna sequence dataset generated by our collaborators. The experimental design is simple: 3 wt (control) and 3 mutant (treatment) individuals.

    Unfortunately, I have noticed a problem with the output. The magnitude of the logFC value that is reported using topTags(et)$table (after executing the exact test) does not match the value I get when I use excel to independently calculate the logged (base 2) ratio of the average cpm's for each category (i.e., log2(avg.cpm.mut)/(avg.cpm.wt)). For example:

    AVG.CPM.WT AVG.CPM.MUT CALC.CPM.OUT.LOGFC EDGER.OUT.LOGFC PAIRWISE.DIFF
    Gene1 112.104446 87.70609579 -0.354094474 -0.348296103 0.005798
    Gene2 112.5008464 123.8749047 0.138948091 0.178525207 0.039577
    and so on...

    Note that I used the cpm(y) function to extract the cpm values for each individual entry (as calculated by edgeR) before calculating their average values.

    The pairwise difference between the two methods for calculating the logFC ranges from very small (<0.000001) to large (>0.2). Regardless of how the logFC is calculated, however, the sign of the logFC value is the same (that is, the directionality of differential gene expression doesn't change). I thought this might just be rounding error, because the two logFC values are highly correlated and the scatter is very consistent throughout the range of values... or am I missing something obvious here?

    Has anyone else run into this problem? If so, how did you resolve this issue?

    Hopefully my explanation is not too confusing - please let me know if I am not being clear enough about what is going on.

    Thanks
    Last edited by mike123; 01-17-2013, 12:28 PM.

  • #2
    Anybody got an answer?

    Comment


    • #3
      There isn't any rounding error. Rather, you are making a number of assumptions here about how things are done, and the calculations in edgeR are actually more subtle than you are assuming.

      The average cpm for a group is not computed in edgeR simply by taking averages of individual cpm values. Doing so would treat all the cpm as equally reliable, whereas reliability actually depends on the library size, the size of the counts and on the negative binomial dispersion.

      The log-fold-change between two groups is not simply the log-ratio of the two average cpm values. That would give wildly variable logFC values for small counts, whereas edgeR returns values that are more stable.

      To get some feeling for what is done, start by reading the help page for exactTest() and looking at the prior.count argument. Also see the predFC() function.

      Comment


      • #4
        That makes sense. Thanks!

        Comment


        • #5
          edgeR logFC calculation

          The best way to understand logFC and logCPM is to take a look at the souce code exactTest as Dr. Smyth suggested.

          Below is how logFC caculated.
          ...
          abundance1 <- mglmOneGroup(y1 + matrix(prior.count[j1], ntags,
          n1, byrow = TRUE),
          offset = offset[j1])
          ...
          abundance2 <- mglmOneGroup(y2 + matrix(prior.count[j2], ntags,
          n2, byrow = TRUE)
          , offset = offset[j2])
          ...
          logFC <- (abundance2 - abundance1)/log(2)


          My question is: I understand raw counts y1, y2 were slightly adjusted by prior.count, but wondering why adjusted this way?

          Thanks,
          Shanrong

          Comment


          • #6
            Originally posted by Shanrong View Post
            My question is: I understand raw counts y1, y2 were slightly adjusted by prior.count, but wondering why adjusted this way?
            To avoid wildly variable log fold changes for very small counts. See help("predFC") for a longer explanation.

            Comment


            • #7
              plotSmear and maPlot

              I have a deep RNA-Seq dataset (100M reads/sample). After I run the needed steps, I call plotSmear

              ...
              et <- exactTest(er)
              topTags(et)
              de <- decideTestsDGE(et)
              detags <- rownames(er)[as.logical(de)]
              plotSmear(et, de.tags=detags)
              ...

              In my call, plotSmean simply takes et$table$logFC and et$table$logCPM and then calls maPlot. If I drill down the souce cod of maPlot, below is what's happening

              if (!is.null(logAbundance) && !is.null(logFC)) {
              A <- logAbundance
              M <- logFC
              w <- v <- rep(FALSE, length(A))
              w <- A < -25 + log2(1e+06)
              ......
              }

              I plot the histogram of A, and I seems to me "-25" above is not a good cutoff, and "-28" might work better for my dataset.

              Below is is the info for histogram.
              > hist(A)$counts
              [1] 2 1 0 4 7 10 1183 1177 1520 1569 1721 2846 4197 2361 486 113 9 1
              > hist(A)$breaks
              [1] -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16

              Note: -25 + log2(1e+06) = -5.07

              My request: it would be nice if I can pass on the cutoff for A when I call maPlot.

              Comment


              • #8
                transcripts with ZERO reads and calcNormFactors

                Originally posted by Gordon Smyth View Post
                To avoid wildly variable log fold changes for very small counts. See help("predFC") for a longer explanation.
                Dr. Smyth,

                I am puzzled by those transcripts having ZERO reads at all conditions. I did normalizations: (1) with all transcripts; (2) exclude transcripts with ZERO read at all conditions

                nrow(counts)
                [1] 36004
                sum( rowSums(counts)>0 )
                [1] 28867

                edger <- DGEList(counts=counts,group=conditions)
                edger <- calcNormFactors(edger)

                edger1 <- DGEList(counts=counts[rowSum(counts)>0,],group=conditions)
                edger1 <- calcNormFactors(edger1)

                edger$samples$norm.factors
                [1] 1.1499722 1.1440693 0.9016199 0.9087874 0.8415538 0.8710643
                edger1$samples$norm.factors
                [1] 1.1499722 1.1440693 0.9016199 0.9087874 0.8415538 0.8710643

                However, the normalize factors are the SAME. why? Does this indicate such transcripts with ZERO reads are not used at all during calcNormFactors?

                Comment


                • #9
                  Originally posted by Shanrong View Post
                  Does this indicate such transcripts with ZERO reads are not used at all during calcNormFactors?
                  Yes.

                  If a transcript has no reads in any sample, then it is effectively not in the data set at all. We naturally do not want it to affect the results.

                  Comment


                  • #10
                    Originally posted by Shanrong View Post
                    My request: it would be nice if I can pass on the cutoff for A when I call maPlot.
                    When plotting shrunk fold changes in edgeR, you don't actually need a smear cutoff, because none of the fold changes will be infinite.

                    Anyway, this isn't the right forum to ask for changes to the edgeR package. It would be better to email the Bioconductor mailing list, because that is the official support list for edgeR and other Bioconductor packages.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    68 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X