Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • edgeR logFC calculation rounding error?

    Hello all,

    I've been using edgeR to analyze a rna sequence dataset generated by our collaborators. The experimental design is simple: 3 wt (control) and 3 mutant (treatment) individuals.

    Unfortunately, I have noticed a problem with the output. The magnitude of the logFC value that is reported using topTags(et)$table (after executing the exact test) does not match the value I get when I use excel to independently calculate the logged (base 2) ratio of the average cpm's for each category (i.e., log2(avg.cpm.mut)/(avg.cpm.wt)). For example:

    AVG.CPM.WT AVG.CPM.MUT CALC.CPM.OUT.LOGFC EDGER.OUT.LOGFC PAIRWISE.DIFF
    Gene1 112.104446 87.70609579 -0.354094474 -0.348296103 0.005798
    Gene2 112.5008464 123.8749047 0.138948091 0.178525207 0.039577
    and so on...

    Note that I used the cpm(y) function to extract the cpm values for each individual entry (as calculated by edgeR) before calculating their average values.

    The pairwise difference between the two methods for calculating the logFC ranges from very small (<0.000001) to large (>0.2). Regardless of how the logFC is calculated, however, the sign of the logFC value is the same (that is, the directionality of differential gene expression doesn't change). I thought this might just be rounding error, because the two logFC values are highly correlated and the scatter is very consistent throughout the range of values... or am I missing something obvious here?

    Has anyone else run into this problem? If so, how did you resolve this issue?

    Hopefully my explanation is not too confusing - please let me know if I am not being clear enough about what is going on.

    Thanks
    Last edited by mike123; 01-17-2013, 12:28 PM.

  • #2
    Anybody got an answer?

    Comment


    • #3
      There isn't any rounding error. Rather, you are making a number of assumptions here about how things are done, and the calculations in edgeR are actually more subtle than you are assuming.

      The average cpm for a group is not computed in edgeR simply by taking averages of individual cpm values. Doing so would treat all the cpm as equally reliable, whereas reliability actually depends on the library size, the size of the counts and on the negative binomial dispersion.

      The log-fold-change between two groups is not simply the log-ratio of the two average cpm values. That would give wildly variable logFC values for small counts, whereas edgeR returns values that are more stable.

      To get some feeling for what is done, start by reading the help page for exactTest() and looking at the prior.count argument. Also see the predFC() function.

      Comment


      • #4
        That makes sense. Thanks!

        Comment


        • #5
          edgeR logFC calculation

          The best way to understand logFC and logCPM is to take a look at the souce code exactTest as Dr. Smyth suggested.

          Below is how logFC caculated.
          ...
          abundance1 <- mglmOneGroup(y1 + matrix(prior.count[j1], ntags,
          n1, byrow = TRUE),
          offset = offset[j1])
          ...
          abundance2 <- mglmOneGroup(y2 + matrix(prior.count[j2], ntags,
          n2, byrow = TRUE)
          , offset = offset[j2])
          ...
          logFC <- (abundance2 - abundance1)/log(2)


          My question is: I understand raw counts y1, y2 were slightly adjusted by prior.count, but wondering why adjusted this way?

          Thanks,
          Shanrong

          Comment


          • #6
            Originally posted by Shanrong View Post
            My question is: I understand raw counts y1, y2 were slightly adjusted by prior.count, but wondering why adjusted this way?
            To avoid wildly variable log fold changes for very small counts. See help("predFC") for a longer explanation.

            Comment


            • #7
              plotSmear and maPlot

              I have a deep RNA-Seq dataset (100M reads/sample). After I run the needed steps, I call plotSmear

              ...
              et <- exactTest(er)
              topTags(et)
              de <- decideTestsDGE(et)
              detags <- rownames(er)[as.logical(de)]
              plotSmear(et, de.tags=detags)
              ...

              In my call, plotSmean simply takes et$table$logFC and et$table$logCPM and then calls maPlot. If I drill down the souce cod of maPlot, below is what's happening

              if (!is.null(logAbundance) && !is.null(logFC)) {
              A <- logAbundance
              M <- logFC
              w <- v <- rep(FALSE, length(A))
              w <- A < -25 + log2(1e+06)
              ......
              }

              I plot the histogram of A, and I seems to me "-25" above is not a good cutoff, and "-28" might work better for my dataset.

              Below is is the info for histogram.
              > hist(A)$counts
              [1] 2 1 0 4 7 10 1183 1177 1520 1569 1721 2846 4197 2361 486 113 9 1
              > hist(A)$breaks
              [1] -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16

              Note: -25 + log2(1e+06) = -5.07

              My request: it would be nice if I can pass on the cutoff for A when I call maPlot.

              Comment


              • #8
                transcripts with ZERO reads and calcNormFactors

                Originally posted by Gordon Smyth View Post
                To avoid wildly variable log fold changes for very small counts. See help("predFC") for a longer explanation.
                Dr. Smyth,

                I am puzzled by those transcripts having ZERO reads at all conditions. I did normalizations: (1) with all transcripts; (2) exclude transcripts with ZERO read at all conditions

                nrow(counts)
                [1] 36004
                sum( rowSums(counts)>0 )
                [1] 28867

                edger <- DGEList(counts=counts,group=conditions)
                edger <- calcNormFactors(edger)

                edger1 <- DGEList(counts=counts[rowSum(counts)>0,],group=conditions)
                edger1 <- calcNormFactors(edger1)

                edger$samples$norm.factors
                [1] 1.1499722 1.1440693 0.9016199 0.9087874 0.8415538 0.8710643
                edger1$samples$norm.factors
                [1] 1.1499722 1.1440693 0.9016199 0.9087874 0.8415538 0.8710643

                However, the normalize factors are the SAME. why? Does this indicate such transcripts with ZERO reads are not used at all during calcNormFactors?

                Comment


                • #9
                  Originally posted by Shanrong View Post
                  Does this indicate such transcripts with ZERO reads are not used at all during calcNormFactors?
                  Yes.

                  If a transcript has no reads in any sample, then it is effectively not in the data set at all. We naturally do not want it to affect the results.

                  Comment


                  • #10
                    Originally posted by Shanrong View Post
                    My request: it would be nice if I can pass on the cutoff for A when I call maPlot.
                    When plotting shrunk fold changes in edgeR, you don't actually need a smear cutoff, because none of the fold changes will be infinite.

                    Anyway, this isn't the right forum to ask for changes to the edgeR package. It would be better to email the Bioconductor mailing list, because that is the official support list for edgeR and other Bioconductor packages.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 08:47 AM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    59 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    54 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X