Announcement

Collapse
No announcement yet.

RNA seq data normalization question

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA seq data normalization question

    Hi,

    Currently I'm working on mRNA Seq and have a question about data normalization.

    If the data is already normalized with RPKM, should I further normalize the data, for example TMM?

    Thanks,

    slny

  • #2
    am not sure what you mean by TMM?
    --
    bioinfosm

    Comment


    • #3
      Hello,

      I think if you did RPKM first, that would incorporate any RNA library compositional bias that TMM aims to compensate for, so if you would want to take the compositional bias into account, perhaps use the scaling factor produced by TMM first to adjust the library read counts and then proceed to do RPKM? Or just use the edgeR package in its entirety.

      Ken

      Comment


      • #4
        Better tell us what you want to do afterwards with your normalized data. This may influence how you want to normalize.

        Comment


        • #5
          Thanks a lot for all the responses.

          Currently I have mRNA seq data for two groups and would like to find out differentially expressed genes. Currently I use countOverlaps function to count the reads for each gene and then use edgeR or DESeq for data normalization and differential analysis.

          Because the expression level should be the count of reads for each gene divided by the gene length, I wonder whether I should normalize the data with RPKM first and then further normalize the data with TMM in edgeR.

          For bioinfosm's question, TMM is a normalization method used by edgeR package. TMM should be kind of global normalization (not very sure).

          Comment


          • #6
            TMM is trimmed mean of M-values and is performed on the counts, not on the RPKM. It's a way to control for samples with different populations of RNA by sort of computing a "global fold change" between samples using a trimmed mean as a scaling factor. If your samples are kind of similar to eachother, you might not need it, but if you're worried about different populations of RNAs, TMM normalization might help. Then you would use the TMM normalized read counts to compute differential expresion.

            Comment


            • #7
              The normalization methods in DESeq and edgeR are meant to be fed with raw, integer counts. Please do not divide by transcript length before the DE analysis; it will screw up the whole method. For visualization purposes, you may want to divide the normalized counts by transcript length afterwards. (In DESeq, you get normalized counts by dividing the raw counts by the appropriate size factor.) However, think carefully about what to use as transcript length The original idea of using the sum of all exon lengths was not that good (see, e.g., the cufflinks paper).

              Comment


              • #8
                Does TMM consider gene length? If not, how could I adjust the gene expression from the read count for each gene?

                Comment


                • #9
                  Originally posted by slny View Post
                  Does TMM consider gene length? If not, how could I adjust the gene expression from the read count for each gene?
                  No, it doesn't, because it doesn't need to.

                  This is why I asked what you want to do with your data.

                  If you want to test for differential expression, you want to compare the expression of the same gene in different samples. As the gene has the same length in all your samples, there is no point in dividing by the gene length. You only mask the information on how precise your measurement is.

                  If you want to compare a gene with another gene, then you may want to divide by gene length, but you should be aware that such a comparison opens a whole new can of worms.

                  Comment


                  • #10
                    Perfect explanation. Thanks a lot!

                    One more question. Should I log transform the count of reads before I normalize the data?

                    Comment


                    • #11
                      No.

                      By "normalize", do you mean using DESeq's and edgeR's normalisation methods? They expect raw, integer counts, see above.

                      Or do you mean dividing by transcript length? This does not make sense on the log scale, for obvious reasons.

                      Comment


                      • #12
                        If we use poisson distribution or negative binomial distribution for differential analysis, then we should not log transformation because of discrete probability distribution.

                        Why do we use these discrete probability distributions in sequencing analysis, but normal distribution in microarray data analysis? Could we log transform the mRNA seq data and normalize the data with quantile normalization? If so, we can still use t test to select differentially expressed genes.

                        Comment


                        • #13
                          +1: Do not log-transform count data.

                          Comment


                          • #14
                            Originally posted by slny View Post
                            Why do we use these discrete probability distributions in sequencing analysis, but normal distribution in microarray data analysis? Could we log transform the mRNA seq data and normalize the data with quantile normalization? If so, we can still use t test to select differentially expressed genes.
                            Cloonan et al, Nature Methods did exactly what you suggest. However, microarray data is fundamentally different as expression is measured indirectly by fluorescence of probes and seems to behave normally on the log scale. For sequencing data this is not the case i.e. when you log a Poisson distribution it's not normally distributed. We actually tested the Cloonan method in our simulation for the TMM paper and it performed significantly worse than count based methods but I don't think that result made it into the paper.

                            One comment on RPKM. In my opinion one would want to divide by gene length when you are looking at absolute expression of a gene i.e. comparing between genes rather than comparing between samples. However to do a proper comparison between genes you really need to take into account other biases such a sequence compositions.

                            Comment


                            • #15
                              maybe sligthly off topic but is RNA-seq counting-related:
                              I always hear about RPKM but, to me, counting gene expressione by covered bases (and not nymber of reads ) looks more precise to me. base counting instead of read counting is very easy (e.g. with SeqMonk software) but is soo poorly mentioned that I'm wondering if it's OK for downstream applications.

                              BTW, for differential expression purposes, I use SeQmonk for harvesting raw data as follows: I select probes of interest (e.g, genes, mRNA or intergenic regions ) , I count data by bases (I do not correct for number of total reads, or gene length and don't log transform) and then feed the raw data to DESeq or EDGER. Upto looks fine to me (at least for my poor experience ).. any warnings?
                              thanks for any comments !

                              Comment

                              Working...
                              X