Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • xhuister
    Member
    • Apr 2010
    • 41

    How to get normalized count from DESeq?

    Hi all,

    I have different libraries of NGS and want to use DESeq to normalize them. I want to make the count comparable among different libraries, but how can I get the normalized count (in integer number but not float value)?

    In the DESeq mannual, there is code like this:
    cds <- newCountDataSet( countsTable, conds )
    cds <- estimateSizeFactors( cds )
    cds <- estimateVarianceFunctions( cds )
    res <- nbinomTest( cds, "T", "N")

    And about the normalization, I have sample1,2,3, each sample only has one replicate, if I want to compare sample1 and 2, should I normalize between 1 and 2, while if to compare 2 and 3, then again normalized between 2 and 3? Can I normalize sample 1,2,3 together then compare 1,2 or 2,3?


    Thanks!
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    You can divide your counts by the size factors:

    normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

    Now, the samples are on a common scale.

    Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

    Simon

    Comment

    • xhuister
      Member
      • Apr 2010
      • 41

      #3
      Originally posted by Simon Anders View Post
      You can divide your counts by the size factors:

      normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

      Now, the samples are on a common scale.

      Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

      Simon
      Thanks Simon, actually float is OK too, or I can round to interger.

      Comment

      • yifangt
        Member
        • Feb 2011
        • 61

        #4
        Normalized counts of common scale!

        Hi, Simon:

        I have a question related to your last post about the normalized counts of common scale. My RNA-seq data composes of seven conditions each with 2 replicates. That is a time series experiment for seed embryo development. I was wondering if I could get the normalized counts of the all the development stages with the common scale to compare the expression across all the time courses.
        I tried your code, and the result showed similar scale only within the two replicates of each stage, but between stages they are still quite different. Here is the normalized counts sum for each library. Note same letter indicates same stage, and 1, 2 indicate replicates.
        Code:
        Z1, Z2, O1, O2, G1, G2, H1, H2, T1, T2, B1, B2, M1, M2
        25581381.8, 26292834.7, 14023619.9, 12161472.3, 18581478.1, 18749724.2,	10861736.7, 11585558.2, 9996853.0, 10318368.5, 14130401.4, 16239989.6, 30725888.0, 27787233.7
        And the original reads sums for each library are:
        Code:
        Z1, Z2, O1, O2, G1, G2, H1, H2, T1, T2, B1, B2, M1, M2
        21012147, 19924212, 26002900, 9660245, 17139388, 7649319, 16430105, 20101956, 12920266, 6306742, 44241095, 20094409, 15166090, 23203758
        I assume the normalized sum for each stage should be close to each other too as for replicates within stage. Not sure what is the function DESeq uses to get the point. edgeR expects provides the common.lib.size for this purpose, and I assume both edgeR and DESeq share the idea. Am I right?
        Actually I tried edgeR for the common.lib.size, but I never got similar numbers across the different stages, which is another reason I post this problem here.

        Did I miss anything e.g. only two conditions are allowed?
        As most of the experiments are for two conditions I tried only sub-section of my data with only two conditions, the result looked the same. Is there anything wrong with my data? But the raw counts number looks fine and other analysis indicated the data is fine.

        I tried different normalization methods (in edgeR TMM, quantile and RLE), or use the sum of each condition without replicate, but still could not get to the common scale with the comparable counts number for each stage. I have posted this question in bioconductor group, not yet get any reply from edgeR group. Appreciate if you could possibly address this point.
        Thank you!

        Yifang

        Originally posted by Simon Anders View Post
        You can divide your counts by the size factors:

        normalizedCounts <- t( t(counts(cds)) / sizeFactors(cds) )

        Now, the samples are on a common scale.

        Of course, these are not integer values. However, I cannot quite see why you would want to have integer values.

        Simon
        Last edited by yifangt; 06-20-2011, 06:02 PM.

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #5
          If you are suspicious that normalization might be off, the best is to compare two samples with an MA plot:

          Code:
          plot( 
             ( counts(cds)[,1] + counts(cds)[,2] )/2, 
             counts(cds)[,2] / counts(cds)[,1], 
             log="xy", pch="." )
          abline( h = sizeFactors(cds)[2] / sizeFactors(cds)[1] )
          If the size factors are fine, the horizontal line should disect the bulk of the points in the middle, especially at the high-expression end (right-hand side) of the plot.

          Comment

          • yifangt
            Member
            • Feb 2011
            • 61

            #6
            Re: Normalized counts between treatments.

            Thanks Simon!

            My suspicion is the normalization between the treatments rather than within treatment. I did the plotting for the cds from two replicates of the same treatment (i.e. within treatment, left image of the attachment) and one replicate each of two different treatments (between treatments, right image) as you suggested. The plot of the within treatment looks very fine, but the one between treatment does not. Is there anything wrong? This bugs me for a couple of days!

            Thanks again!

            Yifang
            Attached Files

            Comment

            • Simon Anders
              Senior Member
              • Feb 2010
              • 995

              #7
              It's really hard to see how many points are plotted there on top of each other in your plots. Maybe try adding color="#00000040" to the plot command to add some transparancy, or use the smoothScatter function. Apart from that, it looks find to me, the horizontal line cuts through the middle of the big blob. Of course, you have more genes at the far bottom (outside the blob) than at the far top, but probably your treatment simply causes more downregulation than upregulation.

              Comment

              • yifangt
                Member
                • Feb 2011
                • 61

                #8
                Re: Normalized counts between treatments.

                Originally posted by Simon Anders View Post
                It's really hard to see how many points are plotted there on top of each other in your plots. Maybe try adding color="#00000040" to the plot command to add some transparancy, or use the smoothScatter function. Apart from that, it looks find to me, the horizontal line cuts through the middle of the big blob. Of course, you have more genes at the far bottom (outside the blob) than at the far top, but probably your treatment simply causes more downregulation than upregulation.
                Thanks Simon! This is a big relief to me.
                The dots are concentrated across the line after I add the transparency to the plot. However, my colleague argues about the fact that the number of counts for each library should come to close numbers, about which edgeR states several times that "the total counts in each libray of the pseudocounts agrees well with the common library size" (page 27 & 44 of the user's guide). But I can't see if DESeq has similar parameters behind.
                After I check the sums of baseMeanA and baseMeanB as normalized counts for each library, the numbers are still quite different, which is the same as the result from edgeR. So,
                1) Does my data violate the assumption of both packages?
                2) Does the normalized library size of the conditions not matter if they are different?
                3) Is the result still meaningful even the normalized library sizes are different?
                4) What could probably be the reason(s) to cause the normalized library size so different?
                5) Should I remove the smaller number reads as some other people do? After I removed the smaller numbers of counts (<=40 in >=6 out of 14 samples), the normalized library sizes become very similar.
                I can feel my lack of mathematics for the packages. Your explanation would be greatly appreciated. Thank you again!
                Yifang
                Attached Files

                Comment

                • Simon Anders
                  Senior Member
                  • Feb 2010
                  • 995

                  #9
                  Originally posted by yifangt View Post
                  However, my colleague argues about the fact that the number of counts for each library should come to close numbers, about which edgeR states several times that "the total counts in each libray of the pseudocounts agrees well with the common library size" (page 27 & 44 of the user's guide).
                  If we expected that the proper normalization factors were close to the ratios of the total numbers of reads, we could use the latter directly. However, the whole point of not using them and instead estimating them from the per-gene counts is that, sometimes, they do not agree, and then, the latter is more reliable. Robinson and Oshlack have written a whole paper arguing this point, so maybe you want to ask you sceptical colleagues to read it. ;-)

                  Comment

                  • xguo
                    Member
                    • Jul 2008
                    • 48

                    #10
                    Can we get the normalized count for each exon using the same approach in the newly released DEXseq package? In the visualization graph generated by DEXSeq, there are normalized counts drawn for each exon. Are those value derived from dividing raw counts by size factors?

                    thanks for your reply.

                    Originally posted by Simon Anders View Post
                    If we expected that the proper normalization factors were close to the ratios of the total numbers of reads, we could use the latter directly. However, the whole point of not using them and instead estimating them from the per-gene counts is that, sometimes, they do not agree, and then, the latter is more reliable. Robinson and Oshlack have written a whole paper arguing this point, so maybe you want to ask you sceptical colleagues to read it. ;-)

                    Comment

                    • areyes
                      Senior Member
                      • Aug 2010
                      • 165

                      #11
                      Hi xguo,

                      Yes, they are derived in the same way. In DEXSeq, you can get them using the parameter normalized=TRUE in the function 'counts':

                      counts(ecs, normalized=TRUE)

                      Cheers!

                      Comment

                      • billstevens
                        Senior Member
                        • Mar 2012
                        • 120

                        #12
                        I have a question about normalization and downscaling:

                        I have 7 samples with the following number of aligned reads, based on samtools flagstat
                        Sample 1a: 50M, Sample 1b: 45M, Sample 2a: 95 M, Sample 2b: 45M, Sample 2c: 55M, Sample 3a: 115 M, Sample 3b: 30 M

                        When I use cds <- estimateSizeFactors, I get the following data:
                        Sample 1a: 0.8, Sample 1b: 0.7, Sample 2a: 1.55, Sample 2b: 0.75, Sample 2c: 0.90, Sample 3a: 1.92, Sample 3b: 0.9

                        The last sample of condition 3 has a high library size than I think it should. I went to a workshop and they suggested downscaling some data so they all match the number of reads. Is that what I should do? Does this data look peculiar?

                        Comment

                        • Simon Anders
                          Senior Member
                          • Feb 2010
                          • 995

                          #13
                          And what did they mean by "downscaling"?

                          Comment

                          • billstevens
                            Senior Member
                            • Mar 2012
                            • 120

                            #14
                            Hi Simon, I wasn't sure how they meant to do this, but the point was to essentially disregard some reads to get all of conditions to have roughly the same amount of reads before running DESeq. Have you not encountered this technique? Do you think it is a problem that one of my samples (Sample 3a) has 4X the depth of another (Sample 3b), but the size factor that DESeq outputs indicates only a twofold difference?

                            Comment

                            • bpb9
                              Member
                              • Aug 2012
                              • 24

                              #15
                              When I run DESeq and look at my dispersion estimates verus normalized expression, it seems that most genes have either very low or very high read counts even after normalization. I am normalizing data that consists of three time points (50 samples per time point). This looks very different from the dispersion plot in the manual. Is this an example where filtering out genes with very low read count would be a good idea? If not, is it worth it to normalize each time point separately? When I compare two samples with an MA plot , it seems fairly normal (though it is omitting about 6000 data points)
                              dispersion:
                              sample compare: http://tinypic.com/r/aadgxz/8

                              Edit: I ran it using cdsFilt and it looks much more like the example in the manual:
                              Last edited by bpb9; 11-20-2014, 09:23 PM. Reason: additional image, revised analysis

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                07-01-2026, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 11:08 AM
                              0 responses
                              7 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              11 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              53 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...