Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • denominator for normalization

    I'm hoping for some input from my statistically gifted brethren on this one:

    I have sixteen RNA-Seq libraries which were aligned with TopHat. Counts of reads mapping to RefSeq genes were generated with htseq-count. My statistician collaborators need to normalize these counts for differences in sequencing depth. Here are my choices for the denominator:
    1. Total number of reads in the raw data (wc -l on the file from the sequencer)
    2. Total number of lines in the TopHat SAM file (wc -l on accepted_hits.sam)
    3. Number of unique reads for which TopHat found at least one location to assign (sort | uniq | wc -l on sequence field from SAM file)
    4. Sum of counts across all genes within each library

    Does anyone have some feedback on this? The range of numbers for choice 1 above is 96396160-131352500.

    Any help will be much appreciated,

    Thanks,

    Shurjo

  • #2
    All of these suffer from the issue that a few strongly and differentially expressed genes can skew them. See the discussion in our paper and especially in Oshlack and Robinson's paper.

    Our DESeq package offers (via its function 'estimateSizeFactors') a simple way to get a robust number for the denominator, which is explained, e.g., here.

    Simon

    Comment


    • #3
      Originally posted by Simon Anders View Post
      All of these suffer from the issue that a few strongly and differentially expressed genes can skew them. See the discussion in our paper and especially in Oshlack and Robinson's paper.

      Our DESeq package offers (via its function 'estimateSizeFactors') a simple way to get a robust number for the denominator, which is explained, e.g., here.

      Simon
      Hi Simon,

      Many thanks for your reply. I read both your and the Oshlack papers and agree with all of the points you make therein. However, in the context of my data, the following points would suggest to me that a simpler normalization strategy may be adequate:
      1. The sixteen libraries I referred to all come from the same tissue source (lymphoblastoid cell lines)
      2. This is a clinical study where the cells were not "induced" or "perturbed" with an external agent, so there is no expectation that a large number of genes will be differentially expressed between the two groups of 8.
      3. A priori, the chances of there being an appreciable number of transcripts that are present in one or a few of these libraries but absent in the others is low.

      I understand that using TMM will be better in the vast majority of data sets. However, my objective here is simply to answer a question from my collaborating statisticians (who will not be using either edgeR or DESeq, but their own tests) as to what makes the best denominator for normalizing libraries for differences in coverage. Given this scenario, do you have any suggestions?

      Once again, thanks for your help and congratulations on your paper.

      Shurjo

      Comment


      • #4
        Hi Shurjo,

        I have been having a tough time thinking this one out as well. I would appreciate any insight you may have gained by solving this problem. I too am torn between using the htseq count total, the unique mapped reads from tophat or all the alignments generated by tophat.

        Thanks for your help,

        Carmen

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        68 views
        0 likes
        Last Post seqadmin  
        Working...
        X