Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2 finding differential expression changes with libraries of different sizes

    Hello;

    I am working with an RNA-Seq dataset with samples that have varying numbers of reads and am wondering how that will affect differential expression, and what is a generally acceptable difference between samples.

    Four groups of samples were barcoded and run on a single flow cell. Within the entire dataset the difference between the largest and smallest sample read count is about 5 fold (and size factors ranging from 0.34 - 3.2). Within each group the number of reads is similar (for the most part) but differences exist between groups that we intend to compare. I'll use our first comparison as an example: For Group1 the samples have ~ 4 million reads per sample, where Group 2 has >7 million reads per sample. The total number of genes detected between the two groups is also different.

    To assess expression changes I used DESeq2, but am wondering whether normalizing with size factor accounts is enough to account for this? Suppose GeneA was not detected in Group1 as a consequence of the small number of reads, but is lowly expressed in Group2. This gene would be identified as DE although we don't know if that is necessarily the case.

  • #2
    In a case such as that, the first thing I would do is only even include the subset of genes that you did actually detect in all samples in the comparison. As you indicated, you cannot argue that a failure to detect equates to the absence of expression, so you really should not even be considering such genes in your comparison.

    For my own analyses, the first thing I do after mapping reads is derive the subset of features that actually have a raw count greater than zero in all my samples. I only analyze that feature set for differential expression.
    Michael Black, Ph.D.
    ScitoVation LLC. RTP, N.C.

    Comment


    • #3
      Removing features which have zero count in more than one sample, will leave me with a very small subset. DESeq2 applies filters where genes with all zero counts are removed AND rows that have extreme count outlier samples - which already reduces the feature set to less than half. Although I suppose that is one way to ensure that lack of reads in not the reason for the observed change

      Comment


      • #4
        Have a look at a PCA plot and/or hierarchical clustering plot and see if the difference in library size is causing one or more samples to be obvious outliers. I've not seen that happen for ~5x size differences, but certainly for >=10x and wouldn't rule it out in any case.

        Comment


        • #5
          Originally posted by mistrm View Post
          Removing features which have zero count in more than one sample, will leave me with a very small subset. DESeq2 applies filters where genes with all zero counts are removed AND rows that have extreme count outlier samples - which already reduces the feature set to less than half. Although I suppose that is one way to ensure that lack of reads in not the reason for the observed change
          Not to sound harsh, but to my mind, it is immaterial how much it reduces your feature set. The reality is that including DE calls for genes where one of the references is to a sample for which you actually have no data (failure to detect) is simply not valid. If you ran two qPCR reactions, and one worked giving valid data and the other did not and thus gave no data, would you include that gene in your results? Any genes you want to talk about as differentially expressed, you need to have an actual measure of expression for each sample in the comparison. There is a certain stochasticity in detection of low expressors, as those are inherently the rarer transcripts in your sample, so not having even detected anything in one sample makes any statement about differential expression relative to another highly suspect.

          If dpryan's suggestion doesn't yield any obvious abberant samples, and you need a larger feature set, then you should either add more replicates or more reads per sample. Do you still have any material left you could sequence further to increase read depth?
          Last edited by mbblack; 08-20-2014, 04:35 AM.
          Michael Black, Ph.D.
          ScitoVation LLC. RTP, N.C.

          Comment


          • #6
            Originally posted by mbblack View Post
            Not to sound harsh, but to my mind, it is immaterial how much it reduces your feature set.
            I couldn't agree more. The name of the game is not creating undue extra work and headaches for yourself.

            Comment


            • #7
              Agree with you both. Though (just for discussion purpose), instead of removing features that have a zero count in any sample across both groups wouldn't it make sense to remove only features that have zero count in Group1 (the group with lower depth samples). For Group2 if there is zero count, there are enough reads to more reliably conclude features as low expressors as opposed to failure to detect. Particularly, if there is increased expression of these low expressors in Group1, we would want to capture those changes.

              There is still material left and will likely to sequence further as it seems the best solution. Thanks for all the help!

              Comment


              • #8
                Not to my mind. You cannot say anything about differential expression based on the absence of data, regardless of what you see in the other sample. Nor can you, to my mind, say that an absence of data, at any read depth, is equal to an absence of expression. There is simply far too much variability in low expressor detection to say that, regardless of read depth. Again, an absence of count data cannot be taken as an absence of a transcript nor absence of expression of that transcript.

                Typically as you increase read depth, you see an ever increasing accumulation of counts for transcripts already detected. Your probability of detecting very low expressors does not change all that much at all, and there will always be a low but persistent probability of detection of novel transcripts relative to higher count features at even read depths of hundreds of millions of reads per sample.

                You say "if there is increased expression of these low expressors in Group1" but how can you say anything about relative expression (increased or decreased) if you do not have any actual data for that transcript in Group 2? All you know is you saw it in Group 1 and did not see it in Group 2, but you have no conclusive information about just why you did not see it in Group 2 (was it truly not expressed, or was it expressed and just missed due to the inherent vagaries of detection in every RNA seq experiment?).

                The only valid contrasts you can make are between samples/groups for which you actually have data in both. For those where you have no data in one group, all you can say is you detected gene "x" in one, and did not detect it in the other - that's it. To infer anything else about the relative relationship of the two groups is pure speculation, and one for which you do not have supportive data since you have no data at all for one group.

                If your goal is to truly demonstrate the absence of expression in one group, then RNA-seq was never the appropriate experiment to use in the first place.
                Last edited by mbblack; 08-20-2014, 08:07 AM.
                Michael Black, Ph.D.
                ScitoVation LLC. RTP, N.C.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Choosing Between NGS and qPCR
                  by seqadmin



                  Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                  10-18-2024, 07:11 AM
                • seqadmin
                  Non-Coding RNA Research and Technologies
                  by seqadmin




                  Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                  Nobel Prize for MicroRNA Discovery
                  This week,...
                  10-07-2024, 08:07 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 05:31 AM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 10-24-2024, 06:58 AM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 10-23-2024, 08:43 AM
                0 responses
                48 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 10-17-2024, 07:29 AM
                0 responses
                58 views
                0 likes
                Last Post seqadmin  
                Working...
                X