Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • aoifemc
    Junior Member
    • Aug 2012
    • 4

    DESeq filtering

    I have a question about filtering prior to DE analysis using DESeq -is common practice to just use all raw count values for conditions (i.e. disease/control) or would it be better to ensure that every gene being tested for differential expression has at least say 5 or 10 reads in each of the replicates within the condition groups. I understand that DESeq can account for low read counts & I've found in my data that I get a larger number of genes that are significant when I filter as I guess there aren't as many tests being conducted however I'm unsure which i should go with.
    I know this kind of question will depend on the data itself and there may be no right or wrong answers but I know very few people doing this kind of analysis with whom I can discuss ...and in my data I've found that not all of the genes that are significant with the unfiltered data are still significant when I filter(despite the inc in the number of genes reaching significance) so Im just wondering what people consider best practice ??
    thanks for any advice
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Have a look at this paper.

    Comment

    • aoifemc
      Junior Member
      • Aug 2012
      • 4

      #3
      Great-Thanks!!

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        Note also that we have recently expanded the DESeq vignette with a section discussing such filtering.

        Comment

        • pinki999
          Member
          • Oct 2010
          • 37

          #5
          Can we carry out VST (variance stabilizing transformation) after the filtering step?

          Comment

          • rfilbert
            Member
            • Dec 2012
            • 43

            #6
            pre-filtering is a bad idea. Why do people do it? Because the software can't handle large data?

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              Originally posted by rfilbert View Post
              pre-filtering is a bad idea. Why do people do it? Because the software can't handle large data?
              There's no benefit to performing tests on genes that have no chance at showing a difference due to counts being too low. Had you bothered to read the paper I referenced, you would have known that.

              Comment

              • chadn737
                Senior Member
                • Jan 2009
                • 392

                #8
                Originally posted by rfilbert View Post
                pre-filtering is a bad idea. Why do people do it? Because the software can't handle large data?

                1) Specifically why is prefiltering "a bad idea"?

                2) It has nothing to do with not being able to handle "large data". Detection power is reduced due to the number of genes tested, and is true for ALL software. Prefiltering is a way of addressing the issue by removing those genes that are unlikely to be differentially expressed and so reduce the overall number of tests performed.
                Last edited by chadn737; 01-16-2013, 06:15 PM.

                Comment

                • pinki999
                  Member
                  • Oct 2010
                  • 37

                  #9
                  But, is it a good idea to filter before variance stabilizing?

                  Comment

                  • rfilbert
                    Member
                    • Dec 2012
                    • 43

                    #10
                    why pre-filter at all? It is only an opportunity for false negatives. I think the only reason for filtering is software that can't handle the whole genome.

                    Comment

                    • rfilbert
                      Member
                      • Dec 2012
                      • 43

                      #11
                      Originally posted by dpryan View Post
                      There's no benefit to performing tests on genes that have no chance at showing a difference due to counts being too low. Had you bothered to read the paper I referenced, you would have known that.
                      Clearly you have little background or access to a real statistician. Hello World! You must filter out low abundance transcripts - they are clearly not important!

                      Comment

                      • chadn737
                        Senior Member
                        • Jan 2009
                        • 392

                        #12
                        Originally posted by rfilbert View Post
                        why pre-filter at all? It is only an opportunity for false negatives. I think the only reason for filtering is software that can't handle the whole genome.
                        Clearly you have little background or access to a real statistician. Hello World! You must filter out low abundance transcripts - they are clearly not important!
                        Clearly you don't either, other than the salesman at Partek.

                        Prefiltering reduces False Negatives. When one prefilters, you are usually removing genes with zero or very few counts. At very low count numbers the shot noise can dominate and all but the most significant changes will not be considered differentially expressed. As typically there will be hundreds, if not thousands of genes with zero to only a few reads, this has a huge affect on multiple testing correction and can lead to a large number of False Negatives. Removing these genes relaxes the multiple testing correction so that more of these genes pass the test.

                        Prefiltering can increase False Positives, but you said False Negatives, not False Positives. However, this can largely be mitigated if you filter using methods like those described in the PNAS paper linked to previously.

                        Every program for differential expression has no problem handling entire genomes. I have used DESeq and EdgeR on genomes twice the size of the human genome with ease. Try using them before you blindly criticize.

                        Comment

                        • chadn737
                          Senior Member
                          • Jan 2009
                          • 392

                          #13
                          Originally posted by pinki999 View Post
                          But, is it a good idea to filter before variance stabilizing?
                          I'm not sure. What are you wanting to do with the variance stabilized data? As Filtering is usually done to increase detection of differential expression, there may not be any advantage in doing it for other purposes.

                          Comment

                          • rfilbert
                            Member
                            • Dec 2012
                            • 43

                            #14
                            Originally posted by chadn737 View Post
                            Clearly you don't either, other than the salesman at Partek.

                            Prefiltering reduces False Negatives. When one prefilters, you are usually removing genes with zero or very few counts. At very low count numbers the shot noise can dominate and all but the most significant changes will not be considered differentially expressed. As typically there will be hundreds, if not thousands of genes with zero to only a few reads, this has a huge affect on multiple testing correction and can lead to a large number of False Negatives. Removing these genes relaxes the multiple testing correction so that more of these genes pass the test.

                            Prefiltering can increase False Positives, but you said False Negatives, not False Positives. However, this can largely be mitigated if you filter using methods like those described in the PNAS paper linked to previously.

                            Every program for differential expression has no problem handling entire genomes. I have used DESeq and EdgeR on genomes twice the size of the human genome with ease. Try using them before you blindly criticize.
                            Um, if you filter out a gene that is truly differentially expressed, that is a false negative. Are you a statistician? Seems you are not.

                            Comment

                            • chadn737
                              Senior Member
                              • Jan 2009
                              • 392

                              #15
                              Originally posted by rfilbert View Post
                              Um, if you filter out a gene that is truly differentially expressed, that is a false negative. Are you a statistician? Seems you are not.
                              I see you aren't getting it.

                              Genes with very few reads, of which there can be hundreds or thousands, are unlikely to be called as differentially expressed, unless the differences are very large.

                              On the other hand, because of multiple testing correction, many genes of higher expression will not pass the threshold of being considered differentially expressed.

                              Filtering will result in a handful of genes with low expression that are differentially expressed being discarded, but will allow for many more genes of higher expression to pass the threshold of being differentially expressed.

                              So overall MORE genes are called differentially expressed and the overall number of false negatives is decreased.

                              You should take a day or two to actually read some papers on the matter. Like the linked PNAS paper, the DESeq, EdgeR, etc papers and vignettes. These all explain exactly what they do so that nothing is hidden, unlike the black boxes that you are putting your data into.
                              Last edited by chadn737; 01-16-2013, 06:15 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              30 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...