Announcement

Collapse
No announcement yet.

DESeq and independent filtering

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq and independent filtering

    Hi everyone

    I know this topic has been up a few times, but yet there is a question. So the basic idea about filtering is that it is done unsupervised to remove genes that are too lowly expressed to become significant. This will in turn reduce the number of tests made and therefore improve the multiple testing correction of Benjamin-Hochberg. This is basically what Bourgon 2010 finds.

    In the DESeq vignette, it is suggested to filter on the sum of reads for each gene and remove the ones in the 40% bottom quantile. It is here the question comes. Those 40% seem rather arbritary to me and must depend on the data set.

    So the question is if it statistically sound to just iterate through say 20-60% cut-off and determine, which yields the best statistics and just use that?

    Thanks

  • #2
    And another, but not really related question.

    I have three different cases where I need to compare data and the question for all three cases is really, if I should be using variance stabilized data or just sizeFactor normalized and therefore scaled data.

    Perhaps the first and most important question is whether the scaled data or the variance stabilized data best reflect the expression level of any given gene?
    1. The first comparison is just Control versus Treatment A in a single experiment - the data is replicated. This could for example be a scatterplot. I think the correct way of doing way of doing this is:
    A) Get the mean of the variance stabilized data (reverse the in-built log2 in DESeqs function for VST) for Control replicates and Treatment A replicates.
    B) Normalize the means pr length (reads pr kilobase) and log2 transform the data

    2. The second experiments has Control and Treatment A and Treatment B with replicated data. In this study I would like to compare the effects of Treatment A to Treatment B. Im interested in see what the effect of treatment B is on the genes that are significantly regulated by treatment A.
    To do this im making separate table of counts, Control vs A and Control vs B. Then I'm getting the variance stabilized data and comparing the moderated fold change for genes that a significant in Control vs Treatment A.

    3. The third setup has Control A vs Treatment A and Control B vs Treatment B, again I have replicated data. In this study I would like to compare the two experiments - can treatment B vs Control B exert similar changes as treatment A over control A?.
    To do this im using an analogous procedure as 2). using the variance stabilized data and comparing the moderated fold change between the conditions.
    Last edited by JesperGrud; 01-16-2013, 07:19 AM.

    Comment


    • #3
      Originally posted by JesperGrud View Post
      In the DESeq vignette, it is suggested to filter on the sum of reads for each gene and remove the ones in the 40% bottom quantile. It is here the question comes. Those 40% seem rather arbritary to me and must depend on the data set. So the question is if it statistically sound to just iterate through say 20-60% cut-off and determine, which yields the best statistics and just use that?
      Dear Jesper

      Have a look at the vignette Diagnostic plots for independent filtering of the genefilter package of Bioconductor, in particular the right hand panel of Fig.2, where this idea is explored.

      In principle, such parameter scanning can introduce bias (i.e. overly optimistic reported p-values or FDR). My belief is that for typical setups, this bias is negligibly small, so the suggested procedure is safe. It would be an interesting project, and perhaps make a little follow-up paper to Richard Bourgon's, to show this e.g. with simulations (and/or supported with some analytical computations).

      Regarding your second post - I was not sure I could parse it, best maybe would be to sit together with a local statistician who is familiar with linear regression.

      Best wishes
      Wolfgang
      Wolfgang Huber
      EMBL

      Comment


      • #4
        Originally posted by JesperGrud View Post
        And another, but not really related question.
        I have three different cases where I need to compare data and the question for all three cases is really, if I should be using variance stabilized data or just sizeFactor normalized and therefore scaled data.
        If the aim is finding differentially expressed genes, i.e. genes whose level is affected by experimental factors, then please use the NB-GLM approach, and there is no need for variance-stabilising transformation (VST).

        If the aim is estimating actual gene abundance, then the size factor adjusted count values are believed to be proportional to the target molecule abundance (in expectation, i.e. up to stochastic sampling effects), however the proportionality factor is gene-dependent, and values for different genes are not easily compared, though many people think FPKM are good for that.

        For plotting and visualisation (e.g. heatmaps) I tend to find VST very useful.
        Wolfgang Huber
        EMBL

        Comment


        • #5
          Originally posted by JesperGrud View Post
          And another, but not really related question.

          I have three different cases where I need to compare data and the question for all three cases is really, if I should be using variance stabilized data or just sizeFactor normalized and therefore scaled data.

          Perhaps the first and most important question is whether the scaled data or the variance stabilized data best reflect the expression level of any given gene?
          1. The first comparison is just Control versus Treatment A in a single experiment - the data is replicated. This could for example be a scatterplot. I think the correct way of doing way of doing this is:
          A) Get the mean of the variance stabilized data (reverse the in-built log2 in DESeqs function for VST) for Control replicates and Treatment A replicates.
          B) Normalize the means pr length (reads pr kilobase) and log2 transform the data

          2. The second experiments has Control and Treatment A and Treatment B with replicated data. In this study I would like to compare the effects of Treatment A to Treatment B. Im interested in see what the effect of treatment B is on the genes that are significantly regulated by treatment A.
          To do this im making separate table of counts, Control vs A and Control vs B. Then I'm getting the variance stabilized data and comparing the moderated fold change for genes that a significant in Control vs Treatment A.

          3. The third setup has Control A vs Treatment A and Control B vs Treatment B, again I have replicated data. In this study I would like to compare the two experiments - can treatment B vs Control B exert similar changes as treatment A over control A?.
          To do this im using an analogous procedure as 2). using the variance stabilized data and comparing the moderated fold change between the conditions.
          My big problem with these overly simplistic filtering approaches can be illustrated with your 3rd experiment design, where you have 3 categories: Control, Treatment A, and Treatment B. Assuming a balanced design, 33% of your samples are in each group. If both Treatments A & B suppress a gene completely that is expressed in Control, only 33% of the samples will have reads for that gene and it will likely be filtered out by a 40% rule as recommended in the DESeq vignette. Of course the problem is much more extreme when you have more treatment groups, or an unbalanced design where one treatment group has only a small percentage of the samples.

          Thus, I don't use the total # of reads for a gene (or equivalently the average across samples), I use the maximum # of reads in any sample. If any sample has a significant # of reads, I don't want to exclude that gene. This approach is still unsupervised (which I agree is needed), but avoids the potential false negatives of the other approaches you cited.

          Comment


          • #6
            Originally posted by rfilbert View Post
            .... and it will likely be filtered out by a 40% rule as recommended in the DESeq vignette. Of course the problem is much more extreme when you have more treatment groups, or an unbalanced design where one treatment group has only a small percentage of the samples.

            Thus, I don't use the total # of reads for a gene (or equivalently the average across samples), I use the maximum # of reads in any sample. If any sample has a significant # of reads, I don't want to exclude that gene. This approach is still unsupervised (which I agree is needed), but avoids the potential false negatives of the other approaches you cited.
            Dear Rfilbert

            yes, it is good practice to evaluate different filter criteria - including the one you mention - and see what works best for your experimental design and assay.

            Btw, the situation that you mention would be picked up by the diagnostic plot suggested in the DESeq vignette (Fig.9, and attached here to this post). When there are such genes as those that you worry about, they show up at the top left of the plot. One can use this plot to assess the performance of the different filter criteria.

            The value of 40% used in the DESeq vignette was chosen based on this plot. The value is very data set dependent.
            Attached Files
            Last edited by Wolfgang Huber; 01-19-2013, 08:58 AM. Reason: Clarification
            Wolfgang Huber
            EMBL

            Comment


            • #7
              Thank you for the answers, Wolfgang. It was exactly the genefilter package and the plot you are referring to, which was my inspiration for iterating through cut-offs.

              Concerning rfilbert's idea. I'm not fussed about p-values in such setup and therefore in these cases I would not do prefiltering, instead I would compare differences in expression over control. Alternatively, if the p-values are important I would go for a treatment A versus treatment B setup rather than comparing treat A vs control to treat B vs control.

              I have a few follow up question on the VST versus normalized read debate, exactly to the point what I just said concerning rfilberts post.

              Imagine an experimental setup such as the second example i proposed originally. Say you have an untreated sample (Control), a sample treated for 2 hours (Treatment A) and a sample treated for 12 hours (Treatment B). You are interested in comparing the effects of treatment A with treatment B.
              The way I'm imagining this could be done is to use DESeq to do independent pair-wise analyses: Create a single table of counts containing the three conditions and then do DESeq on Treatment A vs Control and Treatment B vs Control. Extract the variance stabilized data from each comparison and compare (cluster on) fold changes. From there you can detect significant different clusters. In this case I think I need to use the variance stabilized data (homoscedastic)? as I might be comparing a low changed gene with a highly changed gene. You could considering having some filter (however not sure what you should filter on) and such your might get a massive cluster of unchanged genes

              Another setup concerns integrative analysis of multiple types of NGS data. Consider an example where you have RNA-Seq for control and treatment and you have ChIP-Seq for say 4 different factors. You have identified peaks and assigned them to genes (the troubles of doing this is not important right now). The aim of your analysis is to examine if any factor (or any subpopulation of peak for any given factor) correlates with changes in expression.
              The way I'm imagining this could be done is very similar to the above analysis. I would again use the variance stabilized data and compare the mean of fold changes of genes with peaks to the mean of fold changes without peaks. The variance stabilized data is necessary in this case as the t.test is parametric (alternatively unstabilized data could be used and I could make use of the wilcoxon test?). It could also potentially be interesting to see if the factor itself correlates better with induction or repression. This could be done by comparing the mean of the moderated fold change of the genes with a peak to 0

              I guess what I'm getting at is that it seems to me that any statistical testing downstream of the differential expression benefits from using the variance stabilized data. Of course I may be completely wrong about this

              Best
              Jesper
              Last edited by JesperGrud; 01-19-2013, 09:12 AM.

              Comment


              • #8
                I can't see any harm in effective variance stabilization, and big benefits to it.

                Comment


                • #9
                  confusing about the Fig.1 in the filter paper

                  Hello, everyone,

                  I am learning to use DESeq to do RNAseq dataanalysis, and I would like to do independent filtering first, and I would like to decide the filter criteira first based on my own data. I read the paper "Diagnostics for independent filtering", but I am confused about the Fig.1 in this paper. Actually, I do not understand why this results imply that "by dropping 40% genes with lowest filterstat, we do not loose anything substantial from our subsequent results."

                  The code is:
                  plot(rank(filterstat)/length(filterstat),-log10(pvalue),pch=16,cex=0.45)

                  I attached my independent filtering results, is anybody can help me how to interpret these results?

                  Thank you very much!

                  Best,

                  Sadiexiaoyu
                  Attached Files

                  Comment


                  • #10
                    confusing about the Fig.1 in the filter paper

                    Hi,
                    I think I am a little understand of the Fig.1, but I am still confused that why the author choose (unadjusted) p-value 0.003 as a threshold? Could anyone help me? Thanks!

                    Comment


                    • #11
                      about this fig

                      Originally posted by Wolfgang Huber View Post
                      Dear Rfilbert

                      yes, it is good practice to evaluate different filter criteria - including the one you mention - and see what works best for your experimental design and assay.

                      Btw, the situation that you mention would be picked up by the diagnostic plot suggested in the DESeq vignette (Fig.9, and attached here to this post). When there are such genes as those that you worry about, they show up at the top left of the plot. One can use this plot to assess the performance of the different filter criteria.

                      The value of 40% used in the DESeq vignette was chosen based on this plot. The value is very data set dependent.
                      Hi, Wolfgang,
                      I am confused why choose "(unadjusted) p-value less than 0.003"? could you tell me why choose 0.003?
                      Thank you!

                      Comment


                      • #12
                        Dear Sadie Xiaoyu

                        there is nothing special about "0.003". The sentence just describes what can be seen in the plot in that vignette. Your filter_plot.pdf looks good, and it indicates that if you choose, for example, a cutoff of 30% (on the ranks), no genes with a p-value less than 10^-4 will be lost.

                        As with most cutoffs, there are no hard or objective rules on how to choose this cutoff, it depends on your aims and choices.

                        Best wishes
                        Wolfgang
                        Wolfgang Huber
                        EMBL

                        Comment


                        • #13
                          Originally posted by Wolfgang Huber View Post
                          Dear Sadie Xiaoyu

                          there is nothing special about "0.003". The sentence just describes what can be seen in the plot in that vignette. Your filter_plot.pdf looks good, and it indicates that if you choose, for example, a cutoff of 30% (on the ranks), no genes with a p-value less than 10^-4 will be lost.

                          As with most cutoffs, there are no hard or objective rules on how to choose this cutoff, it depends on your aims and choices.

                          Best wishes
                          Wolfgang
                          Dear Wolfgang,

                          Thank you for your reply! I would like to keep genes with p-value less than 0.05, so according to my result, if I cut around 10%, then no genes with a p-value less than 0.05 (10^-1.3) will be lost. Thank you very much!

                          Best,

                          Sadiexiaoyu

                          Comment


                          • #14
                            An interesting aspect of filtering your count data sets prior to DE testing stems from the aligners themselves. At best I think benchmarking tools report aligners to have somewhere between 98% and 99% true positive rates where a true positive is a simulated read with a known alignment location aligning to that location (within some acceptable window usually). I've run many benchmarking tests myself between many of the most popular aligners and that's what I see as well. In fact the Tophat2 paper suggets they see a similar rate for their own aligner...even with error free reads.

                            So yeah, 98-99% sounds pretty good. But does anybody look at what that 1-2% of your aligned reads are aligning to? I've been interested in this question myself. Even if I simulate reads from a transcriptome, with illumina like error rates for base calls, I at best see maybe 98.5% precision (ratio of correctly aligned reads to incorrectly aligned). that other 1.5% is actually spread all over the place. When I extract only the misaligned reads and generate gene counts I see a lot of low-count features. One pair of values I can remember is for about 2400 mis-aligned reads I got counts for about 2100 different features. So they are spread out..but quite unevenly and considering I only simulated 100,000 reads it is surprising to find that many of these misaligned reads contribute more than 10 hits to some features while most only add counts between 1 and 5 or so. But that's still 1-5 counts per million mapped reads! So this makes me think about downstream DE analysis a little differently. While there's understood sequencing "noise" and biological variation there's also some variance between which aligner and counting method you use that can actually sway gene counts by a good amount...enough to make genes that should not have counts appear to be present. The best way I can think of resolving the issue is to see what happens with your count data for a sample if you use different alignment and counting approaches. Obviously that's kind of time consuming - but it would give you an interesting view of which low-count features may actually just be false positives from that nearly 2% of your aligned data that's aligned incorrectly.

                            Just something to think about...if you've got 100,000,000 reads in your sample there could be as many as 2,000,000 incorrectly aligned reads. That's 20% the size of maybe the minimum amount of data you need to be able to do a differential expression test (given proper sample replication) and its well more than enough additional counts into incorrect features to generate a lot of noise in the system.
                            Last edited by sdriscoll; 06-01-2013, 12:43 PM.
                            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                            Salk Institute for Biological Studies, La Jolla, CA, USA */

                            Comment


                            • #15
                              sdriscoll, thanks for this comment and the work you describe there. I have been thinking a bit about this "quantification noise" as I call it and am pretty convinced that mapping and counting can introduce a fair amount of bias. (this comes from comparing reported counts and FPKM from public data sets with results after my own mapping and counting)

                              Comment

                              Working...
                              X