Announcement

Collapse
No announcement yet.

DESeq and independant filtering

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq and independant filtering

    Dear all,
    I have applied DESeq with independant filtering on a 454 dataset with 2 conditions, and no replicate. As the first aim was transcriptome annotation (sp without reference genome), we used 2 contrasted conditions in order to increase the range of expression. However, as additional objective, we would like to explore differential expression as a preliminary survey (which will be followed by a RNAseq on Illumina Hiseq2000 data, WITH replication).

    I have 3 questions:

    1. I think I do not understand clearly how to decide on the proportion of genes to be filtered (0.4 in the example). What is it based on exactly ?

    2. I used two ways to do what I think is equivalent in terms of analysis, yet obtained (slightly) different results. First, I applied model comparison nbinomGLMTest (counts~factor condition vs counts~1) to get adjusted p-values corresponding to the difference between conditions. I did that for the whole dataset and for the filtered one, and compared results (in terms of padj<.1).

    Second, as I only have this factor with two levels and no replicate, I did 2 separate nbinomTest analyses, i.e., on the whole dataset, and on the filtered one, respectively, and compared the corresponding adjusted p values. I was surprised to get different results. Is it simply because I did not use exactly the same size factors for filtered data ? (in the first case, I used filtered data, after pvalues were calculated on normalized data, whereas in the second case, size factors were estimated on the filtered dataset, before pvalues were calculated).

    3. As I got only 3 extra DE genes thanks to filtering, I am wondering if dispersion was estimated correctly with:
    cdstot<-estimateDispersions(cdstot, method="blind", sharingMode="fit-only", fitType="local")
    Pvalues before BH adjustment were already much higher than with a Fisher exact test, and this surprised me a little.


    Thanks in advance for any help on these topics,
    Agnès

    (see R listing below)
    > fin<-read.table("lsdenext.txt", header=T, row.names=1)
    > attach(fin)
    > #26 avril 2012#
    > conds<-factor(c("nb_readC", "nb_readD"))
    > conds
    [1] nb_readC nb_readD
    Levels: nb_readC nb_readD

    > cdstot<-newCountDataSet(fin, conds)
    > cdstot<-estimateSizeFactors(cdstot, locfunc=shorth)
    > sizeFactors(cdstot)
    nb_readC nb_readD
    1.1238156 0.8903734

    > cdstot<-estimateDispersions(cdstot, method="blind", sharingMode="fit-only", fitType="local")

    > restot<-nbinomTest(cdstot, "nb_readC", "nb_readD")
    > head(restot)
    id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
    1 AB070347.p.ls.1 0.4449128 0.8898257 0.000000 0.0000000 -Inf 1.0000000 1
    2 AB083656.p.ls.1 8.5916778 11.5677341 5.615622 0.4854556 -1.0425887 0.5329552 1
    3 AB159144.p.ls.1 31.3472863 43.6014592 19.093113 0.4379008 -1.1913240 0.2058511 1
    4 AB159145.p.ls.1 2.6911615 0.8898257 4.492497 5.0487385 2.3359230 0.4856913 1
    5 AB159147.p.ls.1 2.1295993 0.8898257 3.369373 3.7865539 1.9208855 0.6508927 1
    6 AB159148.p.ls.1 22.0258009 20.4659910 23.585611 1.1524294 0.2046784 0.8848194 1

    > rs<-rowSums(counts(cdstot))
    > use<-(rs>quantile(rs, 0.4))
    > table(use)
    use
    FALSE TRUE
    6799 10194

    > cdsfil<-cdstot[use,]

    > fitfilcond<-fitNbinomGLMs(cdsfil,count~conds)
    ..........
    > fitfil0<-fitNbinomGLMs(cdsfil,count~1)
    ..........
    > pvalfilt<-nbinomGLMTest(fitfilcond, fitfil0)
    > padjfilt<-p.adjust(pvalfilt, method="BH")


    > fittotcond<-fitNbinomGLMs(cdstot,count~conds)
    ................
    > fittot0<-fitNbinomGLMs(cdstot,count~1)
    ................
    > pvaltot<-nbinomGLMTest(fittotcond, fittot0)
    > padjtot<-p.adjust(pvaltot, method="BH")
    >


    > padjfiltForCompar=rep(+Inf, length(padjtot))
    > padjfiltForCompar[use]=padjfilt
    > tab=table(`nofi`=padjtot<.1, `fi`=padjfiltForCompar<.1)
    > addmargins(tab)
    fi
    nofi FALSE TRUE Sum
    FALSE 16987 3 16990
    TRUE 0 3 3
    Sum 16987 6 16993


    > cdsfil<-estimateSizeFactors(cdsfil, locfunc=shorth)
    > sizeFactors(cdsfil)
    nb_readC nb_readD
    1.1203877 0.8924126

    > cdsfil<-estimateDispersions(cdsfil, method="blind", sharingMode="fit-only", fitType="local")

    > resfil<-nbinomTest(cdsfil,"nb_readC", "nb_readD")
    > head(resfil)
    id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
    1 AB083656.p.ls.1 8.602958 11.6031267 5.602790 0.4828690 -1.0502964 0.5273125 1
    2 AB159144.p.ls.1 31.392174 43.7348621 19.049485 0.4355675 -1.1990317 0.1938215 1
    3 AB159145.p.ls.1 2.687390 0.8925482 4.482232 5.0218372 2.3282153 0.4790481 1
    4 AB159147.p.ls.1 2.127111 0.8925482 3.361674 3.7663779 1.9131778 0.6544218 1
    5 AB159148.p.ls.1 22.030163 20.5286087 23.531717 1.1462889 0.1969707 0.8889670 1
    6 AB159149.p.ls.1 9.723516 11.6031267 7.843906 0.6760165 -0.5648695 0.7486689 1

    > hist(resfil$padj)
    > min(resfil$padj)
    [1] 0.03376218

    > comp3=rep(+Inf, length(restot$padj))
    > comp3[use]=resfil$padj
    > tab<-table(`nofi`=restot$padj<.1,`fi`=comp3<.1)

    > addmargins(tab)
    fi
    nofi FALSE TRUE Sum
    FALSE 16990 2 16992
    TRUE 0 1 1
    Sum 16990 3 16993

  • #2
    Originally posted by coutellec View Post
    Second, as I only have this factor with two levels and no replicate, I did 2 separate nbinomTest analyses, i.e., on the whole dataset, and on the filtered one, respectively, and compared the corresponding adjusted p values. I was surprised to get different results.
    the answer to this one is pretty straightforward. p-value correction for multiple testing is almost directly proportional to the number of tests (in this case the number of genes in the DE test). so if you used two lists with different numbers of genes you should expect to see different adjusted p-values.

    in regard to p-values for a 1 sample verses 1 sample test these are a bit of a gift. it's not a simple task (and possibly not even a logical request) to produce a p-value for a N of 1 test. when I run 1 vs 1 DE tests I take those p-values pretty lightly. What's best is to look at the log2 fold changes and see what sort of genes come up to the top of the list. it's a lot of work but to find those with biological significance you're going to have to spend time going through that list anyways.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      Thank you very much for your answers. I will look at the highest foldchange values and try to get something out of them.

      Actually, the difference that worried me was between the two procedures :
      1. model comparison using "nbinomGLMTest" (the two models compared are based on full and filtered data, respectively)

      vs

      2. two independant tests using "nbinomTest", on full and filtered data, respectively.

      Any, it seems related to the size factors, and this could be easily checked.
      Thanks again,
      Agnès

      Comment


      • #4
        Originally posted by coutellec View Post
        1. I think I do not understand clearly how to decide on the proportion of genes to be filtered (0.4 in the example). What is it based on exactly ?
        Dear Agnes

        Independent filtering is intended to remove those genes whose counts overall (throughout the dataset, irrespective of class label) are so small that they would have a negligible chance of being detected as differentially expressed. What this fraction is depends on your data. Have a look at the article by Bourgon et al. for explanation: http://www.pnas.org/content/107/21/9546.long
        You might also find the diagnostic plots shown in Section Independent filtering in the DESeq vignette useful.

        Originally posted by coutellec View Post
        2. I used two ways to do what I think is equivalent in terms of analysis, yet obtained (slightly) different results. First, I applied model comparison nbinomGLMTest (counts~factor condition vs counts~1) to get adjusted p-values corresponding to the difference between conditions. I did that for the whole dataset and for the filtered one, and compared results (in terms of padj<.1).
        It is not equivalent. See sdriscoll's reply. The whole point of independent filtering is to alleviate the multiple testing problem and to increase experiment-wide power.

        OTOH, as sdriscoll also points out, with no replicates, the p-values are necessarily based on model assumptions which need not have much to do with reality. You really need replication so that DESeq has a chance to estimate the real biological variability in your data.

        Originally posted by coutellec View Post
        3. As I got only 3 extra DE genes thanks to filtering, I am wondering if dispersion was estimated correctly with:
        cdstot<-estimateDispersions(cdstot, method="blind", sharingMode="fit-only", fitType="local")
        Pvalues before BH adjustment were already much higher than with a Fisher exact test, and this surprised me a little.
        Please explain what you mean by 'correctly'. There is no way any method can estimate the true dispersion from data with no replicates. You really need replication before going into any serious discussion of whether dispersion estimates are good enough for the task at hand.
        Wolfgang Huber
        EMBL

        Comment


        • #5
          Thanks for explanations. The dataset is indeed very rich in low count genes (454 pyrosequencing), and I might have to filter a lot of them.
          Regarding dispersion estimation without replicate, I understand we have to treat samples as if they were replicates. By "correctly", I meant that I simply followed the procedure of DESeq for data with no replicates, and I thought there might be alternatives to blind and local fit options (smthg like a hope it could improve the outcome). Anyway, I fully agree on the critical need for replication (and don't believe in miracles...).
          Thanks again,
          Agnès

          Comment

          Working...
          X