Hi everyone
I know this topic has been up a few times, but yet there is a question. So the basic idea about filtering is that it is done unsupervised to remove genes that are too lowly expressed to become significant. This will in turn reduce the number of tests made and therefore improve the multiple testing correction of Benjamin-Hochberg. This is basically what Bourgon 2010 finds.
In the DESeq vignette, it is suggested to filter on the sum of reads for each gene and remove the ones in the 40% bottom quantile. It is here the question comes. Those 40% seem rather arbritary to me and must depend on the data set.
So the question is if it statistically sound to just iterate through say 20-60% cut-off and determine, which yields the best statistics and just use that?
Thanks
I know this topic has been up a few times, but yet there is a question. So the basic idea about filtering is that it is done unsupervised to remove genes that are too lowly expressed to become significant. This will in turn reduce the number of tests made and therefore improve the multiple testing correction of Benjamin-Hochberg. This is basically what Bourgon 2010 finds.
In the DESeq vignette, it is suggested to filter on the sum of reads for each gene and remove the ones in the 40% bottom quantile. It is here the question comes. Those 40% seem rather arbritary to me and must depend on the data set.
So the question is if it statistically sound to just iterate through say 20-60% cut-off and determine, which yields the best statistics and just use that?
Thanks

Comment