Seqanswers Leaderboard Ad

**sdriscoll** · 06-01-2013, 06:23 PM

I having some fun with it. What it seems like to me is I've got to have very detailed knowledge of the transcriptome within context of a sequencing run. For example we know things like that there are families of genes in different loci who are 50 or 60% similar which to a biologist makes it sound like they are fairly separable. To an aligner with 50bp reads, however, those features could share a lot of data when one or the other is expressed. Since most mappers assign equally good hits randomly that's gonna be messy.

So you need to know how much data can be shared between which genes at a specific sequencing type and read length.

**Simon Anders** · 06-02-2013, 10:43 PM

sdriscoll: I've been argueing for some time that there is no such thing as "alignment noise", but your post seems to contradict my claim. So I would like to hear more about your simulations.

These 2% wrongly aligned reads, do they really look like correctly mapped ones? Do they have good mapping quality (MAPQ value in the SAM file)? Are they mapped uniquely? Are they unspliced? If you run the aligner a second time, do they end up at the same wrong position?

**Simon Anders** · 06-02-2013, 10:48 PM

Originally posted by sadiexiaoyu View Post

I would like to keep genes with p-value less than 0.05, so according to my result, if I cut around 10%, then no genes with a p-value less than 0.05 (10^-1.3) will be lost

Of course, you cannot call genes with a raw p value below 0.05 as significant, due to the multiple-testing problem. Rather, you want the adjusted p value to be below some threshold (with 0.05 or 0.1 being commonly chosen values), and an adjusted p value of 0.05 typically (though not always) corresponds to a raw p value which is a good deal smaller. This is why Wolfgang suggested something like 0.003.

As the relation between raw and adjusted p values depends on your data set, some experimenting with the threshold is often helpful to get optimal power.

**sdriscoll** · 06-02-2013, 11:29 PM

Originally posted by Simon Anders View Post

sdriscoll: I've been argueing for some time that there is no such thing as "alignment noise", but your post seems to contradict my claim. So I would like to hear more about your simulations.

These 2% wrongly aligned reads, do they really look like correctly mapped ones? Do they have good mapping quality (MAPQ value in the SAM file)? Are they mapped uniquely? Are they unspliced? If you run the aligner a second time, do they end up at the same wrong position?

Ill see what I can put together for you. Naturally some if this will depend on which aligner and how I'm mapping the reads...but I can come up with some answers for you. I'm working on a transcriptome analysis that I suspect will explain a lot of it. I'm positive there are many connections between genes at the 100bp window of resolution that goes beyond the names, ids and even genomic location of the features. Once I have this map I expect to see some of the false gene counts from the misaligned reads fall away.

**sadiexiaoyu** · 06-03-2013, 03:25 AM

Originally posted by Simon Anders View Post

Of course, you cannot call genes with a raw p value below 0.05 as significant, due to the multiple-testing problem. Rather, you want the adjusted p value to be below some threshold (with 0.05 or 0.1 being commonly chosen values), and an adjusted p value of 0.05 typically (though not always) corresponds to a raw p value which is a good deal smaller. This is why Wolfgang suggested something like 0.003.

As the relation between raw and adjusted p values depends on your data set, some experimenting with the threshold is often helpful to get optimal power.

Dear Simon,

Thank you very much for your suggestion. I was also confused about whether I should use raw p-value or adjusted p value (like in edgeR, it is the FDR). And you suggested that "some experimenting with the threshold is often helpful to get optimal power", so I plan to run the data without filtering to see how is the 0.05 FDR corresponded to the raw p value, then choose this adjusted p value to see how many percent data should be filtered out according to the Fig.1 in the paper. Do you think this will be helpful?

Best,

Sadiexiaoyu

**Simon Anders** · 06-03-2013, 07:32 AM

Yes.

But you are confused in terminology: An "adjusted p value" is a p value that has been "adjusted" for multiple testing. If the adjustment method is one that is designed to control false discovery rate (FDR), such as the methods by Benjamini and Hochberg or by Storey and Tibschirani, and if the original p values were sound, than the following holds: If one considers all genes with an adjusted p value below some threshold ϑ as "hits" then the proportion of false positives in this list of hits, the so-called false discovery rate, is expected to be at most ϑ.

**sdriscoll** · 06-03-2013, 10:22 AM

Right...so it's not necessary to do any kind of analysis comparing raw p-values to adjusted p-values. By the rules of statistics when multiple testing correction is necessary you're supposed to ignore the raw p-values and take the adjusted ones as "truth". Then you do as Simon suggested, pick a threshold and understand that means you're results may have that ratio of false positives.

**rskr** · 06-03-2013, 11:04 AM

I think you have to use a Chinese restaurant process or biological diversity estimate to justify your thresholds, since reads counts aren't independent. IE you are fiddling around with the number of *types* of things that were observed, which is dependent on the number of observations and the relative expression of each of the genes. It could be that your arbitrary threshold throws away much more of the reads in some cases and not others, depending on how the reads were distributed throughout the sample.

Also you need to take into account that many of the most interesting and relevant genes will be produced at much lower levels.

**Simon Anders** · 06-03-2013, 11:15 AM

No need to make things overly complicated.

The point of the Bourgon et al. paper is that the following is perfectly fine: Try different thresholds on the count sums (by simply scanning through a gird of values), always adjust the p values of the genes above the count sum with BH and then use the threshold that gives the largest absolute number of genes with an adjusted p value below your chosen FDR. (It may sound that such a post-hoc choosing of the threshold by peaking at the test outcome is "cheating" and breaks FDR control, but this is, somewhat surprisingly, not the case, as Bourgon et al. showed.)

Of course, if you are specifically interested in lowly expressed genes then such a way of choosing the filter may be permissible but disadvantagous because your goal is not to optimise power to get many hits but to learn about the small genes. Then, it might be better to choose a lower threshold, just so low that you do not lose any hits at all compared to the no-filtering case.

**rskr** · 06-03-2013, 11:32 AM

Originally posted by Simon Anders View Post

No need to make things overly complicated.

The point of the Bourgon et al. paper is that the following is perfectly fine: Try different thresholds on the count sums (by simply scanning through a gird of values), always adjust the p values of the genes above the count sum with BH and then use the threshold that gives the largest absolute number of genes with an adjusted p value below your chosen FDR. (It may sound that such a post-hoc choosing of the threshold by peaking at the test outcome is "cheating" and breaks FDR control, but this is, somewhat surprisingly, not the case, as Bourgon et al. showed.)

Of course, if you are specifically interested in lowly expressed genes then such a way of choosing the filter may be permissible but disadvantagous because your goal is not to optimise power to get many hits but to learn about the small genes. Then, it might be better to choose a lower threshold, just so low that you do not lose any hits at all compared to the no-filtering case.

I just don't think that is the correct framework to begin with since they aren't doing multiple testing in the first place.

**Simon Anders** · 06-03-2013, 11:34 AM

Sorry, lost the thread of the discussion now. Whom do you mean by "they"?

**rskr** · 06-03-2013, 11:44 AM

Bourgon et al.

**sadiexiaoyu** · 06-03-2013, 12:31 PM

Originally posted by Simon Anders View Post

Yes.

But you are confused in terminology: An "adjusted p value" is a p value that has been "adjusted" for multiple testing. If the adjustment method is one that is designed to control false discovery rate (FDR), such as the methods by Benjamini and Hochberg or by Storey and Tibschirani, and if the original p values were sound, than the following holds: If one considers all genes with an adjusted p value below some threshold ϑ as "hits" then the proportion of false positives in this list of hits, the so-called false discovery rate, is expected to be at most ϑ.

Hi, Simon,

Thank you so much for the correction! I also noticed your nice explain in this thread http://seqanswers.com/forums/showthread.php?t=17011

Best,

Sadiexiaoyu

**Simon Anders** · 06-03-2013, 10:26 PM

I just don't think that is the correct framework to begin with since they aren't doing multiple testing in the first place.

Maybe we are talking about different papers. I'm referring to this one:

R Bourgon, R Gentleman, W Huber: Independent filtering increases detection power for high-throughput experiments.
PNAS 2010 107(21):9546-51. doi: 10.1073/pnas.0914005107.

This paper discusses which kind of filtering is permissible in the sense that it does not invalidate subsequent adjustment for multiple testing.

So, yes, of course, they do multiple testing. It's the whole point of the paper.

**rskr** · 06-04-2013, 03:23 AM

What they were doing was trying to adapt an existing analysis framework to their problem.

Anyway it just strikes me as wrong to adjust the significance of the sample by selecting the number of genes to test, when it seems the p-values could be derived from first principles.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 33 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News