Unconfigured Ad

**A Oshlack** · 02-25-2012, 02:22 AM

http://www.ncbi.nlm.nih.gov/pubmed/19371405

**Jane M** · 02-27-2012, 12:49 AM

Thanks for the paper.
I am in particular interested in the effect of sequencing depth on SNPs and indels detection. Are there papers on this topic?

I would like to know if there is a SNP at a position, do we have the same probability to detect it if the coverage is low than if the coverage is high, assuming we have the same sequencing error?

Thanks,
Jane

**gringer** · 02-27-2012, 04:19 PM

Originally posted by Jane M View Post

I would like to know if there is a SNP at a position, do we have the same probability to detect it if the coverage is low than if the coverage is high, assuming we have the same sequencing error?

All other things equal, the probability of detecting a SNP at high-coverage will be higher than the probability of detecting a SNP at low coverage. The error rate at a particular location will decrease with repeated sampling, increasing the reliability of measurement.

This is not a particularly meaningful statement. Of course there will be some point where an increase in quality won't significantly increase the reliability of measurement (e.g. phred score of 40 or so, considering repeated sampling). However, in almost all cases the actual SNP frequency will play a greater role in detection, and the difference in detection probability will be insignificant for high-frequency SNPs and for very low-frequency SNPs.

If the chance of a polymorphism is near 50%, then you'd need a coverage of less than 6 or so over a region (my ball-park guess) to miss repeated observations of both variants of a dimorphic SNP. Conversely, for a SNP (depending on the definition of SNP) with frequency less than 1%, you'd have to be quite lucky to get any sample that has the variant of interest.

**gringer** · 02-27-2012, 04:44 PM

Or if a differentially expressed gene has a low coverage on the average, is the probability to detect the gene as differentially expressed lower than if the coverage was high?

This is quite a different question from the SNP question, because there are two dimensions of measurement that influence the probability that differential expression is significant even when just considering the read counts at a single base-pair location (number of raw reads, and fold-change difference). A low number of raw reads increases the measurement error, increasing the fold-change difference that would need to be observed for a differential expression to be considered significant (note: raw read counts, not normalised read counts).

Again, with all other things equal, a high coverage will increase the reliability of the result, but this time it has a much greater role to play in determining whether the expression difference is significant.

Unfortunately, there are plenty of other confounding factors, such that differential expression analysis by NGS can really only be used for fishing / hypothesis generation. Off the top of my head, there's multiply-mapped reads, multiple isoforms / splice variants, incomplete coverage of the gene / transcript, PCR duplicates, and incorrect gene annotation. Some of these situations can be identified by looking at coverage plots at a transcript level, but that requires too much effort and human intervention to work at a genome-wide scale.

If you really want to doubt the reliability of your results, look at the coefficient of variation for coverage in all transcripts (SD of coverage divided by mean coverage). The last time I looked at that, I think about 70~125% described a "good" coverage, and most transcripts were over something like 300%. I'd be interested to know other people's experience regarding this matter.

**Jane M** · 02-28-2012, 03:00 AM

Thanks a lot for your answer gringer!

I must admit that currently, I'm particularly interested in the detection of SNPs. So I would like to have an idea about the reliability of my results when having low coverage.
Because I detect variant in these 2 extreme cases :
-3 reads for the reference and 3 reads for the variant
-100 reads for the reference and 100 reads for the variant

Has someone estimated the reliability of results depending on sequencing depth? Gringer, can you suggest me publications about it?

Jane

**gringer** · 02-28-2012, 03:22 AM

Because I detect variant in these 2 extreme cases :
-3 reads for the reference and 3 reads for the variant
-100 reads for the reference and 100 reads for the variant

That's not a particularly extreme case. It suggests SNP frequencies of 50%, which means coverage is not going to matter. Of course for a heterozygous sample, this is expected. Are these reads for a single sample (i.e. you're looking at a heterozygous sample), or for multiple samples? You should be doing your SNP detection using pooled reads for all samples, and then type according to this. A more interesting case (for a single sample) would be something like the following:

SNP 1: 1 read for the reference and 5 reads for the variant [probably homozygous variant, but small possibility of heterozygote]
SNP 2: 20 reads for the reference and 80 reads for the variant [small possibility of heterozygote, but the imbalance of counts suggests there might be multiple read hits in the genome]

With sanger sequencing, two observations of a variant (in a population) are typically enough to consider the variant as being present, bearing in mind that a typical definition of a SNP is for a frequency greater than 1% (or possibly 5%). I expect it would be similar for NGS. I think the SNP microarrays use a few replicate sequences per variant (e.g. see here), just to be safe.

Edit:

can you suggest me publications about it?

I'm not aware of any NGS publications relating to SNP discovery (because I haven't looked), but for "classical" SNP detection I guess you could look at the Wikipedia references:

SNP genotyping - Wikipedia

http://en.wikipedia.org/wiki/SNP_genotyping#References

**Jane M** · 02-28-2012, 06:20 AM

Originally posted by gringer View Post

That's not a particularly extreme case. It suggests SNP frequencies of 50%, which means coverage is not going to matter. Of course for a heterozygous sample, this is expected. Are these reads for a single sample (i.e. you're looking at a heterozygous sample), or for multiple samples? You should be doing your SNP detection using pooled reads for all samples, and then type according to this. A more interesting case (for a single sample) would be something like the following:

SNP 1: 1 read for the reference and 5 reads for the variant [probably homozygous variant, but small possibility of heterozygote]
SNP 2: 20 reads for the reference and 80 reads for the variant [small possibility of heterozygote, but the imbalance of counts suggests there might be multiple read hits in the genome]

The examples that I gave are not especially something that I've got, maybe I have it, I have hundreds of variants...

My questions are related to the examples that I gave and the ones that you gave. It's easier to start with my cases.
From what you said, I understand that I can trust equally my two cases.
It was my question, I though I could be more confident with (100 reads for the reference and 100 reads for the variant) than with (3 reads for the reference and 3 reads for the variant) all other things equal because it is more likely to have 3 than 100 errors.

Then, for the cases you mentioned, it's more complicated. But, it's the same idea. We calculate a proportion of variant and this proportion is probably more reliable if it has been estimated from a big sample, all other things equal.

I'm studying the mutations occurring in cells of patients suffering from leukaemia. I am looking for somatic mutations which take place at homozygous position as a first study.
I'm using tools like VarScan 2 and JointSNVMix for detection.
I know that my samples have a purity of 1 (or very close to 1) but I shouldn't expect 0, 50 or 100% of variant because all my cells won't be mutated...

So to filter my (big) list of variants, I use quality criterion and that is why I'm looking for publications about it.

**rlopez** · 07-30-2012, 05:52 AM

> I'm using tools like VarScan 2 and JointSNVMix for detection.

Hello Jane M,

This might not be the right post but I was wondering if you would you like to share your experience with VarScan2, JointSNVMix? and Strelka? and others you might have tried it i.e. somatic sniper, muTect, etc...

Many thanks,

Rene L

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 22 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 38 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 61 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

Correlation between sequencing depth and false positives

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News