There's no 'XT:A:U' tag for bwa-mem outputs. How can I get the uniquely mapped reads?
Announcement
Collapse
No announcement yet.
X
-
Keep in mind that "uniqeness" is not a clear cut variable. Rather is a probability that the position of the read is wrong. Specifically, p_wrong= 10^(-mapq/10).
So for mapq 10 you have p_wrong= 0.1, for mapq 20 p_wrong= 0.01. In practice I see that most of the reads are either mapq 0 or mapq pretty high (>30) with little in between, so the exact threshold makes little difference. But this depends of course on your study system.
Comment
-
Originally posted by dariober View PostKeep in mind that "uniqeness" is not a clear cut variable. Rather is a probability that the position of the read is wrong. Specifically, p_wrong= 10^(-mapq/10).
So for mapq 10 you have p_wrong= 0.1, for mapq 20 p_wrong= 0.01. In practice I see that most of the reads are either mapq 0 or mapq pretty high (>30) with little in between, so the exact threshold makes little difference. But this depends of course on your study system.
I have checked many resources and I know it's hard to definite the 'unique' reads. So is it secure to take the down-stream analysis on the reads with mapq > 30? Or can I make a more filter to choose reads from the mapq>30 reads?
Comment
-
Originally posted by Yamol View PostSo is it secure to take the down-stream analysis on the reads with mapq > 30? Or can I make a more filter to choose reads from the mapq>30 reads?
If you filter to remove anything with mapq<=3 you will get only reads that the aligner considers as having one mapping location that is better than any other.
Comment
-
Originally posted by Brian Bushnell View PostNo, that will eliminate huge numbers of reads that have variations, or errors, or are mapped to somewhat repetitive areas, and generally eliminate coverage from a lot of places and give a severe ref-bias everywhere else, so the results will not be trustworthy.
If you filter to remove anything with mapq<=3 you will get only reads that the aligner considers as having one mapping location that is better than any other.
Anyway, as I said above I find it makes little difference the exact threshold between say 5 and 20. These are the number of reads mapped to mouse chr18 (bowtie2), read length 75, it's a pull-down experiment:
Code:samtools view -c -q0 grm056_i1_KO_carcass.bam 18 8184447 samtools view -c -q1 grm056_i1_KO_carcass.bam 18 8039114 samtools view -c -q3 grm056_i1_KO_carcass.bam 18 7838063 samtools view -c -q5 grm056_i1_KO_carcass.bam 18 7816288 samtools view -c -q10 grm056_i1_KO_carcass.bam 18 7707885 samtools view -c -q20 grm056_i1_KO_carcass.bam 18 7635921 samtools view -c -q30 grm056_i1_KO_carcass.bam 18 7313124 samtools view -c -q40 grm056_i1_KO_carcass.bam 18 7106571
Comment
-
Originally posted by dariober View PostI agree that going above q30 is probably unnecessarily harsh. However there might be cases where you prefer to loose coverage in some regions rather than accepting uncertain read positions.
Isn't <=3 a little too relaxed though? It means that a read with q4 has 10^(-4/10) ~ 0.4 chance to be wrong.
Anyway, as I said above I find it makes little difference the exact threshold between say 5 and 20. These are the number of reads mapped to mouse chr18 (bowtie2), read length 75, it's a pull-down experiment:
Code:samtools view -c -q0 grm056_i1_KO_carcass.bam 18 8184447 samtools view -c -q1 grm056_i1_KO_carcass.bam 18 8039114 samtools view -c -q3 grm056_i1_KO_carcass.bam 18 7838063 samtools view -c -q5 grm056_i1_KO_carcass.bam 18 7816288 samtools view -c -q10 grm056_i1_KO_carcass.bam 18 7707885 samtools view -c -q20 grm056_i1_KO_carcass.bam 18 7635921 samtools view -c -q30 grm056_i1_KO_carcass.bam 18 7313124 samtools view -c -q40 grm056_i1_KO_carcass.bam 18 7106571
Comment
-
Originally posted by dariober View PostIsn't <=3 a little too relaxed though? It means that a read with q4 has 10^(-4/10) ~ 0.4 chance to be wrong.
Thanks for sharing the empirical results of filtering with various thresholds! But, I'd like to point out that the scalar result of reads remaining does not necessarily reflect the damage potentially done to the analysis results. As an example, a typical human has ~3 million mutations compared to the reference, or around a 0.1% rate. With 100bp error-free reads, then, about 90% of them would be expected to map to the reference with 0 mismatches. If you set your quality threshold high enough, you can will eliminate all of the reads that indicate a variation... but still retain 90% of your reads, which looks pretty good!
Comment
-
Originally posted by Brian Bushnell View PostThe question was specifically about multimapping, not mismapping.
Thanks for sharing the empirical results of filtering with various thresholds! But, I'd like to point out that the scalar result of reads remaining does not necessarily reflect the damage potentially done to the analysis results. As an example, a typical human has ~3 million mutations compared to the reference, or around a 0.1% rate. With 100bp error-free reads, then, about 90% of them would be expected to map to the reference with 0 mismatches. If you set your quality threshold high enough, you can will eliminate all of the reads that indicate a variation... but still retain 90% of your reads, which looks pretty good!
Comment
-
Originally posted by dariober View PostAs a thought experiment, imagine a region of 100bp with several SNP relative to the reference, say 20. In order to map a read there you need to lower the mapq score. This could mean that a read truly coming from that region could map even better somewhere else and reads truly mapping somewhere else could end up in this critical region.
It's always possible to take a read from some region, apply a bunch of mutations to it, and have the resulting read map perfectly to somewhere else, while having too many differences to map to the correct location at all. These events are very unlikely and tend to be ignored; increasing the mapq threshold will not protect you from them. Of course, there's always some gray area in between. But generally, the higher the mapq threshold, the more reference-bias you will get, which greatly interferes with variant calling.
Comment
Latest Articles
Collapse
-
by seqadmin
The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
...-
Channel: Articles
11-27-2023, 01:15 PM -
-
by seqadmin
Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...-
Channel: Articles
11-09-2023, 07:02 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 10:48 AM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
Today, 10:48 AM
|
||
Started by seqadmin, Yesterday, 08:26 AM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
Yesterday, 08:26 AM
|
||
Started by seqadmin, Yesterday, 08:12 AM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
Yesterday, 08:12 AM
|
||
Started by seqadmin, 11-27-2023, 08:12 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
11-27-2023, 08:12 AM
|
Comment