I often heard that "alignment of sequence reads to a reference can be inaccurate towards the ends of a read", does anybody know the exact reason of this?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
As you approach the end of a read, it becomes increasingly difficult to distinguish between a substitution, deletion, insertion, and even match. Imagine the last base of a read is "A" and the reference has a "C" at that position. The "A" could be a substitution. Or, it could be a deletion of however many bases until the next "A". Or, it could be an insertion of a single "A". All events would appear identical. Whereas in the middle of a long read, it would be quite obvious which was the case. Furthermore, if the last read base was "A" and the ref base was "A", but the subsequent ref base was also "A", you would not be able to distinguish between a match and a 1bp deletion.
I normally chop the last ~10bp off both ends of reads as being uninformative for this reason, and do not include them in any consensus.Last edited by Brian Bushnell; 12-11-2014, 09:05 PM.
-
I take it you chop the first and last 10 bp of the alignment read segment rather than trimming before alignment, since otherwise you just shift where in the read the uninformative variants are, right? What do you use to do that?Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
Comment
-
Yes, after alignment. I wrote a couple tools for calling variations and found they were more accurate if I ignored the outer ~10bp of reads. So, the coverage was calculated from only the middle (length-20bp), and variations were called using the rate of variations in the middle (length-20bp) versus the coverage of that region. I did not use any 3rd-party software and do not know of any that takes that approach; and unfortunately, the software I wrote for that is owned by UTSW, which doesn't use it.
If you look at this image:
This is a histogram of match, substitution, insertion, deletion, and so forth rates by position in a read. "Other" means "clipped", as in, it went off the end of a contig, so it's obvious why that would increase toward the ends (this was a draft assembly). But also notice how the insertion rate increases toward the ends. Illumina reads generally do not suffer from artificial indels, so that is probably mis-called substitutions - specifically, if there are a lot of mismatches near the end of a read, it is cheaper to classify them as a single long insertion event rather than a bunch of isolated substitutions interspersed with matches.
To compensate for that, I banned indels at the very end of reads, which is why the insertion and deletions rates suddenly drop to zero. This greatly improved overall accuracy. But still, the closer you get to the ends, the less information you have to classify a base as match/substitution/insertion/deletion. So you'll always increase accuracy if, when you have multiple reads mapped over an area, you call variants based on the middles of the reads and ignore the outermost parts.
Since I don't currently have a piece of software that does this and I don't know of any, I can at least suggest that if you want to do so, just map the reads and then post-process the sam file, trimming the outermost X bases on each end (adjusting the bases, qualities, cigar string, pos, and tossing the MD tag) to pretend that it was 10bp shorter on each end, before feeding it to a variant caller.
Comment
-
Thanks, I was curious because I pretty much deal with RAD-like (or RAD-ish, as I like to say) libraries, so the reads stack up and a variant will always have the same nucleotide position in a read... and it is therefore pretty easy to deal with "end effects".
Unfortunate about the other-owned software. I think your solution of the sam file would certainly work.Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
Comment
-
Hi Brian,
Thank you very much for your prompt, detailed and clear answer, now I understand why the accuracy drops at both ends of the read.
However, I don't very understand one point in your answer that is when we approach to the middle of a long read, it is easier to make a accurate conclusion. Could you please explain more about that?
Comment
-
Originally posted by Brian Bushnell View PostAs you approach the end of a read, it becomes increasingly difficult to distinguish between a substitution, deletion, insertion, and even match. Imagine the last base of a read is "A" and the reference has a "C" at that position. The "A" could be a substitution. Or, it could be a deletion of however many bases until the next "A". Or, it could be an insertion of a single "A". All events would appear identical. Whereas in the middle of a long read, it would be quite obvious which was the case. Furthermore, if the last read base was "A" and the ref base was "A", but the subsequent ref base was also "A", you would not be able to distinguish between a match and a 1bp deletion.
I normally chop the last ~10bp off both ends of reads as being uninformative for this reason, and do not include them in any consensus.
In this paper, their points are:
1. Alignments of sequencing reads to a reference can be inaccurate towards the ends of a read.
2. This (point 1) is especially true when there is a variant near an edge of the read. The mismatch due to the variant might cause all the bases from the variant to the edge of the read to be excluded from the alignment, otherwise, soft-clipped.
For point 1, you have already gave an excellent answer, for point 2, do you have any idea what the difference is between variants near an edge of the read or in other positions of the read, why the variants near edge are more evil?
bless~
Comment
-
Originally posted by Brian Bushnell View PostAs you approach the end of a read, it becomes increasingly difficult to distinguish between a substitution, deletion, insertion, and even match. Imagine the last base of a read is "A" and the reference has a "C" at that position. The "A" could be a substitution. Or, it could be a deletion of however many bases until the next "A". Or, it could be an insertion of a single "A". All events would appear identical. Whereas in the middle of a long read, it would be quite obvious which was the case. Furthermore, if the last read base was "A" and the ref base was "A", but the subsequent ref base was also "A", you would not be able to distinguish between a match and a 1bp deletion.
I normally chop the last ~10bp off both ends of reads as being uninformative for this reason, and do not include them in any consensus.
If I have already do local indel-realigment on my data using GATK RealignTargetCreator and IndelRealigner, should I still chop my reads? The read in your answer means the one which has already been trimmed by remove any primers and adaptors, am I right?
Comment
-
Local aligners will soft-clip the ends of reads if there are too many variants there. Global aligners won't do that. So, point 2 regarding soft-clipping is irrelevant to global aligners, which are preferred for variant-calling.
But the most important point is that, still, at the very ends of reads you cannot distinguish between match, mismatch, substitution, or deletion - they can (mostly) look the same. In the middle of a read, a 1bp substitution is obviously a 1bp substitution. A 1bp deletion, in the middle of a 150bp read, will shift 75 bases inward; this will mean the aligner can call a 1bp deletion, or (0.75*75)=56.25 substitutions on average, which is (with most scoring functions) far more expensive. The nearer to the end, the fewer bases are moved, and thus the fewer substitutions are generated by a miscalled indel, therefore the harder it is to distinguish between an indel and a substitution, or any other event.
Comment
-
Originally posted by blaboon View PostSorry for annoying questions. As a newbie, an expert just like a god.
If I have already do local indel-realigment on my data using GATK RealignTargetCreator and IndelRealigner, should I still chop my reads? The read in your answer means the one which has already been trimmed by remove any primers and adaptors, am I right?
That said, I have not rigorously benchmarked BBMap with or without "chopping" versus GATK with indel-realignment, so to some extent this is my opinion rather than something I can state as fact.
I will state these two things:
1) Calling variations directly from the consensus of a pileup from BBMap is more accurate if you chop the ends off the aligned reads.
2) BBMap is more accurate with indels than any other aligner I have tested, regardless of whether the read ends are chopped off.
However, once you include additional variables like GATK with or without indel realignment, plus an arbitrary choice of aligner, there are too many variables to be certain anymore, and I have not tested all of those combinations. I am also not sure whether "chopping" is better or not when you do indel realignment or local reassembly around suspected indels. Probably not, if the indel realigner / local assembler is good. However, with a simple consensus and no realignment, chopping should generally be better.Last edited by Brian Bushnell; 12-12-2014, 12:03 AM.
Comment
-
The end problem was first mentioned in a 2008 paper and then rediscovered multiple times after that. The problem subsequently motivated the development of GATK realignment, SRMA, samtools BAQ and stampy per-base alignment score. In 2009, someone already required variants to have sufficient support from the middle one third of a read (though in 2012 some others were still publishing a CNS paper mainly caused by this artifact).
In theory, perfect GATK-like realignment should satisfactorily solve the end problem for WGS. For all the pairwise NGS mappers I know of, none of them could replace such realignment because they don't have the powerful consensus information during the mapping phase. However, the implementations of SRMA and GATK realignment are not optimal. They are a bit slow and do not do the best job all the time.
The continuing indel calling problem leads to the development of new-generation variant callers, GATK-HC, platypus, freebayes and soapindel among the many. These modern callers call variants simultaneously with realignment or reassembly. Some of them even completely ignore the alignment. They are usually very robust to the end-indel issue. These new tools are on the right line.
The end problem is also alleviated by longer reads because the fraction of reads ends is reduced. For deep WGS, the misalignment of a few ends will not lead to a wrong call. GATK realignment is not much needed nowadays, especially when we use a modern caller. End chopping is not recommended, either, because it hurts sensitivity. Nonetheless, for RAD and amplicon sequencing where it is much harder to fix the end problem with consensus, end chopping still has its uses. In addition, GATK and samtools have long been computing summary statistics about where a variant is supported on reads. They serve a similar goal to end chopping.
Comment
-
Originally posted by Brian Bushnell View PostTrimming adapters/primers should be done before alignment. "Chopping" the reads should be done post-alignment. Realignment around indels can to some extent mitigate the problem, because it uses consensus of multiple reads rather than aligning based on individual reads. Still, if you are at all interested in accurately mapping indels, I suggest you use BBMap rather than relying on indel-realignment.
That said, I have not rigorously benchmarked BBMap with or without "chopping" versus GATK with indel-realignment, so to some extent this is my opinion rather than something I can state as fact.
I will state these two things:
1) Calling variations directly from the consensus of a pileup from BBMap is more accurate if you chop the ends off the aligned reads.
2) BBMap is more accurate with indels than any other aligner I have tested, regardless of whether the read ends are chopped off.
However, once you include additional variables like GATK with or without indel realignment, plus an arbitrary choice of aligner, there are too many variables to be certain anymore, and I have not tested all of those combinations. I am also not sure whether "chopping" is better or not when you do indel realignment or local reassembly around suspected indels. Probably not, if the indel realigner / local assembler is good. However, with a simple consensus and no realignment, chopping should generally be better.
Thank you for your detailed response.
Now my question is why we should trim primers before alignment? Since the Qiagen paper said, the alignment quality is low to the ends of reads, so if we keep primers when we do alignment, I think we can move "low quality part" (just one end) into primer binding region, so our "real" reads will achieve higher alignment quality.
Comment
-
Nice idea, but unfortunately, you can't resolve the problem by sticking (or keeping) random bases on the end of the reads and then taking them back off after mapping. In fact, that will only make the problem worse as the random bases will have some coincidental matches with the genomic sequence, potentially altering the alignment. It's the ends of the genomic portions of the reads that are aligned inaccurately, regardless of whether there are nongenomic bases extending beyond them.
However, trimming low-quality (but genomic) bases from the read ends could indeed be done after alignment, which would help reduce the problem somewhat.
Comment
-
If you are working with haploid data, another approach to get better aligned ends is to run iterative mapping where you take the consensus from the previous mapping iteration and map the reads to that consensus on successive iterations. This also has the advantage of being able to map reads that would normally not get mapped and make variant calls in regions that would otherwise have no coverage because no reads directly map there.
I implemented that approach in Geneious (commerical software) and described it in a white paper at http://assets.geneious.com/documenta...ReadMapper.pdf
Comment
Latest Articles
Collapse
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
-
by seqadmin
Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...-
Channel: Articles
09-23-2024, 06:35 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Genetic Barcodes and Single-Cell Sequencing Illuminate Tumor Initiation and Chemoresistance in Breast Cancer
by seqadmin
Started by seqadmin, Today, 06:35 AM
|
0 responses
7 views
0 likes
|
Last Post
by seqadmin
Today, 06:35 AM
|
||
Started by seqadmin, Yesterday, 02:44 PM
|
0 responses
7 views
0 likes
|
Last Post
by seqadmin
Yesterday, 02:44 PM
|
||
Started by seqadmin, 10-11-2024, 06:55 AM
|
0 responses
15 views
0 likes
|
Last Post
by seqadmin
10-11-2024, 06:55 AM
|
||
Started by seqadmin, 10-02-2024, 04:51 AM
|
0 responses
112 views
0 likes
|
Last Post
by seqadmin
10-02-2024, 04:51 AM
|
Comment