Seqanswers Leaderboard Ad

**dpryan** · 04-22-2014, 12:52 AM

It depends completely on the aligner and, in any case, is always just an estimate. Bowtie2 and BWA are described by me and another user on biostars.

**AmitL** · 04-22-2014, 03:30 AM

Originally posted by dpryan View Post

It depends completely on the aligner and, in any case, is always just an estimate. Bowtie2 and BWA are described by me and another user on biostars.

Thank you very much! That was helpful.

Given what you said, how do you relate to these quality scores? I mean, if they don't necessarily represent the actual mistake ratio, how can you tell which mapping is accurate enough for you?
With people I've worked with it was usually based on hunches. Is there an empirical way to tell?

Does the estimation take the reference data in account? I mean in terms of empirically measuring error rates on the reference genome in question.

Thanks

**Brian Bushnell** · 04-22-2014, 09:39 AM

Originally posted by AmitL View Post

Thank you very much! That was helpful.

Given what you said, how do you relate to these quality scores? I mean, if they don't necessarily represent the actual mistake ratio, how can you tell which mapping is accurate enough for you?
With people I've worked with it was usually based on hunches. Is there an empirical way to tell?

Does the estimation take the reference data in account? I mean in terms of empirically measuring error rates on the reference genome in question.

Thanks

The mapping score alone is never going to tell you the true probability that a mapping is correct. The best way to determine that is by looking at an ROC curve generated from synthetic data, which will tell you the empirically determined true positive and false positive rates for a given quality cutoff. BBTools has some useful tools for this. For example, using the e.coli reference:

First index the reference:
bbmap.sh ref=ecoli.fa -Xmx1g

Then generate tagged synthetic reads:
randomreads.sh build=1 reads=100000 length=150 minq=5 midq=20 maxq=30 out=reads.fq -Xmx1g

Then map them and produce a sam file:
bbmap.sh in=reads.fq out=mapped.sam -Xmx1g
(or a different command using your aligner of choice)

Then generate an ROC curve:
samtoroc.sh in=mapped.sam reads=100000 > roc.txt

Then plot the ROC curve using Excel, or whatever. The columns are labeled; to determine whether the aligner was exactly right, you would plot "truePositiveStrict" versus "falsePositiveStrict". Bear in mind that the resulting relationship between mapping score and accuracy will only be precisely valid for that genome, with that particular error model, and that particular read length, but it will be roughly maintained across different organisms, error models, and read lengths.

randomreads.sh is capable of adding other things like indels and no-calls also, but is not yet documented.

**dpryan** · 04-22-2014, 11:31 AM

I'd like to just reiterate Brian's point. Looking at synthetic data is absolutely the most reliable way to arrive at a meaningful threshold. You can find ROCs in most aligner papers, though it's always best to make your own with random reads matching your phred score and mismatch profile.

**AmitL** · 04-22-2014, 11:41 AM

Thank you Brian!

So if I understand you correctly, you suggest making a "recalibration" of the error estimation using simulated data. Is that correct?

How well do simulated reads represent real sequencing data - in terms of quality, positional bias (if any?), naturally occurring variants, etc?

Cheers

**Brian Bushnell** · 04-22-2014, 12:23 PM

Originally posted by AmitL View Post

Thank you Brian!

So if I understand you correctly, you suggest making a "recalibration" of the error estimation using simulated data. Is that correct?

Essentially, yes. Either recalibrate if you will be using the mapping score for something sensitive, or just decide on an acceptable threshold and throw away all reads with a lower score.

How well do simulated reads represent real sequencing data - in terms of quality, positional bias (if any?), naturally occurring variants, etc?

Cheers

Depends on the simulator. The command I gave you just generates reads with errors based on the quality profile, using a quality profile that reflects that of Illumina reads. You could additionally add snps and indels using flags like this:
snprate=0.5 maxsnps=4 delrate=0.5 maxdellen=15 maxdels=2 insrate=0.5 maxinslen=15 maxinss=2
...which would add up to 4 snps, 2 deletions, and 2 insertions per read, independent of the quality value of the nearby bases. However, they would be totally random with respect to the underlying genome. Some sequencing platforms have a positional bias - e.g., 454 is more likely to have indels in homopolymers. My generator will not reflect such biases, so it's not quite the same. As for a real genome's bias against SNPs that change amino acids, I doubt that would make a noticeable difference in the results, but that's also not modeled. There are other important events, such as adapter sequence contamination, that are also nonrandom; I have another tool, "addadapters.sh", that can add specified adapter sequences to reads at random positions, but randomreads.sh does not do that.

It's possible that there are read generators available that model all of these biases, but I have not studied other generators so I'm not aware of them. FYI wgsim is probably the most widely used.

**AmitL** · 04-22-2014, 01:23 PM

Thanks for the detailed explanations, I'll have a look at the tools you mentioned.

Have you published any comparison results of BBMap and other mappers? or a paper about the tool? I'd be happy to read more about it.

**Brian Bushnell** · 04-22-2014, 01:31 PM

BBMap is not yet published - I'm still working on the publication - though it will be mentioned in a few upcoming papers. I have a poster, though, using data from last year.

Attached Files

BB_poster.ppt (906.0 KB, 30 views)

**AmitL** · 04-22-2014, 01:58 PM

Looks interesting, I hope it makes it to the journals soon.

Thanks for the help and good luck!

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 20 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 26 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 21 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 21 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

How is mapping quality estimated?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News