Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How is mapping quality estimated?

    Hi everyone,

    I apologize if this is a repost, please refer me to an older discussion if it is (I couldn't find one myself).

    I am trying to understand how to handle the mapping quality scores in the outputs of mappers like BWA, Bowtie, BFAST etc. For this I first wish to understand how exactly their are estimated.

    From what I've found so far, this is an estimation of the likelihood of a wrong mapping. I saw the formula here, but I still can't figure why it represents that likelihood.

    Can someone please explain?

    Thanks,
    Amit

  • #2
    It depends completely on the aligner and, in any case, is always just an estimate. Bowtie2 and BWA are described by me and another user on biostars.

    Comment


    • #3
      Originally posted by dpryan View Post
      It depends completely on the aligner and, in any case, is always just an estimate. Bowtie2 and BWA are described by me and another user on biostars.
      Thank you very much! That was helpful.

      Given what you said, how do you relate to these quality scores? I mean, if they don't necessarily represent the actual mistake ratio, how can you tell which mapping is accurate enough for you?
      With people I've worked with it was usually based on hunches. Is there an empirical way to tell?

      Does the estimation take the reference data in account? I mean in terms of empirically measuring error rates on the reference genome in question.

      Thanks

      Comment


      • #4
        Originally posted by AmitL View Post
        Thank you very much! That was helpful.

        Given what you said, how do you relate to these quality scores? I mean, if they don't necessarily represent the actual mistake ratio, how can you tell which mapping is accurate enough for you?
        With people I've worked with it was usually based on hunches. Is there an empirical way to tell?

        Does the estimation take the reference data in account? I mean in terms of empirically measuring error rates on the reference genome in question.

        Thanks
        The mapping score alone is never going to tell you the true probability that a mapping is correct. The best way to determine that is by looking at an ROC curve generated from synthetic data, which will tell you the empirically determined true positive and false positive rates for a given quality cutoff. BBTools has some useful tools for this. For example, using the e.coli reference:

        First index the reference:
        bbmap.sh ref=ecoli.fa -Xmx1g

        Then generate tagged synthetic reads:
        randomreads.sh build=1 reads=100000 length=150 minq=5 midq=20 maxq=30 out=reads.fq -Xmx1g

        Then map them and produce a sam file:
        bbmap.sh in=reads.fq out=mapped.sam -Xmx1g
        (or a different command using your aligner of choice)

        Then generate an ROC curve:
        samtoroc.sh in=mapped.sam reads=100000 > roc.txt

        Then plot the ROC curve using Excel, or whatever. The columns are labeled; to determine whether the aligner was exactly right, you would plot "truePositiveStrict" versus "falsePositiveStrict". Bear in mind that the resulting relationship between mapping score and accuracy will only be precisely valid for that genome, with that particular error model, and that particular read length, but it will be roughly maintained across different organisms, error models, and read lengths.

        randomreads.sh is capable of adding other things like indels and no-calls also, but is not yet documented.

        Comment


        • #5
          I'd like to just reiterate Brian's point. Looking at synthetic data is absolutely the most reliable way to arrive at a meaningful threshold. You can find ROCs in most aligner papers, though it's always best to make your own with random reads matching your phred score and mismatch profile.

          Comment


          • #6
            Thank you Brian!

            So if I understand you correctly, you suggest making a "recalibration" of the error estimation using simulated data. Is that correct?

            How well do simulated reads represent real sequencing data - in terms of quality, positional bias (if any?), naturally occurring variants, etc?

            Cheers

            Comment


            • #7
              Originally posted by AmitL View Post
              Thank you Brian!

              So if I understand you correctly, you suggest making a "recalibration" of the error estimation using simulated data. Is that correct?
              Essentially, yes. Either recalibrate if you will be using the mapping score for something sensitive, or just decide on an acceptable threshold and throw away all reads with a lower score.

              How well do simulated reads represent real sequencing data - in terms of quality, positional bias (if any?), naturally occurring variants, etc?

              Cheers
              Depends on the simulator. The command I gave you just generates reads with errors based on the quality profile, using a quality profile that reflects that of Illumina reads. You could additionally add snps and indels using flags like this:
              snprate=0.5 maxsnps=4 delrate=0.5 maxdellen=15 maxdels=2 insrate=0.5 maxinslen=15 maxinss=2
              ...which would add up to 4 snps, 2 deletions, and 2 insertions per read, independent of the quality value of the nearby bases. However, they would be totally random with respect to the underlying genome. Some sequencing platforms have a positional bias - e.g., 454 is more likely to have indels in homopolymers. My generator will not reflect such biases, so it's not quite the same. As for a real genome's bias against SNPs that change amino acids, I doubt that would make a noticeable difference in the results, but that's also not modeled. There are other important events, such as adapter sequence contamination, that are also nonrandom; I have another tool, "addadapters.sh", that can add specified adapter sequences to reads at random positions, but randomreads.sh does not do that.

              It's possible that there are read generators available that model all of these biases, but I have not studied other generators so I'm not aware of them. FYI wgsim is probably the most widely used.
              Last edited by Brian Bushnell; 04-22-2014, 12:28 PM.

              Comment


              • #8
                Thanks for the detailed explanations, I'll have a look at the tools you mentioned.

                Have you published any comparison results of BBMap and other mappers? or a paper about the tool? I'd be happy to read more about it.

                Comment


                • #9
                  BBMap is not yet published - I'm still working on the publication - though it will be mentioned in a few upcoming papers. I have a poster, though, using data from last year.
                  Attached Files

                  Comment


                  • #10
                    Looks interesting, I hope it makes it to the journals soon.
                    Thanks for the help and good luck!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Recent Advances in Sequencing Analysis Tools
                      by seqadmin


                      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                      05-06-2024, 07:48 AM
                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 05-10-2024, 06:35 AM
                    0 responses
                    20 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-09-2024, 02:46 PM
                    0 responses
                    26 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-07-2024, 06:57 AM
                    0 responses
                    21 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-06-2024, 07:17 AM
                    0 responses
                    21 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X