Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unique mapping with Tophat2

    Hello everyone,

    I have seen posts about filtering unique hits after default run with Tophat2 but some say I should get the reads with MAPQ=255, some say I should get the reads with NH:i:1. Can you briefly describe how to collect the unique hits after tophat2 run, I am a little confused :/

  • #2
    I think the most recent versions use a MAPQ of 50 for unique alignments, but in either case a threshold of 5 will filter out the multimappers (the current possible MAPQs are 0, 1, 2, 3, and 50, if I remember correctly). At least htseq-count will automatically filter by NH tag if it's there, though I always specified a MAPQ threshold anyway.

    Comment


    • #3
      Thanks to devon, I assume I could get the unique hits from a default run (-g 20).

      But to confirm if I get the unique hits, I performed several runs. So, if I substract
      "-g 1" number of reads from "-g 2" number of reads, I should find the number of multi-
      hitting reads, right? In my case it is: 20,064,410 - 19,499,641 = 564,749 reads seems to
      be multi mapping. If I substract 564,749 from "-g 1" number of reads, I should find the
      uniquely mapping reads: 19,499,461 - 564,749 = 18,934,712? But when I typed
      "samtools view -q 50 -o output_file accepted_hits.bam" code to get the number of unique
      hits from a default run (-g 20), it is: 19,170,015 :/ so it's not equal, am I thinking
      wrong?

      Also what I have realized is that if I try to filter unique hits from a "-g 2", the number is: 19,136,250 but if I try to filter unique hits from a "-g 20" (default), it is: 19,170,015 , I expected that they would be also equal in number. What am I missing?

      Comment


      • #4
        I'm not sure what's going on in your case, but there are certainly problems with the mapq encoding in tophat. Specifically it seems to calculate the mapq value after doing the -g filtering instead of before, so if you specify -g 1 then all reads (even ones with lots of possible mapping positions) will get a mapq of 50.

        If you're using -g > 1 then you could also compare the mapq values to the NH:i:x values since these should correctly report whether the hit is unique or not.

        Comment


        • #5
          yes, you are right. If you do a "-g 1" run, all the reads have MAPQ of 50, so it defines uniqueness depending on the output file.

          Comment


          • #6
            Originally posted by sazz View Post
            yes, you are right. If you do a "-g 1" run, all the reads have MAPQ of 50, so it defines uniqueness depending on the output file.
            Indeed, but it really shouldn't do that as it's completely misleading. The NH:i;x tags are specifically designed to show the number of redundant hits within the current file, but mapq should be a measure of mapping quality which is independent of how many hits you reported.

            Comment


            • #7
              I tried to find the number of unique mappings with "NH:i:1" tag but the result is exactly same as the number I find with MAPQ:50.

              "-g 2": unique hits both with NH:i:1 and MAPQ 50 is 19,136,250
              "-g 20": unique hits both with NH:i:1 and MAPQ 50 is 19,170,015
              "-g 1000": unique hits both with NH:i:1 and MAPQ 50 is 19,170,015

              Still I don't understand why there is difference between "-g 2" and "-g 20"?

              Comment


              • #8
                Originally posted by sazz View Post
                I tried to find the number of unique mappings with "NH:i:1" tag but the result is exactly same as the number I find with MAPQ:50.

                "-g 2": unique hits both with NH:i:1 and MAPQ 50 is 19,136,250
                "-g 20": unique hits both with NH:i:1 and MAPQ 50 is 19,170,015
                "-g 1000": unique hits both with NH:i:1 and MAPQ 50 is 19,170,015

                Still I don't understand why there is difference between "-g 2" and "-g 20"?
                I'm not sure. In the controlled tests I did to get to the bottom of this we got back what we expected, with the -g filter being applied correctly. We also tested the --report-secondary-alignments options which also seemed to work OK. The only problem we hit was with the mapq values.

                As another check you could look at the 0x100 position in the flag value. This should be false for only one read in each set (the primary alignment) so finding the number of reads for which this is false should also give you a measure of different hits.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Non-Coding RNA Research and Technologies
                  by seqadmin




                  Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                  Nobel Prize for MicroRNA Discovery
                  This week,...
                  Yesterday, 08:07 AM
                • seqadmin
                  Recent Developments in Metagenomics
                  by seqadmin





                  Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                  09-23-2024, 06:35 AM
                • seqadmin
                  Understanding Genetic Influence on Infectious Disease
                  by seqadmin




                  During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                  Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                  09-09-2024, 10:59 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 10-02-2024, 04:51 AM
                0 responses
                95 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 10-01-2024, 07:10 AM
                0 responses
                106 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 09-30-2024, 08:33 AM
                1 response
                106 views
                0 likes
                Last Post EmiTom
                by EmiTom
                 
                Started by seqadmin, 09-26-2024, 12:57 PM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Working...
                X