Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • rajesh1989
    Junior Member
    • Feb 2015
    • 7

    what is wrong with samtools flagstat or read mapping with tophat?

    I have 6,673,385 (around 6 million) reads in each pair end file after quality filtering. but when i map it using tophat and run samtools flagstat on bam file it gives following output
    1343686 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    1343686 + 0 mapped (100.00%:-nan%)
    1343686 + 0 paired in sequencing
    670808 + 0 read1
    672878 + 0 read2
    1203600 + 0 properly paired (89.57%:-nan%)
    1311198 + 0 with itself and mate mapped
    32488 + 0 singletons (2.42%:-nan%)
    15874 + 0 with mate mapped to a different chr
    452 + 0 with mate mapped to a different chr (mapQ>=5)
    I am not very sure how to interpret samtools flagstat output, but as i assume there are only 670808 (around 0.6 million) reads in pair1 are mapped 672878 (around 0.6 million) from pair2. is it correct? That is 1/10 th of the total input reads. where are rest of my reads???

    Report produced by tophat shows some other statistics

    Left reads:
    Input : 468668
    Mapped : 443344 (94.6% of input)
    of these: 216780 (48.9%) have multiple alignments (1 have >20)
    Right reads:
    Input : 468668
    Mapped : 444468 (94.8% of input)
    of these: 217699 (49.0%) have multiple alignments (1 have >20)
    94.7% overall read mapping rate.
    Aligned pairs: 433356
    of these: 211726 (48.9%) have multiple alignments
    5512 ( 1.3%) are discordant alignments
    91.3% concordant pair alignment rate.

    why is tophat saying it mapped around 94% of reads when there are around 6 million reads in beginning?
    how to interpret all these numbers thank you.
  • gringer
    David Eccles (gringer)
    • May 2011
    • 845

    #2
    samtools flagstat can only report what's in the file, so if there are no unmapped reads in the BAM file then the calculated mapping rate will be 100% (with some reduction in that due to unpaired and low-quality mappings, if included).

    Comment

    • rajesh1989
      Junior Member
      • Feb 2015
      • 7

      #3
      thank you for the reply.
      this is output of tophat prep_reads.info

      left_min_read_len=25
      left_max_read_len=101
      left_reads_in =6673385
      left_reads_out=6667431
      right_min_read_len=25
      right_max_read_len=101
      right_reads_in =6673385
      right_reads_out=6673220

      where are rest of the reads if tophat didn't map them. i also checked unmapped.bam it's size is very small.

      Comment

      • fanli
        Senior Member
        • Jul 2014
        • 197

        #4
        Originally posted by rajesh1989 View Post
        Report produced by tophat shows some other statistics

        Left reads:
        Input : 468668
        Mapped : 443344 (94.6% of input)
        of these: 216780 (48.9%) have multiple alignments (1 have >20)
        Right reads:
        Input : 468668
        Mapped : 444468 (94.8% of input)
        of these: 217699 (49.0%) have multiple alignments (1 have >20)
        94.7% overall read mapping rate.
        Aligned pairs: 433356
        of these: 211726 (48.9%) have multiple alignments
        5512 ( 1.3%) are discordant alignments
        91.3% concordant pair alignment rate.
        This says your input to tophat is only ~460k read pairs. This directly contradicts what you posted in the prep_reads.info. Are you sure you don't have mismatched files?

        Comment

        • rajesh1989
          Junior Member
          • Feb 2015
          • 7

          #5
          Hello,

          whatever i have written here is correct i just copied details and pasted here.

          what do you mean by mismatched files?

          that is my actual query why tophat is taking only ~460k read pairs?

          Comment

          • fanli
            Senior Member
            • Jul 2014
            • 197

            #6
            Like the prep_reads.info is from one sample and the tophat align_summary is from another?

            Comment

            • rajesh1989
              Junior Member
              • Feb 2015
              • 7

              #7
              no they are not in very same folder i have those two files.

              Comment

              • fanli
                Senior Member
                • Jul 2014
                • 197

                #8
                Perhaps you have mixed up files in your script. You may want to check the logs in your tophat output directory.

                As an example, here's what my align_summary.txt looks like:
                Code:
                Left reads:
                          Input     :   6551998
                           Mapped   :   5980941 (91.3% of input)
                            of these:    199516 ( 3.3%) have multiple alignments (10560 have >10)
                Right reads:
                          Input     :   6551998
                           Mapped   :   5574400 (85.1% of input)
                            of these:    184354 ( 3.3%) have multiple alignments (10346 have >10)
                88.2% overall read mapping rate.
                
                Aligned pairs:   5394272
                     of these:    177939 ( 3.3%) have multiple alignments
                                  148603 ( 2.8%) are discordant alignments
                80.1% concordant pair alignment rate.
                and the corresponding prep_reads.info:
                Code:
                left_min_read_len=75
                left_max_read_len=75
                left_reads_in =6551998
                left_reads_out=6544622
                right_min_read_len=75
                right_max_read_len=75
                right_reads_in =6551998
                right_reads_out=6495499
                Note that both files refer to 6551998 as the number of read pairs input.

                Comment

                • rajesh1989
                  Junior Member
                  • Feb 2015
                  • 7

                  #9
                  i got the answer. i think this is some issue with multi threading. when i run tophat on single core i get correct results. other peoples have also reported this issue.

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM
                  • SEQadmin2
                    Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                    by SEQadmin2


                    With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                    Introduction

                    Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                    05-22-2026, 06:42 AM
                  • SEQadmin2
                    Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                    by SEQadmin2

                    Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                    Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                    05-06-2026, 09:04 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Yesterday, 08:59 AM
                  0 responses
                  14 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  22 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 11:40 AM
                  0 responses
                  19 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-28-2026, 11:40 AM
                  0 responses
                  32 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...