Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat bam shows more reads than the library

    Hi there,

    I've run tophat (version 2.0.4 and encode gtf) to a paired-end 50bp RNA-Seq data . The fastq files shows the total number of reads is 36,497,412, after tophat run, I used samtools to check the bam file, I got:

    samtools flagstat accepted_hits.bam
    38,502,265 in total
    0 QC failure
    0 duplicates
    38502265 mapped (100.00%)
    38502265 paired in sequencing
    19645891 read1
    18856374 read2
    17808682 properly paired (46.25%)
    32180420 with itself and mate mapped
    6321845 singletons (16.42%)
    4384082 with mate mapped to a different chr
    301920 with mate mapped to a different chr (mapQ>=5)

    My questions are:

    1. Why is the total number of reads (38,502,265) from bam file is larger than my original library size (36,497,41)?

    2. The flagstat shows that there is no unmapped reads. In this case, what are the proper ways to calculate the percentage of unmapped reads?

    Thanks very much. Any suggestions are welcome.

  • #2
    But the number 36,497,412 is relative to R1 and R2 fastq file together?

    Comment


    • #3
      yes, the 36,497,412 is the total number of my raw reads.

      Comment


      • #4
        TopHat allows reads to map to more than one place in the genome (multihits). This can occur for reads that overlap with elements that repeat in the transcriptome. This means that those reads get counted twice since each multiread alignment is a separate line in the bam file.

        Tophat also does not keep reads that don't align in the bam file, so you'd have to go back and count the number of reads in the original fastq file (basically the number of lines in the fastq file divided by 4) to get the original read count including the reads that do not align and look at the difference between that and the bam file. You can use samtools view -F flag to output a bam file without the multireads. In this case the you'd use -F 256 to return a bam file without aligned reads that are not the primary alignment (see http://picard.sourceforge.net/explain-flags.html for more explanation of how to set the -f/-F flag to get the types of reads you want).

        Edit: Also, by default, tophat will give back up to 20 alignments for the same read. You can change this option with -g/--max-multihits <int> on the command line as noted in the manual.
        Last edited by jb2; 02-07-2013, 01:29 PM.

        Comment


        • #5
          Thanks a lot, jb2.

          Comment


          • #6
            I am currently stuck with a similar problem, maybe someone here can help.
            Ultimately I am trying to determine the number of reads that were mapped by TopHat.
            STEP 1
            First I used "samtools flagstat" on accepted_hits.bam:
            76865299 + 0 in total (QC-passed reads + QC-failed reads)
            0 + 0 duplicates
            76865299 + 0 mapped (100.00%:-nan%)
            0 + 0 paired in sequencing
            0 + 0 read1
            0 + 0 read2
            0 + 0 properly paired (-nan%:-nan%)
            0 + 0 with itself and mate mapped
            0 + 0 singletons (-nan%:-nan%)
            0 + 0 with mate mapped to a different chr
            0 + 0 with mate mapped to a different chr (mapQ>=5)
            As you can see, I have unpaired reads.
            If I understand jb2 correctly, samtools spuriously tells me 100% were mapped because it can't see any of the reads that TopHat threw out. While it looks kind of suspicious that someone would write code to tell you (some%) when it can't ever count unmapped reads, I can believe that's what's happening here. The problem is that when I run "wc -l" on the original fastq file and divide by four, I get 70,247,418. This is significantly less than 76,865,299. As far as I can tell, this may be because TopHat aligns some reads multiple times, so samtools is giving me a number that includes duplicates and increasing my percentage of mapped reads beyond 100%.

            STEP 2
            To get around this hurdle and determine the number of unique reads aligned, I used a script from this thread discussing techniques for counting reads:
            samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l
            Which output: 63,702,278
            Great! So it looks like I have 70,247,418 reads; 63,702,278 of which were mapped with TopHat, some more than once; giving me 76,865,299 total mapped reads (including duplicates).

            STEP 3 - TopHat log files
            Here's what "bowtie.left_kept_reads" tells me:
            70221973 reads; of these:

            70221973 (100.00%) were unpaired; of these:

            11079060 (15.78%) aligned 0 times

            44347197 (63.15%) aligned exactly 1 time

            14795716 (21.07%) aligned >1 times

            84.22% overall alignment rate
            Do you guys think the tophat log outputs correspond well enough with the numbers from STEP 2?
            Why does samtools flagstat pretend to give a mapping percentage if it can't ever count the unmapped reads?
            Anyone know a better way to count reads?

            I appreciate anyone who can help, since this was a longer post than I first intended.
            Thanks!

            Comment


            • #7
              Do you guys think the tophat log outputs correspond well enough with the numbers from STEP 2?
              Why does samtools flagstat pretend to give a mapping percentage if it can't ever count the unmapped reads?
              Anyone know a better way to count reads?

              I appreciate anyone who can help, since this was a longer post than I first intended.
              Thanks!
              Of course samtools can count unmapped reads. It does so for me everyday. It counts every .bam entry with the 4 flagged.

              But it can't count what isn't there. It's Bowtie that is throwing them out. (in contrast, when bwa makes a .sam file, it leaves unmapped reads in) You can set bowtie to write the unmapped reads to another file.

              Comment


              • #8
                Small correction, it is TopHat that is throwing them out. I am pretty sure that Bowtie (when run by itself) leaves the unmapped reads in the output.

                Comment


                • #9
                  Actually unmapped hits are written separately under unmapped.bam. You can run samtools -flagstat on that and get the number of unmapped reads (QC pass + QC fail)

                  Comment


                  • #10
                    Originally posted by Siva View Post
                    Actually unmapped hits are written separately under unmapped.bam. You can run samtools -flagstat on that and get the number of unmapped reads (QC pass + QC fail)
                    Keep in mind that the QC pass/fail numbers only mean something if there is software in your pipeline somewhere that checks for QC and sets/changes the flag in the .bam appropriately. SAMTools flagstat is just reading the flags, it's not setting them. I don't know that bowtie/tophat will attempt to set or change that flag at all.

                    Comment


                    • #11
                      Originally posted by swbarnes2 View Post
                      Keep in mind that the QC pass/fail numbers only mean something if there is software in your pipeline somewhere that checks for QC and sets/changes the flag in the .bam appropriately. SAMTools flagstat is just reading the flags, it's not setting them. I don't know that bowtie/tophat will attempt to set or change that flag at all.
                      Thanks! I do understand your point about QC.

                      I have some tophat2 generated BAM files (50 base paired end RNA seq). I ran tophat with all default options. I want to get the number of uniquely mapped reads. I get a feeling that it isn't all that straight forward to get them. I am walking you through my steps and would like to get comments.

                      On the original accepted_hits.bam file I ran samtools -flagstat and got the following
                      Code:
                      145949292 + 0 in total (QC-passed reads + QC-failed reads)
                      0 + 0 duplicates
                      145949292 + 0 mapped (100.00%:-nan%)
                      145949292 + 0 paired in sequencing
                      73274066 + 0 read1
                      72675226 + 0 read2
                      114124162 + 0 properly paired (78.19%:-nan%)
                      138690450 + 0 with itself and mate mapped
                      7258842 + 0 singletons (4.97%:-nan%)
                      13016262 + 0 with mate mapped to a different chr
                      417680 + 0 with mate mapped to a different chr (mapQ>=5)
                      In order to get uniquely mapped reads, I think in in tophat (bowtie2)....the tag that signifies (highest probability) of being unique is "NH:i:1". So I then grepped for only lines with "NH:i:1" tag and got a sam file. I then made a bam file (accepted_hits_NH1.bam) and performed samtools flagstat and got the following:

                      Code:
                      108602832 + 0 in total (QC-passed reads + QC-failed reads)
                      0 + 0 duplicates
                      108602832 + 0 mapped (100.00%:-nan%)
                      108602832 + 0 paired in sequencing
                      54503639 + 0 read1
                      54099193 + 0 read2
                      95543890 + 0 properly paired (87.98%:-nan%)
                      104454694 + 0 with itself and mate mapped
                      4148138 + 0 singletons (3.82%:-nan%)
                      1058650 + 0 with mate mapped to a different chr
                      417680 + 0 with mate mapped to a different chr (mapQ>=5)
                      Is it then safe to assume that the number of uniquely mapped reads are the sum of properly paired and singletons from my NH:i:1 only bam file. Thus
                      No. of properly paired reads (95543890) + singletons (4148138) = 99692028.

                      By the way...is it better if I re -run tophat with
                      --report secondary alignments
                      ON so that I can see the difference between the AS and XS (secondary alignment) and use that as a measure of uniqueness?
                      Last edited by Siva; 03-12-2013, 11:05 AM. Reason: Adding info

                      Comment


                      • #12
                        Try using:

                        Code:
                        cut -f 1 <samfile> | sort | uniq -u | wc -l
                        Using the --report secondary alignments will force tophat2 to report a lower quality secondary alignment if one exists. The choice to use it depends I guess on how stringent you want to be in defining "uniquely mapped". It will lower a great deal the number of uniquely mapped reads and I would not typically use it.

                        Comment


                        • #13
                          Originally posted by chadn737 View Post
                          Try using:

                          Code:
                          cut -f 1 <samfile> | sort | uniq -u | wc -l
                          Hi Chad...
                          This will give a set of unique read IDs but not always uniquely mapped (aligned reads). I think it depends (as you point out) on how we ask tophat to align.

                          In default mode tophat reports best aligned reads with varying AS scores.The higher the AS score, the more uniquely mapped it is. We can ask tophat to report up to 20 extra alignments using --report-secondary-alignments. The optional tags AS (primary alignment score) and XS (Extra alignment score) can be used to report the read as uniquely mapped or not.

                          cheers
                          Siva
                          Last edited by Siva; 03-13-2013, 09:04 AM. Reason: typos

                          Comment


                          • #14
                            You're right Siva. That code goes off of the read IDs, so it will tell you the number of unique IDs in that file.

                            With the older versions of Tophat, where all unique reads simply had a score of 255, and multimapped reads were either 0,1,2,3 then there was no way to really distinguish this on the basis of mapping scores in the sam file.

                            So ultimately it depends on how stringently you want to define "unique" and what your purposes for that is.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Non-Coding RNA Research and Technologies
                              by seqadmin




                              Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                              Nobel Prize for MicroRNA Discovery
                              This week,...
                              10-07-2024, 08:07 AM
                            • seqadmin
                              Recent Developments in Metagenomics
                              by seqadmin





                              Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                              09-23-2024, 06:35 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Today, 06:35 AM
                            0 responses
                            7 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 02:44 PM
                            0 responses
                            7 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 10-11-2024, 06:55 AM
                            0 responses
                            15 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 10-02-2024, 04:51 AM
                            0 responses
                            111 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X