Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

How many reads mapped to the genome?(about TopHat output)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How many reads mapped to the genome?(about TopHat output)

    Hello,all
    I use tophat to analyze a RNA-seq data,while I find the log folder contain many log files,such as "bowtie.left_kept_reads.fixmap.log","bowtie.left_kept_reads_seg1.fixmap.log"and"bowtie.left_kept_reads_seg2.fixmpa.log"
    The content of the 3 files are like this:

    "bowtie.left_kept_reads.fixmap.log":
    # reads processed: 12384696
    # reads with at least one reported alignment: 10847735 (87.59%)
    # reads that failed to align: 1448633 (11.70%)
    # reads with alignments suppressed due to -m: 88328 (0.71%)
    Reported 12152934 alignments to 1 output stream(s)

    "bowtie.left_kept_reads_seg1.fixmap.log":
    # reads processed: 1448633
    # reads with at least one reported alignment: 559397 (38.62%)
    # reads that failed to align: 885817 (61.15%)
    # reads with alignments suppressed due to -m: 3419 (0.24%)
    Reported 895982 alignments to 1 output stream(s)

    "bowtie.left_kept_reads_seg2.fixmpa.log":
    # reads processed: 1448633
    # reads with at least one reported alignment: 545362 (37.65%)
    # reads that failed to align: 897019 (61.92%)
    # reads with alignments suppressed due to -m: 6252 (0.43%)
    Reported 850850 alignments to 1 output stream(s)

    My quetion is :What's the difference of the 3 files? And how many reads have been mapped to the genome in my analysis?

    Thanks.
    Last edited by wisense; 05-07-2012, 04:28 PM.

  • #2
    Hi wisense, have you removed adapter sequences from your reads? To me, it seems that you have removed the adapter. Those reads that contains adapter will be break into 2 part, which is like "AGTAGTCGCTGCATC[ADAPTER]GCTAGCTGATCGTCA", for instance. Then, for your left reads file, it contains
    1. full read (which does not contain adapters)
    2. left segment (AGTAGTCGCTGCATC)
    3. right segment (GCTAGCTGATCGTCA)
    When you mapping those reads to the reference genome or transcriptome, whatever, the logs will be saved individually.
    That is just my guess. I am new to RNA-seq.

    Best
    You Li

    Comment


    • #3
      Hi Li,thanks for your point.
      As we can see,the number of the reads processed in the second and third file is 1448633,which is the reads failed to align in the first file. So I think tophat shear the unmapped reads into segments and then map them to the junctions?
      The ratios of the mapped segments seem too low(only 38.62% and 37.65%),can you tell me which parameter is associated with it?

      Best,
      WSZ

      Comment


      • #4
        log files 2 and 3 are from Tophat's "segment mapping" stages. that's when it chops previously unaligned reads into segments (25bp min per segment) and aligns them to the genome to find both exons and junctions. for example if it didn't do this stage then 100bp reads would never yield any alignments to exons shorter than 100bp. Thanks to segment mapping I see coverage at short exons even with 100bp reads. this is also how Tophat can assemble a novel junction database without any reference annotation.

        how many reads actually aligned? that's hard to say from those bowtie files. it would certainly be nice if Tophat reported that information. so far the only way I know how to do that is to actually count the read names in the BAM alignment file. there will be some multi-aligned reads in your BAM file unless you told Tophat to only report unique alignments so counting lines doesn't work. i only know to do something like this:

        Code:
        samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l
        that should return a row count of unique read names...it might take a while though since it's sorting 10's of millions of rows.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #5
          Originally posted by sdriscoll View Post
          log files 2 and 3 are from Tophat's "segment mapping" stages. that's when it chops previously unaligned reads into segments (25bp min per segment) and aligns them to the genome to find both exons and junctions. for example if it didn't do this stage then 100bp reads would never yield any alignments to exons shorter than 100bp. Thanks to segment mapping I see coverage at short exons even with 100bp reads. this is also how Tophat can assemble a novel junction database without any reference annotation.

          how many reads actually aligned? that's hard to say from those bowtie files. it would certainly be nice if Tophat reported that information. so far the only way I know how to do that is to actually count the read names in the BAM alignment file. there will be some multi-aligned reads in your BAM file unless you told Tophat to only report unique alignments so counting lines doesn't work. i only know to do something like this:

          Code:
          samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l
          that should return a row count of unique read names...it might take a while though since it's sorting 10's of millions of rows.
          Thank you very much.

          Comment


          • #6
            log files 2 and 3 are from Tophat's "segment mapping" stages. that's when it chops previously unaligned reads into segments (25bp min per segment) and aligns them to the genome to find both exons and junctions. for example if it didn't do this stage then 100bp reads would never yield any alignments to exons shorter than 100bp. Thanks to segment mapping I see coverage at short exons even with 100bp reads. this is also how Tophat can assemble a novel junction database without any reference annotation.

            how many reads actually aligned? that's hard to say from those bowtie files. it would certainly be nice if Tophat reported that information. so far the only way I know how to do that is to actually count the read names in the BAM alignment file. there will be some multi-aligned reads in your BAM file unless you told Tophat to only report unique alignments so counting lines doesn't work. i only know to do something like this:

            Code:

            samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

            that should return a row count of unique read names...it might take a while though since it's sorting 10's of millions of rows.
            Thanks for the tips, but how about for paired reads dataset? The sam file don't keep the paired read information (/1 and /2). I think calculating using 'sort and uniq' may not be correct if broken paired reads happens for the mapping.

            Can we just calculate number of unmapped reads from 'unmapped_left.fq.z' ?

            Comment


            • #7
              Originally posted by masterpiece View Post
              Thanks for the tips, but how about for paired reads dataset? The sam file don't keep the paired read information (/1 and /2). I think calculating using 'sort and uniq' may not be correct if broken paired reads happens for the mapping.

              Can we just calculate number of unmapped reads from 'unmapped_left.fq.z' ?
              Did you have an answer now? I have the same question.

              Comment


              • #8
                Did you have an answer now? I have the same question.
                I'm afraid only the Tophat develop can answer this

                Comment


                • #9
                  Have you tried samtools flagstat? It will give you the number of reads that mapped, and how many that mapped in a proper pair.

                  Code:
                  samtools flagstat file.bam

                  Comment


                  • #10
                    Would you say that this:

                    samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

                    is the correct number to use to normalize the reads mapping onto a gene for library size, i.e, the M in FPKM? Which value do you use for normalization?

                    Many thanks for the input,

                    Carmen

                    Comment


                    • #11
                      bumping

                      Comment


                      • #12
                        I count up the unmapped reads and compare to the number of reads I started with. There's some directory structure stuff assumed here, but you'll get the idea. E.g. for tophat 1.4.1:

                        Code:
                        #!/bin/sh
                        
                        # Directory structure:
                        # In current directory must have reads (or symbolic links to reads), and tophat output directory.
                        # Read filename pattern: read_1.fq, read_2.fq
                        # Tophat output directory pattern: read-tophat
                        
                        echo "^sample^total_reads_each_end^mapped_left^pct_mapped_left^mapped_right^pct_mapped_right^pct_mapped_total^"
                        
                        for sample in [email protected]
                        do
                        	
                        	# TOTal reads for one end - divide by four to get # reads
                        	TOT=$(wc -l ${sample}_1.fq | cut -d ' ' -f1)
                        	TOT=$(echo "$TOT/4" | bc) # don't need -l, want integer.
                        	
                        	# Unmapped Left, Unmapped Right - divide by 4 to get total # reads
                        	UL=$(zcat ${sample}-tophat/unmapped_left.fq.z  | wc -l | cut -d ' ' -f1)
                        	UR=$(zcat ${sample}-tophat/unmapped_right.fq.z | wc -l | cut -d ' ' -f1)
                        	UL=$(echo "$UL/4" | bc)
                        	UR=$(echo "$UR/4" | bc)
                        	
                        	# Mapped Left, Mapped Right
                        	ML=$(echo "$TOT-$UL" | bc)
                        	MR=$(echo "$TOT-$UR" | bc)
                        	
                        	# Percentage Mapped Left, Percentage Mapped Right
                        	PML=$(echo "$ML/$TOT" | bc -l)
                        	PMR=$(echo "$MR/$TOT" | bc -l)
                        	
                        	# Percentage mapped total
                        	PMT=$(echo "($ML+$MR)/(2*$TOT)" | bc -l)
                        	
                        	# write output
                        	echo "|$sample|$TOT|$ML|$PML|$MR|$PMR|$PMT|"
                        
                        done
                        For tophat2 (I've had trouble with tophat2 and don't use it much):

                        Code:
                        #!/bin/sh
                        
                        # Directory structure:
                        # In current directory must have reads (or symbolic links to reads), and tophat output directory.
                        # Read filename pattern: read_1.fq, read_2.fq
                        # Tophat output directory pattern: read-tophat
                        
                        echo "^sample^total_reads_both_ends^mapped^pct_mapped^"
                        
                        for sample in [email protected]
                        do
                        	
                        	# TOTal reads for one end - divide by four to get # reads for one end. Multiple x2 for both ends.
                        	TOT=$(wc -l ${sample}_1.fq | cut -d ' ' -f1)
                        	TOT=$(echo "$TOT/4" | bc) # divide by 4 to get number of reads in one end
                        	TOT=$(echo "$TOT*2" | bc) # multiple by 2 to get number of reads from both ends
                        	
                        	# Unmapped reads from both ends, one per line
                        	UNMAPPED=$(samtools view ${sample}-tophat/unmapped.bam | wc -l | cut -d ' ' -f1)
                        	
                        	# Total mapped
                        	MAPPED=$(echo "$TOT-$UNMAPPED" | bc -l)
                        	
                        	# Percentage mapped 
                        	PCT_MAPPED=$(echo "$MAPPED/$TOT" | bc -l)
                        	
                        	# write output
                        	echo "|$sample|$TOT|$MAPPED|$PCT_MAPPED|"
                        
                        done

                        Comment


                        • #13
                          Hi, sdriscoll
                          Your way of counting mapped reads in sam file is smart. I guess it is no problem for single-end. How about the pair-end read, because sam doesnot contain the pari information in sam file? is there a way to count how many pair-end reads retained in sam file?

                          Comment


                          • #14
                            Actually I think you might be OK. It is to your advantage that the /1 and /2 bits have been cut off of the read names. the 'uniq' part of my goofy one-liner will take care of collapsing the duplicate names (from paired alignments) down to single lines. That shouldn't interfere with the read counts.

                            For the record the SAM format does retain the paired information it's just set in the FLAG field, column 2 of the SAM file. That field is hard to read because it's a bit flag field. So when you see 67 (decimal) that's 43 (hex) and 0100 0011 (binary). bit flags means they are using some 16-bit integer value to store the 11 different flags they describe in the SAM format spec PDF (download it from the samtools site). If 0x40 or 0x80 are set that means the read is either the first or second in a pair, respectively. If you know perl or python you can make a script that checks those fields. samtools view also provides a way to filter what it returns based on those flags with the -f and -F options.

                            For example you can exclude unmapped reads and secondary alignments from the output of samtools view like this:

                            Code:
                            samtools view -F 0x100 -F 0x4 aln.bam
                            Or you can view only the first read of each properly aligned pair like this:

                            Code:
                            samtools view -f 0x2 -f 0x40 -F 0x100 aln.bam
                            In other words you could count more specific things in this way. just as glados mentioned, samtools flagstat provides some basic count information based on the FLAG field. It provides a breakdown of several different paired-read count stats. That command completely relies on the FLAG field so as long as the aligner is setting that flag correctly it will produce real counts.

                            I guess when you take the FLAG field into consideration for counting reads it gets a little confusing depeneding on what the aligner did. For example if tophat can't properly align paired reads it might report one or the other (or maybe both) as aligned reads but not properly aligned. So maybe you want to count that as two alignments or maybe you don't want to count that one at all. The nice thing about 'flagstat' is it breaks out that information for you. So you can see how many of those type of alignments happened as well as how many proper alignments happened.
                            Last edited by sdriscoll; 12-02-2012, 11:42 PM.
                            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                            Salk Institute for Biological Studies, La Jolla, CA, USA */

                            Comment


                            • #15
                              Originally posted by carmeyeii View Post
                              Would you say that this:

                              samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

                              is the correct number to use to normalize the reads mapping onto a gene for library size, i.e, the M in FPKM? Which value do you use for normalization?

                              Many thanks for the input,

                              Carmen
                              Hi Carmen,
                              I haven't ever used that count for FPKM normalization but I guess you could and it wouldn't hurt anything. It IS an accurate count of the number of aligned reads. If you have counted reads per feature using something like htseq-count it's also acceptable to just use the sum total hits at features for the M value as well. Chances are the difference between the amount of reads mapped to the genome and those mapping to genes isn't too large and the two methods might result in similar numbers. The argument for using only counts to features is to avoid bias in the FPKM's due to large numbers of alignments in intronic regions. The argument against that might be that the genes are getting X reads per million INCLUDING that intronic stuff and that means something...so the intronic data shouldn't be ignored.
                              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                              Salk Institute for Biological Studies, La Jolla, CA, USA */

                              Comment

                              Working...
                              X