Seqanswers Leaderboard Ad

**lynnco2008** · 05-07-2012, 09:50 AM

Hi wisense, have you removed adapter sequences from your reads? To me, it seems that you have removed the adapter. Those reads that contains adapter will be break into 2 part, which is like "AGTAGTCGCTGCATC[ADAPTER]GCTAGCTGATCGTCA", for instance. Then, for your left reads file, it contains
1. full read (which does not contain adapters)
2. left segment (AGTAGTCGCTGCATC)
3. right segment (GCTAGCTGATCGTCA)
When you mapping those reads to the reference genome or transcriptome, whatever, the logs will be saved individually.
That is just my guess. I am new to RNA-seq.

Best
You Li

**wisense** · 05-07-2012, 04:50 PM

Hi Li,thanks for your point.
As we can see,the number of the reads processed in the second and third file is 1448633,which is the reads failed to align in the first file. So I think tophat shear the unmapped reads into segments and then map them to the junctions?
The ratios of the mapped segments seem too low(only 38.62% and 37.65%),can you tell me which parameter is associated with it?

Best,
WSZ

**sdriscoll** · 05-07-2012, 05:29 PM

log files 2 and 3 are from Tophat's "segment mapping" stages. that's when it chops previously unaligned reads into segments (25bp min per segment) and aligns them to the genome to find both exons and junctions. for example if it didn't do this stage then 100bp reads would never yield any alignments to exons shorter than 100bp. Thanks to segment mapping I see coverage at short exons even with 100bp reads. this is also how Tophat can assemble a novel junction database without any reference annotation.

how many reads actually aligned? that's hard to say from those bowtie files. it would certainly be nice if Tophat reported that information. so far the only way I know how to do that is to actually count the read names in the BAM alignment file. there will be some multi-aligned reads in your BAM file unless you told Tophat to only report unique alignments so counting lines doesn't work. i only know to do something like this:

Code:

samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

that should return a row count of unique read names...it might take a while though since it's sorting 10's of millions of rows.

**wisense** · 05-07-2012, 05:39 PM

Originally posted by sdriscoll View Post

log files 2 and 3 are from Tophat's "segment mapping" stages. that's when it chops previously unaligned reads into segments (25bp min per segment) and aligns them to the genome to find both exons and junctions. for example if it didn't do this stage then 100bp reads would never yield any alignments to exons shorter than 100bp. Thanks to segment mapping I see coverage at short exons even with 100bp reads. this is also how Tophat can assemble a novel junction database without any reference annotation.

how many reads actually aligned? that's hard to say from those bowtie files. it would certainly be nice if Tophat reported that information. so far the only way I know how to do that is to actually count the read names in the BAM alignment file. there will be some multi-aligned reads in your BAM file unless you told Tophat to only report unique alignments so counting lines doesn't work. i only know to do something like this:

Code:

samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

that should return a row count of unique read names...it might take a while though since it's sorting 10's of millions of rows.

Thank you very much.

**masterpiece** · 05-23-2012, 09:28 PM

log files 2 and 3 are from Tophat's "segment mapping" stages. that's when it chops previously unaligned reads into segments (25bp min per segment) and aligns them to the genome to find both exons and junctions. for example if it didn't do this stage then 100bp reads would never yield any alignments to exons shorter than 100bp. Thanks to segment mapping I see coverage at short exons even with 100bp reads. this is also how Tophat can assemble a novel junction database without any reference annotation.

how many reads actually aligned? that's hard to say from those bowtie files. it would certainly be nice if Tophat reported that information. so far the only way I know how to do that is to actually count the read names in the BAM alignment file. there will be some multi-aligned reads in your BAM file unless you told Tophat to only report unique alignments so counting lines doesn't work. i only know to do something like this:

Code:

samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

that should return a row count of unique read names...it might take a while though since it's sorting 10's of millions of rows.

Thanks for the tips, but how about for paired reads dataset? The sam file don't keep the paired read information (/1 and /2). I think calculating using 'sort and uniq' may not be correct if broken paired reads happens for the mapping.

Can we just calculate number of unmapped reads from 'unmapped_left.fq.z' ?

**happydsw** · 05-29-2012, 02:43 AM

Originally posted by masterpiece View Post

Thanks for the tips, but how about for paired reads dataset? The sam file don't keep the paired read information (/1 and /2). I think calculating using 'sort and uniq' may not be correct if broken paired reads happens for the mapping.

Can we just calculate number of unmapped reads from 'unmapped_left.fq.z' ?

Did you have an answer now? I have the same question.

**masterpiece** · 06-08-2012, 02:18 AM

Did you have an answer now? I have the same question.

I'm afraid only the Tophat develop can answer this

**glados** · 06-21-2012, 03:47 AM

Have you tried samtools flagstat? It will give you the number of reads that mapped, and how many that mapped in a proper pair.

Code:

samtools flagstat file.bam

**carmeyeii** · 09-03-2012, 03:39 PM

Would you say that this:

samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

is the correct number to use to normalize the reads mapping onto a gene for library size, i.e, the M in FPKM? Which value do you use for normalization?

Many thanks for the input,

Carmen

**carmeyeii** · 09-26-2012, 01:10 PM

bumping

**turnersd** · 09-26-2012, 01:31 PM

I count up the unmapped reads and compare to the number of reads I started with. There's some directory structure stuff assumed here, but you'll get the idea. E.g. for tophat 1.4.1:

Code:

#!/bin/sh

# Directory structure:
# In current directory must have reads (or symbolic links to reads), and tophat output directory.
# Read filename pattern: read_1.fq, read_2.fq
# Tophat output directory pattern: read-tophat

echo "^sample^total_reads_each_end^mapped_left^pct_mapped_left^mapped_right^pct_mapped_right^pct_mapped_total^"

for sample in $@
do
	
	# TOTal reads for one end - divide by four to get # reads
	TOT=$(wc -l ${sample}_1.fq | cut -d ' ' -f1)
	TOT=$(echo "$TOT/4" | bc) # don't need -l, want integer.
	
	# Unmapped Left, Unmapped Right - divide by 4 to get total # reads
	UL=$(zcat ${sample}-tophat/unmapped_left.fq.z  | wc -l | cut -d ' ' -f1)
	UR=$(zcat ${sample}-tophat/unmapped_right.fq.z | wc -l | cut -d ' ' -f1)
	UL=$(echo "$UL/4" | bc)
	UR=$(echo "$UR/4" | bc)
	
	# Mapped Left, Mapped Right
	ML=$(echo "$TOT-$UL" | bc)
	MR=$(echo "$TOT-$UR" | bc)
	
	# Percentage Mapped Left, Percentage Mapped Right
	PML=$(echo "$ML/$TOT" | bc -l)
	PMR=$(echo "$MR/$TOT" | bc -l)
	
	# Percentage mapped total
	PMT=$(echo "($ML+$MR)/(2*$TOT)" | bc -l)
	
	# write output
	echo "|$sample|$TOT|$ML|$PML|$MR|$PMR|$PMT|"

done

For tophat2 (I've had trouble with tophat2 and don't use it much):

Code:

#!/bin/sh

# Directory structure:
# In current directory must have reads (or symbolic links to reads), and tophat output directory.
# Read filename pattern: read_1.fq, read_2.fq
# Tophat output directory pattern: read-tophat

echo "^sample^total_reads_both_ends^mapped^pct_mapped^"

for sample in $@
do
	
	# TOTal reads for one end - divide by four to get # reads for one end. Multiple x2 for both ends.
	TOT=$(wc -l ${sample}_1.fq | cut -d ' ' -f1)
	TOT=$(echo "$TOT/4" | bc) # divide by 4 to get number of reads in one end
	TOT=$(echo "$TOT*2" | bc) # multiple by 2 to get number of reads from both ends
	
	# Unmapped reads from both ends, one per line
	UNMAPPED=$(samtools view ${sample}-tophat/unmapped.bam | wc -l | cut -d ' ' -f1)
	
	# Total mapped
	MAPPED=$(echo "$TOT-$UNMAPPED" | bc -l)
	
	# Percentage mapped 
	PCT_MAPPED=$(echo "$MAPPED/$TOT" | bc -l)
	
	# write output
	echo "|$sample|$TOT|$MAPPED|$PCT_MAPPED|"

done

**maize** · 12-02-2012, 07:03 PM

Hi, sdriscoll
Your way of counting mapped reads in sam file is smart. I guess it is no problem for single-end. How about the pair-end read, because sam doesnot contain the pari information in sam file? is there a way to count how many pair-end reads retained in sam file?

**sdriscoll** · 12-02-2012, 11:30 PM

Actually I think you might be OK. It is to your advantage that the /1 and /2 bits have been cut off of the read names. the 'uniq' part of my goofy one-liner will take care of collapsing the duplicate names (from paired alignments) down to single lines. That shouldn't interfere with the read counts.

For the record the SAM format does retain the paired information it's just set in the FLAG field, column 2 of the SAM file. That field is hard to read because it's a bit flag field. So when you see 67 (decimal) that's 43 (hex) and 0100 0011 (binary). bit flags means they are using some 16-bit integer value to store the 11 different flags they describe in the SAM format spec PDF (download it from the samtools site). If 0x40 or 0x80 are set that means the read is either the first or second in a pair, respectively. If you know perl or python you can make a script that checks those fields. samtools view also provides a way to filter what it returns based on those flags with the -f and -F options.

For example you can exclude unmapped reads and secondary alignments from the output of samtools view like this:

Code:

samtools view -F 0x100 -F 0x4 aln.bam

Or you can view only the first read of each properly aligned pair like this:

Code:

samtools view -f 0x2 -f 0x40 -F 0x100 aln.bam

In other words you could count more specific things in this way. just as glados mentioned, samtools flagstat provides some basic count information based on the FLAG field. It provides a breakdown of several different paired-read count stats. That command completely relies on the FLAG field so as long as the aligner is setting that flag correctly it will produce real counts.

I guess when you take the FLAG field into consideration for counting reads it gets a little confusing depeneding on what the aligner did. For example if tophat can't properly align paired reads it might report one or the other (or maybe both) as aligned reads but not properly aligned. So maybe you want to count that as two alignments or maybe you don't want to count that one at all. The nice thing about 'flagstat' is it breaks out that information for you. So you can see how many of those type of alignments happened as well as how many proper alignments happened.

**sdriscoll** · 12-02-2012, 11:42 PM

Originally posted by carmeyeii View Post

Would you say that this:

samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

is the correct number to use to normalize the reads mapping onto a gene for library size, i.e, the M in FPKM? Which value do you use for normalization?

Many thanks for the input,

Carmen

Hi Carmen,
I haven't ever used that count for FPKM normalization but I guess you could and it wouldn't hurt anything. It IS an accurate count of the number of aligned reads. If you have counted reads per feature using something like htseq-count it's also acceptable to just use the sum total hits at features for the M value as well. Chances are the difference between the amount of reads mapped to the genome and those mapping to genes isn't too large and the two methods might result in similar numbers. The argument for using only counts to features is to avoid bias in the FPKM's due to large numbers of alignments in intronic regions. The argument against that might be that the genes are getting X reads per million INCLUDING that intronic stuff and that means something...so the intronic data shouldn't be ignored.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 58 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How many reads mapped to the genome?(about TopHat output)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News