Seqanswers Leaderboard Ad

**westerman** · 08-08-2013, 09:44 AM

I do use Tophat.

As it should be obvious, alignment percentages varies a lot depending on the sequenced organism and what reference you can get hold of. A recent project with Arabidopsis is getting 98% to 99% of the reads mapping using Tophat. Of course Arabidopsis is "simple" and highly characterized. A slightly older project with whale samples mapped against human reference (I couldn't find a better reference) had 8-9% mapping. Apple came out around 64%-66%. So choose your target number ... I've got all of them covered. :-)

My percentages *are* from the accepted_hits.bam file compared to the original fastq files. Of course you need to count up only the unique reads but this is easy enough to do via a script (mine is a one-liner perl script).

**apredeus** · 08-08-2013, 11:01 AM

Yes that's a very good elaboration. I meant more specific case of well known genomes (I normally work with mouse and some human samples), so I use percentage as a primitive but effective identifier of library quality.

A problem with taking unique reads only is in the fact that in RNA-seq, not all reads are actually unique - some strongly expressed genes are expected to generate quite a bit of exactly same reads, especially at substantial sequencing depths.

One can, of course, calculate (unique aligned)/(unique all) reads, but that's still not quite what I'm looking for. I guess I want to know the number of those reads that just did not align anywhere at all. What would be a good way to estimate that number?

**GenoMax** · 08-08-2013, 11:08 AM

You can write the unmapped reads to a file and count them if you are using bowtie.

http://seqanswers.com/forums/showthread.php?t=31242 (post #3).
http://seqanswers.com/forums/showthread.php?t=32639 (post #6)

**apredeus** · 08-08-2013, 11:14 AM

Yes, that's really not a problem with bowtie. First of all the percent of aligned reads is reported in the log (correctly, as far as I can tell), and second, bowtie by default writes all non-aligned reads into the BAM file with an appropriate binary flag (4) in the second column. So simple "samtools flagstat file.bam" command should also produce correct alignment statistics.

Tophat, however, skips the un-aligned reads when writing accepted_hits.bam. I'm not sure if there's an option to write them to the BAM file as well, I've checked the manual and I don't see it so far.

**biznatch** · 08-08-2013, 11:15 AM

I did a little work on one set of RNA-seq data here's what I found, and there's some related comments.

Tophat low percent alignment (maybe) and a few other questions - SEQanswers

http://seqanswers.com/forums/showthread.php?t=15870

Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)

**apredeus** · 08-08-2013, 11:30 AM

Ok, so I was not attentive enough :/

There's a file unmapped_left.fq.z that has all of the un-mapped reads in a compressed form.

something like

Code:

gunzip -c unmapped_left.fq.z | wc -l | awk '{print $1/4}'

should give you the number of all un-aligned reads.

Thanks everybody!

**westerman** · 08-08-2013, 11:50 AM

Originally posted by apredeus View Post

A problem with taking unique reads only is in the fact that in RNA-seq, not all reads are actually unique - some strongly expressed genes are expected to generate quite a bit of exactly same reads, especially at substantial sequencing depths.

Actually by "unique" I meant counting the reads by their name not by their sequence. Any given read is only counted once no matter how many times it is in the accepted_hits.bam file. I agree that counting by sequence would give poor information especially at high sequencing depths.

Anyway, we don't do human samples here nor do we have many mouse samples coming through thus I can't say what a good percentage mapping would be for RNA-seq experiments. Certainly I would expect above 90%. Maybe close to the 98% I get for Arabidopsis.

**apredeus** · 08-08-2013, 12:20 PM

Originally posted by westerman View Post

Actually by "unique" I meant counting the reads by their name not by their sequence. Any given read is only counted once no matter how many times it is in the accepted_hits.bam file. I agree that counting by sequence would give poor information especially at high sequencing depths.

I see, I misunderstood what you meant. Thank you for the clarification.

Meanwhile, I am still somewhat confused. As an example, I have a small RNA-Seq experiment, FASTQ file for which contained 357706 reads.

After the alignment, I've obtained accepted_hits.bam file, that has 135627 entries, of them 103735 unique (by their name, as mentioned above).

The unmapped_left.fq.z file has 29358 entries in it (all unique by name).

Thus I have about 224613 reads unaccounted for. I assume this is due to "prefilter_multihits" option, that prevents reads that map in too many places from being considered (e.g. repeats etc). To confirm this, my prep_reads.log file states:

224603 out of 357706 reads have been filtered out
(219406 filtered out due to <skipped>/left_multimapped.fq)

The number is indeed very close to the difference that I've just calculated.

So, overall, it seems that, while I do have only ~8-9% of reads completely un-alignable to mouse genome, only about a third (~100k) of them are actually useful for differential gene expression, etc.

Correct me if I am wrong here.

Thanks!

**westerman** · 08-08-2013, 12:34 PM

So, overall, it seems that, while I do have only ~8-9% of reads completely un-alignable to mouse genome, only about a third (~100k) of them are actually useful for differential gene expression, etc.

Correct me if I am wrong here.

Thanks!

I do believe that you are correct. And such statistics should raise a red flag. In particular I wonder how many of your reads map to rRNA. Do you check for this?

**apredeus** · 08-08-2013, 12:37 PM

Yes I realize that's not a good library

That's just one of the experiments while testing new RNA-seq methods. We do check how many of the reads align to rRNA - we use Picard tools for this if I remember correctly.

**chadn737** · 08-08-2013, 12:38 PM

Easiest way...install the latest version of Tophat. They finally got around to having Tophat spit out stats similar to the way bowtie2 does: number of mapped reads and number mapping multiple times.

**apredeus** · 08-08-2013, 12:41 PM

Originally posted by chadn737 View Post

Easiest way...install the latest version of Tophat. They finally got around to having Tophat spit out stats similar to the way bowtie2 does: number of mapped reads and number mapping multiple times.

I'm running them on a cluster and Tophat 2.0.* just was not cooperating when I tried it... producing all kinds of random errors etc. Plus collaborators use 1.4.1 (or 1.4.1.1 to be more exact) so I'll stick with it for now...

But thank you for the advice. It's good to know that this issue was addressed.

**chadn737** · 08-08-2013, 12:56 PM

Originally posted by apredeus View Post

I'm running them on a cluster and Tophat 2.0.* just was not cooperating when I tried it... producing all kinds of random errors etc. Plus collaborators use 1.4.1 (or 1.4.1.1 to be more exact) so I'll stick with it for now...

But thank you for the advice. It's good to know that this issue was addressed.

If you are using 1.4.1 then the mapping scores should all follow this formula:

255 = unique mapping
3 = maps to 2 locations in the target
2 = maps to 3 locations
1 = maps to 4-9 locations
0 = maps to 10 or more locations.

You can use this info then to quickly count reads for each category.

**apredeus** · 08-08-2013, 02:43 PM

Originally posted by chadn737 View Post

If you are using 1.4.1 then the mapping scores should all follow this formula:

255 = unique mapping
3 = maps to 2 locations in the target
2 = maps to 3 locations
1 = maps to 4-9 locations
0 = maps to 10 or more locations.

You can use this info then to quickly count reads for each category.

That's what I've read but upon looking on the actual BAM outputs, this is not the case (anymore, probably).

The flags for single-end experiment are limited to 0,16,256 and 272 (which I'd imagine would be + and - reads aligned "as is" and + and - reads aligned with a junction).

Topics	Statistics	Last Post
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 12 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 22 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM

Seqanswers Leaderboard Ad

Announcement

Percent aligned reads for RNA-seq

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News