Seqanswers Leaderboard Ad

**Jon_Keats** · 11-30-2011, 09:20 AM

With 100x100 mers that alignment rate is not uncommon in my experience. I personally wonder if 100x100 is a benefit or detriment to RNAseq data. If you aligne the same data trimed to 75x75 and 50x50 and compare the metrics with the 100x100 you will find that the percent aligned increases significantly as you reduce the size to 75x75 (~10-15% higher) and the number of unique mapping events only drops by 1%, dropping to 50x50 increases mapping by another ~5% and drops the unique mapping by 1%. I often wonder that as the read length increases we are overloading tophat. remember the average exon size in human, assuming mouse to be similar, is around 110bp with a mode of 96bp. So aligner function will always be fixed to those genome features and different read lengths may effect the outcome.

**biznatch** · 11-30-2011, 10:21 AM

From my understanding, Tophat takes all the 100bp sequences it can't align and splits them in to 25bp fragments then tries aligning them again. Wouldn't this help with the problem of trying to align 100bp sequences to exons that are not much bigger than this?

I will try trimming the reads in a few different ways and see how the results compare.

Do you what all the "malformed closure" and "multiple closures" warnings mean?

**chadn737** · 11-30-2011, 10:42 AM

How were your libraries prepared? Do you suspect that there may be any adapter sequence in your reads? If there is this can greatly reduce the amount of reads that align. Trimming for quality can also improve alignments.

**biznatch** · 11-30-2011, 11:38 AM

There are adapter sequences in the reads, I'm going to trim adapters and low quality sequence and align again, and see how that compares.

**chadn737** · 11-30-2011, 11:46 AM

Its unsurprising then that only half your reads aligned if there is adapter sequence present. Is the adapter sequence at the 5' or 3' end of your reads?

**cjp** · 11-30-2011, 12:17 PM

Originally posted by Jon_Keats View Post

With 100x100 mers that alignment rate is not uncommon in my experience. I personally wonder if 100x100 is a benefit or detriment to RNAseq data. If you aligne the same data trimed to 75x75 and 50x50 and compare the metrics with the 100x100 you will find that the percent aligned increases significantly as you reduce the size to 75x75 (~10-15% higher) and the number of unique mapping events only drops by 1%, dropping to 50x50 increases mapping by another ~5% and drops the unique mapping by 1%. I often wonder that as the read length increases we are overloading tophat. remember the average exon size in human, assuming mouse to be similar, is around 110bp with a mode of 96bp. So aligner function will always be fixed to those genome features and different read lengths may effect the outcome.

Good idea - how well does TopHat cope if it needs to put two+ introns into an alignment - and thus align over 3 (or more) exons?

Chris

**biznatch** · 11-30-2011, 02:09 PM

I'm not sure exactly how the libraries were prepared...someone else in my lab dissected mouse tissue and sent it frozen to the sequencing facility and the facility did the rest.

Originally posted by chadn737 View Post

Its unsurprising then that only half your reads aligned if there is adapter sequence present. Is the adapter sequence at the 5' or 3' end of your reads?

They are in the middle or towards the 3' ends...is this what it looks like when you have short fragments? These are the most common adapters in the data, sequences are from the FastQC contaminant_list.txt file, the numbers are how many times they appeared in the first 100,000 sequences, there are some other "contaminant" sequences that appear less often. Here's what some of them look like, adapter sequences in red:

Reads1, Illumina Multiplexing Adapter 1 11640
Reads1, Illumina Gex Adapter 2.01 689
Reads2, Illumina Paired End Adapter 1/Multiplexing Adapter 2 5563

Should I use these adapter sequences with Cutadapt (or another trimming program) and trim everything from the adapter to the 3' end of each read? (And also trim for quality.)

**chadn737** · 11-30-2011, 02:19 PM

Originally posted by biznatch View Post

They are in the middle or towards the 3' ends...is this what it looks like when you have short fragments?

Yes.

I have encountered the exact same problem before. 50-40% of my reads failed to align due to varying lengths of adapter sequence at the 3' end. However after quality trimming and adapter trimming I was able to get closer to 89-90% of my reads to align.

I tried a variety of approaches, but had the best results when I first used fastx quality trimmer to trim reads from the 3' end based on quality and then using cutadapt to trim adapter sequence from the 3' end. Cutadapt can trim based on quality, but for some reason this resulted in fewer aligned reads than when I first when over it with fastx quality trimmer.

I assume this arises when you have a size selection problem during the library prep. This is going to create eve greater headaches for you since your reads are paired end. If you look at the paired reads for those that have adapter sequence at the 3', the two pairs probably overlap significantly. Which if I understand it correctly, means that there is no insert between the two paired ends. That will probably screw things up in Tophat by throwing off the insert size.

**biznatch** · 11-30-2011, 02:27 PM

Thank you so much for your help! I will try the trimming as you suggest.

I will also take a look at the paired reads and see if they overlap...I wonder if I could combine the paired end reads that overlap each other into non-paired end reads then align those ones separately...I think I saw a post asking something similar on here recently and someone mentioned a program that could do that, I will have to go look.

[Edit]: Looking at the aligned data from my first post, here are the insert size metrics from Picard. Is this normal or should it be more bell curve shaped, and or should the peak be higher? I aligned using Tophat with -r 150 --mate-std-dev 40 (insert size 150, standard deviation 40). I got these numbers by aligning 1 million of the paired reads with Bowtie then using Picard InsertSizeMetrics.

**cjp** · 12-01-2011, 01:50 AM

Also try adding --closure-search to tophat for reads with a small insert size:

--closure-search Enables the mate pair closure-based search for junctions. Closure-based search should only be used when the expected inner distance between mates is small (<= 50bp)

Chris

**biznatch** · 12-01-2011, 02:28 PM

Trying different alignment and trimming settings

I've been trying different alignment and trimming options, these were all tested with only 100,000 reads otherwise it would take forever.

Different alignment options in Tophat using non-trimmed, non-filtered reads, all of these were almost identical percent alignment (~63%):

My data is Illumina 1.5 but it didn't make a difference whether I used --solexa1.3-quals or not (why??).
-r (inner distance between mate pairs) ranging from 50 to 300 were all the same (based on Picard CollectInsertSizeMetrics, ~150 is the correct value).
--coverage-search, --microexon-search, --butterfly-search, --closure-search all about the same.
--library-type fr-unstranded (which I think is the correct one) was 63% while both fr-firststrand and fr-secondstrand were about 1.5% lower.

Trimming:

Trimming off the last 25bp of each read using Fastx = 77.5% alignment. (This is the highest percent alignment I got)

The following all aligned ~50%:

Trimming the "Illumina Multiplexing Adapter 1" from reads1 and "Illumina Paired End Adapter 1/Multiplexing Adapter 2" from reads2 using Cutadapt. (These adapters appear in about 11.5% and 5.5% of the left and right sequences, respectively.)
Trimming the above adapters AND trimming for quality (10) using Cutadapt = 50%.
Trimming 3' end just based on quality using Fastx, threshold set to either 10, 20, or 30 were all the same.

I have no idea why trimming based on quality or trimming off adapters would lower percent alignment, while just removing the last 25bp would increase it...unless I did something wrong in the trimming step. When I opened the files of quality trimmed reads I saw that basically all the 3' "B" quality positions were removed so it seems to have worked correctly.

I still get lots of "malformed closure" warnings.

So I'm not sure where to go from here...just trim off 25bp and use that data? We're getting another set of RNA-seq data this week so I'll see how it compares.

**Jon_Keats** · 12-01-2011, 02:38 PM

trimming might not work as tophat may expect uniform read lengths so any read pair with a deviation might be tossed out of the analysis. Percent alignment will increase a bit if you increase to 1,000,000 reads as more junction reads will be defined unless you are using a GTF reference.

**biznatch** · 12-01-2011, 02:46 PM

Tophat does not require uniform read lengths (at least, not anymore).

Mixed read lengths in TopHat input file - SEQanswers

http://seqanswers.com/forums/showthread.php?t=7430

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

I got ~63% alignment when I aligned the entire set (~33 million reads) and with my test sets of 100,000 I get 63% when using the same Tophat settings so I don't know if using more will necessarily improve percent alignment, I'm not using a GTF reference.

**biznatch** · 12-02-2011, 12:45 PM

Percent alignment vs. number of junctions

I tried some more conditions and this time looked at how many junctions were found, not just percent alignment, as I think this might be a better metric. Any sequences less than 25 bp after trimming were discarded. Alignment command was:

Code:

tophat -o ./1 -p 2 -r 150 --mate-std-dev 40 --solexa1.3-quals ~/chip/bowtie/indexes/mm9all+ ./reads1.fq ./reads2.fq

Unmodified.
Trimmed off adapters.
Trimmed 3' end for quality=10.
Trimmed off adapters then 3' end for quality.
Trimmed 3' end for quality then trimmed off adapters.
Removed 10 bp from 3' end.
Removed 20 bp from 3' end.
Removed 30 bp from 3' end.
Removed 40 bp from 3' end.
Removed 50 bp from 3' end.
Removed 10 bp from 5' end.
Removed 20 bp from 5' end.

Removing a fixed number of bp was done using FASTX, adapter and quality trimming was done using cutadapt. The following were removed sequentially, because with cutadapt if you have more than one adapter sequence it only uses the one with the best match, and some of the sequences had multiple adapters:

MINT adapters (from making cDNA) from 5' end of left and right reads.
Illumina Multiplexing Adapter 1 from 3' end of left reads.
llumina Paired End Adapter 1/Multiplexing Adapter 2 form 3' end of right reads.
I also had some longs strings of A's and T's, I'm not really sure what they're from. I trimmed any reads which were more than half (50bp) A or T.

Overall, trimming based on quality or adapters decreased the number of sequences that aligned. Even though a higher percentage of reads aligned as I trimmed a fixed number of bases off the 3' end (removing 50bp from 3' end had 87% alignment), the most junctions were found using the unmodified reads. In fact, the 50bp removal 87% alignment identified the lowest number of junctions. Not exactly what I expected. Maybe more of the trimmed reads are aligning but they're aligning in the same spot so it's not helping to identify more junctions?

So I guess from here I'll just use unmodified reads? Oh well, it'll save some time I suppose. But I'm a bit confused because trimming off adapters and/or for quality seems to be a common practice. I'm not sure what is different with my data that causes this trimming to actually give poorer results.

This is an example at two genes, the percent alignment is shown and then the number of junctions identified.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Tophat low percent alignment (maybe) and a few other questions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News