Seqanswers Leaderboard Ad

**Wallysb01** · 11-19-2012, 09:37 AM

Are you having specific problems with the quality scores dropping off greatly towards the end of the 50bp read? Usually at the 50bp illumina predicted quality scores are still very, very good. And what Q score are you trimming to? I would only suggest some very low cut off, ie no higher 10 (1% predicted error). Generally, for reads that are just mapped to the genome, I wouldn't suggest doing any trimming unless there are very specific problems. Alternatively, you could do some quality filtering as opposed to trimming. And just drop pairs with too many Q scores under some threshold (i.e. require 80% of both read pairs with > Q20?).

But if you are going to go down this road of trimming, I would suggest increasing the minimum length to something closer to 40, then dropping the --segment-length to (1/2)*(min read length). At 30 bp the problem is going to be that this number needs to 15. That's just too short for accurate mapping. Remember tophat is doing this step to attempt to find spliced reads, where part of the read maps to a different exon, so each segment needs to be mappable independently. And without this capability, tophat loses a lot of its functionality and your resulting data for analysis will probably be quite bad.

I have ran into something similar where we had some data that displayed a huge nucleotide bias (we assume now that it was because of poor random priming during cDNA synthesis) for a large part of the 5' end of the read. So we trimmed that section off and ended up with 39bp reads, and I set the segment-length to 19. What we saw was a generally lower percentage of reads mapping (usually we were getting ~70% and now we had ~60%), and very few spliced reads. Its hard to know what should be correct data of course, but the results from downstream analysis with cufflinks and the like with these samples were also quite a bit different from the longer reads without trimming.

I think many tools are currently assuming you have something like 75-104 bp reads now, and have been optimized given that assumption. You're already starting with something less than that at 50bp, so I wouldn't willing go lower unless you have a very good reason to do so.

**ashokrags** · 11-19-2012, 10:28 AM

Thanks for the prompt response:

I have some problems with the reads at the end especially with the 2nd reads in the pair. I used sickle (https://github.com/najoshi/sickle) to trim the reads and ended up with about 70% of the pairs being kept, with a quality score threshold of 20 (which is the 1% predicted error). I have now increased my read length to 36bp and this solves a lot of my alignment problems. I will try quality filtering and appreciate the tip. I will see how many read pairs are kept in this manner.

any hints on why I am missing quality scores in tophat

I havent made any comparisons with the mapping yet, but will report back soon on the % mapping with and without trimming. Tentatively I slightly get better alignment rates with trimming, but of course overall its lower than untrimmed, as seen below for one sample:
1. UnTrimmed:
  16805252 reads; of these:
  16805252 (100.00%) were unpaired; of these:
  1330440 (7.92%) aligned 0 times
  9669720 (57.54%) aligned exactly 1 time
  5805092 (34.54%) aligned >1 times
  92.08% overall alignment rate
2. Trimmed:
  11771625 reads; of these:
  11771625 (100.00%) were unpaired; of these:
  805833 (6.85%) aligned 0 times
  6834128 (58.06%) aligned exactly 1 time
  4131664 (35.10%) aligned >1 times
  93.15% overall alignment rate
What are the chances of mis-alignment due to not trimming? Will this be worse in the long run

As always any insight is appreciated

**Wallysb01** · 11-19-2012, 08:30 PM

I do not have any idea what's going on with the missing quality scores. That does seem odd.

I like sickle as well (and yes 1% is Q20, whoops). 36bp could work well, that's similar to GAIIx read lengths that these tools were first created for. I don't think the mis-alignment is in general going to be as much of a problem with low quality scores. I'd expect that high error rates would rather just lead to lower alignment percentage, due to too many mismatches, which you don't seem to have a problem with.

**ParthavJailwala** · 11-23-2012, 07:57 AM

Missing quality scores: Tophat 2.4.0.1

I just wanted to add to the observation made by ashokrags on missing quality scores for primary alignments generated by TopHat 2.4.0.1.

- We have observed that same behavior: about 20 to 25% of the total reads have "*" in the quality scores, eventhough these reads have valid qualities in the corresponding input fastq files.
- When the BAM files generated from Tophat (containing the "*" reads) is used with any PICARD tools, PICARD throws errors as the framework does not deal with "*" in the quality strings

We have not figures out why TopHat does not pick up the quality strings correctly from the fastq files. Any solutions/thoughts from others ?

Thanks

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

TopHat Alignments issues:Trimming and quality scores

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News