Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat Alignments issues:Trimming and quality scores

    Hi,
    I am using tophat 2.4.0.1 and have come across two main issues.
    • One is that even for uniquely mapped read-pairs some of the alignments are missing their quality scores i.e. quality scores are set to "*" as for secondary alignments. But I have checked and these reads are primary alignments. Any insight in this regard will be appreciated
    • While aligning trimmed reads tophat fails and this behavior is kind of random. I finally figured out that this is due to the fact that I have trimmed my reads (pe 50, minimum length 30 bp) and Bowtie2 throws and error when tophat tries its segment mapping. This is mainly because the default is 25 "--segment-length Each read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25." It ends up trying to match the other 5 bases when the read is trimmed. Bowtie2 specifically fails here as the number of mismatches in the seed is specified to be 2. One fix is to manually run Bowtie2 with the no mismatches in seed option, the other is to change the tophat segment-length option to at least half the size of your smallest read length
    • Has anyone encountered this behavior and any clues as to how this affects the analysis in general? I would much appreciate any comments or insights in this regard

    cheers
    Ashok

  • #2
    Are you having specific problems with the quality scores dropping off greatly towards the end of the 50bp read? Usually at the 50bp illumina predicted quality scores are still very, very good. And what Q score are you trimming to? I would only suggest some very low cut off, ie no higher 10 (1% predicted error). Generally, for reads that are just mapped to the genome, I wouldn't suggest doing any trimming unless there are very specific problems. Alternatively, you could do some quality filtering as opposed to trimming. And just drop pairs with too many Q scores under some threshold (i.e. require 80% of both read pairs with > Q20?).

    But if you are going to go down this road of trimming, I would suggest increasing the minimum length to something closer to 40, then dropping the --segment-length to (1/2)*(min read length). At 30 bp the problem is going to be that this number needs to 15. That's just too short for accurate mapping. Remember tophat is doing this step to attempt to find spliced reads, where part of the read maps to a different exon, so each segment needs to be mappable independently. And without this capability, tophat loses a lot of its functionality and your resulting data for analysis will probably be quite bad.

    I have ran into something similar where we had some data that displayed a huge nucleotide bias (we assume now that it was because of poor random priming during cDNA synthesis) for a large part of the 5' end of the read. So we trimmed that section off and ended up with 39bp reads, and I set the segment-length to 19. What we saw was a generally lower percentage of reads mapping (usually we were getting ~70% and now we had ~60%), and very few spliced reads. Its hard to know what should be correct data of course, but the results from downstream analysis with cufflinks and the like with these samples were also quite a bit different from the longer reads without trimming.

    I think many tools are currently assuming you have something like 75-104 bp reads now, and have been optimized given that assumption. You're already starting with something less than that at 50bp, so I wouldn't willing go lower unless you have a very good reason to do so.

    Comment


    • #3
      Thanks for the prompt response:
      • I have some problems with the reads at the end especially with the 2nd reads in the pair. I used sickle (https://github.com/najoshi/sickle) to trim the reads and ended up with about 70% of the pairs being kept, with a quality score threshold of 20 (which is the 1% predicted error). I have now increased my read length to 36bp and this solves a lot of my alignment problems. I will try quality filtering and appreciate the tip. I will see how many read pairs are kept in this manner.
      • any hints on why I am missing quality scores in tophat
      • I havent made any comparisons with the mapping yet, but will report back soon on the % mapping with and without trimming. Tentatively I slightly get better alignment rates with trimming, but of course overall its lower than untrimmed, as seen below for one sample:
        1. UnTrimmed:
          16805252 reads; of these:
          16805252 (100.00%) were unpaired; of these:
          1330440 (7.92%) aligned 0 times
          9669720 (57.54%) aligned exactly 1 time
          5805092 (34.54%) aligned >1 times
          92.08% overall alignment rate
        2. Trimmed:
          11771625 reads; of these:
          11771625 (100.00%) were unpaired; of these:
          805833 (6.85%) aligned 0 times
          6834128 (58.06%) aligned exactly 1 time
          4131664 (35.10%) aligned >1 times
          93.15% overall alignment rate

        What are the chances of mis-alignment due to not trimming? Will this be worse in the long run

      As always any insight is appreciated

      Comment


      • #4
        I do not have any idea what's going on with the missing quality scores. That does seem odd.

        I like sickle as well (and yes 1% is Q20, whoops). 36bp could work well, that's similar to GAIIx read lengths that these tools were first created for. I don't think the mis-alignment is in general going to be as much of a problem with low quality scores. I'd expect that high error rates would rather just lead to lower alignment percentage, due to too many mismatches, which you don't seem to have a problem with.

        Comment


        • #5
          Missing quality scores: Tophat 2.4.0.1

          I just wanted to add to the observation made by ashokrags on missing quality scores for primary alignments generated by TopHat 2.4.0.1.

          - We have observed that same behavior: about 20 to 25% of the total reads have "*" in the quality scores, eventhough these reads have valid qualities in the corresponding input fastq files.
          - When the BAM files generated from Tophat (containing the "*" reads) is used with any PICARD tools, PICARD throws errors as the framework does not deal with "*" in the quality strings

          We have not figures out why TopHat does not pick up the quality strings correctly from the fastq files. Any solutions/thoughts from others ?

          Thanks

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-25-2024, 11:49 AM
          0 responses
          15 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-24-2024, 08:47 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          62 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Working...
          X