Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • deuterium
    Junior Member
    • Mar 2012
    • 8

    problem of mapping SOLiD data using tophat2

    When I using tophat to map SOLiD data (dataset: DRR013130) with the following parameters (tophat2 -m 2 -p 8 --color --quals --bowtie1 -G /home/mm10/mm10_GTF/genes.gtf -o /media/DRR013130_Oo_3_F3 /home/mm10/BowtieCIndex/genome /media/DRR013130_Oo_3_F3.csfasta /media/DRR013130_Oo_3_F3_QV.qual), I found out that each length of the quality of sequence in unmapped.fastq (which is converted from unmapped.bam using bam2fastx) is 1 length shorter than itself length. However, this problem does happen in accepted_hits.fastq (converted from accepted_hits.bam). How could it happen and how to solve it?
    Thank you very much for your help!

    -----------------------------------------------------------------------------------
    >head unmapped.fastq
    @DRR013130_Oo_3.3872899
    TTGACCTTACGCTCGTGTCAATTGAACTCTTATGCTACCCCTACCGTCAGA 51 length
    +
    @==;@==?>@?@@@@?@@@@@?;@@2@:/66226?6/8/6?8;/=;;>8/ 50 length
    @DRR013130_Oo_3.12296928
    TGACCGTCTTAGACATATCTCCGTCGTAGGGATCCCCGGCTAACGGATCCG 51 length
    +
    @@@@@@@@@@@@@@@@@@@@@@@?@@@;=@@@?6?@@6//2//66//26/ 50 length
    -----------------------------------------------------------------------------------
    >head accepted_hits.fastq
    @DRR013130_Oo_3.9803742
    ACTTTTACAAGGCCTAATGGTGACTCCTACAGTGGTTGACACCGACTACC 50 length
    +
    @__]L@Q__UU]]___[[____]]^^___WW____UU__22___[QS]W8 50 length
    @DRR013130_Oo_3.15175672
    CAACCTAAAATAAAAACAACTAAAAAAGCTGACTCGTGAGGCAAAAAGAC 50 length
    +
    @________^^___________]]__________________]]__[[_@ 50 length
    Last edited by deuterium; 09-24-2014, 10:41 AM.
  • deuterium
    Junior Member
    • Mar 2012
    • 8

    #2
    I think I should remove the "T" at the head of each sequence in unmapping.fastq, could anyone help me to write a script to do that?
    Thanks very much!

    Comment

    • WhatsOEver
      Senior Member
      • Apr 2012
      • 215

      #3
      If the length of the sequence and corresponding quality scores is not equal, you have an error in your input data! (I'm actually a little surprised that tophat is not complaining about these entries) How do you know that the quality score from the leading "T" is missing and not from any other base? Maybe quality score conversion failed totally for these reads (for what ever reason - you have to look into the raw data to check this).

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Solid data has 1 more "bases" than qualities, because it starts with one fixed base (in this case T) followed by numbers (0-3). But what you are showing is not Solid data. You need to go back to the original colorspace data and map it in colorspace; Solid data cannot be accurately converted to bases without mapping first.

        Comment

        • WhatsOEver
          Senior Member
          • Apr 2012
          • 215

          #5
          But the data was mapped in colorspace.
          Originally posted by deuterium
          (tophat2 -m 2 -p 8 --color --quals --bowtie1 -G /home/mm10/mm10_GTF/genes.gtf -o /media/DRR013130_Oo_3_F3 /home/mm10/BowtieCIndex/genome /media/DRR013130_Oo_3_F3.csfasta /media/DRR013130_Oo_3_F3_QV.qual
          I didn't know that you cannot convert at all without mapping. I would be interested to know how tophat/bowtie is converting the unmappable reads in this case?! If it is really just an additional T at the start you could simply use sed to remove it:

          Code:
           sed 's/^T//g' ./yourFastqFile.fastq
          This will delete 1 leading "T" at the beginning of each line (your quality lines should not have a "T" in them, so there is no need to handle that)

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            @WhatsoEver: It's a really bad idea to handle colorspace data in basespace.

            Comment

            • WhatsOEver
              Senior Member
              • Apr 2012
              • 215

              #7
              It wasn't my idea, just the answer to the authors question And to point that out: I didn't mean to remove the leading "T" in the ".csfasta" file, I meant to delete it in the "unmapped.fastq" file.

              Anyhow, the question would then be of what value is the data in unmapped.bam? If it cannot be used, why/how is it converted?

              Comment

              • deuterium
                Junior Member
                • Mar 2012
                • 8

                #8
                @WhatsOEver
                Thank you very much! This script is really helpful!

                Comment

                • deuterium
                  Junior Member
                  • Mar 2012
                  • 8

                  #9
                  Originally posted by WhatsOEver View Post
                  It wasn't my idea, just the answer to the authors question And to point that out: I didn't mean to remove the leading "T" in the ".csfasta" file, I meant to delete it in the "unmapped.fastq" file.

                  Anyhow, the question would then be of what value is the data in unmapped.bam? If it cannot be used, why/how is it converted?
                  Yes, I think it is a bug of tophat2!

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                    by SEQadmin2


                    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                    Here are nine questions we think about, in roughly the order they matter, before...
                    06-18-2026, 07:11 AM
                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, 06-17-2026, 06:09 AM
                  0 responses
                  30 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-09-2026, 11:58 AM
                  0 responses
                  44 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-05-2026, 10:09 AM
                  0 responses
                  50 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-04-2026, 08:59 AM
                  0 responses
                  51 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...