Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • problem of mapping SOLiD data using tophat2

    When I using tophat to map SOLiD data (dataset: DRR013130) with the following parameters (tophat2 -m 2 -p 8 --color --quals --bowtie1 -G /home/mm10/mm10_GTF/genes.gtf -o /media/DRR013130_Oo_3_F3 /home/mm10/BowtieCIndex/genome /media/DRR013130_Oo_3_F3.csfasta /media/DRR013130_Oo_3_F3_QV.qual), I found out that each length of the quality of sequence in unmapped.fastq (which is converted from unmapped.bam using bam2fastx) is 1 length shorter than itself length. However, this problem does happen in accepted_hits.fastq (converted from accepted_hits.bam). How could it happen and how to solve it?
    Thank you very much for your help!

    -----------------------------------------------------------------------------------
    >head unmapped.fastq
    @DRR013130_Oo_3.3872899
    TTGACCTTACGCTCGTGTCAATTGAACTCTTATGCTACCCCTACCGTCAGA 51 length
    +
    @==;@==?>@?@@@@?@@@@@?;@@2@:/66226?6/8/6?8;/=;;>8/ 50 length
    @DRR013130_Oo_3.12296928
    TGACCGTCTTAGACATATCTCCGTCGTAGGGATCCCCGGCTAACGGATCCG 51 length
    +
    @@@@@@@@@@@@@@@@@@@@@@@?@@@;=@@@?6?@@6//2//66//26/ 50 length
    -----------------------------------------------------------------------------------
    >head accepted_hits.fastq
    @DRR013130_Oo_3.9803742
    ACTTTTACAAGGCCTAATGGTGACTCCTACAGTGGTTGACACCGACTACC 50 length
    +
    @__]L@Q__UU]]___[[____]]^^___WW____UU__22___[QS]W8 50 length
    @DRR013130_Oo_3.15175672
    CAACCTAAAATAAAAACAACTAAAAAAGCTGACTCGTGAGGCAAAAAGAC 50 length
    +
    @________^^___________]]__________________]]__[[_@ 50 length
    Last edited by deuterium; 09-24-2014, 10:41 AM.

  • #2
    I think I should remove the "T" at the head of each sequence in unmapping.fastq, could anyone help me to write a script to do that?
    Thanks very much!

    Comment


    • #3
      If the length of the sequence and corresponding quality scores is not equal, you have an error in your input data! (I'm actually a little surprised that tophat is not complaining about these entries) How do you know that the quality score from the leading "T" is missing and not from any other base? Maybe quality score conversion failed totally for these reads (for what ever reason - you have to look into the raw data to check this).

      Comment


      • #4
        Solid data has 1 more "bases" than qualities, because it starts with one fixed base (in this case T) followed by numbers (0-3). But what you are showing is not Solid data. You need to go back to the original colorspace data and map it in colorspace; Solid data cannot be accurately converted to bases without mapping first.

        Comment


        • #5
          But the data was mapped in colorspace.
          Originally posted by deuterium
          (tophat2 -m 2 -p 8 --color --quals --bowtie1 -G /home/mm10/mm10_GTF/genes.gtf -o /media/DRR013130_Oo_3_F3 /home/mm10/BowtieCIndex/genome /media/DRR013130_Oo_3_F3.csfasta /media/DRR013130_Oo_3_F3_QV.qual
          I didn't know that you cannot convert at all without mapping. I would be interested to know how tophat/bowtie is converting the unmappable reads in this case?! If it is really just an additional T at the start you could simply use sed to remove it:

          Code:
           sed 's/^T//g' ./yourFastqFile.fastq
          This will delete 1 leading "T" at the beginning of each line (your quality lines should not have a "T" in them, so there is no need to handle that)

          Comment


          • #6
            @WhatsoEver: It's a really bad idea to handle colorspace data in basespace.

            Comment


            • #7
              It wasn't my idea, just the answer to the authors question And to point that out: I didn't mean to remove the leading "T" in the ".csfasta" file, I meant to delete it in the "unmapped.fastq" file.

              Anyhow, the question would then be of what value is the data in unmapped.bam? If it cannot be used, why/how is it converted?

              Comment


              • #8
                @WhatsOEver
                Thank you very much! This script is really helpful!

                Comment


                • #9
                  Originally posted by WhatsOEver View Post
                  It wasn't my idea, just the answer to the authors question And to point that out: I didn't mean to remove the leading "T" in the ".csfasta" file, I meant to delete it in the "unmapped.fastq" file.

                  Anyhow, the question would then be of what value is the data in unmapped.bam? If it cannot be used, why/how is it converted?
                  Yes, I think it is a bug of tophat2!

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X