Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • problem of mapping SOLiD data using tophat2

    When I using tophat to map SOLiD data (dataset: DRR013130) with the following parameters (tophat2 -m 2 -p 8 --color --quals --bowtie1 -G /home/mm10/mm10_GTF/genes.gtf -o /media/DRR013130_Oo_3_F3 /home/mm10/BowtieCIndex/genome /media/DRR013130_Oo_3_F3.csfasta /media/DRR013130_Oo_3_F3_QV.qual), I found out that each length of the quality of sequence in unmapped.fastq (which is converted from unmapped.bam using bam2fastx) is 1 length shorter than itself length. However, this problem does happen in accepted_hits.fastq (converted from accepted_hits.bam). How could it happen and how to solve it?
    Thank you very much for your help!

    -----------------------------------------------------------------------------------
    >head unmapped.fastq
    @DRR013130_Oo_3.3872899
    TTGACCTTACGCTCGTGTCAATTGAACTCTTATGCTACCCCTACCGTCAGA 51 length
    +
    @==;@==?>@?@@@@?@@@@@?;@@2@:/66226?6/8/6?8;/=;;>8/ 50 length
    @DRR013130_Oo_3.12296928
    TGACCGTCTTAGACATATCTCCGTCGTAGGGATCCCCGGCTAACGGATCCG 51 length
    +
    @@@@@@@@@@@@@@@@@@@@@@@?@@@;=@@@?6?@@6//2//66//26/ 50 length
    -----------------------------------------------------------------------------------
    >head accepted_hits.fastq
    @DRR013130_Oo_3.9803742
    ACTTTTACAAGGCCTAATGGTGACTCCTACAGTGGTTGACACCGACTACC 50 length
    +
    @__]L@Q__UU]]___[[____]]^^___WW____UU__22___[QS]W8 50 length
    @DRR013130_Oo_3.15175672
    CAACCTAAAATAAAAACAACTAAAAAAGCTGACTCGTGAGGCAAAAAGAC 50 length
    +
    @________^^___________]]__________________]]__[[_@ 50 length
    Last edited by deuterium; 09-24-2014, 10:41 AM.

  • #2
    I think I should remove the "T" at the head of each sequence in unmapping.fastq, could anyone help me to write a script to do that?
    Thanks very much!

    Comment


    • #3
      If the length of the sequence and corresponding quality scores is not equal, you have an error in your input data! (I'm actually a little surprised that tophat is not complaining about these entries) How do you know that the quality score from the leading "T" is missing and not from any other base? Maybe quality score conversion failed totally for these reads (for what ever reason - you have to look into the raw data to check this).

      Comment


      • #4
        Solid data has 1 more "bases" than qualities, because it starts with one fixed base (in this case T) followed by numbers (0-3). But what you are showing is not Solid data. You need to go back to the original colorspace data and map it in colorspace; Solid data cannot be accurately converted to bases without mapping first.

        Comment


        • #5
          But the data was mapped in colorspace.
          Originally posted by deuterium
          (tophat2 -m 2 -p 8 --color --quals --bowtie1 -G /home/mm10/mm10_GTF/genes.gtf -o /media/DRR013130_Oo_3_F3 /home/mm10/BowtieCIndex/genome /media/DRR013130_Oo_3_F3.csfasta /media/DRR013130_Oo_3_F3_QV.qual
          I didn't know that you cannot convert at all without mapping. I would be interested to know how tophat/bowtie is converting the unmappable reads in this case?! If it is really just an additional T at the start you could simply use sed to remove it:

          Code:
           sed 's/^T//g' ./yourFastqFile.fastq
          This will delete 1 leading "T" at the beginning of each line (your quality lines should not have a "T" in them, so there is no need to handle that)

          Comment


          • #6
            @WhatsoEver: It's a really bad idea to handle colorspace data in basespace.

            Comment


            • #7
              It wasn't my idea, just the answer to the authors question And to point that out: I didn't mean to remove the leading "T" in the ".csfasta" file, I meant to delete it in the "unmapped.fastq" file.

              Anyhow, the question would then be of what value is the data in unmapped.bam? If it cannot be used, why/how is it converted?

              Comment


              • #8
                @WhatsOEver
                Thank you very much! This script is really helpful!

                Comment


                • #9
                  Originally posted by WhatsOEver View Post
                  It wasn't my idea, just the answer to the authors question And to point that out: I didn't mean to remove the leading "T" in the ".csfasta" file, I meant to delete it in the "unmapped.fastq" file.

                  Anyhow, the question would then be of what value is the data in unmapped.bam? If it cannot be used, why/how is it converted?
                  Yes, I think it is a bug of tophat2!

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 11:49 AM
                  0 responses
                  15 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-24-2024, 08:47 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  61 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X