Seqanswers Leaderboard Ad

**deuterium** · 09-24-2014, 10:56 AM

I think I should remove the "T" at the head of each sequence in unmapping.fastq, could anyone help me to write a script to do that?
Thanks very much!

**WhatsOEver** · 09-25-2014, 12:13 AM

If the length of the sequence and corresponding quality scores is not equal, you have an error in your input data! (I'm actually a little surprised that tophat is not complaining about these entries) How do you know that the quality score from the leading "T" is missing and not from any other base? Maybe quality score conversion failed totally for these reads (for what ever reason - you have to look into the raw data to check this).

**Brian Bushnell** · 09-25-2014, 08:53 AM

Solid data has 1 more "bases" than qualities, because it starts with one fixed base (in this case T) followed by numbers (0-3). But what you are showing is not Solid data. You need to go back to the original colorspace data and map it in colorspace; Solid data cannot be accurately converted to bases without mapping first.

**WhatsOEver** · 09-26-2014, 12:18 AM

But the data was mapped in colorspace.

Originally posted by deuterium

(tophat2 -m 2 -p 8 --color --quals --bowtie1 -G /home/mm10/mm10_GTF/genes.gtf -o /media/DRR013130_Oo_3_F3 /home/mm10/BowtieCIndex/genome /media/DRR013130_Oo_3_F3.csfasta /media/DRR013130_Oo_3_F3_QV.qual

I didn't know that you cannot convert at all without mapping. I would be interested to know how tophat/bowtie is converting the unmappable reads in this case?! If it is really just an additional T at the start you could simply use sed to remove it:

Code:

 sed 's/^T//g' ./yourFastqFile.fastq

This will delete 1 leading "T" at the beginning of each line (your quality lines should not have a "T" in them, so there is no need to handle that)

**dpryan** · 09-26-2014, 12:23 AM

@WhatsoEver: It's a really bad idea to handle colorspace data in basespace.

**WhatsOEver** · 09-26-2014, 12:30 AM

It wasn't my idea, just the answer to the authors question

And to point that out: I didn't mean to remove the leading "T" in the ".csfasta" file, I meant to delete it in the "unmapped.fastq" file.

Anyhow, the question would then be of what value is the data in unmapped.bam? If it cannot be used, why/how is it converted?

**deuterium** · 09-26-2014, 12:35 AM

@WhatsOEver
Thank you very much! This script is really helpful!

**deuterium** · 09-26-2014, 01:15 AM

Originally posted by WhatsOEver View Post

It wasn't my idea, just the answer to the authors question

And to point that out: I didn't mean to remove the leading "T" in the ".csfasta" file, I meant to delete it in the "unmapped.fastq" file.

Anyhow, the question would then be of what value is the data in unmapped.bam? If it cannot be used, why/how is it converted?

Yes, I think it is a bug of tophat2!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

problem of mapping SOLiD data using tophat2

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News