Seqanswers Leaderboard Ad

**lvaruzza** · 12-01-2010, 10:11 AM

Question 1: The B's and F's should not be there

Question 2: You need a method to track the position of the T's in the begin of the reads in the contigs and do the conversion to base space considering that.

**maubp** · 12-01-2010, 10:16 AM

Regarding Q2: I recall seeing some slides from SOLiD showing how they use read nucleotide prefix in order to work out the starting letter of a color space contig. Basically you could ignore the prefix for the de novo step, then the prefix of each read will give you its answer for which base the contig should start with, and apply some sensible consensus to pick. However, I don't work with color-space data

**westerman** · 12-01-2010, 01:44 PM

Well I have never written an assembler so I may full of beans. But I do work with SOLiD data a lot. Nils would be a good person to chime in here.

I see zero reason for a program which is trying to become
"color-space aware" to convert color-space (cs) to double-encoded space (de-space) for internal use. de-space should only be used as a last attempt by a human when that human is trying to use a non-color-space aware program because he/she has no other alternative. It seems to me that any program which actually uses cs properly would be able to handle the 0,1,2,3 of cs as easily as it handles the artificial A,C,G,T of de-space. On the other hand a program that insists on using the A,C,G,T of de-space would make me wonder if the program's author actually understood cs.

Converting cs into base-space (bs) throws away all of the power of cs while also dragging all of weaknesses of cs along.

The major power (or advantage) of cs is that, at enough sequencing depth, it is self-correcting. A single cs mismatch *must* be a sequencing error. Two successive cs-matches can either be sequencing error (3/4th of the time) or a true SNP. In other words if I have 5 reads:

T3101130
T3101130
T3101130
T3100130
T3100030

Then I know that the 4th read (a single mismatch) has a sequencing error while the 5th read (two mismatches) could be error or a SNP. On the other hand if I convert into bs:

ACCACGG
ACCACGG
ACCACGG
ACCCATT
ACCCCGG

I would probably assume that read # 4 was not related to the other four. And be incorrect about the assumption. Note that read #5 was a SNP after all.

As I said I've never written an assembler. But having manually done cs-to-de-space conversions and then using cs-naive assemblers with consequent poor results, I suspect that making a proper color-space-aware assembler is a bit more tricky than just converting from cs to de-space.

As for your actual question:

So, how does a color-space contig is converted to base-space ?

As I see it, there are 4 possible base-space versions for any color-space sequence -- one for each possible starting letter. Am I right ?

I suppose in some sense you are correct. But for any given color-space read it will start off with only one letter and thus will decode to only one base-space read. As illustrated above the base-space read can be horribly incorrect even though the color-space read is almost perfectly correct. But there will be only one bs read.

**drio** · 12-01-2010, 03:54 PM

@seb567 Why don't you take a look to how velvet works in color space? It has to do corrections prior to perform the assembly

@westerman ABi had some code to perform spectral corrections (corrections in CS without ref genome). It was computational difficult to attack when working with big genomes.

I guess this spectral correction ("in real time") is what is coming in the new ABI sequencers? So they can then drop sequence space reads.

**westerman** · 12-02-2010, 08:20 AM

Originally posted by drio View Post

@westerman ABi had some code to perform spectral corrections (corrections in CS without ref genome). It was computational difficult to attack when working with big genomes.

I think you are talking about the 'SAET' tool. While we use it even for 'medium-sized' (300 MBase) genomes I agree that it works much better for bacterial size projects. SAET processing can take a day or two to run when the overall coverage is low and the genome size large. Sometimes I wonder if it is worth the effort.

I guess this spectral correction ("in real time") is what is coming in the new ABI sequencers? So they can then drop sequence space reads.

I am not sure. I don't think that you would be able to do correction in real time because you would lack enough knowledge (or depth of coverage) near the beginning of the run. But I could be wrong. ABI/Lifetech has held a "future of sequencing" conference in San Diego during the last couple of days. I was unable to attend but hope to find out the details soon. Exciting times keep rolling our way!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

2 Questions on color-space format

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News