Header Leaderboard Ad

Collapse

2 Questions on color-space format

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • westerman
    replied
    Originally posted by drio View Post
    @westerman ABi had some code to perform spectral corrections (corrections in CS without ref genome). It was computational difficult to attack when working with big genomes.
    I think you are talking about the 'SAET' tool. While we use it even for 'medium-sized' (300 MBase) genomes I agree that it works much better for bacterial size projects. SAET processing can take a day or two to run when the overall coverage is low and the genome size large. Sometimes I wonder if it is worth the effort.

    I guess this spectral correction ("in real time") is what is coming in the new ABI sequencers? So they can then drop sequence space reads.
    I am not sure. I don't think that you would be able to do correction in real time because you would lack enough knowledge (or depth of coverage) near the beginning of the run. But I could be wrong. ABI/Lifetech has held a "future of sequencing" conference in San Diego during the last couple of days. I was unable to attend but hope to find out the details soon. Exciting times keep rolling our way!

    Leave a comment:


  • drio
    replied
    @seb567 Why don't you take a look to how velvet works in color space? It has to do corrections prior to perform the assembly

    @westerman ABi had some code to perform spectral corrections (corrections in CS without ref genome). It was computational difficult to attack when working with big genomes.

    I guess this spectral correction ("in real time") is what is coming in the new ABI sequencers? So they can then drop sequence space reads.

    Leave a comment:


  • westerman
    replied
    Well I have never written an assembler so I may full of beans. But I do work with SOLiD data a lot. Nils would be a good person to chime in here.

    I see zero reason for a program which is trying to become
    "color-space aware" to convert color-space (cs) to double-encoded space (de-space) for internal use. de-space should only be used as a last attempt by a human when that human is trying to use a non-color-space aware program because he/she has no other alternative. It seems to me that any program which actually uses cs properly would be able to handle the 0,1,2,3 of cs as easily as it handles the artificial A,C,G,T of de-space. On the other hand a program that insists on using the A,C,G,T of de-space would make me wonder if the program's author actually understood cs.

    Converting cs into base-space (bs) throws away all of the power of cs while also dragging all of weaknesses of cs along.

    The major power (or advantage) of cs is that, at enough sequencing depth, it is self-correcting. A single cs mismatch *must* be a sequencing error. Two successive cs-matches can either be sequencing error (3/4th of the time) or a true SNP. In other words if I have 5 reads:

    T3101130
    T3101130
    T3101130
    T3100130
    T3100030


    Then I know that the 4th read (a single mismatch) has a sequencing error while the 5th read (two mismatches) could be error or a SNP. On the other hand if I convert into bs:

    ACCACGG
    ACCACGG
    ACCACGG
    ACCCATT
    ACCCCGG


    I would probably assume that read # 4 was not related to the other four. And be incorrect about the assumption. Note that read #5 was a SNP after all.


    As I said I've never written an assembler. But having manually done cs-to-de-space conversions and then using cs-naive assemblers with consequent poor results, I suspect that making a proper color-space-aware assembler is a bit more tricky than just converting from cs to de-space.


    As for your actual question:

    So, how does a color-space contig is converted to base-space ?

    As I see it, there are 4 possible base-space versions for any color-space sequence -- one for each possible starting letter. Am I right ?
    I suppose in some sense you are correct. But for any given color-space read it will start off with only one letter and thus will decode to only one base-space read. As illustrated above the base-space read can be horribly incorrect even though the color-space read is almost perfectly correct. But there will be only one bs read.

    Leave a comment:


  • maubp
    replied
    Regarding Q2: I recall seeing some slides from SOLiD showing how they use read nucleotide prefix in order to work out the starting letter of a color space contig. Basically you could ignore the prefix for the de novo step, then the prefix of each read will give you its answer for which base the contig should start with, and apply some sensible consensus to pick. However, I don't work with color-space data

    Leave a comment:


  • lvaruzza
    replied
    Question 1: The B's and F's should not be there

    Question 2: You need a method to track the position of the T's in the begin of the reads in the contigs and do the conversion to base space considering that.

    Leave a comment:


  • seb567
    started a topic 2 Questions on color-space format

    2 Questions on color-space format

    Hello,

    I am currently implementing color-space assembly into Ray, a de novo assembler running 100 % with message passing interface.

    I have read several documents on color space.


    Document 1

    SOLiDTM Data Format and File Definitions Guide
    http://www3.appliedbiosystems.com/cm...cms_058717.pdf


    Document 2

    SOLiDTM de novo accessory tools 2.0
    http://solidsoftwaretools.com/gf/dow...peline_2.0.pdf


    Document 3

    Applied Biosystems SOLiDTM 3 Plus System, De Novo Assembly Protocol
    http://solidsoftwaretools.com/gf/dow...col0060810.pdf


    Document 4

    SRA Handbook
    http://www.ncbi.nlm.nih.gov/books/NB...verview_BK.pdf


    I have two questions:


    Question 1

    From SRR001354.fastq (SRA001031, converted sra file to fastq):

    ...
    @SRR001354.1 S0013_20071128_2_DH10BFC_461_28_1048_F3 length=35
    T2333132333313233232313333333233323F
    +SRR001354.1 S0013_20071128_2_DH10BFC_461_28_1048_F3 length=35
    !%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    ...
    @SRR001354.12 S0013_20071128_2_DH10BFC_461_59_1483_F3 length=35
    T2133131333313111331113331131231133B
    +SRR001354.12 S0013_20071128_2_DH10BFC_461_59_1483_F3 length=35
    !%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%&
    ...



    What is the meaning of the trailing F and B in the sequences above ?

    Nothing is said about that in Document 1.

    As I understand it (I might be wrong), a color-space sequence has a starting nucleotide for bootstrapping. Also, the first color (after the starting nucleotide) depends on the starting nucleotide.

    Other colors are independent.

    Am I right?





    Question 2

    For de novo assembly, one must skip the starting nucleotide and skip the first color, and convert the remaining colors to double-encoding.
    Also, the reverse-complement of a vertex is simply the reverse, and so it is for any sequences of SOLiD colors. Right ?


    So, how does a color-space contig is converted to base-space ?

    As I see it, there are 4 possible base-space versions for any color-space sequence -- one for each possible starting letter. Am I right ?

    Since an assembly has more than 1 color-space contig, I see there a great deal of combinatorics.


    Thank you in advance for your anticipated collective wisdom.


    S├ębastien
    PhD student
Working...
X