Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 2 Questions on color-space format

    Hello,

    I am currently implementing color-space assembly into Ray, a de novo assembler running 100 % with message passing interface.

    I have read several documents on color space.


    Document 1

    SOLiDTM Data Format and File Definitions Guide
    Thermo Fisher Scientific enables our customers to make the world healthier, cleaner and safer. Delivering technology, pharmaceutical and biotechnology services.



    Document 2

    SOLiDTM de novo accessory tools 2.0



    Document 3

    Applied Biosystems SOLiDTM 3 Plus System, De Novo Assembly Protocol



    Document 4

    SRA Handbook



    I have two questions:


    Question 1

    From SRR001354.fastq (SRA001031, converted sra file to fastq):

    ...
    @SRR001354.1 S0013_20071128_2_DH10BFC_461_28_1048_F3 length=35
    T2333132333313233232313333333233323F
    +SRR001354.1 S0013_20071128_2_DH10BFC_461_28_1048_F3 length=35
    !%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    ...
    @SRR001354.12 S0013_20071128_2_DH10BFC_461_59_1483_F3 length=35
    T2133131333313111331113331131231133B
    +SRR001354.12 S0013_20071128_2_DH10BFC_461_59_1483_F3 length=35
    !%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%&
    ...



    What is the meaning of the trailing F and B in the sequences above ?

    Nothing is said about that in Document 1.

    As I understand it (I might be wrong), a color-space sequence has a starting nucleotide for bootstrapping. Also, the first color (after the starting nucleotide) depends on the starting nucleotide.

    Other colors are independent.

    Am I right?





    Question 2

    For de novo assembly, one must skip the starting nucleotide and skip the first color, and convert the remaining colors to double-encoding.
    Also, the reverse-complement of a vertex is simply the reverse, and so it is for any sequences of SOLiD colors. Right ?


    So, how does a color-space contig is converted to base-space ?

    As I see it, there are 4 possible base-space versions for any color-space sequence -- one for each possible starting letter. Am I right ?

    Since an assembly has more than 1 color-space contig, I see there a great deal of combinatorics.


    Thank you in advance for your anticipated collective wisdom.


    Sébastien
    PhD student

  • #2
    Question 1: The B's and F's should not be there

    Question 2: You need a method to track the position of the T's in the begin of the reads in the contigs and do the conversion to base space considering that.

    Comment


    • #3
      Regarding Q2: I recall seeing some slides from SOLiD showing how they use read nucleotide prefix in order to work out the starting letter of a color space contig. Basically you could ignore the prefix for the de novo step, then the prefix of each read will give you its answer for which base the contig should start with, and apply some sensible consensus to pick. However, I don't work with color-space data

      Comment


      • #4
        Well I have never written an assembler so I may full of beans. But I do work with SOLiD data a lot. Nils would be a good person to chime in here.

        I see zero reason for a program which is trying to become
        "color-space aware" to convert color-space (cs) to double-encoded space (de-space) for internal use. de-space should only be used as a last attempt by a human when that human is trying to use a non-color-space aware program because he/she has no other alternative. It seems to me that any program which actually uses cs properly would be able to handle the 0,1,2,3 of cs as easily as it handles the artificial A,C,G,T of de-space. On the other hand a program that insists on using the A,C,G,T of de-space would make me wonder if the program's author actually understood cs.

        Converting cs into base-space (bs) throws away all of the power of cs while also dragging all of weaknesses of cs along.

        The major power (or advantage) of cs is that, at enough sequencing depth, it is self-correcting. A single cs mismatch *must* be a sequencing error. Two successive cs-matches can either be sequencing error (3/4th of the time) or a true SNP. In other words if I have 5 reads:

        T3101130
        T3101130
        T3101130
        T3100130
        T3100030


        Then I know that the 4th read (a single mismatch) has a sequencing error while the 5th read (two mismatches) could be error or a SNP. On the other hand if I convert into bs:

        ACCACGG
        ACCACGG
        ACCACGG
        ACCCATT
        ACCCCGG


        I would probably assume that read # 4 was not related to the other four. And be incorrect about the assumption. Note that read #5 was a SNP after all.


        As I said I've never written an assembler. But having manually done cs-to-de-space conversions and then using cs-naive assemblers with consequent poor results, I suspect that making a proper color-space-aware assembler is a bit more tricky than just converting from cs to de-space.


        As for your actual question:

        So, how does a color-space contig is converted to base-space ?

        As I see it, there are 4 possible base-space versions for any color-space sequence -- one for each possible starting letter. Am I right ?
        I suppose in some sense you are correct. But for any given color-space read it will start off with only one letter and thus will decode to only one base-space read. As illustrated above the base-space read can be horribly incorrect even though the color-space read is almost perfectly correct. But there will be only one bs read.

        Comment


        • #5
          @seb567 Why don't you take a look to how velvet works in color space? It has to do corrections prior to perform the assembly

          @westerman ABi had some code to perform spectral corrections (corrections in CS without ref genome). It was computational difficult to attack when working with big genomes.

          I guess this spectral correction ("in real time") is what is coming in the new ABI sequencers? So they can then drop sequence space reads.
          -drd

          Comment


          • #6
            Originally posted by drio View Post
            @westerman ABi had some code to perform spectral corrections (corrections in CS without ref genome). It was computational difficult to attack when working with big genomes.
            I think you are talking about the 'SAET' tool. While we use it even for 'medium-sized' (300 MBase) genomes I agree that it works much better for bacterial size projects. SAET processing can take a day or two to run when the overall coverage is low and the genome size large. Sometimes I wonder if it is worth the effort.

            I guess this spectral correction ("in real time") is what is coming in the new ABI sequencers? So they can then drop sequence space reads.
            I am not sure. I don't think that you would be able to do correction in real time because you would lack enough knowledge (or depth of coverage) near the beginning of the run. But I could be wrong. ABI/Lifetech has held a "future of sequencing" conference in San Diego during the last couple of days. I was unable to attend but hope to find out the details soon. Exciting times keep rolling our way!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X