Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Accuracy of SOLiD platform

    Hey guys,

    I've been getting mixed messages about the accuracy of the reads coming off the SOLiD platform. A 454 representative swore blind to me last week that only 25% of SOLiD reads are error-free (!), but this doesn't fit in with the per-base error rates I hear from other people. Can I get some insight into this from people who've been actually running these machines?

  • #2
    Since he was promoting 454 I guess that number is for reads without color-space errors, and with the old chemistry. Did he also comment on the number of sequences / bases generated per run compared to 454?...

    Comment


    • #3
      Hi Chipper,

      Thanks - can you comment on the error rate using the new chemistry?

      The 454 rep acknowledged that their bases/run is vastly lower than their competitors (even he couldn't deny that) - but argued that their higher accuracy and longer read length made assembly so much easier that lower coverage was sufficient. On that basis he claimed that their new Titanium system (500 Mb/run) will be roughly cost-competitive with Solexa and SOLiD for at least some projects. Does that seem remotely plausible?

      Comment


      • #4
        It has something to do with how the ligation primers are placed, and will give a more even (lower) error rate for all five primers. Yes, the 454 may be better for some purposes du to read length but the readlenth is likely going to increse on the Solid as well.

        Comment


        • #5
          It's a very hard question to answer based on the uniqueness of the ligation concept, but in the simplest example, we've seen a much higher percentage of SOLiD sequences aligned to genome using any algorithm than we did with Solexa datasets.

          Comment


          • #6
            "It's a very hard question to answer based on the uniqueness of the ligation concept, but in the simplest example, we've seen a much higher percentage of SOLiD sequences aligned to genome using any algorithm than we did with Solexa datasets."

            as solid requires a colour-space aligner I find it hard to understand how 'any aligner' can be used with it to give a comparison with Solexa data which are just straight strings with Q-scores.
            My experience of solids is that the raw read error rate is ~Q15. The two-colour changes are ~Q30. Exploiting the latter on Human requires a longer read length than is currently standard on the platform but will come soon.
            The amount of alignable solexa data depends on cluster density - but at the optimum its ~95% of the PF Data (PF = non overlapping clusters). This can be ~5gigabases for a paired end run - taking 4-5days. so about 1G of alignable data (on GAII) per day provided you have optimised your cluster density for the sample.

            Solid tries to align all of the data and can be very variable but typically about 20-30% align to human - im not sure of the standard solids aligner can do whole human alignments in one go - in which case you may want to look at MaQ from sanger.

            you may find more on www.genographia.org

            Comment


            • #7
              "as solid requires a colour-space aligner I find it hard to understand how 'any aligner' can be used with it to give a comparison with Solexa data"

              As you say, to take full advantage of color-space sequence you would need an aligner that understood that it was using dual-base encoded data. But by "double encoding" CS sequence you can use it with any sequence alignment program.

              That is the equivalent of tying one hand behind your back. It is also possible to convert the raw SOLiD CS reads into (real) sequence space. This would be the equivalent of tying both your hands behind your back. Any error in color space would then propagate downstream ensuring subsequent correct color space calls would be converted to incorrect sequence space calls.

              --
              Phillip

              Comment


              • #8
                Hey Phillip, This is a good discussion that I haven't had outside of my close colleagues...and I want to make sure I'm understanding it correctly. I agree it's a waste of time to convert colorspace directly to basespace using the decoding rules...any read becomes useless after a normally-correctable single color error.

                By "double encoding" you are referring to the practice of converting colorspace reads 0,1,2,3 to A,C,G,T directly (what ABI calls psuedo-base space, or what I call "fakebase"), converting the reference to "fakebase" in the same way, and using them with existing tools. (Of course you have to ditch the adapter anchor base, and first colorspace call, as neither are in the genome).

                At first glance this appears to make reads/reference that can be read by any tool...but there is a problem with this approach, and that comes upon when the program tries to work with the reverse complement of the reads. The reverse complement of colorspace is just the reverse of the sequence....NOT the reverse complement as in base space. Thus you cannot align to both strands if you just do a simple csfasta->fakebase conversion.

                The above can be made to work by also putting in the reverse of the colorspace reads to your fakebase input file...unfortunately this doubles the number of reads you will deal with (potentially causing memory issues), and makes the output parsing a bit confusing, but the upside is that you can use any tool you want.

                I have found MAQ to have the best support for SOLiD yet, as it's able to do the appropriate conversions with builtin functions, and properly deals with the reverses. Last but not least it can now output a nucleotide-corrected alignment...so you can immediately get back to basespace, but the color information has been used to generate the alignment.

                Comment


                • #9
                  Hi Eco,
                  Yes, you understood me perfectly. And you are right, I completely missed the reverse/reverse-complement issue.

                  Bummer! I thought I would be able to use VMATCH trivially on color space data (without even converting to fakebase). That's not in the cards, is it?

                  --
                  Phillip

                  Comment


                  • #10
                    Originally posted by dgmacarthur View Post
                    - can you comment on the error rate using the new chemistry?
                    From my analysis, the error rate with version 1 chemistry was 0.1%. With the version 2 upgrade it has dropped to about 0.075%.

                    Note that this is the miss-called SNP rate per base, not the raw (systematic) error rate.

                    Cheers

                    Comment


                    • #11
                      Originally posted by cgb View Post
                      Solid tries to align all of the data and can be very variable but typically about 20-30% align to human
                      Ok, 20-30% is really low, even for version 1 chemistry. If you are using a high quality, high molecular weight genomic DNA input, I would expect to see at least 35% mapped with version 1.
                      Our complex genome runs on the new chemistry show >50% mapped.

                      Originally posted by cgb View Post
                      im not sure of the standard solids aligner can do whole human alignments in one go - in which case you may want to look at MaQ from sanger.
                      You align to each chromosome separately then aggregate the data into a single file. At this point, reads that are uniquely mapped to the genome are pulled out as input to the SNP and consensus calling pipeline. If you run this on a cluster, each mapping uses a single core. So you need at least 24 cores to map to all human chromosomes concurrently.

                      MAQ is very slow but can be used for mate pair rescue. Shrimp is another option if you want to use a different aligner.

                      Comment


                      • #12
                        Originally posted by jungle View Post
                        From my analysis, the error rate with version 1 chemistry was 0.1%. With the version 2 upgrade it has dropped to about 0.075%.

                        Note that this is the miss-called SNP rate per base, not the raw (systematic) error rate.

                        Cheers
                        And what do you use for SNP calling? Could you discuss the workflow after getting SOLiD's color-space reads to mapping and SNP calling, etc.?
                        --
                        bioinfosm

                        Comment


                        • #13
                          Originally posted by bioinfosm View Post
                          And what do you use for SNP calling? Could you discuss the workflow after getting SOLiD's color-space reads to mapping and SNP calling, etc.?
                          I should point out that when I say mis-called SNP, I mean erroneous valid adjacents since I am talking about errors at the read level, not in the consensus.

                          Mapping to a large genome (eg. human) is done on each chromosome separately. The data are then aggregated into a single file so reads that map uniquely to the genome can be identified. The unique hit file is then separated into individual chromosomes again (ie. 24 separarte unique match files). Consensus and SNP calling is done on these individually.

                          What you end up with is a folder full of files for each chromosome. These include a consensus sequences in base space, a list of all variants, a list of "confirmed" SNPs, coverage depth at each position in the reference, genomic coordinates of regions that are covered at least once, gff files for the alignment.

                          I use the AB pipelines (albeit adapted to my needs) as they work and are easy to manipulate (mostly perl).


                          Hope this helps!

                          Comment


                          • #14
                            Originally posted by jungle View Post
                            From my analysis, the error rate with version 1 chemistry was 0.1%. With the version 2 upgrade it has dropped to about 0.075%.

                            Note that this is the miss-called SNP rate per base, not the raw (systematic) error rate.

                            Cheers
                            thanks for that clarification jungle. I am currently at a conference in cambridge where this came up. Comparing the SNP-error rate on SOLiD to the raw read error rate on Solexa clearly isnt apples and apple. the raw-read error rate on SOLiD being ~8-15% ? and ~1% on Solexa. I dont know the equivalent Slexa metric to the SOLiD 'error' (two colour miscall) rate .... anybody ?

                            Comment


                            • #15
                              Originally posted by cgb View Post
                              thanks for that clarification jungle. I am currently at a conference in cambridge where this came up. Comparing the SNP-error rate on SOLiD to the raw read error rate on Solexa clearly isnt apples and apple. the raw-read error rate on SOLiD being ~8-15% ? and ~1% on Solexa. I dont know the equivalent Slexa metric to the SOLiD 'error' (two colour miscall) rate .... anybody ?
                              No worries cgb.

                              The Solid systematic error rate was 4 -5% on old chemistry and is now ~3%.

                              I think the closest comparison to Solid would be to the erroneous valid adjacent error rate (~0.075%). I have never worked with Solexa data, so I have no first hand impression of the error rate there. However, I found a recent publication on the Solexa 1G that says "We found that error rates range from 0.3% at the beginning of reads to 3.8% at the end of reads". That seems a bit higher than I would have expected...

                              Anyone have more up-to-date information?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X