Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Accuracy of SOLiD platform

    Hey guys,

    I've been getting mixed messages about the accuracy of the reads coming off the SOLiD platform. A 454 representative swore blind to me last week that only 25% of SOLiD reads are error-free (!), but this doesn't fit in with the per-base error rates I hear from other people. Can I get some insight into this from people who've been actually running these machines?

  • #2
    Since he was promoting 454 I guess that number is for reads without color-space errors, and with the old chemistry. Did he also comment on the number of sequences / bases generated per run compared to 454?...

    Comment


    • #3
      Hi Chipper,

      Thanks - can you comment on the error rate using the new chemistry?

      The 454 rep acknowledged that their bases/run is vastly lower than their competitors (even he couldn't deny that) - but argued that their higher accuracy and longer read length made assembly so much easier that lower coverage was sufficient. On that basis he claimed that their new Titanium system (500 Mb/run) will be roughly cost-competitive with Solexa and SOLiD for at least some projects. Does that seem remotely plausible?

      Comment


      • #4
        It has something to do with how the ligation primers are placed, and will give a more even (lower) error rate for all five primers. Yes, the 454 may be better for some purposes du to read length but the readlenth is likely going to increse on the Solid as well.

        Comment


        • #5
          It's a very hard question to answer based on the uniqueness of the ligation concept, but in the simplest example, we've seen a much higher percentage of SOLiD sequences aligned to genome using any algorithm than we did with Solexa datasets.

          Comment


          • #6
            "It's a very hard question to answer based on the uniqueness of the ligation concept, but in the simplest example, we've seen a much higher percentage of SOLiD sequences aligned to genome using any algorithm than we did with Solexa datasets."

            as solid requires a colour-space aligner I find it hard to understand how 'any aligner' can be used with it to give a comparison with Solexa data which are just straight strings with Q-scores.
            My experience of solids is that the raw read error rate is ~Q15. The two-colour changes are ~Q30. Exploiting the latter on Human requires a longer read length than is currently standard on the platform but will come soon.
            The amount of alignable solexa data depends on cluster density - but at the optimum its ~95% of the PF Data (PF = non overlapping clusters). This can be ~5gigabases for a paired end run - taking 4-5days. so about 1G of alignable data (on GAII) per day provided you have optimised your cluster density for the sample.

            Solid tries to align all of the data and can be very variable but typically about 20-30% align to human - im not sure of the standard solids aligner can do whole human alignments in one go - in which case you may want to look at MaQ from sanger.

            you may find more on www.genographia.org

            Comment


            • #7
              "as solid requires a colour-space aligner I find it hard to understand how 'any aligner' can be used with it to give a comparison with Solexa data"

              As you say, to take full advantage of color-space sequence you would need an aligner that understood that it was using dual-base encoded data. But by "double encoding" CS sequence you can use it with any sequence alignment program.

              That is the equivalent of tying one hand behind your back. It is also possible to convert the raw SOLiD CS reads into (real) sequence space. This would be the equivalent of tying both your hands behind your back. Any error in color space would then propagate downstream ensuring subsequent correct color space calls would be converted to incorrect sequence space calls.

              --
              Phillip

              Comment


              • #8
                Hey Phillip, This is a good discussion that I haven't had outside of my close colleagues...and I want to make sure I'm understanding it correctly. I agree it's a waste of time to convert colorspace directly to basespace using the decoding rules...any read becomes useless after a normally-correctable single color error.

                By "double encoding" you are referring to the practice of converting colorspace reads 0,1,2,3 to A,C,G,T directly (what ABI calls psuedo-base space, or what I call "fakebase"), converting the reference to "fakebase" in the same way, and using them with existing tools. (Of course you have to ditch the adapter anchor base, and first colorspace call, as neither are in the genome).

                At first glance this appears to make reads/reference that can be read by any tool...but there is a problem with this approach, and that comes upon when the program tries to work with the reverse complement of the reads. The reverse complement of colorspace is just the reverse of the sequence....NOT the reverse complement as in base space. Thus you cannot align to both strands if you just do a simple csfasta->fakebase conversion.

                The above can be made to work by also putting in the reverse of the colorspace reads to your fakebase input file...unfortunately this doubles the number of reads you will deal with (potentially causing memory issues), and makes the output parsing a bit confusing, but the upside is that you can use any tool you want.

                I have found MAQ to have the best support for SOLiD yet, as it's able to do the appropriate conversions with builtin functions, and properly deals with the reverses. Last but not least it can now output a nucleotide-corrected alignment...so you can immediately get back to basespace, but the color information has been used to generate the alignment.

                Comment


                • #9
                  Hi Eco,
                  Yes, you understood me perfectly. And you are right, I completely missed the reverse/reverse-complement issue.

                  Bummer! I thought I would be able to use VMATCH trivially on color space data (without even converting to fakebase). That's not in the cards, is it?

                  --
                  Phillip

                  Comment


                  • #10
                    Originally posted by dgmacarthur View Post
                    - can you comment on the error rate using the new chemistry?
                    From my analysis, the error rate with version 1 chemistry was 0.1%. With the version 2 upgrade it has dropped to about 0.075%.

                    Note that this is the miss-called SNP rate per base, not the raw (systematic) error rate.

                    Cheers

                    Comment


                    • #11
                      Originally posted by cgb View Post
                      Solid tries to align all of the data and can be very variable but typically about 20-30% align to human
                      Ok, 20-30% is really low, even for version 1 chemistry. If you are using a high quality, high molecular weight genomic DNA input, I would expect to see at least 35% mapped with version 1.
                      Our complex genome runs on the new chemistry show >50% mapped.

                      Originally posted by cgb View Post
                      im not sure of the standard solids aligner can do whole human alignments in one go - in which case you may want to look at MaQ from sanger.
                      You align to each chromosome separately then aggregate the data into a single file. At this point, reads that are uniquely mapped to the genome are pulled out as input to the SNP and consensus calling pipeline. If you run this on a cluster, each mapping uses a single core. So you need at least 24 cores to map to all human chromosomes concurrently.

                      MAQ is very slow but can be used for mate pair rescue. Shrimp is another option if you want to use a different aligner.

                      Comment


                      • #12
                        Originally posted by jungle View Post
                        From my analysis, the error rate with version 1 chemistry was 0.1%. With the version 2 upgrade it has dropped to about 0.075%.

                        Note that this is the miss-called SNP rate per base, not the raw (systematic) error rate.

                        Cheers
                        And what do you use for SNP calling? Could you discuss the workflow after getting SOLiD's color-space reads to mapping and SNP calling, etc.?
                        --
                        bioinfosm

                        Comment


                        • #13
                          Originally posted by bioinfosm View Post
                          And what do you use for SNP calling? Could you discuss the workflow after getting SOLiD's color-space reads to mapping and SNP calling, etc.?
                          I should point out that when I say mis-called SNP, I mean erroneous valid adjacents since I am talking about errors at the read level, not in the consensus.

                          Mapping to a large genome (eg. human) is done on each chromosome separately. The data are then aggregated into a single file so reads that map uniquely to the genome can be identified. The unique hit file is then separated into individual chromosomes again (ie. 24 separarte unique match files). Consensus and SNP calling is done on these individually.

                          What you end up with is a folder full of files for each chromosome. These include a consensus sequences in base space, a list of all variants, a list of "confirmed" SNPs, coverage depth at each position in the reference, genomic coordinates of regions that are covered at least once, gff files for the alignment.

                          I use the AB pipelines (albeit adapted to my needs) as they work and are easy to manipulate (mostly perl).


                          Hope this helps!

                          Comment


                          • #14
                            Originally posted by jungle View Post
                            From my analysis, the error rate with version 1 chemistry was 0.1%. With the version 2 upgrade it has dropped to about 0.075%.

                            Note that this is the miss-called SNP rate per base, not the raw (systematic) error rate.

                            Cheers
                            thanks for that clarification jungle. I am currently at a conference in cambridge where this came up. Comparing the SNP-error rate on SOLiD to the raw read error rate on Solexa clearly isnt apples and apple. the raw-read error rate on SOLiD being ~8-15% ? and ~1% on Solexa. I dont know the equivalent Slexa metric to the SOLiD 'error' (two colour miscall) rate .... anybody ?

                            Comment


                            • #15
                              Originally posted by cgb View Post
                              thanks for that clarification jungle. I am currently at a conference in cambridge where this came up. Comparing the SNP-error rate on SOLiD to the raw read error rate on Solexa clearly isnt apples and apple. the raw-read error rate on SOLiD being ~8-15% ? and ~1% on Solexa. I dont know the equivalent Slexa metric to the SOLiD 'error' (two colour miscall) rate .... anybody ?
                              No worries cgb.

                              The Solid systematic error rate was 4 -5% on old chemistry and is now ~3%.

                              I think the closest comparison to Solid would be to the erroneous valid adjacent error rate (~0.075%). I have never worked with Solexa data, so I have no first hand impression of the error rate there. However, I found a recent publication on the Solexa 1G that says "We found that error rates range from 0.3% at the beginning of reads to 3.8% at the end of reads". That seems a bit higher than I would have expected...

                              Anyone have more up-to-date information?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 10-11-2024, 06:55 AM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              110 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-01-2024, 07:10 AM
                              0 responses
                              114 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-30-2024, 08:33 AM
                              1 response
                              121 views
                              0 likes
                              Last Post EmiTom
                              by EmiTom
                               
                              Working...
                              X