Seqanswers Leaderboard Ad

**Chipper** · 07-01-2008, 10:48 AM

Since he was promoting 454 I guess that number is for reads without color-space errors, and with the old chemistry. Did he also comment on the number of sequences / bases generated per run compared to 454?...

**dgmacarthur** · 07-01-2008, 03:08 PM

Hi Chipper,

Thanks - can you comment on the error rate using the new chemistry?

The 454 rep acknowledged that their bases/run is vastly lower than their competitors (even he couldn't deny that) - but argued that their higher accuracy and longer read length made assembly so much easier that lower coverage was sufficient. On that basis he claimed that their new Titanium system (500 Mb/run) will be roughly cost-competitive with Solexa and SOLiD for at least some projects. Does that seem remotely plausible?

**Chipper** · 07-01-2008, 10:04 PM

It has something to do with how the ligation primers are placed, and will give a more even (lower) error rate for all five primers. Yes, the 454 may be better for some purposes du to read length but the readlenth is likely going to increse on the Solid as well.

**solidifier** · 07-11-2008, 12:31 PM

It's a very hard question to answer based on the uniqueness of the ligation concept, but in the simplest example, we've seen a much higher percentage of SOLiD sequences aligned to genome using any algorithm than we did with Solexa datasets.

**cgb** · 07-12-2008, 10:25 AM

"It's a very hard question to answer based on the uniqueness of the ligation concept, but in the simplest example, we've seen a much higher percentage of SOLiD sequences aligned to genome using any algorithm than we did with Solexa datasets."

as solid requires a colour-space aligner I find it hard to understand how 'any aligner' can be used with it to give a comparison with Solexa data which are just straight strings with Q-scores.
My experience of solids is that the raw read error rate is ~Q15. The two-colour changes are ~Q30. Exploiting the latter on Human requires a longer read length than is currently standard on the platform but will come soon.
The amount of alignable solexa data depends on cluster density - but at the optimum its ~95% of the PF Data (PF = non overlapping clusters). This can be ~5gigabases for a paired end run - taking 4-5days. so about 1G of alignable data (on GAII) per day provided you have optimised your cluster density for the sample.

Solid tries to align all of the data and can be very variable but typically about 20-30% align to human - im not sure of the standard solids aligner can do whole human alignments in one go - in which case you may want to look at MaQ from sanger.

you may find more on www.genographia.org

**pmiguel** · 08-26-2008, 08:04 AM

"as solid requires a colour-space aligner I find it hard to understand how 'any aligner' can be used with it to give a comparison with Solexa data"

As you say, to take full advantage of color-space sequence you would need an aligner that understood that it was using dual-base encoded data. But by "double encoding" CS sequence you can use it with any sequence alignment program.

That is the equivalent of tying one hand behind your back. It is also possible to convert the raw SOLiD CS reads into (real) sequence space. This would be the equivalent of tying both your hands behind your back. Any error in color space would then propagate downstream ensuring subsequent correct color space calls would be converted to incorrect sequence space calls.

--
Phillip

**ECO** · 08-26-2008, 10:59 AM

Hey Phillip, This is a good discussion that I haven't had outside of my close colleagues...and I want to make sure I'm understanding it correctly. I agree it's a waste of time to convert colorspace directly to basespace using the decoding rules...any read becomes useless after a normally-correctable single color error.

By "double encoding" you are referring to the practice of converting colorspace reads 0,1,2,3 to A,C,G,T directly (what ABI calls psuedo-base space, or what I call "fakebase"), converting the reference to "fakebase" in the same way, and using them with existing tools. (Of course you have to ditch the adapter anchor base, and first colorspace call, as neither are in the genome).

At first glance this appears to make reads/reference that can be read by any tool...but there is a problem with this approach, and that comes upon when the program tries to work with the reverse complement of the reads. The reverse complement of colorspace is just the reverse of the sequence....NOT the reverse complement as in base space. Thus you cannot align to both strands if you just do a simple csfasta->fakebase conversion.

The above can be made to work by also putting in the reverse of the colorspace reads to your fakebase input file...unfortunately this doubles the number of reads you will deal with (potentially causing memory issues), and makes the output parsing a bit confusing, but the upside is that you can use any tool you want.

I have found MAQ to have the best support for SOLiD yet, as it's able to do the appropriate conversions with builtin functions, and properly deals with the reverses. Last but not least it can now output a nucleotide-corrected alignment...so you can immediately get back to basespace, but the color information has been used to generate the alignment.

**pmiguel** · 08-26-2008, 11:49 AM

Hi Eco,
Yes, you understood me perfectly. And you are right, I completely missed the reverse/reverse-complement issue.

Bummer! I thought I would be able to use VMATCH trivially on color space data (without even converting to fakebase). That's not in the cards, is it?

--
Phillip

**jungle** · 09-09-2008, 01:46 AM

Originally posted by dgmacarthur View Post

- can you comment on the error rate using the new chemistry?

From my analysis, the error rate with version 1 chemistry was 0.1%. With the version 2 upgrade it has dropped to about 0.075%.

Note that this is the miss-called SNP rate per base, not the raw (systematic) error rate.

Cheers

**jungle** · 09-09-2008, 02:01 AM

Originally posted by cgb View Post

Solid tries to align all of the data and can be very variable but typically about 20-30% align to human

Ok, 20-30% is really low, even for version 1 chemistry. If you are using a high quality, high molecular weight genomic DNA input, I would expect to see at least 35% mapped with version 1.
Our complex genome runs on the new chemistry show >50% mapped.

Originally posted by cgb View Post

im not sure of the standard solids aligner can do whole human alignments in one go - in which case you may want to look at MaQ from sanger.

You align to each chromosome separately then aggregate the data into a single file. At this point, reads that are uniquely mapped to the genome are pulled out as input to the SNP and consensus calling pipeline. If you run this on a cluster, each mapping uses a single core. So you need at least 24 cores to map to all human chromosomes concurrently.

MAQ is very slow but can be used for mate pair rescue. Shrimp is another option if you want to use a different aligner.

**bioinfosm** · 09-10-2008, 06:56 AM

Originally posted by jungle View Post

From my analysis, the error rate with version 1 chemistry was 0.1%. With the version 2 upgrade it has dropped to about 0.075%.

Note that this is the miss-called SNP rate per base, not the raw (systematic) error rate.

Cheers

And what do you use for SNP calling? Could you discuss the workflow after getting SOLiD's color-space reads to mapping and SNP calling, etc.?

**jungle** · 09-10-2008, 07:19 AM

Originally posted by bioinfosm View Post

And what do you use for SNP calling? Could you discuss the workflow after getting SOLiD's color-space reads to mapping and SNP calling, etc.?

I should point out that when I say mis-called SNP, I mean erroneous valid adjacents since I am talking about errors at the read level, not in the consensus.

Mapping to a large genome (eg. human) is done on each chromosome separately. The data are then aggregated into a single file so reads that map uniquely to the genome can be identified. The unique hit file is then separated into individual chromosomes again (ie. 24 separarte unique match files). Consensus and SNP calling is done on these individually.

What you end up with is a folder full of files for each chromosome. These include a consensus sequences in base space, a list of all variants, a list of "confirmed" SNPs, coverage depth at each position in the reference, genomic coordinates of regions that are covered at least once, gff files for the alignment.

I use the AB pipelines (albeit adapted to my needs) as they work and are easy to manipulate (mostly perl).

Hope this helps!

**cgb** · 09-11-2008, 12:07 AM

Originally posted by jungle View Post

From my analysis, the error rate with version 1 chemistry was 0.1%. With the version 2 upgrade it has dropped to about 0.075%.

Note that this is the miss-called SNP rate per base, not the raw (systematic) error rate.

Cheers

thanks for that clarification jungle. I am currently at a conference in cambridge where this came up. Comparing the SNP-error rate on SOLiD to the raw read error rate on Solexa clearly isnt apples and apple. the raw-read error rate on SOLiD being ~8-15% ? and ~1% on Solexa. I dont know the equivalent Slexa metric to the SOLiD 'error' (two colour miscall) rate .... anybody ?

**jungle** · 09-11-2008, 02:07 AM

Originally posted by cgb View Post

thanks for that clarification jungle. I am currently at a conference in cambridge where this came up. Comparing the SNP-error rate on SOLiD to the raw read error rate on Solexa clearly isnt apples and apple. the raw-read error rate on SOLiD being ~8-15% ? and ~1% on Solexa. I dont know the equivalent Slexa metric to the SOLiD 'error' (two colour miscall) rate .... anybody ?

No worries cgb.

The Solid systematic error rate was 4 -5% on old chemistry and is now ~3%.

I think the closest comparison to Solid would be to the erroneous valid adjacent error rate (~0.075%). I have never worked with Solexa data, so I have no first hand impression of the error rate there. However, I found a recent publication on the Solexa 1G that says "We found that error rates range from 0.3% at the beginning of reads to 3.8% at the end of reads". That seems a bit higher than I would have expected...

Anyone have more up-to-date information?

Topics	Statistics	Last Post
Bacterial Timeline Study Suggests Oxygen Use Preceded Photosynthesis by seqadmin Started by seqadmin, Today, 12:59 PM	0 responses 6 views 0 reactions	Last Post by seqadmin Today, 12:59 PM
New Software Simplifies 3D Gene Expression Mapping by seqadmin Started by seqadmin, Yesterday, 10:17 AM	0 responses 8 views 0 reactions	Last Post by seqadmin Yesterday, 10:17 AM
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 60 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM

Seqanswers Leaderboard Ad

Accuracy of SOLiD platform

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News