Unconfigured Ad

**Chipper** · 02-19-2009, 01:31 PM

Originally posted by bioinfosm View Post

I am still curious as to how SOLiD and Solexa compare apples-to-apples. Both produce short reads, but still not much about how similar or complementary they are!

Met a few at AGBT and still could not find the answers..

It's not easy to compare since throughput changes so fast on both instruments - for example the latest Genome Biology RNA-seq paper used 38 lanes to get 138 M aligned reads which is a number you can get from one SOLiD slide (1/2 run) today. What the current numbers are for the GA-II I do not know. What sort of apples are you interrested in comparing?...

**bioinfosm** · 02-19-2009, 02:09 PM

I am interested in the quality of data. Using say 6million 35bp reads on the same sample, which instrument should one prefer, say for SNP calling. From a celegans comparison paper, it looks SOLiD has a slight advantage in calling rare SNP? Does its 2-base encoding really give more accurate results?

**westerman** · 02-20-2009, 08:36 AM

Originally posted by new300 View Post

How many raw and aligned reads per run do you get out of your Solid?

From a project that I have been working on this week since the data come off the sequencer Monday evening. This is one run. Mate-paired 25-base to a non-human eukaryotic organism. One region/plate.

Raw reads: ~142M

Mapped R3 reads: ~114M for unique & random at 3 mismatches
Mapped F3 reads: ~118M (ditto)

Mapped R3 reads: ~77M for uniquely placed reads at 3 mismatches
Mapped F3 reads: ~75M (ditto)

Paired F3-R3 reads: ~78M

So Approximately 3900 Mbases. (78M times 50 bases).

SNP analysis is currently in progress on the paired reads. From my work with the mapped but not-paired reads we should obtain quite a few SNPs.

**westerman** · 02-20-2009, 09:43 AM

Originally posted by bioinfosm View Post

I am interested in the quality of data. Using say 6million 35bp reads on the same sample, which instrument should one prefer, say for SNP calling. From a celegans comparison paper, it looks SOLiD has a slight advantage in calling rare SNP? Does its 2-base encoding really give more accurate results?

In theory color-space should give more accurate results for SNP calling. The concept is that it takes two adjacent color space mismatch to indicate a SNP. If you see a single color-space mismatch then you can flag read that as a sequencer error. Compare this to traditional base-space where, when you see a single mismatch, you have no idea if this arises from a sequencer error or a SNP. Depth of coverage can take help resolve the problem but there are limits to that especially for rare SNPs.

In practice the rate of sequencer error could play a major role. Obviously if there is too much sequencer error then too much data will be thrown away and nothing will be found. The SOLiD's error rate may be higher than the Solexa's. I do not have firm numbers on this, however.

Let's do a couple of thought experiments. Say that there is a common SNP that occurs in 50% of the population. Furthermore say that the SOLiD has a 0.5% error rate per base while the Solexa is 1/5 that - 0.1% per base [note that I am just making up those numbers -- the actual rates are probably much different]. If we pool 100 individuals together in a run of 25 mers then -- very roughly since I am doing simple probability here --

The SOLiD run will -- for sequencer errors -- generate 12 - 13 runs with a single mismatch and 0 - 1 runs with adjacent mismatches.

Co-mingled with the above will be 50 runs with 2 adjacent mismatches that represent the SNPs.

So overall there will be about:

44 runs without mismatches -- the non-SNPs
44 runs with adjacent mismatches - the SNPs plus *maybe* 1 error run
12 runs with non-adjacent mismatch(es) -- errors for both non-SNPs and SNPs

When we look at the data we would toss out the non-adjacent mismatch reads as errors. We would then pick up 44 adjacent mismatch runs representing the same SNP and maybe 1 run representing a different (and erroneous) SNP.

For the Solexa there would be:
52 runs with a mismatch(es) -- 50 real SNPs and 2 or maybe 3 runs with errors.
48 runs without mismatches.

Once again it is easy to pick up the true SNP since 50 of the runs all have a mismatch in the same location and the 2 or 3 runs that indicate SNPs are simply errors and could be tossed.

Now ... for the rare variant that occurs in 2% of the population.

The SOLiD has
84 runs with no mismatches
12 runs with non-adjacent mismatch(es)
2 runs with adjacent mismatches and *maybe* 1 adjacent mismatch error run

Those two adjacent mismatches are the real SNP. The errors are simply tossed.

The Solexa has
96 runs with no mismatches
4 (maybe 5) runs with mismatches.

2 of the adjacent mismatches are the real SNP while 2 or 3 are errors.

In neither case does the platform pick up the real SNP unambiguously -- it is hard to do when sequencers generate errors -- but the SOLiD (and color space) does work, in theory, better with the rare variants. It works even better if we assume that the sequencer error is the same as the Solexa's.

Next up: color space and indels. Once my head stops hurting.

**new300** · 03-02-2009, 02:58 PM

Originally posted by westerman View Post

So Approximately 3900 Mbases. (78M times 50 bases).

So, I can't really see the throughput advantage of the Solid there. GA1 runs I've seen are around 4Gb. If you look at the short read archive GA2 runs are around 7Gb+ with 35bp reads. For PhiX around 95% of Illumina reads align within 2 errors. For human I think you tend to see about 80%. Those are 35bp reads I believe. There are 50bp reads in the SRA which appear to go up to 14Gb.

**bioinfosm** · 03-03-2009, 08:35 AM

Thanks Westerman, those are useful thoughts, and I believe the same, the SOLiD may perform better for rare variants even with same error rate of instrument as illumina

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 30 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 37 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 64 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News