Unconfigured Ad

**Chipper** · 03-05-2009, 02:07 PM

Looks good, is this dataset (100M CS) available for download for comparison with other aligners?

**lh3** · 03-05-2009, 03:00 PM

Alternatively, you may align some publicly available data set and give your results, such as CPU time, memory, #aligned reads, #proper pairs and so on. I think the data here might be good (human male; 1000genomes data done by Illumina):

ftp://ftp.era.ebi.ac.uk/vol1/fastq/ERR000/ERR000589

Most aligners have to make a tradeoff between speed, memory and accuracy, especially for paired-end alignment. It would be good to show accuracy as well. This is particularly important for people who are interested in structural variations.

**BioWizard** · 03-05-2009, 03:02 PM

How to get SOLiD data for alignment

As for the ABI data, its all downloadable from their web site,
although it takes forever, and its a pain to find the link. If you can't find the link, I'll search for it. As for the finite bandwidth of their web site... nothing I can do about that, ours is probably worse

As for the Illumina data, we got it from one of our customers, and although we didn't sign an NDA, I would consider it unethical to share this w/o their
permission.

**kmay** · 03-06-2009, 05:55 AM

Some questions...

Hi BioWizard!

Impressive number, indeed!

To better understand youir ISAS i have some questions about the background of aligning.

1) 2 point mutations: do you search exhaustively for all combinations of pms in the 25mer? Are all found alignments true positives?

2) Do you mask the genome? How do you treat multiple matches? Do you keep them? What are you doing to repeats?

3) Any estimation of false negatives?

4) How do you treat InDels ? What effect has it on timings?

5) Any restrictions on read-length? If so, min/max?

6) How does it perform in sequence space? Do you consider quality files?

Cheers

Klaus

**BioWizard** · 03-06-2009, 03:12 PM

Thanks for the link Ih3,

I was worried that it would take forever to download, but actualy those files are quite small, only 12M 50mers, and they downloaded rather quickly. I ran each file separately, as well as both as pairs (which, in deed they turn out to be). I used the setting: 2 substitutions, max. 10 repeats. After running, I can see that the data is rather good quality, too. I will paste the "histograms" below. On the obsolete 2GHz server that R&D gets to use (while out customers get systems twice as fast as ours...) it took about 8 minutes for single files, and about 13 minutes for both together as pairs. Because I didn't know the min. or max. length between pairs I used 1 base as the min. and a ridiculously large max. of 1Mbases, I'll look at the output file to see the realistic lengths.

file=ERR000589_1.fastq
Aligned 12139786 sequences (415.8 sec.)
Wrote 12139786 aligned sequences (82.0 sec.)

Total of 12139786 sequences done in a total of 8 minutes and 18 seconds.
*** NOTE: 19490 sequences were skipped (no. of matches set to 0) because they contained invalid characters.

Hits Histogram
==== =========
0 990159
1 9122035
2 346049
3 154639
4 106323
5 86753
6 75089
7 57069
8 42622
9 33468
10+ 1125580

file=ERR000589_2.fastq
Aligned 12139786 sequences (428.8 sec.)
Wrote 12139786 aligned sequences (83.1 sec.)

Total of 12139786 sequences done in a total of 8 minutes and 32 seconds.
*** NOTE: 15844 sequences were skipped (no. of matches set to 0) because they c
ontained invalid characters.

Hits Histogram
==== =========
0 1296041
1 8883412
2 335689
3 150253
4 104689
5 84127
6 72865
7 55341
8 40498
9 32519
10+ 1084352

files=/home/Hadar/ISAS/IlluminaData/ERR000589_1.fastq,/home/Hadar/ISAS/IlluminaData/ERR000589_2.fastq,1,1000000

Aligned 12139786 sequence pairs (623.7 sec.)
Wrote 12139786 aligned sequence pairs (155.0 sec.)

Total of 12139786 sequence pairs done in a total of 12 minutes and 58 seconds.
*** NOTE: 35334 sequences were skipped (no. of matches set to 0) because they c
ontained invalid characters.

Hits Histogram
==== =========
0 2043749
1 9350603
2 275721
3 131604
4 88321
5 65691
6 48955
7 28766
8 20653
9 16101
10+ 69622

I will try to get the the paired run result file posted at:

404 Not Found

http://www.imagenix.com/publicdata

But I will have to remove it by Monday... so please someone who has the bandwidth for this - copy it and post where for everyone. On Monday I will delete this before I get complaints

Great weekend to all !

**BioWizard** · 03-06-2009, 03:41 PM

Hi Klaus,

We search for any "mutations" which have up to the maximum specified mismatches. In the case of the public data which I just ran, the spec was "maximum 2 substitutions". It doesn't matter in how many places
so a max mismatch of 3 can be ....x....x....x... or ...xx....x... or ...xxx....
and , of course all lesser mismatches like two ...x... or ....xx.... or one\....x...... or zero ...... when the sample was identical to the reference AND the sequencer did not make any errors.
The search is lossless, in the sense that there are no compromises or shortcuts - if anywhere in the reference there are N (in this example 50) bases with either 0, or 1 , or 2 substitutions from the searched sequence - then it will be found. The only exception: if too many hits were already found, the search is abandoned. In this example, we set the limit to 10. So if a sequence is terribly repetitive, after 10 independent locations, it will not be searched for anymore.

We do NOT mask the reference, as we consider this kind of "cheating". If the use WANTS to see 100 repeats, he has the ability to do so. We report all the repeats, up to the specified limit (this is why the output file is sooooo big). This bring an idea to my mind... if I find that I am unable to upload the results file, I'll re-run with a smaller limit (2 or 3 ?) and get a much smaller file and upload that one. So far, while I'm typing this... about 60MB (out of 1300MB) have been uploaded.

As for "false negatives", from the mathematical point of view, if you accept the assumption of "no more than m mismatches" then there are no false negatives. From the practical point of view (whatever nature can do to the sample's DNA, plus whatever disasters the sequencer can add due to its thermal/mechanical.electrical problems) then no one can ever know the worst case "false negatives". Once can easily run simulations based on one's envelope of expectations. ABI has done such simulations (maybe they know the weaknesses of their machine better than others?) and were very happy - although in their case, we added the VA (valid adjacent) function to save the color code from missing real SNPs. If you're an Illumina customer - be happy that you don't have to worry about this problem. If you're a SOLiD customer - once you understand this problem, you'll always run ISAS with VA mode turned on. Theres a 5 page technical explanation of what I am talking about, so for Illumina customers - forget this

Indel is currently not enabled. We had it enabled originally, but ABI wanted it off, which I was surprised at the time, but since then we've seen really good results w/o indel so we left it off. We can add it if customers demand it. I think it can slow down about two to three times.

Current version (3.2) readlength range:

min max
colorspace 25 60
basespace 20 93 (we have one customer who is demanding 110
so this will go up in the next version)

We don't use the quality values provided by Illumina. This can be done in the future, but first we have to see concrete evidence that it REALLY helps. I've looked at a lot of claims of how great it is, but I didn't see that it really helped. We are relying on our partner for synthetic "gold standard" tests as this is the only evidence I will trust. Some people do all kinds of "fancy" things and then say "I got more unique mapped" or "I got less repetitions" but in reality they incorrectly mapped a repeat as a unique because of disqualifying a match which was below their quality threshold. Arbitrarily deciding what is the "magic" thershold for cutting off reads is a tricky business, and I fear, not scientifically done.

Performance is faster (especially for longer reads) in basespace or "sequence space" (let's just call it "Illumina" !). In general, alignment is easier for Illumina data. ABI argues (I'm not taking sides here - I really don't know) that you save money by needing less consumables, and more computation when you do colorspace (less consumables - they say) and alignment with VA (more computation - I agree).

OK - I hope I've answered all your questions

I'm too exhausted to continue.... 179MBytes have been uploaded (out of 1300), I'll come back in an hour to check....

Originally posted by kmay View Post

Hi BioWizard!

Impressive number, indeed!

To better understand youir ISAS i have some questions about the background of aligning.

1) 2 point mutations: do you search exhaustively for all combinations of pms in the 25mer? Are all found alignments true positives?

2) Do you mask the genome? How do you treat multiple matches? Do you keep them? What are you doing to repeats?

3) Any estimation of false negatives?

4) How do you treat InDels ? What effect has it on timings?

5) Any restrictions on read-length? If so, min/max?

6) How does it perform in sequence space? Do you consider quality files?

Cheers

Klaus

**BioWizard** · 03-06-2009, 10:14 PM

OK, the file (1300MB) has been uploaded.
Anyone with big space/bandwidth that can copy it from

404 Not Found

https://www.imagenix.com/publicdata

and put on your site, tell me so I can remove it.

**BioWizard** · 03-09-2009, 04:08 PM

There have been many downloads of that 1.3GB file in the last 3 days, but so far as I know... no one has volunteered to host this file for the community - where's big government when you need them

I think by this tiome tomorrow I have to delete the file

Meanwhile I want to clarify something that several people have been asking recently:

The native color space version of ISAS also has a "Valid Adjacent" mode. Maybe its the only alignment system that even implements Valid Adjacent rules so you can catch 1 snp PLUS 1 or 2 machine errors in the same SOLID sequence. Does anyone know of any other alignment system that implements the VA rules - and allows 4 substitutions instead of 2 for 25mers, so that VA can catch 1 SN plus 2 machine errors ? We'd like to know so we can acknowledge that there is another systme. We allow 4 subs so you can even catch 2 SNPs in the same sequence (and color code VA rules make sure they really are SNPs).

**ECO** · 03-09-2009, 05:37 PM

Subject edited for neutrality.

**lh3** · 03-12-2009, 11:57 AM

Thanks for posting the data, BioWizard. ISAS is really impressive, especially for its high error tolerence. Few algorithms remain fast while guaranteeing to find 3 or more mismatches.

Here are some stats I get from the file you uploaded:

# reads: 24279572
# mapped reads: 21947836
# reads mapped in proper pairs (external dist.<=300bp): 18995200
# unqiue mappings: 19326957
# unique mappings that exist in proper pairs: 18116368

BTW, is the time you were quoting the CPU time on a single core or across the 8 cores?

**BioWizard** · 03-12-2009, 05:36 PM

The time was "real time" (some people call it "wall clock time"), and it was on our old 2.0GHz dual socket quad core machine, in other words 8 cores.

Its about 80 to 85 percent of that time for a 2.8GHz dual quad penryn, and it is MUCH faster on the new Imagenix Genome Cruncher machine
16 threads in one small box... I am drooling all over myself

that we're constructing right now for the NextGen Sequencing show.

It sounds like it is hard to believe for many people, so we encourage everyone to bring fastq or cfasta files to see for themselves. Please gzip before putting on a DVD or CD. The DVD/CD reader is so slow that it takes more time to copy the file to hard disk than to do alignment.

Anyway - It is I who thanks you, lh3, first you were kind enough to post some public data source for us all, and then you analized the file, which I know is time consuming, and finally, your encouraging words.

If you have more data you would like us to run, as a courtesy, it would be my pleasure to run for you. Just in the next few days I am overloaded, so let's say after the S.D. show is over (end of next week). You can mail us CDs/DVDs and it would be my pleasure to run. Especially when they let me get my hands on the new machine.

**BioWizard** · 03-23-2009, 10:22 AM

Thanks to all the people that visited our booth in the San Diego Next Gen Sequencing Conference.

I also want to thank Hadar and Ryan who performed alignments in real time for the customers, day after day, with little chance to rest.

We were able to get the new "Genome Cruncher" computer shipped to the Hilton in San Diego, and demonstrated 100 million 25mers with 2 substitutions on full human reference in 15 minutes. I wish I could have been there, but someone had to stay behind.

For all those who had to wait in line, or couldn't make it at all, we invite you to come in for personal demos. We will soon be opening a demo center that will be open to the public - kind of like a "perpetual show". We hope those of you that couldn't make it to San Diego, can make it to the next show in San Francisco. We are approx. 40 minutes from S.F. and about 15 minutes from Applied Biosystems (Forster City), or 30 minutes from Illumina (Hayward).

**And37** · 03-25-2009, 02:05 AM

3 subs?

Hi BioWizard,

Your results are extreme, respect.

For most programs handling more substitutions seems to be more problematic, even when the matching sequences are limited to 10.

Can you give an estimate for the ISAS running for the 100M ABI data against the 3G human genome, but enabling 3 substitutions?

Thanks,
Andris

**snetmcom** · 03-27-2009, 11:35 AM

i'd be more interested if biowizard wasn't so pretentious and condescending. People in this field work hard.

Topics	Statistics	Last Post
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM

Unconfigured Ad

ISAS Alignment Software

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News