Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ISAS Alignment Software

    Hi,

    We're from a company called Imagenix Technologies.

    We just want to give everyone a "heads up". We'll be at the NextGen Sequencing conference in San Diego in 2 weeks, doing real-time live demos of the world's fastest alignment system: ISAS

    It was not easy to be able to state this confidently, as we found that most alignment software don't like to disclose how fast (or how slow) their alignment is. We are proud of our numbers and display them prominently:

    "100 million 25mers in colorspace with 2 substitutions on full human genome (3GB) reference in 30 minutes on ONE computer".

    And to be more specific, when Applied biosystems ran our software (which they licensed for their own use) on their Dell it took 36 minutes for 100Million 25mers straight from one of their SOLiD machines, with 2 subs, human 3G ref. When we run it on our computer (dual socket quad core Penryn), it takes 29 minutes. Next week we expect to build a new computer (that one is over a year old - thus obsolete) and hope to reach under 20 minutes for 100 million 25mers in colorspace. We cannot "brag" about 20 minutes yet... until the machine is built and we can start running. Also - longer sequences take longer to align. For example, 88 long with 4 substitutions, Illumina (basespace) data from Illumina machine: 56million plus another 56million "paired end" took 2 hours on our old computer.

    So... if you're coming to San Diego, we'll be happy to see you at booth #40, next to the food stand

    We encourage you to bring CDs or DVDs with cfasta or fastq files so we can show you on the spot what is fast alignment. Please bring human data, or if you want to do some other species, you'll have to bring the fa files so we can build a referecne database for your species.

    Meanwhile.... we encourage your feedback, for example telling everyone how slow your current alignment is

    but seriouswly, folks... maybe we can have a productive discussion on this forum ? For example, one university was so frustrated with how slow their alignment was, that they wanted to spend lots of money to build some fancy custom hardware (FPGA, and all that painfully expensive and instantly obsolete stuff), because they never heard of ISAS. So what some might think is selfish promotion on our behalf, other see as helpful info for the entire community.

    Cheers !

  • #2
    Looks good, is this dataset (100M CS) available for download for comparison with other aligners?

    Comment


    • #3
      Alternatively, you may align some publicly available data set and give your results, such as CPU time, memory, #aligned reads, #proper pairs and so on. I think the data here might be good (human male; 1000genomes data done by Illumina):

      ftp://ftp.era.ebi.ac.uk/vol1/fastq/ERR000/ERR000589

      Most aligners have to make a tradeoff between speed, memory and accuracy, especially for paired-end alignment. It would be good to show accuracy as well. This is particularly important for people who are interested in structural variations.

      Comment


      • #4
        How to get SOLiD data for alignment

        As for the ABI data, its all downloadable from their web site,
        although it takes forever, and its a pain to find the link. If you can't find the link, I'll search for it. As for the finite bandwidth of their web site... nothing I can do about that, ours is probably worse

        As for the Illumina data, we got it from one of our customers, and although we didn't sign an NDA, I would consider it unethical to share this w/o their
        permission.

        Comment


        • #5
          Some questions...

          Hi BioWizard!

          Impressive number, indeed!

          To better understand youir ISAS i have some questions about the background of aligning.
          1) 2 point mutations: do you search exhaustively for all combinations of pms in the 25mer? Are all found alignments true positives?

          2) Do you mask the genome? How do you treat multiple matches? Do you keep them? What are you doing to repeats?

          3) Any estimation of false negatives?

          4) How do you treat InDels ? What effect has it on timings?

          5) Any restrictions on read-length? If so, min/max?

          6) How does it perform in sequence space? Do you consider quality files?

          Cheers

          Klaus
          Last edited by kmay; 03-06-2009, 06:26 AM.

          Comment


          • #6
            Thanks for the link Ih3,

            I was worried that it would take forever to download, but actualy those files are quite small, only 12M 50mers, and they downloaded rather quickly. I ran each file separately, as well as both as pairs (which, in deed they turn out to be). I used the setting: 2 substitutions, max. 10 repeats. After running, I can see that the data is rather good quality, too. I will paste the "histograms" below. On the obsolete 2GHz server that R&D gets to use (while out customers get systems twice as fast as ours...) it took about 8 minutes for single files, and about 13 minutes for both together as pairs. Because I didn't know the min. or max. length between pairs I used 1 base as the min. and a ridiculously large max. of 1Mbases, I'll look at the output file to see the realistic lengths.

            file=ERR000589_1.fastq
            Aligned 12139786 sequences (415.8 sec.)
            Wrote 12139786 aligned sequences (82.0 sec.)

            Total of 12139786 sequences done in a total of 8 minutes and 18 seconds.
            *** NOTE: 19490 sequences were skipped (no. of matches set to 0) because they contained invalid characters.


            Hits Histogram
            ==== =========
            0 990159
            1 9122035
            2 346049
            3 154639
            4 106323
            5 86753
            6 75089
            7 57069
            8 42622
            9 33468
            10+ 1125580


            file=ERR000589_2.fastq
            Aligned 12139786 sequences (428.8 sec.)
            Wrote 12139786 aligned sequences (83.1 sec.)

            Total of 12139786 sequences done in a total of 8 minutes and 32 seconds.
            *** NOTE: 15844 sequences were skipped (no. of matches set to 0) because they c
            ontained invalid characters.


            Hits Histogram
            ==== =========
            0 1296041
            1 8883412
            2 335689
            3 150253
            4 104689
            5 84127
            6 72865
            7 55341
            8 40498
            9 32519
            10+ 1084352


            files=/home/Hadar/ISAS/IlluminaData/ERR000589_1.fastq,/home/Hadar/ISAS/IlluminaData/ERR000589_2.fastq,1,1000000

            Aligned 12139786 sequence pairs (623.7 sec.)
            Wrote 12139786 aligned sequence pairs (155.0 sec.)

            Total of 12139786 sequence pairs done in a total of 12 minutes and 58 seconds.
            *** NOTE: 35334 sequences were skipped (no. of matches set to 0) because they c
            ontained invalid characters.


            Hits Histogram
            ==== =========
            0 2043749
            1 9350603
            2 275721
            3 131604
            4 88321
            5 65691
            6 48955
            7 28766
            8 20653
            9 16101
            10+ 69622



            I will try to get the the paired run result file posted at:



            But I will have to remove it by Monday... so please someone who has the bandwidth for this - copy it and post where for everyone. On Monday I will delete this before I get complaints

            Great weekend to all !

            Comment


            • #7
              Hi Klaus,

              We search for any "mutations" which have up to the maximum specified mismatches. In the case of the public data which I just ran, the spec was "maximum 2 substitutions". It doesn't matter in how many places
              so a max mismatch of 3 can be ....x....x....x... or ...xx....x... or ...xxx....
              and , of course all lesser mismatches like two ...x... or ....xx.... or one\....x...... or zero ...... when the sample was identical to the reference AND the sequencer did not make any errors.
              The search is lossless, in the sense that there are no compromises or shortcuts - if anywhere in the reference there are N (in this example 50) bases with either 0, or 1 , or 2 substitutions from the searched sequence - then it will be found. The only exception: if too many hits were already found, the search is abandoned. In this example, we set the limit to 10. So if a sequence is terribly repetitive, after 10 independent locations, it will not be searched for anymore.

              We do NOT mask the reference, as we consider this kind of "cheating". If the use WANTS to see 100 repeats, he has the ability to do so. We report all the repeats, up to the specified limit (this is why the output file is sooooo big). This bring an idea to my mind... if I find that I am unable to upload the results file, I'll re-run with a smaller limit (2 or 3 ?) and get a much smaller file and upload that one. So far, while I'm typing this... about 60MB (out of 1300MB) have been uploaded.

              As for "false negatives", from the mathematical point of view, if you accept the assumption of "no more than m mismatches" then there are no false negatives. From the practical point of view (whatever nature can do to the sample's DNA, plus whatever disasters the sequencer can add due to its thermal/mechanical.electrical problems) then no one can ever know the worst case "false negatives". Once can easily run simulations based on one's envelope of expectations. ABI has done such simulations (maybe they know the weaknesses of their machine better than others?) and were very happy - although in their case, we added the VA (valid adjacent) function to save the color code from missing real SNPs. If you're an Illumina customer - be happy that you don't have to worry about this problem. If you're a SOLiD customer - once you understand this problem, you'll always run ISAS with VA mode turned on. Theres a 5 page technical explanation of what I am talking about, so for Illumina customers - forget this


              Indel is currently not enabled. We had it enabled originally, but ABI wanted it off, which I was surprised at the time, but since then we've seen really good results w/o indel so we left it off. We can add it if customers demand it. I think it can slow down about two to three times.

              Current version (3.2) readlength range:

              min max
              colorspace 25 60
              basespace 20 93 (we have one customer who is demanding 110
              so this will go up in the next version)

              We don't use the quality values provided by Illumina. This can be done in the future, but first we have to see concrete evidence that it REALLY helps. I've looked at a lot of claims of how great it is, but I didn't see that it really helped. We are relying on our partner for synthetic "gold standard" tests as this is the only evidence I will trust. Some people do all kinds of "fancy" things and then say "I got more unique mapped" or "I got less repetitions" but in reality they incorrectly mapped a repeat as a unique because of disqualifying a match which was below their quality threshold. Arbitrarily deciding what is the "magic" thershold for cutting off reads is a tricky business, and I fear, not scientifically done.

              Performance is faster (especially for longer reads) in basespace or "sequence space" (let's just call it "Illumina" !). In general, alignment is easier for Illumina data. ABI argues (I'm not taking sides here - I really don't know) that you save money by needing less consumables, and more computation when you do colorspace (less consumables - they say) and alignment with VA (more computation - I agree).

              OK - I hope I've answered all your questions
              I'm too exhausted to continue.... 179MBytes have been uploaded (out of 1300), I'll come back in an hour to check....


              Originally posted by kmay View Post
              Hi BioWizard!

              Impressive number, indeed!

              To better understand youir ISAS i have some questions about the background of aligning.
              1) 2 point mutations: do you search exhaustively for all combinations of pms in the 25mer? Are all found alignments true positives?

              2) Do you mask the genome? How do you treat multiple matches? Do you keep them? What are you doing to repeats?

              3) Any estimation of false negatives?

              4) How do you treat InDels ? What effect has it on timings?

              5) Any restrictions on read-length? If so, min/max?

              6) How does it perform in sequence space? Do you consider quality files?

              Cheers

              Klaus

              Comment


              • #8
                OK, the file (1300MB) has been uploaded.
                Anyone with big space/bandwidth that can copy it from

                and put on your site, tell me so I can remove it.

                Comment


                • #9
                  There have been many downloads of that 1.3GB file in the last 3 days, but so far as I know... no one has volunteered to host this file for the community - where's big government when you need them

                  I think by this tiome tomorrow I have to delete the file

                  Meanwhile I want to clarify something that several people have been asking recently:

                  The native color space version of ISAS also has a "Valid Adjacent" mode. Maybe its the only alignment system that even implements Valid Adjacent rules so you can catch 1 snp PLUS 1 or 2 machine errors in the same SOLID sequence. Does anyone know of any other alignment system that implements the VA rules - and allows 4 substitutions instead of 2 for 25mers, so that VA can catch 1 SN plus 2 machine errors ? We'd like to know so we can acknowledge that there is another systme. We allow 4 subs so you can even catch 2 SNPs in the same sequence (and color code VA rules make sure they really are SNPs).

                  Comment


                  • #10
                    Subject edited for neutrality.

                    Comment


                    • #11
                      Thanks for posting the data, BioWizard. ISAS is really impressive, especially for its high error tolerence. Few algorithms remain fast while guaranteeing to find 3 or more mismatches.

                      Here are some stats I get from the file you uploaded:

                      # reads: 24279572
                      # mapped reads: 21947836
                      # reads mapped in proper pairs (external dist.<=300bp): 18995200
                      # unqiue mappings: 19326957
                      # unique mappings that exist in proper pairs: 18116368

                      BTW, is the time you were quoting the CPU time on a single core or across the 8 cores?

                      Comment


                      • #12
                        The time was "real time" (some people call it "wall clock time"), and it was on our old 2.0GHz dual socket quad core machine, in other words 8 cores.

                        Its about 80 to 85 percent of that time for a 2.8GHz dual quad penryn, and it is MUCH faster on the new Imagenix Genome Cruncher machine
                        16 threads in one small box... I am drooling all over myself that we're constructing right now for the NextGen Sequencing show.

                        It sounds like it is hard to believe for many people, so we encourage everyone to bring fastq or cfasta files to see for themselves. Please gzip before putting on a DVD or CD. The DVD/CD reader is so slow that it takes more time to copy the file to hard disk than to do alignment.

                        Anyway - It is I who thanks you, lh3, first you were kind enough to post some public data source for us all, and then you analized the file, which I know is time consuming, and finally, your encouraging words.

                        If you have more data you would like us to run, as a courtesy, it would be my pleasure to run for you. Just in the next few days I am overloaded, so let's say after the S.D. show is over (end of next week). You can mail us CDs/DVDs and it would be my pleasure to run. Especially when they let me get my hands on the new machine.

                        Comment


                        • #13
                          Thanks to all the people that visited our booth in the San Diego Next Gen Sequencing Conference.

                          I also want to thank Hadar and Ryan who performed alignments in real time for the customers, day after day, with little chance to rest.

                          We were able to get the new "Genome Cruncher" computer shipped to the Hilton in San Diego, and demonstrated 100 million 25mers with 2 substitutions on full human reference in 15 minutes. I wish I could have been there, but someone had to stay behind.

                          For all those who had to wait in line, or couldn't make it at all, we invite you to come in for personal demos. We will soon be opening a demo center that will be open to the public - kind of like a "perpetual show". We hope those of you that couldn't make it to San Diego, can make it to the next show in San Francisco. We are approx. 40 minutes from S.F. and about 15 minutes from Applied Biosystems (Forster City), or 30 minutes from Illumina (Hayward).

                          Comment


                          • #14
                            3 subs?

                            Hi BioWizard,

                            Your results are extreme, respect. For most programs handling more substitutions seems to be more problematic, even when the matching sequences are limited to 10.

                            Can you give an estimate for the ISAS running for the 100M ABI data against the 3G human genome, but enabling 3 substitutions?

                            Thanks,
                            Andris

                            Comment


                            • #15
                              i'd be more interested if biowizard wasn't so pretentious and condescending. People in this field work hard.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X