Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Software packages for next gen sequence analysis

Collapse
This topic is closed.
X
This is a sticky topic.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Thank you for your interest. I will answer these questions as I could.

    1. What are the longest and shortest reads it can handle effectively?

    Now, ZOOM could handle reads of length ranging from 15bp to 64bp. In fact, the kernel idea of ZOOM is quite easy to be extended to longer reads. It is the implementation that limits the length to be no more than 64bp. We will come to the 454 data later after the version for Illumina/Solexa and ABI SOLiD is stable.

    2. how does it compare to Eland or MAQ in reads aligned per minute?

    Since ELAND is the fastest software to deal with Illumina/Solexa data as we know, we compare the speed with ELAND in our benchmark. By mapping reads of length 15bp to 32bp with same sensitivity, ZOOM took half time of ELAND, even 1/3 when short reads are concerned. Furthermore, ELAND can only deal with no more than about 16 million reads. ZOOM has no limitation on the reads number as long as your RAM accepts. Both ELAND and ZOOM hash read and scan the reference sequence. So, if you process more reads in one scan pass, you could even save more time. Since the speed of ZOOM correlates closely to the length of reference sequence and the read length, it’s hard to give the number of reads aligned per minutes. To give you an impression, there is some data from our benchmark. When achieving full sensitivity of two mismatches:
    It aligns 3.4 million reads of 36bp BAC reads to the 162k region (where the BAC comes from) in 37 seconds with 1.1G RAM.
    It aligns 24 million reads of 36bp (5X of human chromosome 6) to chromosome 6 in 17 minutes 17 seconds with 6.5G RAM.
    It aligns 22 million reads of 17bp CHIP-SEQ data to whole human genome in 4 hours and 22 minutes with 4.2G RAM.
    For ABI/SOLiD data, the speed is slower than Illumina/Solexa data. ZOOM aligns 28 million reads of 25bp to E.coli genome(4M) with automatic sequencing error correction in 5 minutes.
    We tried to compare the speed and sensitivity with MAQ since it’s famous. However, I am totally puzzled with its input format and output format. So lazy me gave up since its website declare it’s slower than ELAND.

    3. How many mismatches does it handle?

    In principle, you can decide the mismatch number as you like as long as it is less than the read length.  ZOOM guarantee 100% sensitivity for a large range of <read length, mismatch number> cases.
    When mismatches required is larger than the mismatch number in the cases of <read length, mismatch number> ZOOM used, sensitivity will decrease slightly. For example, mapping read of length 50bp could achieve 100% sensitivity with 4 mismatches. If you require 5 mismatches, then the sensitivity will decrease slightly. However, if you do need 100% sensitivity in these cases, feel free to contact us, we will satisfy you.

    4. Does it have a gapped mode?
    Yes, ZOOM can handle insertion/deletion between reads and the reference sequence. For Illumina/Solexa data, one gap but with any length you wish are allowed besides mismatches required. However, ZOOM can’t guarantee 100% sensitivity to find alignments with gap. I think nobody using filtering strategy could. 

    5. What format is required for the reference genome?

    The format of reference genome would be a fasta file or multiple fasta files.
    The format of Illumina/Solexa reads file can be in fasta, *_seq.txt or *_prb.txt. The format of ABI SOLiD *.csfasta is supported too.

    6. What format are the alignments reported in?

    For Illumina/Solexa data, the output of this release of ZOOM is reported in the format of “read_name reference_seq_name: position_of_mapped +/- mismatch_number” . If assembly is required, ZOOM will output the assembly consensus, coverage and frequency of {A,C,T,G} on each position of consensus.
    For ABI/SOLiD data, besides the alignment information, ZOOM could output the reads decoded into the base space, with polymorphism on base space and sequencing error on color space highlighted.
    In our next release, we will show the alignment in a GUI view showing the multiple alignment of mapped reads on the reference sequence and those heterozygous sites.

    7. Can you comment on the cost/licenses it will be provided under?

    About the cost of full version of ZOOM, maybe it’s a better way to ask the sales person when the website is ready next week.  I think an academic-free version for Illumina/Solexa data with limited function will be provided too.

    8. Can you give us the link to the download when it's ready?

    Sure. I will offer the latest news when it’s ready.


    Originally posted by apfejes View Post
    Thanks for the update, spirit.

    Maybe you could give us a little bit of information on Zoom, as well, since things may have changed since last time I heard anything about it.

    What are the longest and shortest reads it can handle effectively?
    how does it compare to Eland or MAQ in reads aligned per minute?
    How many mismatches does it handle?
    Does it have a gapped mode?
    What format is required for the reference genome?
    What format are the alignments reported in?
    Can you comment on the cost/licenses it will be provided under?
    Can you give us the link to the download when it's ready?

    I'm sure I'm missing other important information, but those are the first questions that occur to me.

    Thanks!

    Comment


    • #62
      (ECO, if you're reading this, spirit's reply might be deserving of it's own thread.)

      Hi Spirit,

      First of all, thanks for the long and complete answer. It sounds like a fantastic program - I'll definitely give it a try when it's available. In the meantime, I hope I'm not bothering you with too many questions. I'm very interested in giving Zoom a try, but don't want to evaluate software that won't meet my requirements.

      Just to touch on a few points (comments and questions):

      We're exclusively an Illumina/Solexa shop, at the moment, and we're starting to produce reads longer than 64bp. If it goes well, I don't think we'll be doing short read runs on the Illumina machines for much longer (Maybe just for chip-seq?) Anyhow, this will be important to Illumina users VERY soon.

      I've never come across Eland having a 16M sequence limit - and I'm not sure why it's important, since it's trivial to run it once for each lane, anyhow. I've heard this claim from someone else when discussing zoom, so I thought I'd mention it.

      As for the benchmarks, they sound pretty good, but aren't quite describing the "normal" situation we seem to come across most of the time. Would you be able to give the time for, say 6 or 8Million 36 or 42bp reads aligned to the complete human genome? What hardware was that benchmark run on? Is the code threaded for multi-cpu computers and/or does it use MPI?

      Does Zoom take advantage of illumina probability/quality scores when doing the alignments?

      Can Zoom handle Mulitple alignments?

      and

      Also related to benchmarks, how much slower is the application in gapped mode vs. un-gapped mode?

      Finally, just out of curiosity, why are you putting a GUI on an aligner?

      Thanks!
      Anthony
      The more you know, the more you know you don't know. —Aristotle

      Comment


      • #63
        Anthony,

        some Illumina real life benchmarks are posted here

        Comment


        • #64
          Hi Kmay,

          I don't see anything at that link about the Zoom aligner, which is what I was asking for benchmarks against.

          Did I miss something?

          Anthony
          The more you know, the more you know you don't know. —Aristotle

          Comment


          • #65
            Hi, Anthony,

            Sorry for the late reply. I was out for weekend. You are THE Anthony Fejes! Your blog enlightened us a lot!!! Thank you! We like it.

            The results on the benchmark above were gotten on a single core of AMD Opteron 275 CPU (2.2GHz) with 8G memory. The code is not multi CPU threaded yet. If you want to parallelize, currently you need to divide the data set and run ZOOM for multiple times.

            I came across the 16M sequence limit using ELAND 0.2.2.5. For ZOOM, the speed correlated much with the times the reference genome is scanned. For example, the time used for 20 million reads input directly is much less than the summation of the time used for 10 million reads twice.

            We have only one data set of reads of 36bp, which contain 3.4 million reads. So I simulated data sets by randomly picking 6 million and 8 million segments of human genome with two mismatches. When mapping back to the human genome using a single core (2.8GHz) of AMD Opteron(tm) Processor 2220, the results with 100% sensitivity is as following(the time is denoted in the format of hours:minutes:seconds ):

            6 million reads of 36bp 6 million reads of 42bp
            01:34:18 ( 2.08G RAM ) 01:04:11 ( 1.90G )

            8 million reads of 36bp 8 million reads of 42bp
            01:53:21 ( 2.42G RAM ) 01:18:11 (2.19G)

            Do you want to have a look with the time usage when 3 or 4 mismatches allowed? If yes, I will show later.

            In ZOOM, user can choose to take advantage of Illumina quality scores. Now, ZOOM uses a specified threshold by user to differentiate high quality bases from low quality bases. ZOOM will ignore mismatches at low quality bases (without sacrificing much to program efficiency), since mismatches at low quality bases are likely due to sequencing errors. However, quality scores are not considered when doing assembly, unlike MAQ. Maybe MAQ's way is a better way.

            Yes, ZOOM can produce multiple alignment matches for each read. It can report unique or top-N best mapping results for each read.

            In common, the gapped mode is five times slower than the un-gapped mode. We'll accelerate DP later since when read length gets much longer, we are expecting more gaps.

            Well, for the reasons of adding GUI, hu~~, the first one comes to me is that, all software of BioinformaticsSolutions have a GUI, PatternHunter, PEAKs... Not a good reason, right? Here are three more:
            1. It will make those who are not familiar with linux or command line feel more easier .
            2. If there is a GUI, you can run ZOOM on your desktop computer, to monitor progress or automatically submit/control multiple jobs on your multi-server cluster.
            3. Later, ZOOM will go beyond an aligner. The post processing of mapping results will be integrated, such as the SNP finding, small RNA finding or CHIP-SEQ etc. Maybe the GUI way will be more institutive.

            I have discussed with Zefeng Zhang, the main developer of ZOOM. It's not so difficult to extend the read length to 128bp~256bp in ZOOM. Since you will need the support for longer read very soon, we'll put it in the first flight after this version is released next week.

            Hao Lin



            Originally posted by apfejes View Post
            (ECO, if you're reading this, spirit's reply might be deserving of it's own thread.)

            Hi Spirit,

            First of all, thanks for the long and complete answer. It sounds like a fantastic program - I'll definitely give it a try when it's available. In the meantime, I hope I'm not bothering you with too many questions. I'm very interested in giving Zoom a try, but don't want to evaluate software that won't meet my requirements.

            Just to touch on a few points (comments and questions):

            We're exclusively an Illumina/Solexa shop, at the moment, and we're starting to produce reads longer than 64bp. If it goes well, I don't think we'll be doing short read runs on the Illumina machines for much longer (Maybe just for chip-seq?) Anyhow, this will be important to Illumina users VERY soon.

            I've never come across Eland having a 16M sequence limit - and I'm not sure why it's important, since it's trivial to run it once for each lane, anyhow. I've heard this claim from someone else when discussing zoom, so I thought I'd mention it.

            As for the benchmarks, they sound pretty good, but aren't quite describing the "normal" situation we seem to come across most of the time. Would you be able to give the time for, say 6 or 8Million 36 or 42bp reads aligned to the complete human genome? What hardware was that benchmark run on? Is the code threaded for multi-cpu computers and/or does it use MPI?

            Does Zoom take advantage of illumina probability/quality scores when doing the alignments?

            Can Zoom handle Mulitple alignments?

            and

            Also related to benchmarks, how much slower is the application in gapped mode vs. un-gapped mode?

            Finally, just out of curiosity, why are you putting a GUI on an aligner?

            Thanks!
            Anthony
            Last edited by spirit; 08-11-2008, 08:29 AM.

            Comment


            • #66
              Hi spirit.

              I didn't know that my blog was THAT famous.... certainly I didn't think I had that many people reading it. I'm glad to know it's been useful to you, however. (=

              As for the benchmarks, thank you VERY much for posting them. I'm very excited to try your program when I return from my vacation at the end of August. In the meantime, I'll let other people at the GSC know to check for the demo. Unfortunately, not everyone here is Academic, so I'm not sure how that will work for your license.

              Just to try to keep this thread short:

              * 3-4 mismatch benchmarks: I don't think I'll need them yet, but I will give it a try on my own data when I have the application.

              * Probabilities: Sounds good. I'll think about this for a while. Your approach sounds relatively simple, but may be "good enough" for now. The only way to know is to try it out.

              * Multiple Alignments: EXCELLENT!

              * Gapped mode: I think 5x slower is not a bad price to pay, particularly if the unique matches are filtered out in a first pass. That's very encouraging!

              * Gui:Thanks for answering my question. The first couple of reasons seem pretty weak, but I would certainly believe the last one has some merit. I don't really see anyone at the GSC using a GUI for any of those reasons, because of the high throughput volume we use, but I'm sure there are plenty of other people who would appreciate it.

              Thanks again for your answers and for responding to my comments and suggestions so quickly! I'm looking forward to giving ZOOM a try.

              Anthony
              The more you know, the more you know you don't know. —Aristotle

              Comment


              • #67
                Any package available for (NGS) SAGE tag mapping to RefSeq/genome etc? Thanks.

                Comment


                • #68
                  There has been some discussion in this thread about ZOOM. I wanted to let everyone know that the demo is now available. Please send me an email at [email protected] to request a 30 day free demo.

                  Comment


                  • #69
                    -> xxqtony:

                    The Genomatix Mapping Station could do it for you. If you can get us your data, we´d happy to map them for you. You can see this as a test case and share your experiences here in the forum...

                    Cheers

                    Klaus

                    Comment


                    • #70
                      Attention those with commercial interests posting in this thread.

                      Please check out this thread: Towards Forming a Policy on Commerical Posts (OPEN FOR DISCUSSION)

                      Also, I welcome comments from anyone else on that topic!

                      Comment


                      • #71
                        Hi all,

                        another good tool for ChIPSeq analysis is:
                        http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/

                        Comment


                        • #72
                          I tried the SISSRS method now, but as far as I can tell it does nothing more than produce a list of peakmaxima from aligned positions.

                          Comment


                          • #73
                            Something to add to your list

                            Something to add to your list.

                            Few months ago my lab purchased a site license of <edited by ECO>.

                            It is a very innovative assembler (automatic batch assembly, automatic mismatch correction, automatic low quality ends trimming and other stuff).

                            I think the web address was <edited by ECO>.
                            Last edited by ECO; 09-21-2008, 10:07 PM. Reason: Clearly shill posting off-topic commercial content.

                            Comment


                            • #74
                              Originally posted by motan View Post
                              Something to add to your list.

                              Few months ago my lab purchased a site license of <edited by ECO>.
                              It is a very innovative assembler (automatic batch assembly, automatic mismatch correction, automatic low quality ends trimming and other stuff).
                              I think the web address was <edited by ECO>
                              This seems pretty off-topic. From the information on the DNAbaser website, it only handles capillary sequence data. Not at all what this forum is about.

                              Comment


                              • #75
                                Originally posted by myrna View Post
                                This seems pretty off-topic. From the information on the DNAbaser website, it only handles capillary sequence data. Not at all what this forum is about.
                                Good catch myrna.

                                Looks like he's acting (poorly) as a shill for software he developed:

                                http://www.vadino.com/education/misc/dna-baser.html

                                ...the email on the right of that page corresponds to the one he used to register on this site.

                                Comment

                                Working...
                                X