Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Benchmark (or experience) between SOAPdenovo, Velvet, Abyss, and ALLPATHS2

    Hello,

    I'm aware that accurate benchmarking is very though to do . But, with all the recent hype regarding the Panda genome, has anyone done or have a link to some benchmarking between SOAPdenovo, Velvet upgraded, ALLPATHS2 and Abyss? Or if you have any experience with them and want to share your impressions . Anyhow, from the SOAPdenovo paper it would seem that it needs more RAM than Abyss but performs better on the African genome data set -- they didn't do a comparative using the "stats" developped by the ALLPATHS team.

    Also, can you use Velvet (or Abyss or ALLPATHS2) for an initial assembly and then use the GapCloser module (click on bullet #4) from SOAPdenovo? I'm guessing that you'll need to modify the format of the "scaffold file". Or maybe GapCloser's input "scaffold file" needs some specific information that only SOAPdenovo records.

    I guess that at least you can use the "Correction" module from SOAPdenovo prior to using any of the other assemblers.

    Thank you!!
    Leonardo
    L. Collado Torres, Ph.D. student in Biostatistics.

  • #2
    Benchmarking between different denovo assemblers

    I am reviving this thread after almost an year with the hope that someone has tried to benchmark different denovo assemblers in the past year especially for large genomes. There is an excellent review of different algorithms I found

    J.R. Miller, et al., Assembly algorithms for next-generation sequencing data, Genomics (2010)

    But I havent seen any benchmarking/comparisons except for general views such as "X takes more memory" and "Y is faster".

    I was wondering if anyone knows (or has experience about) which softwares work best and under what conditions with special reference to large genomes (>300MB)?

    Also, I was curious...what benchmarking criteria would one use apart from computational requirements, if one was working with a completely unknown genome? What is the general practice now...do people use all tools (eg: AbySS, Allpaths, SOAPdenovo etc.) and find which one works best?

    Any thoughts welcome

    Thanks
    Flobpf
    Last edited by flobpf; 05-03-2010, 07:16 PM.

    Comment


    • #3
      What they say.

      ALLPATHS 2

      "The five genomes were assembled on a 16-processor Dell server having 128 GB of memory. Some of the code is parallelized. The wall-clock times for the assemblies were: S. aureus, 1.7 hours; E. coli, 8.2 hours; R. sphaeroides, 10.2 hours; S. pombe, 80.5 hours; N. crassa, 86.6 hours."

      Genome Biology 2009, http://dx.doi.org/doi:10.1186/gb-2009-10-10-r103




      SOAPdenovo

      "For the computational intensive steps, we used threaded parallelization. The preassembly error correction of the raw reads was the most time consuming step, which cost 24 and 22 h, respectively, on the Asian and African data set."

      "To manage the huge number of short reads effectively and handle them in a standard supercomputer with 512 Gb memory installed, we modularized the assembly method and organized it as a pipeline by loading only the necessary data at each step."

      Genome Research 2010, http://dx.doi.org/10.1101/gr.097261.109




      ABySS

      "With the novel distributed de Bruijn graph approach in ABySS, we are able to parallelize the assembly of billions of short reads over a cluster of commodity hardware. This method allows us to cost effectively increase the amount of memory available to the assembly process, which can scale up to handle genomes of virtually any size."

      Genome Research 2008, http://dx.doi.org/10.1101/gr.089532.108

      Comment


      • #4
        I have been playing around with ABYSS, SOAPdenovo and CLC Bio for a genome project. To cut a very long story short, these are our experiences.

        We started from a set of standard 200bp PE reads and a set of 5kb mate pair reads.

        -ABYSS: with our limited 5kb reads, we never managed to get ABYSS to use them properly for scaffolding. The Contig N50 was a bit poor, whatever we tried. It took a fair while, we never got it to parallelize

        -SOAPdenovo: very fast because using multiple threads is as simple as saying -p number of processors, and VERY good at scaffolding. The Contig N50 was not great, but better than ABYSS (around 600bp)

        -CLC Bio: although it does not support scaffolding, it gave us by far the best N50 in terms of contigs (an N50 of 2.2Kb)

        In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb.

        Finally we use the SOAPdenovo GapCloser to close GAPS in the scaffolds produced, which removed about 25% of the Ns we had in the assembly!

        All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.

        Now we are going to throw more data at it, hoping for a much better assembly

        best regards

        Elia
        --------------------------------------
        Elia Stupka
        Co-Director and Head of Unit
        Center for Translational Genomics and Bioinformatics
        San Raffaele Scientific Institute
        Via Olgettina 58
        20132 Milano
        Italy
        ---------------------------------------

        Comment


        • #5
          Like Elia, I have been looking at a few different assemblers to see what's best really. We have tended to use Velvet mostly for our Illumina data.

          I should add a massive disclaimer: whilst I am not entirely ignorant when it comes to running these assemblers on a UNIX system, I am not an expert. I would consider myself as fairly representative of an average user who wants to assemble data using assemblers that work more or less straight "out of the box." As some of the results below vary so dramatically for one reason or another it may be an artefact of my attempt to use these progams rather than the assemblers themselves... Any help or advice would be gratefully accepted.

          In order to get a reasonable comparison I have used the same data set:

          ~4.6mb bacterial genome

          The samples were sequenced on a GAII machine as part of a 76bp multiplex library. The shuffled paired end fastq file contains ~5.2 million reads.

          Velvet optimisation script

          6 kmer values attempted with a kmer of 53 being used for the optimal assembly. Analysis took 22 mins and gave a contig N50 of 29438 with the largest contig being 107403

          Abyss

          Using the same kmer value that Velvet determined, Abyss ran in ~20mins (1gb of memory used) and gave a contig N50 of 28875 with the largest contig being 107380. This was a standard abyss run, I guess running the job in parallel may well speed up the process.

          RAY

          Using the maximum allowed kmer of 32 and utilising openmpi to run the job across 32 nodes this job took 146 mins (1.2gb of memory). This gave a contig N50 of 33906 and a largest contig of 116140. So far this is the "best assembly" though not the fastest.

          SOAPdenovo

          Using a kmer of 45 and the default of 8 threads this job only took 14 mins but it did use a pretty impressive 61Gb of memory which seems extreme (if any one knows how to reduce this for SOAP please let me know!). Luckily I have access to a large memory machine... The results weren't great as the contig N50 was only 10100 and the largest contig came out at 57645.

          MIRA

          MIRA took an incredible 2007 mins (33.4 hrs) to run. This may be to do with the fact that we use NFS mounts? Memory usage was also high at 51Gb. Again I am happy to be corrected if there is a better way of running MIRA. The results were fairly similar to Velvet Abyss and RAY and gave me an N50 of 31814 and largest contig of 117960.

          Other assemblers

          Allpaths isn't an option due to its requirement for a second "jumping" Illumina library. I might have a play with SSAKE as well. I have also recently been made aware of another Velvet script that works in a similar way to VelvetOptimiser but according to the creators, outperforms it. We will see...

          As I have 454 data for this genome I have used Newbler to combine some of the Illumina assemblies with the 454 data. The results are all pretty similar with the Velvet, RAY and SOAP combined assemblies coming out in 6 scaffolds with on giant scaffold of 4.3mb which is the majority of the genome.

          Comment


          • #6
            Originally posted by avtsanger View Post
            I have also recently been made aware of another Velvet script that works in a similar way to VelvetOptimiser but according to the creators, outperforms it. We will see...
            This claim got my attention as we've had good results using the current VelvetOptimiser.

            By the way, there is a new version available of ALLPATHS: its ALLPATHS-LG (Larger Genomes). It requires several "jumping" libraries which makes it hard to use in a small/medium project. Something interesting is the way they treat small bubbles in the de Brujin graphs.

            Leo
            L. Collado Torres, Ph.D. student in Biostatistics.

            Comment


            • #7
              Originally posted by avtsanger View Post
              Like Elia, I have been looking at a few different assemblers to see what's best really. We have tended to use Velvet mostly for our Illumina data.

              I should add a massive disclaimer: whilst I am not entirely ignorant when it comes to running these assemblers on a UNIX system, I am not an expert. I would consider myself as fairly representative of an average user who wants to assemble data using assemblers that work more or less straight "out of the box." As some of the results below vary so dramatically for one reason or another it may be an artefact of my attempt to use these progams rather than the assemblers themselves... Any help or advice would be gratefully accepted.

              In order to get a reasonable comparison I have used the same data set:

              ~4.6mb bacterial genome



              The samples were sequenced on a GAII machine as part of a 76bp multiplex library. The shuffled paired end fastq file contains ~5.2 million reads.

              Just out of curiosity, did you include paired-end reads ?

              If yes, what was the distance separating the paired reads ?

              Also, do you have the reference sequence ?

              '~4.6mb' reminds me of E. coli K-12 MG1655 though it is meaningless to presume the identity based solely on genome length.

              Originally posted by avtsanger View Post

              Velvet optimisation script

              6 kmer values attempted with a kmer of 53 being used for the optimal assembly. Analysis took 22 mins and gave a contig N50 of 29438 with the largest contig being 107403


              What was the peak memory usage ?

              Originally posted by avtsanger View Post


              Abyss

              Using the same kmer value that Velvet determined, Abyss ran in ~20mins (1gb of memory used) and gave a contig N50 of 28875 with the largest contig being 107380. This was a standard abyss run, I guess running the job in parallel may well speed up the process.


              Why did you choose to reutilize the k-mer length optimized for Velvet ?


              Originally posted by avtsanger View Post

              RAY

              Using the maximum allowed kmer of 32 and utilising openmpi to run the job across 32 nodes this job took 146 mins (1.2gb of memory). This gave a contig N50 of 33906 and a largest contig of 116140. So far this is the "best assembly" though not the fastest.

              I think 146 minutes using 32 nodes for 5.2 million reads is pretty long.

              What was the interconnection between the nodes ?

              I would say it is definitely not Infiniband although I might be wrong.

              My guess: gigabit ethernet or 100BASE-TX.


              Why did you choose 32 ?


              Using a k-mer size smaller like 21 will accommodate higher error rates.

              What was the peak coverage ? (grep peak RayLog)



              (I am the author of Ray -- feel free to send me an email !)

              Originally posted by avtsanger View Post



              SOAPdenovo

              Using a kmer of 45 and the default of 8 threads this job only took 14 mins but it did use a pretty impressive 61Gb of memory which seems extreme (if any one knows how to reduce this for SOAP please let me know!). Luckily I have access to a large memory machine... The results weren't great as the contig N50 was only 10100 and the largest contig came out at 57645.

              In their Genome Research paper, the authors used k=17.

              "De novo assembly of human genomes with massively parallel short read sequencing"
              Genome Research
              An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms


              Originally posted by avtsanger View Post


              MIRA

              MIRA took an incredible 2007 mins (33.4 hrs) to run. This may be to do with the fact that we use NFS mounts? Memory usage was also high at 51Gb. Again I am happy to be corrected if there is a better way of running MIRA. The results were fairly similar to Velvet Abyss and RAY and gave me an N50 of 31814 and largest contig of 117960.

              I think that using a network-mounted file system will slow down any assembler that write down on-disk files. ABySS uses on-disk indexes, using google-sparsehash, if available.

              I don't know for MIRA though.

              The Genome Research paper -- which might be totally outdated (I don't know), indicates that MIRA utilizes overlap-layout-consensus, not de Bruijn graphs.

              "Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs"
              Genome Research
              An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms



              Therefore, that might explain the running time and the memory usage.

              Originally posted by avtsanger View Post

              Other assemblers

              Allpaths isn't an option due to its requirement for a second "jumping" Illumina library. I might have a play with SSAKE as well. I have also recently been made aware of another Velvet script that works in a similar way to VelvetOptimiser but according to the creators, outperforms it. We will see...

              You can't tell if any assembler outperformed any other assemblers here because you need to do some short-range and long-range validation.

              If you happen to have the fasta file of the reference, you can utilize the print-latex.sh script, which is in the scripts directory of Ray.

              To use it, you need MUMmer and ruby.

              print-latex.sh reference.fasta contigs.fasta AssemblerName

              Originally posted by avtsanger View Post


              As I have 454 data for this genome I have used Newbler to combine some of the Illumina assemblies with the 454 data. The results are all pretty similar with the Velvet, RAY and SOAP combined assemblies coming out in 6 scaffolds with on giant scaffold of 4.3mb which is the majority of the genome.
              Newbler is pretty good with homopolymers.

              Did you give a try to EULER ?

              In my opinion, the researchers at the Broad Institute should remove requirements for jumping libraries for their otherwise splendid approach to genome assembly.

              Comment


              • #8
                Originally posted by lcollado View Post
                Something interesting is the way they treat small bubbles in the de Brujin graphs.
                Yes, I think it is very clever to store genome variations as they are encountered.

                Comment


                • #9
                  Originally posted by seb567 View Post
                  Just out of curiosity, did you include paired-end reads ?

                  If yes, what was the distance separating the paired reads ?
                  This was a paired end run. The insert size is 200bp

                  Originally posted by seb567
                  Also, do you have the reference sequence ?

                  '~4.6mb' reminds me of E. coli K-12 MG1655 though it is meaningless to presume the identity based solely on genome length.
                  This is a de-novo assembly with the aim of creating a reference, so unfortunately no reference to compare it to. It isn't an E.coli but I can't say anymore than that...


                  Originally posted by seb567
                  What was the peak memory usage ?
                  I couldn't find the output file for this as it was done sometime ago by a colleague. However I wouldn't have thought it would have exceeded 8Gb per kmer



                  Originally posted by seb567
                  Why did you choose to reutilize the k-mer length optimized for Velvet ?
                  This was done to save time. I am looking at several different organisms so getting Velvet to choose an optimal k-mer for Abyss to use seemed simpler




                  Originally posted by seb567
                  I think 146 minutes using 32 nodes for 5.2 million reads is pretty long.

                  What was the interconnection between the nodes ?

                  I would say it is definitely not Infiniband although I might be wrong.

                  My guess: gigabit ethernet or 100BASE-TX.


                  Why did you choose 32 ?


                  Using a k-mer size smaller like 21 will accommodate higher error rates.

                  What was the peak coverage ? (grep peak RayLog)



                  (I am the author of Ray -- feel free to send me an email !)
                  As far as I am aware we have a gigabit ethernet connection but I might be wrong about that. I believe 32 is the highest available kmer value in RAY. As Velvet and Abyss used higher values I figured I would use the highest kmer I could. Peak coverage was 37


                  Originally posted by seb567
                  The Genome Research paper -- which might be totally outdated (I don't know), indicates that MIRA utilizes overlap-layout-consensus, not de Bruijn graphs.
                  MIRA is an overlap-layout consensus assembler so the time taken may be normal.

                  Originally posted by seb567
                  Did you give a try to EULER ?
                  Not yet...

                  Originally posted by seb567
                  In my opinion, the researchers at the Broad Institute should remove requirements for jumping libraries for their otherwise splendid approach to genome assembly.
                  I totally agree!

                  Comment


                  • #10
                    Originally posted by seb567 View Post
                    In my opinion, the researchers at the Broad Institute should remove requirements for jumping libraries for their otherwise splendid approach to genome assembly.

                    While I would like to agree with you because I would be able to use this tool in our de novo assembly projects, I understand that part of their success is based on the different kinds of input data. After all, better data should produce better results. Moreover, in their latest paper they argue that there are different kinds of de novo assembly projects and while some don't mind fragmented assemblies (after all, they contain 95% or more of the genes in most cases) others want to produce the least fragmented (and accurate) assembly and are willing to spend more $$.

                    Leo
                    L. Collado Torres, Ph.D. student in Biostatistics.

                    Comment


                    • #11
                      Originally posted by avtsanger View Post
                      Using a kmer of 45 and the default of 8 threads this job only took 14 mins but it did use a pretty impressive 61Gb of memory which seems extreme (if any one knows how to reduce this for SOAP please let me know!). Luckily I have access to a large memory machine... The results weren't great as the contig N50 was only 10100 and the largest contig came out at 57645.
                      You can try to pre-filter the dataset to reduce SOAPs appetite - we use a sliding window approach, and trim off any runs of Bs. That reduces our memory requirements from 250GB+ to around 50-100GB (on a rather large data set) depending on the quality cut-off.

                      Also, does soap actually use K-mer above 31?

                      Originally posted by avtsanger View Post
                      MIRA took an incredible 2007 mins (33.4 hrs) to run. This may be to do with the fact that we use NFS mounts? Memory usage was also high at 51Gb. Again I am happy to be corrected if there is a better way of running MIRA.
                      MIRA doesn't play well with NFS. It also would like a lot of spare disk space if you have it

                      And it could use some threading...

                      Comment


                      • #12
                        Originally posted by eslondon View Post
                        In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb.
                        Interesting - did you have to hack much to get SOAP to work on CLC contigs?

                        Originally posted by eslondon View Post
                        All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.
                        Also interesting. Did you check for broken pairs as part of QC?

                        We've noticed a tendency for CLC denovo to oversimplify complex repeat-filled areas, turning 'frayed ropes' into a single contigs. These don't tend to be in coding regions, so you won't find them with genes or RNA
                        Last edited by tonybolger; 02-11-2011, 05:56 AM.

                        Comment


                        • #13
                          Originally posted by tonybolger View Post

                          Also, does soap actually use K-mer above 31?
                          There is a version of SOAP that does (up to a k-mer of 63 I think). I e-mailed the authors who kindly provided it. Not sure if the current downloadable version is the most recent

                          Comment


                          • #14
                            Originally posted by avtsanger View Post
                            There is a version of SOAP that does (up to a k-mer of 63 I think). I e-mailed the authors who kindly provided it. Not sure if the current downloadable version is the most recent
                            Right you are sir, it's out since yesterday - limits are now kmer 31/63/127 using various versions.

                            But strangely, some (not all) of the versions require the Intel MKL library.

                            Comment


                            • #15
                              Quick question in the same vein as this thread...

                              I have some deep sequencing results from a virus-infected sample. We know the viral sequence - kinda. We know that there are differences in our reference sequence and what is actually in the cells. If I allow for a couple mismatches in the alignment I do with bowtie, I seem to have more or less complete coverage of the viral genome in our reads. I'd like to assemble the reads to get a "consensus" sequence of the virus. Any recommendations for what program to use for this small scale assembly? Reads are about 25 bp, total viral genome should be <10kb

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-19-2024, 07:20 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X