Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tonybolger
    replied
    Originally posted by eslondon View Post
    In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb.
    Interesting - did you have to hack much to get SOAP to work on CLC contigs?

    Originally posted by eslondon View Post
    All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.
    Also interesting. Did you check for broken pairs as part of QC?

    We've noticed a tendency for CLC denovo to oversimplify complex repeat-filled areas, turning 'frayed ropes' into a single contigs. These don't tend to be in coding regions, so you won't find them with genes or RNA
    Last edited by tonybolger; 02-11-2011, 05:56 AM.

    Leave a comment:


  • tonybolger
    replied
    Originally posted by avtsanger View Post
    Using a kmer of 45 and the default of 8 threads this job only took 14 mins but it did use a pretty impressive 61Gb of memory which seems extreme (if any one knows how to reduce this for SOAP please let me know!). Luckily I have access to a large memory machine... The results weren't great as the contig N50 was only 10100 and the largest contig came out at 57645.
    You can try to pre-filter the dataset to reduce SOAPs appetite - we use a sliding window approach, and trim off any runs of Bs. That reduces our memory requirements from 250GB+ to around 50-100GB (on a rather large data set) depending on the quality cut-off.

    Also, does soap actually use K-mer above 31?

    Originally posted by avtsanger View Post
    MIRA took an incredible 2007 mins (33.4 hrs) to run. This may be to do with the fact that we use NFS mounts? Memory usage was also high at 51Gb. Again I am happy to be corrected if there is a better way of running MIRA.
    MIRA doesn't play well with NFS. It also would like a lot of spare disk space if you have it

    And it could use some threading...

    Leave a comment:


  • lcollado
    replied
    Originally posted by seb567 View Post
    In my opinion, the researchers at the Broad Institute should remove requirements for jumping libraries for their otherwise splendid approach to genome assembly.

    While I would like to agree with you because I would be able to use this tool in our de novo assembly projects, I understand that part of their success is based on the different kinds of input data. After all, better data should produce better results. Moreover, in their latest paper they argue that there are different kinds of de novo assembly projects and while some don't mind fragmented assemblies (after all, they contain 95% or more of the genes in most cases) others want to produce the least fragmented (and accurate) assembly and are willing to spend more $$.

    Leo

    Leave a comment:


  • avtsanger
    replied
    Originally posted by seb567 View Post
    Just out of curiosity, did you include paired-end reads ?

    If yes, what was the distance separating the paired reads ?
    This was a paired end run. The insert size is 200bp

    Originally posted by seb567
    Also, do you have the reference sequence ?

    '~4.6mb' reminds me of E. coli K-12 MG1655 though it is meaningless to presume the identity based solely on genome length.
    This is a de-novo assembly with the aim of creating a reference, so unfortunately no reference to compare it to. It isn't an E.coli but I can't say anymore than that...


    Originally posted by seb567
    What was the peak memory usage ?
    I couldn't find the output file for this as it was done sometime ago by a colleague. However I wouldn't have thought it would have exceeded 8Gb per kmer



    Originally posted by seb567
    Why did you choose to reutilize the k-mer length optimized for Velvet ?
    This was done to save time. I am looking at several different organisms so getting Velvet to choose an optimal k-mer for Abyss to use seemed simpler




    Originally posted by seb567
    I think 146 minutes using 32 nodes for 5.2 million reads is pretty long.

    What was the interconnection between the nodes ?

    I would say it is definitely not Infiniband although I might be wrong.

    My guess: gigabit ethernet or 100BASE-TX.


    Why did you choose 32 ?


    Using a k-mer size smaller like 21 will accommodate higher error rates.

    What was the peak coverage ? (grep peak RayLog)



    (I am the author of Ray -- feel free to send me an email !)
    As far as I am aware we have a gigabit ethernet connection but I might be wrong about that. I believe 32 is the highest available kmer value in RAY. As Velvet and Abyss used higher values I figured I would use the highest kmer I could. Peak coverage was 37


    Originally posted by seb567
    The Genome Research paper -- which might be totally outdated (I don't know), indicates that MIRA utilizes overlap-layout-consensus, not de Bruijn graphs.
    MIRA is an overlap-layout consensus assembler so the time taken may be normal.

    Originally posted by seb567
    Did you give a try to EULER ?
    Not yet...

    Originally posted by seb567
    In my opinion, the researchers at the Broad Institute should remove requirements for jumping libraries for their otherwise splendid approach to genome assembly.
    I totally agree!

    Leave a comment:


  • seb567
    replied
    Originally posted by lcollado View Post
    Something interesting is the way they treat small bubbles in the de Brujin graphs.
    Yes, I think it is very clever to store genome variations as they are encountered.

    Leave a comment:


  • seb567
    replied
    Originally posted by avtsanger View Post
    Like Elia, I have been looking at a few different assemblers to see what's best really. We have tended to use Velvet mostly for our Illumina data.

    I should add a massive disclaimer: whilst I am not entirely ignorant when it comes to running these assemblers on a UNIX system, I am not an expert. I would consider myself as fairly representative of an average user who wants to assemble data using assemblers that work more or less straight "out of the box." As some of the results below vary so dramatically for one reason or another it may be an artefact of my attempt to use these progams rather than the assemblers themselves... Any help or advice would be gratefully accepted.

    In order to get a reasonable comparison I have used the same data set:

    ~4.6mb bacterial genome



    The samples were sequenced on a GAII machine as part of a 76bp multiplex library. The shuffled paired end fastq file contains ~5.2 million reads.

    Just out of curiosity, did you include paired-end reads ?

    If yes, what was the distance separating the paired reads ?

    Also, do you have the reference sequence ?

    '~4.6mb' reminds me of E. coli K-12 MG1655 though it is meaningless to presume the identity based solely on genome length.

    Originally posted by avtsanger View Post

    Velvet optimisation script

    6 kmer values attempted with a kmer of 53 being used for the optimal assembly. Analysis took 22 mins and gave a contig N50 of 29438 with the largest contig being 107403


    What was the peak memory usage ?

    Originally posted by avtsanger View Post


    Abyss

    Using the same kmer value that Velvet determined, Abyss ran in ~20mins (1gb of memory used) and gave a contig N50 of 28875 with the largest contig being 107380. This was a standard abyss run, I guess running the job in parallel may well speed up the process.


    Why did you choose to reutilize the k-mer length optimized for Velvet ?


    Originally posted by avtsanger View Post

    RAY

    Using the maximum allowed kmer of 32 and utilising openmpi to run the job across 32 nodes this job took 146 mins (1.2gb of memory). This gave a contig N50 of 33906 and a largest contig of 116140. So far this is the "best assembly" though not the fastest.

    I think 146 minutes using 32 nodes for 5.2 million reads is pretty long.

    What was the interconnection between the nodes ?

    I would say it is definitely not Infiniband although I might be wrong.

    My guess: gigabit ethernet or 100BASE-TX.


    Why did you choose 32 ?


    Using a k-mer size smaller like 21 will accommodate higher error rates.

    What was the peak coverage ? (grep peak RayLog)



    (I am the author of Ray -- feel free to send me an email !)

    Originally posted by avtsanger View Post



    SOAPdenovo

    Using a kmer of 45 and the default of 8 threads this job only took 14 mins but it did use a pretty impressive 61Gb of memory which seems extreme (if any one knows how to reduce this for SOAP please let me know!). Luckily I have access to a large memory machine... The results weren't great as the contig N50 was only 10100 and the largest contig came out at 57645.

    In their Genome Research paper, the authors used k=17.

    "De novo assembly of human genomes with massively parallel short read sequencing"
    Genome Research
    An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms


    Originally posted by avtsanger View Post


    MIRA

    MIRA took an incredible 2007 mins (33.4 hrs) to run. This may be to do with the fact that we use NFS mounts? Memory usage was also high at 51Gb. Again I am happy to be corrected if there is a better way of running MIRA. The results were fairly similar to Velvet Abyss and RAY and gave me an N50 of 31814 and largest contig of 117960.

    I think that using a network-mounted file system will slow down any assembler that write down on-disk files. ABySS uses on-disk indexes, using google-sparsehash, if available.

    I don't know for MIRA though.

    The Genome Research paper -- which might be totally outdated (I don't know), indicates that MIRA utilizes overlap-layout-consensus, not de Bruijn graphs.

    "Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs"
    Genome Research
    An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms



    Therefore, that might explain the running time and the memory usage.

    Originally posted by avtsanger View Post

    Other assemblers

    Allpaths isn't an option due to its requirement for a second "jumping" Illumina library. I might have a play with SSAKE as well. I have also recently been made aware of another Velvet script that works in a similar way to VelvetOptimiser but according to the creators, outperforms it. We will see...

    You can't tell if any assembler outperformed any other assemblers here because you need to do some short-range and long-range validation.

    If you happen to have the fasta file of the reference, you can utilize the print-latex.sh script, which is in the scripts directory of Ray.

    To use it, you need MUMmer and ruby.

    print-latex.sh reference.fasta contigs.fasta AssemblerName

    Originally posted by avtsanger View Post


    As I have 454 data for this genome I have used Newbler to combine some of the Illumina assemblies with the 454 data. The results are all pretty similar with the Velvet, RAY and SOAP combined assemblies coming out in 6 scaffolds with on giant scaffold of 4.3mb which is the majority of the genome.
    Newbler is pretty good with homopolymers.

    Did you give a try to EULER ?

    In my opinion, the researchers at the Broad Institute should remove requirements for jumping libraries for their otherwise splendid approach to genome assembly.

    Leave a comment:


  • lcollado
    replied
    Originally posted by avtsanger View Post
    I have also recently been made aware of another Velvet script that works in a similar way to VelvetOptimiser but according to the creators, outperforms it. We will see...
    This claim got my attention as we've had good results using the current VelvetOptimiser.

    By the way, there is a new version available of ALLPATHS: its ALLPATHS-LG (Larger Genomes). It requires several "jumping" libraries which makes it hard to use in a small/medium project. Something interesting is the way they treat small bubbles in the de Brujin graphs.

    Leo

    Leave a comment:


  • avtsanger
    replied
    Like Elia, I have been looking at a few different assemblers to see what's best really. We have tended to use Velvet mostly for our Illumina data.

    I should add a massive disclaimer: whilst I am not entirely ignorant when it comes to running these assemblers on a UNIX system, I am not an expert. I would consider myself as fairly representative of an average user who wants to assemble data using assemblers that work more or less straight "out of the box." As some of the results below vary so dramatically for one reason or another it may be an artefact of my attempt to use these progams rather than the assemblers themselves... Any help or advice would be gratefully accepted.

    In order to get a reasonable comparison I have used the same data set:

    ~4.6mb bacterial genome

    The samples were sequenced on a GAII machine as part of a 76bp multiplex library. The shuffled paired end fastq file contains ~5.2 million reads.

    Velvet optimisation script

    6 kmer values attempted with a kmer of 53 being used for the optimal assembly. Analysis took 22 mins and gave a contig N50 of 29438 with the largest contig being 107403

    Abyss

    Using the same kmer value that Velvet determined, Abyss ran in ~20mins (1gb of memory used) and gave a contig N50 of 28875 with the largest contig being 107380. This was a standard abyss run, I guess running the job in parallel may well speed up the process.

    RAY

    Using the maximum allowed kmer of 32 and utilising openmpi to run the job across 32 nodes this job took 146 mins (1.2gb of memory). This gave a contig N50 of 33906 and a largest contig of 116140. So far this is the "best assembly" though not the fastest.

    SOAPdenovo

    Using a kmer of 45 and the default of 8 threads this job only took 14 mins but it did use a pretty impressive 61Gb of memory which seems extreme (if any one knows how to reduce this for SOAP please let me know!). Luckily I have access to a large memory machine... The results weren't great as the contig N50 was only 10100 and the largest contig came out at 57645.

    MIRA

    MIRA took an incredible 2007 mins (33.4 hrs) to run. This may be to do with the fact that we use NFS mounts? Memory usage was also high at 51Gb. Again I am happy to be corrected if there is a better way of running MIRA. The results were fairly similar to Velvet Abyss and RAY and gave me an N50 of 31814 and largest contig of 117960.

    Other assemblers

    Allpaths isn't an option due to its requirement for a second "jumping" Illumina library. I might have a play with SSAKE as well. I have also recently been made aware of another Velvet script that works in a similar way to VelvetOptimiser but according to the creators, outperforms it. We will see...

    As I have 454 data for this genome I have used Newbler to combine some of the Illumina assemblies with the 454 data. The results are all pretty similar with the Velvet, RAY and SOAP combined assemblies coming out in 6 scaffolds with on giant scaffold of 4.3mb which is the majority of the genome.

    Leave a comment:


  • eslondon
    replied
    I have been playing around with ABYSS, SOAPdenovo and CLC Bio for a genome project. To cut a very long story short, these are our experiences.

    We started from a set of standard 200bp PE reads and a set of 5kb mate pair reads.

    -ABYSS: with our limited 5kb reads, we never managed to get ABYSS to use them properly for scaffolding. The Contig N50 was a bit poor, whatever we tried. It took a fair while, we never got it to parallelize

    -SOAPdenovo: very fast because using multiple threads is as simple as saying -p number of processors, and VERY good at scaffolding. The Contig N50 was not great, but better than ABYSS (around 600bp)

    -CLC Bio: although it does not support scaffolding, it gave us by far the best N50 in terms of contigs (an N50 of 2.2Kb)

    In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb.

    Finally we use the SOAPdenovo GapCloser to close GAPS in the scaffolds produced, which removed about 25% of the Ns we had in the assembly!

    All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.

    Now we are going to throw more data at it, hoping for a much better assembly

    best regards

    Elia

    Leave a comment:


  • seb567
    replied
    What they say.

    ALLPATHS 2

    "The five genomes were assembled on a 16-processor Dell server having 128 GB of memory. Some of the code is parallelized. The wall-clock times for the assemblies were: S. aureus, 1.7 hours; E. coli, 8.2 hours; R. sphaeroides, 10.2 hours; S. pombe, 80.5 hours; N. crassa, 86.6 hours."

    Genome Biology 2009, http://dx.doi.org/doi:10.1186/gb-2009-10-10-r103




    SOAPdenovo

    "For the computational intensive steps, we used threaded parallelization. The preassembly error correction of the raw reads was the most time consuming step, which cost 24 and 22 h, respectively, on the Asian and African data set."

    "To manage the huge number of short reads effectively and handle them in a standard supercomputer with 512 Gb memory installed, we modularized the assembly method and organized it as a pipeline by loading only the necessary data at each step."

    Genome Research 2010, http://dx.doi.org/10.1101/gr.097261.109




    ABySS

    "With the novel distributed de Bruijn graph approach in ABySS, we are able to parallelize the assembly of billions of short reads over a cluster of commodity hardware. This method allows us to cost effectively increase the amount of memory available to the assembly process, which can scale up to handle genomes of virtually any size."

    Genome Research 2008, http://dx.doi.org/10.1101/gr.089532.108

    Leave a comment:


  • flobpf
    replied
    Benchmarking between different denovo assemblers

    I am reviving this thread after almost an year with the hope that someone has tried to benchmark different denovo assemblers in the past year especially for large genomes. There is an excellent review of different algorithms I found

    J.R. Miller, et al., Assembly algorithms for next-generation sequencing data, Genomics (2010)

    But I havent seen any benchmarking/comparisons except for general views such as "X takes more memory" and "Y is faster".

    I was wondering if anyone knows (or has experience about) which softwares work best and under what conditions with special reference to large genomes (>300MB)?

    Also, I was curious...what benchmarking criteria would one use apart from computational requirements, if one was working with a completely unknown genome? What is the general practice now...do people use all tools (eg: AbySS, Allpaths, SOAPdenovo etc.) and find which one works best?

    Any thoughts welcome

    Thanks
    Flobpf
    Last edited by flobpf; 05-03-2010, 07:16 PM.

    Leave a comment:


  • Benchmark (or experience) between SOAPdenovo, Velvet, Abyss, and ALLPATHS2

    Hello,

    I'm aware that accurate benchmarking is very though to do . But, with all the recent hype regarding the Panda genome, has anyone done or have a link to some benchmarking between SOAPdenovo, Velvet upgraded, ALLPATHS2 and Abyss? Or if you have any experience with them and want to share your impressions . Anyhow, from the SOAPdenovo paper it would seem that it needs more RAM than Abyss but performs better on the African genome data set -- they didn't do a comparative using the "stats" developped by the ALLPATHS team.

    Also, can you use Velvet (or Abyss or ALLPATHS2) for an initial assembly and then use the GapCloser module (click on bullet #4) from SOAPdenovo? I'm guessing that you'll need to modify the format of the "scaffold file". Or maybe GapCloser's input "scaffold file" needs some specific information that only SOAPdenovo records.

    I guess that at least you can use the "Correction" module from SOAPdenovo prior to using any of the other assemblers.

    Thank you!!
    Leonardo

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 03-27-2024, 06:37 PM
0 responses
12 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-27-2024, 06:07 PM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
69 views
0 likes
Last Post seqadmin  
Working...
X