Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ultrafast de novo assembly?

    Is there any way how to run "quick and dirty" de novo assembly of Illumina reads from a genome? All we need is to obtain contigs at least several hundred nucleotides long. Our current runs with SOAPdenovo and Velvet are good but way too time-consuming for what we need.
    Thank you for any suggestions.

  • #2
    Minia is a quick and memory efficient de-novo assembler, but the results is not so accurate.

    Comment


    • #3
      CLC Assembly Cell is probably one of the quickest out there with reasonable results. It's expensive, but they have a 2 week trial version so you can see if it meets your needs.

      Comment


      • #4
        Originally posted by TiborNagy View Post
        Minia is a quick and memory efficient de-novo assembler, but the results is not so accurate.
        Minia may be memory efficient, but I found it to take orders of magnitude longer than Velvet.

        For fast assembly, I suggest subsampling or normalizing the input reads first, to reduce the input volume - that speeds things up a lot with Velvet, at least. Subsampling is faster but normalizing is often better. You can do either with BBTools. I find that a target depth of around 40 works well with Velvet.

        subsample:
        reformat.sh in=reads.fq out=sampled.fq samplerate=0.1

        normalize:
        bbnorm.sh in=reads.fq out=normalized.fq target=40

        If you have paired reads in 2 files, you can use the in1, in2, out1, and out2 flags, and pairs will stay together.

        Comment


        • #5
          normalization

          Thanks for the suggestions.

          As for CLC - we do have CLC genomics workbench - it works great but is still too slow for what we need, not much different from Velvet

          As for normalization of reads before assembly - I do not understand the methods well enough, but I was told that when you normalize, it is not good for assembly methods based on K-mers. Possibly because the methods need the information about the abundance of reads containing the K-mers and that would be lost by normalization. I am not sure if that is the same as normalization, but I wanted to use Usearch program to reduce the read numbers (dereplication or UCLUST). Usearch is fast enough for our planned throughput.

          Comment


          • #6
            The effect of normalization really depends on the normalizer and the assembler.

            In my testing of BBNorm, normalization universally improves the L50 with Velvet, and some other metrics (total number of errors, total size, longest contig length, total number of contigs) may go up or down slightly but generally there is a positive trend. There's also typically a positive trend with Soap. AllPathsLG appears to be much more sensitive to read abundance patterns, and normalization seems to have a negative impact just as often as a positive one.

            But subsampling does not change the relative read abundance, it just scales it down by a constant factor across the whole genome, so if you are worried about the effects of normalization then subsampling is a better option. It's extremely fast. Dereplication is not a bad idea, but if you only remove identical read pairs, it won't decrease your data volume much. If you treat data as single-ended and remove all duplicate individual reads, it will reduce your data much more. However, dereplication DOES increase the error rate, since reads with errors are less likely to be duplicates. You may wish to error-correct first, which BBNorm can also do - that will cause more reads to be removed.

            Comment


            • #7
              Minia dev here. I regret to hear that for some of you Minia has been inaccurate or too slow.

              Minia is IO-intensive, so it can be slow if you run it on a network-mounted folder (e.g. your cluster's home directory). On a regular hard drive, or even better a SSD, it will be quick; I stand by the claim that human-sized genomes are assembled in a day on a plain desktop computer.

              Regarding the quality of Minia results, in my tests (using QUAST) I never noticed more misassemblies than other assemblers. TiborNagy, could you elaborate your comment?

              To contribute to the thread: if all you have is a single machine with many CPU's, then SOAPdenovo2/Velvet using all CPU cores are likely to be the fastest assemblers. Minia's pretty fast using just 1 thread. I recall that ABySS was able to assemble a human genome in half a day using a cluster, and it's likely that the Ray assembler will match this performance as well.

              Comment


              • #8
                Originally posted by rchikhi View Post
                TiborNagy, could you elaborate your comment?
                Of course I can. I have tried Minia 3 years ago and I have tried to assemble human HLA genes with different assemblers. (Yes, it is a very hard task, I known) I have mapped the contigs back to the human reference and watched the coverage. Minia was the fastest program, but the contigs were too small. (Sorry, I can not remember the exact values.)

                I have read the Minia article. It is a clever algorithm, but does not fit for every task.

                Comment


                • #9
                  Thanks for your comment.

                  There's a difference between accuracy of contigs (misassembly, mismatches) and contiguity (how long the contigs are).

                  Yes it make sense to say that Minia doesn't always produce the most contiguous results, given that it has a very simple contig construction algorithm that doesn't use read information or paired-end. However, in terms of accuracy (misassembles, mismatches), it should perform reasonably well.

                  Comment


                  • #10
                    I should clarify that I've only tried Minia once, and it was on a metagenome of unknown size and composition (the assemblies came out at 30 to 60 Mbp). I ran Velvet, Spades, Soap, and Minia. Soap crashed; Velvet was the fastest, and Minia took a long time. None of the assemblies were any good (L50 much shorter than read length). Our disk subsystem is very unpredictable and often extremely slow, which could have been the problem.

                    So, that could be an anomalous result compared to running it on an isolate using local disk.

                    Comment


                    • #11
                      Ty for the details -- slow disk system is the only reason why Minia can be slow, so this makes sense.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 12:08 PM
                      0 responses
                      11 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      17 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      14 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      43 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X