Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • question about denovo assembly

    Hi guys,
    i have a question regarding denovo assembly.

    Firstly some info on my machine:
    64 bit Slackware linux
    64GB RAM
    i7 3930 K

    For a small genomes there is no problem. But what about 3Gb genomes? How can one handle such a task considering the available power of the working machine?

    Once i had a RNAseq with 70M PE illumina reads which i tried to assemble with SOAPdenovo-trans but the program broke after loading 100M reads.So i am concerned that for a 3Gb genome with 10x coverage at least i will have a huge number of reads which i wont be able to handle with SOAP-denovo. I saw that they have a '-a' option for the pregraph which supposedly would restrict the memory usage but i have the feeling i will have problems.

    Then i was having the idea to split the raw reads into smaller files and make mini denovo assemblies on the split reads and then merge them somehow. But for now i could not find software which could allow me to do that.

    So i would like to ask if there is such software which will allow me to do the procedure described above? Or is there some other strategy to tackle that problem? Or i just need a machine with loots of RAM?

    Thank you for your time and any help!!

  • #2
    Try Titus Brown's Diginorm (Digitial normalization) program on your sample before you run SOAP-denovo. It should reduce the number of reads without reducing the complexity of the sample. I do not have a reference to the program available but a google search or looking through this forum should bring up a link.

    Comment


    • #3
      An efficient de novo algorithm such as this might be of interest :

      De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index der …


      Otherwise, you can't beat more RAM. Perhaps a cloud service hi ram service such as that from BGI might be the cost effective solution since you're not going to be doing this every day.

      Comment


      • #4
        Does BGI actually provide large memory machines? I do not use them and just now tried looking through their offerings but could not find information on memory limits. I know that Amazon's, admittedly not-bioinformatics-oriented, EC2 cloud only goes up to 64GB.

        Comment


        • #5
          Re BGI, I'm not sure either whether they offer it commercially, but they do have the resources to do it and plenty of experience with assembling 3Gb size genomes from Illumina data.

          They do talk about de novo assembly with Hecate here:

          Comment


          • #6
            Hi,
            Thank you all for the prompt replies.

            I downloaded and compiled the SGA and will give it a try. Seems that it might do the job

            Will also try Diginorm. It might help in other denovo projects as well.

            Last option is to talk to the boss to buy computer with 192 GB RAM. That will do the job for sure

            Thanks again
            Cheers

            Comment


            • #7
              Gossamer:

              Gossamer is available for non-commercial use from http://www.genomics.csse.unimelb.edu.au/product-gossamer.php.


              claims to be close to the theoretical lower limit for memory usage for de bruijn graph de novo assemblers.

              Comment


              • #8
                Originally posted by kenietz View Post
                Hi,
                Thank you all for the prompt replies.

                I downloaded and compiled the SGA and will give it a try. Seems that it might do the job

                Will also try Diginorm. It might help in other denovo projects as well.

                Last option is to talk to the boss to buy computer with 192 GB RAM. That will do the job for sure

                Thanks again
                Cheers
                Quality trimming your data will drastically reduce the memory usage by de bruijn graph assemblers. Unfortunately, 192 GB of RAM is no where close to what you need to assemble a 3 Gb genome unless you use a string graph assembler like SGA or readjoiner. The trade off is running time, though readjoiner is very fast compared to SGA (but does not produce the same quality assemblies as de bruijn graph assemblers in my experience). Regardless, I'm not sure you have the resources to assemble a 3 Gb genome based on your first post. What is your genome coverage?

                Comment


                • #9
                  @SES:
                  Thank you for the information. The client wants to try out with 10x at first and then proceed with higher coverage. Yeah, i got it that SGA would probably be able to do the job. Now i am reading about readjoiner. I'm still considering if to take the job at all.

                  Btw, what kind of power would i really need to assemble 3Gb genome?

                  Comment


                  • #10
                    If by "power" you mean "memory", this thread might be relevant:

                    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                    Talks about memory requirement for velvet, which is pretty memory-hungry. So if you can do it with velvet you could probably do it with any de-bruijn based assembly program (like gossamer that I mentioned above). Some programs are based on other methods (e.g. overlap-consensus) and I am not sure how to calculate memory requirements, although I know MIRA has a memory-requirement estimation program that comes with it.

                    If by power you mean processor speed, this is usually not the limiting factor in my experience.

                    Comment


                    • #11
                      Hi DFJ111,
                      thanks for the info. By power i meant mainly the memory. Yeah MIRA is pretty good program but requires a lot of memory when working with illumina reads. Its like 1-1.5Gb per million reads.

                      So for now if i have to do that job i should try SGA or readjoiner. Or find a cluster. Btw, im not sure but do most of the assemblers run on clusters? I never used an assembler on cluster yet.

                      Comment


                      • #12
                        Originally posted by kenietz View Post
                        Btw, im not sure but do most of the assemblers run on clusters? I never used an assembler on cluster yet.
                        Most? I am not sure about that. Practically speaking you only need one or two good cluster-aware assemblers so who really cares about the others?

                        Velvet is not, as far as I know, cluster-aware. ABySS is cluster-aware. Not sure about SGA, etc.

                        Comment


                        • #13
                          Originally posted by kenietz View Post
                          @SES:
                          Thank you for the information. The client wants to try out with 10x at first and then proceed with higher coverage. Yeah, i got it that SGA would probably be able to do the job. Now i am reading about readjoiner. I'm still considering if to take the job at all.

                          Btw, what kind of power would i really need to assemble 3Gb genome?
                          With 10X coverage, you will likely not get an "assembly." With that low of coverage you will just be clustering reads and then find out the "assembly" is far less in length than what you expected. If you already have a reference then this approach makes sense, but not if this will be the reference.

                          If you have sufficient coverage and a mixture of 454 and Illumina then you will need as much memory as you can access. The Broad reports that AllPaths uses 1.7 bytes of memory per read base, so that can be a rough guide. That suggests that 512 GB should be sufficient to assemble a 3 Gb genome, but I don't think that is the case with large plant genomes anyway. I have seen a number of talks in the last year where people (colleagues included) are doing assemblies of genomes >3 Gb on machines with 1 TB memory. Of course, this is all highly dependent on the amount and type of data you have, as well as the unique properties (i.e., repeat structure) of your species.

                          Comment


                          • #14
                            Originally posted by westerman View Post
                            Most? I am not sure about that. Practically speaking you only need one or two good cluster-aware assemblers so who really cares about the others?

                            Velvet is not, as far as I know, cluster-aware. ABySS is cluster-aware. Not sure about SGA, etc.
                            Ray will use multiple processors, though I could never get Ray to produce assemblies comparable to Velvet and SOAP.

                            Comment


                            • #15
                              Originally posted by kenietz View Post
                              Hi guys,
                              i have a question regarding denovo assembly.

                              Firstly some info on my machine:
                              64 bit Slackware linux
                              64GB RAM
                              i7 3930 K

                              For a small genomes there is no problem. But what about 3Gb genomes? How can one handle such a task considering the available power of the working machine?

                              Once i had a RNAseq with 70M PE illumina reads which i tried to assemble with SOAPdenovo-trans but the program broke after loading 100M reads.So i am concerned that for a 3Gb genome with 10x coverage at least i will have a huge number of reads which i wont be able to handle with SOAP-denovo. I saw that they have a '-a' option for the pregraph which supposedly would restrict the memory usage but i have the feeling i will have problems.

                              Then i was having the idea to split the raw reads into smaller files and make mini denovo assemblies on the split reads and then merge them somehow. But for now i could not find software which could allow me to do that.

                              So i would like to ask if there is such software which will allow me to do the procedure described above? Or is there some other strategy to tackle that problem? Or i just need a machine with loots of RAM?

                              Thank you for your time and any help!!

                              The answer to your question depends on whether you are assembling genome or transcriptome. Could you please clarify on that?
                              http://homolog.us

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X