Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SOAPdenovo running for 48 h?

    Hi,

    I am doing do novo assembly of a bacterial genome ~4Mb. I used Abyss and it took around 8 hours to produce ~1000 scaffolds (N50 80kb).

    I used paired-end reads 100bp (total ~300,000,000 reads), mate-paired reads 100 bp (~60,000,000 reads total) and long single end reads ~3kb with 50x coverage. SOAPdenovo has been running for 48 hours now on Linux 64-bit with 8 cpus and 33GB RAM.


    Is this running time normal? The log by now says: 32400000000th reads. I don't know long the program will take to finish the analysis, is there any way of seeing this?

    Thanks

  • #2
    Originally posted by illinu View Post
    I used paired-end reads 100bp (total ~300,000,000 reads), mate-paired reads 100 bp (~60,000,000 reads total) and long single end reads ~3kb with 50x coverage. SOAPdenovo has been running for 48 hours now on Linux 64-bit with 8 cpus and 33GB RAM.
    You have ~30Gbp of paired end data and ~6Gbp of mate pair data; 36Gbp total to assemble a 4Mbp genome???. That is 9,000X coverage (plus your 50X coverage of long reads). Beyond a certain level of coverage you will not improve an assembly, and in fact it is likely that you will make the assembly worse. The more reads you have the more random sequencing errors you have, which the assembler will have to resolve. While I don't use SOAPdenovo myself I'm not surprised that it is taking forever with the extreme depth you are using.

    In my experience anything beyond 30-50X total coverage the assembly of a bacterial genome will not improve.

    Comment


    • #3
      Originally posted by kmcarr View Post
      You have ~30Gbp of paired end data and ~6Gbp of mate pair data; 36Gbp total to assemble a 4Mbp genome???. That is 9,000X coverage (plus your 50X coverage of long reads). Beyond a certain level of coverage you will not improve an assembly, and in fact it is likely that you will make the assembly worse. The more reads you have the more random sequencing errors you have, which the assembler will have to resolve. While I don't use SOAPdenovo myself I'm not surprised that it is taking forever with the extreme depth you are using.

      In my experience anything beyond 30-50X total coverage the assembly of a bacterial genome will not improve.
      Hi kmcarr
      You are right. how can I select a subset of reads for the assembly?

      Comment


      • #4
        Originally posted by illinu View Post
        Hi kmcarr
        You are right. how can I select a subset of reads for the assembly?
        Here is a past thread with ideas/scripts for sub-sampling data. http://seqanswers.com/forums/showthread.php?t=16505

        Comment


        • #5
          Thank you both, that was very helpful

          Comment


          • #6
            Hi again,

            So I took a subsample of paired-end reads (approx 10%) and now I am using a powerful computer with 64 cores and 512GB RAM. I am running SOAPdenovo-127mer with the new subset of data plus the mate pairs and the single end reads. I set the option to -all to go as far as scaffolding.

            After 18 hours running, soapdenovo has advanced just as much as the last time with the "smaller" computer. When I check the usage it says that soapdenovo is using 900% cpu but only 1.4% memory. Is there a way I can allocate more memory for this process?

            Has anyone experienced such running times with soapdenovo in bacterial genomes?

            Thanks

            Comment


            • #7
              Originally posted by illinu View Post
              Hi again,

              So I took a subsample of paired-end reads (approx 10%) and now I am using a powerful computer with 64 cores and 512GB RAM. I am running SOAPdenovo-127mer with the new subset of data plus the mate pairs and the single end reads. I set the option to -all to go as far as scaffolding.

              After 18 hours running, soapdenovo has advanced just as much as the last time with the "smaller" computer. When I check the usage it says that soapdenovo is using 900% cpu but only 1.4% memory. Is there a way I can allocate more memory for this process?

              Has anyone experienced such running times with soapdenovo in bacterial genomes?

              Thanks
              If I'm reading correctly you only subsampled the paired end reads, going from 30Gbp to 3Gbp. Add to that your 6Gbp of mate pairs this makes 9Gbp which is stll 2,250X coverage, plus your long reads. This is still way, way too much. As I said above you should only be using 30-50X TOTAL coverage when trying a de novo assembly of a bacteria. Beyond that amount if input, throwing more data at your assembly is not going to help.

              Comment


              • #8
                Originally posted by kmcarr View Post
                If I'm reading correctly you only subsampled the paired end reads, going from 30Gbp to 3Gbp. Add to that your 6Gbp of mate pairs this makes 9Gbp which is stll 2,250X coverage, plus your long reads. This is still way, way too much. As I said above you should only be using 30-50X TOTAL coverage when trying a de novo assembly of a bacteria. Beyond that amount if input, throwing more data at your assembly is not going to help.
                I will try to reduce it even further then, according to what you say I will have to sacrifice data in the mate pairs and long reads set. I wonder why the sequencing company went so deep?

                The thing is that SOAPdenovo only uses the mate pairs for scaffolding so at the moment it is only producing contigs. What I am surprised is that Abyss took only a few hours to produce contigs and scaffolds (with the big data set) while it is claimed in other posts that Abyss requires more memory than soapdenovo. I am not having the same experience.
                What I was wondering now is about the memory usage for soapdenovo (1.4%) if it could anyhow be increased.

                Comment


                • #9
                  Originally posted by illinu View Post
                  I wonder why the sequencing company went so deep?
                  You did not discuss the coverage need with the provider before you sent the sample in for sequencing?

                  I hope they did not charge you by the base

                  Originally posted by illinu View Post
                  What I was wondering now is about the memory usage for soapdenovo (1.4%) if it could anyhow be increased.
                  In general programs will use as much memory as they need. Increasing the amount of RAM available would only be a consideration if the program was aborting due to a lack of memory (not a problem in your case).

                  You can bring in the data for mate pairs and long reads in some sort of staged manner. Assemble the short reads first and then use the other info to close the gaps that remain.

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    You did not discuss the coverage need with the provider before you sent the sample in for sequencing?
                    Hi GenoMax,
                    I just came in contact with this data trying to give a hand, so I was not involved in discussing sequencing for this project unfortunatelly I have to work with what I have which is roughly raw data. However I just subsampled all the files getting a total coverage of around 100x. I got input from other groups assembling bacterial genomes which apparently worked fine with such final coverage.

                    Just to add to my experience with SOAPdenovo, I am running again the subset of data (100x) and the program is still running and not making great progress in terms of speed. I ran Ray in the "smaller" computer (8 cores, 33GB RAM) and it gave me an impressive 82 contigs in 9 minutes !!

                    I will have SOAPdenovo running overnight but deffinetely not happy about how slow it works.

                    Thanks for your help

                    Comment


                    • #11
                      Velvet (http://www.ebi.ac.uk/~zerbino/velvet/) works pretty well for bacterial genomes (based on previous discussions on here). Give that a try with the short reads only.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Recent Advances in Sequencing Analysis Tools
                        by seqadmin


                        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                        05-06-2024, 07:48 AM
                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Today, 07:03 AM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-10-2024, 06:35 AM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-09-2024, 02:46 PM
                      0 responses
                      32 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 05-07-2024, 06:57 AM
                      0 responses
                      26 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X