Seqanswers Leaderboard Ad

**kmcarr** · 08-09-2013, 02:26 AM

Originally posted by illinu View Post

I used paired-end reads 100bp (total ~300,000,000 reads), mate-paired reads 100 bp (~60,000,000 reads total) and long single end reads ~3kb with 50x coverage. SOAPdenovo has been running for 48 hours now on Linux 64-bit with 8 cpus and 33GB RAM.

You have ~30Gbp of paired end data and ~6Gbp of mate pair data; 36Gbp total to assemble a 4Mbp genome???. That is 9,000X coverage (plus your 50X coverage of long reads). Beyond a certain level of coverage you will not improve an assembly, and in fact it is likely that you will make the assembly worse. The more reads you have the more random sequencing errors you have, which the assembler will have to resolve. While I don't use SOAPdenovo myself I'm not surprised that it is taking forever with the extreme depth you are using.

In my experience anything beyond 30-50X total coverage the assembly of a bacterial genome will not improve.

**illinu** · 08-09-2013, 03:11 AM

Originally posted by kmcarr View Post

You have ~30Gbp of paired end data and ~6Gbp of mate pair data; 36Gbp total to assemble a 4Mbp genome???. That is 9,000X coverage (plus your 50X coverage of long reads). Beyond a certain level of coverage you will not improve an assembly, and in fact it is likely that you will make the assembly worse. The more reads you have the more random sequencing errors you have, which the assembler will have to resolve. While I don't use SOAPdenovo myself I'm not surprised that it is taking forever with the extreme depth you are using.

In my experience anything beyond 30-50X total coverage the assembly of a bacterial genome will not improve.

Hi kmcarr
You are right. how can I select a subset of reads for the assembly?

**GenoMax** · 08-09-2013, 03:18 AM

Originally posted by illinu View Post

Hi kmcarr
You are right. how can I select a subset of reads for the assembly?

Here is a past thread with ideas/scripts for sub-sampling data. http://seqanswers.com/forums/showthread.php?t=16505

**illinu** · 08-09-2013, 04:37 AM

Thank you both, that was very helpful

**illinu** · 08-15-2013, 02:02 AM

Hi again,

So I took a subsample of paired-end reads (approx 10%) and now I am using a powerful computer with 64 cores and 512GB RAM. I am running SOAPdenovo-127mer with the new subset of data plus the mate pairs and the single end reads. I set the option to -all to go as far as scaffolding.

After 18 hours running, soapdenovo has advanced just as much as the last time with the "smaller" computer. When I check the usage it says that soapdenovo is using 900% cpu but only 1.4% memory. Is there a way I can allocate more memory for this process?

Has anyone experienced such running times with soapdenovo in bacterial genomes?

Thanks

**kmcarr** · 08-15-2013, 02:36 AM

Originally posted by illinu View Post

Hi again,

So I took a subsample of paired-end reads (approx 10%) and now I am using a powerful computer with 64 cores and 512GB RAM. I am running SOAPdenovo-127mer with the new subset of data plus the mate pairs and the single end reads. I set the option to -all to go as far as scaffolding.

After 18 hours running, soapdenovo has advanced just as much as the last time with the "smaller" computer. When I check the usage it says that soapdenovo is using 900% cpu but only 1.4% memory. Is there a way I can allocate more memory for this process?

Has anyone experienced such running times with soapdenovo in bacterial genomes?

Thanks

If I'm reading correctly you only subsampled the paired end reads, going from 30Gbp to 3Gbp. Add to that your 6Gbp of mate pairs this makes 9Gbp which is stll 2,250X coverage, plus your long reads. This is still way, way too much. As I said above you should only be using 30-50X TOTAL coverage when trying a de novo assembly of a bacteria. Beyond that amount if input, throwing more data at your assembly is not going to help.

**illinu** · 08-15-2013, 04:34 AM

Originally posted by kmcarr View Post

If I'm reading correctly you only subsampled the paired end reads, going from 30Gbp to 3Gbp. Add to that your 6Gbp of mate pairs this makes 9Gbp which is stll 2,250X coverage, plus your long reads. This is still way, way too much. As I said above you should only be using 30-50X TOTAL coverage when trying a de novo assembly of a bacteria. Beyond that amount if input, throwing more data at your assembly is not going to help.

I will try to reduce it even further then, according to what you say I will have to sacrifice data in the mate pairs and long reads set. I wonder why the sequencing company went so deep?

The thing is that SOAPdenovo only uses the mate pairs for scaffolding so at the moment it is only producing contigs. What I am surprised is that Abyss took only a few hours to produce contigs and scaffolds (with the big data set) while it is claimed in other posts that Abyss requires more memory than soapdenovo. I am not having the same experience.
What I was wondering now is about the memory usage for soapdenovo (1.4%) if it could anyhow be increased.

**GenoMax** · 08-15-2013, 04:57 AM

Originally posted by illinu View Post

I wonder why the sequencing company went so deep?

You did not discuss the coverage need with the provider before you sent the sample in for sequencing?

I hope they did not charge you by the base

Originally posted by illinu View Post

What I was wondering now is about the memory usage for soapdenovo (1.4%) if it could anyhow be increased.

In general programs will use as much memory as they need. Increasing the amount of RAM available would only be a consideration if the program was aborting due to a lack of memory (not a problem in your case).

You can bring in the data for mate pairs and long reads in some sort of staged manner. Assemble the short reads first and then use the other info to close the gaps that remain.

**illinu** · 08-15-2013, 10:33 AM

Originally posted by GenoMax View Post

You did not discuss the coverage need with the provider before you sent the sample in for sequencing?

Hi GenoMax,
I just came in contact with this data trying to give a hand, so I was not involved in discussing sequencing for this project unfortunatelly I have to work with what I have which is roughly raw data. However I just subsampled all the files getting a total coverage of around 100x. I got input from other groups assembling bacterial genomes which apparently worked fine with such final coverage.

Just to add to my experience with SOAPdenovo, I am running again the subset of data (100x) and the program is still running and not making great progress in terms of speed. I ran Ray in the "smaller" computer (8 cores, 33GB RAM) and it gave me an impressive 82 contigs in 9 minutes !!

I will have SOAPdenovo running overnight but deffinetely not happy about how slow it works.

Thanks for your help

**GenoMax** · 08-15-2013, 10:38 AM

Velvet (http://www.ebi.ac.uk/~zerbino/velvet/) works pretty well for bacterial genomes (based on previous discussions on here). Give that a try with the short reads only.

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, Today, 07:03 AM	0 responses 9 views 0 likes	Last Post by seqadmin Today, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 27 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 32 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 26 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

SOAPdenovo running for 48 h?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News