Seqanswers Leaderboard Ad

**Brian Bushnell** · 05-15-2014, 09:30 AM

Are your reads trimmed? They should be, for adapter removal, at least. Thus if you set the length cutoff at 250 you are probably using very few reads and thus have too little data to assemble.

I would generally expect the negative correlation you observe, down to a cutoff equal to your kmer size, particularly when you don't have very much data - you only had around 150x coverage before trimming, and if you throw away a lot of it, you will drop too low. Don't set a length cutoff unless you have plenty of data. In your case even 63bp reads might be useful.

And don't worry about the L50 being lower than the read length... contigs come from the kmer graph which is not directly related to individual reads, so that's not a bug, it's a feature

**ju.lie** · 05-16-2014, 04:27 AM

Thank you for your quick reply!

I'm not sure if I understand you correctly - from your answer it sounds as if the read length cutoff specifies the minimum length reads should have to be included in the assembly.
From the manual and the example.config file given on the SOAP website I had been assuming that the rd_len_cutoff indicates, for each read, the position of the last base used for assembly (i.e. the position after which the reads will be cut).

**mastal** · 05-16-2014, 04:49 AM

I think you're right, my understanding of the Soapdenovo web page is that the rd_len_cutoff is the last base left in the read, Soapdenovo trims off all the bases after that point.

In that case, if you set the rd_len_cutoff to 250, nothing will be trimmed from the 3' ends of the reads, which often have low quality bases, so that could explain why the assembly is worse.

In your original post you mentioned that you are working with simulated MiSeq data.

How are you simulating the MiSeq data? What software are you using? Do you end up with reads that are all the same length (250) or not?

**ju.lie** · 05-16-2014, 06:42 AM

Again a prompt reply, thank you!

Yes, all sequences are of same length and they are generated using ART.

Initially, I had the same thought as you. That's why I conducted a series of runs with varying rd_len_cutoff. Yet, to me, these didn't explain why results from rd_len_cutoff = 175 with

total scaffold length: 35,545,564
average scaffold length: 9,007
N50: 14,637

would still be far worse than results from rd_len_cutoff = 150 with

total scaffold length: 36,161,245
average scaffold length: 12,362
N50: 25,376

You're right of course - in most cases, a decrease in base quality can be observed towards the 3' end. But, regarding the FastQC Report (see attachment) for my data, I wouldn't have expected the assembly to be affected so distinctly. Even though the 25 bases ommited in the example above are of a lower quality I still would have thought that the quality overall is rather good and that the loss of information associated with trimming should at least lead to similar results in both cases.

Moreover, Velvet, which is also works on De Bruijn graphs, performs perfectly well on the set of 250 bp sequences. Do you happen to know which computational difference between these two programs might lead to such divergent results?

Attached Files

per_base_quality.png (9.7 KB, 30 views)

**mastal** · 05-16-2014, 06:59 AM

I agree your per base quality seems pretty good, which is not what you would see in real data.

I have used velvet quite a bit, but have never used Soapdenovo or ART.

With real data you would also get adapters at the 3' ends of some reads, and with long MiSeq PE reads, sometimes R1 and R2 overlap. Does ART add adapters to some reads or not, and what is the average insert length for the simulated reads?

Of course, what also happens when you make the reads shorter, is that the coverage decreases. I don't know about Soapdenovo, but velvet doesn't do so well with very high coverage.

**Wallysb01** · 05-16-2014, 07:05 AM

Have you looked at what the kmer frequenting plots are doing as you change the read length? Those can be helpful in diagnosing a problem. And have you tried using SOAP’s error correction method?

**Brian Bushnell** · 05-16-2014, 08:36 AM

Originally posted by ju.lie View Post

Thank you for your quick reply!

I'm not sure if I understand you correctly - from your answer it sounds as if the read length cutoff specifies the minimum length reads should have to be included in the assembly.
From the manual and the example.config file given on the SOAP website I had been assuming that the rd_len_cutoff indicates, for each read, the position of the last base used for assembly (i.e. the position after which the reads will be cut).

Sorry, my answer was completely wrong; I misinterpreted the meaning of that flag. Since you are using synthetic data of high quality, I can't explain the results unless the reads have adapters inserted, or an intentional base-composition bias toward the end of the reads, or there is some problem with the read generator. I suggest you post a base-composition by position plot.

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, 05-14-2024, 07:03 AM	0 responses 23 views 0 likes	Last Post by seqadmin 05-14-2024, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 44 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 58 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 44 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

SOAPdenovo2: rd_len_cutoff and assembly quality

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News