Seqanswers Leaderboard Ad

**samanta** · 08-25-2013, 07:32 AM

My first guess (based on seeing your histograms-k29 chart):

Your coverage is very high and that is making the assembler believe that some noisy k-mers are real. One way to check whether that is really the case, try assembling only 1/6th of the reads and see whether the quality goes up. Alternatively, you can assemble the entire set, but play with Velvet parameter that filters out k-mers below certain frequency.

Let's see the results, and then we can brainstorm other possibilities/solutions.

P.S. Weird, not wired. We call it wired, when the assembly works

**mcnelson.phd** · 08-25-2013, 07:50 AM

A couple of things could be going on here.

First, make sure that your reads have had the Nextera adapter sequence trimmed off. When setting up the sample sheet for sequencing XT libraries, this should normally be turned on, but as some users have found out some core facilities don't do that. For all of your reads that were longer than the sequenced fragment, those adapter sequences will cause the assembler to choke and die or else give lots of small contigs because it can't properly resolve the graph.

If you're sure that the adapter sequences have been trimmed, try merging reads that can be overlapped into single reads. That will remove some of the range for the pair distance, and if you feed a mix of merged single reads and paired reads, you should get a better assembly than just having paired reads with a large pair distance. There are a bunch of programs that can do the merging for you, we use SeqPrep but FLASH and numerous others are widely used as well.

It doesn't look like you did any quality trimming, based on the commands you listed (P06_S1_L001_R1_001.fastq would be a direct output file from MiSeq Reporter based on its name), so you should probably give that a try. With 300x coverage, bad reads and parts of reads will prevent velvet from collapsing a lot of bubbles in the graph which will result in lots of small contigs. Additionally, increasing the kmer size should help the assembly as well after you quality trim. You can also try using khmer to down-sample/normalize your data so erroneous reads/kmers don't cause issues.

**mastal** · 08-25-2013, 11:35 AM

Originally posted by chjp0632 View Post

I run Velvet assembler using these command lines: 1) ./bin/velvet_1.2.10/velveth out 33 -fastq -short ./P06_S1_L001_R1_001.fastq ./P06_S1_L001_R2_001.fastq
2) velvetg out -exp_cov auto -ins_length 1044 -scaffolding yes

If you want velvet to treat the reads as paired reads you need to use the flags -shortPaired and -separate.

Code:

./bin/velvet_1.2.10/velveth out 33 -fastq -shortPaired -separate ./P06_S1_L001_R1_001.fastq ./P06_S1_L001_R2_001.fastq

**chjp0632** · 09-08-2013, 11:44 PM

Originally posted by samanta View Post

My first guess (based on seeing your histograms-k29 chart):

Your coverage is very high and that is making the assembler believe that some noisy k-mers are real. One way to check whether that is really the case, try assembling only 1/6th of the reads and see whether the quality goes up. Alternatively, you can assemble the entire set, but play with Velvet parameter that filters out k-mers below certain frequency.

Let's see the results, and then we can brainstorm other possibilities/solutions.

P.S. Weird, not wired. We call it wired, when the assembly works

I subsample the sequencing data to 300k pairs of reads, 500k, 1million, 2m, 3m and 4m. The SOAPdenovo result is shown below:

#ReadPair 300k 500k 1m 2m 3m 4m
N50 13740 24563 16256 2721 941 522

The result indicates that 500k pair of 250 bp reads produce most continuous contigs by SOAPdenovo. Velvet's performance on these data sets show similar trends. Therefor 5 million sequencing reads may be more than enough.

**chjp0632** · 09-08-2013, 11:53 PM

Originally posted by mcnelson.phd View Post

A couple of things could be going on here.

If you're sure that the adapter sequences have been trimmed, try merging reads that can be overlapped into single reads. That will remove some of the range for the pair distance, and if you feed a mix of merged single reads and paired reads, you should get a better assembly than just having paired reads with a large pair distance. There are a bunch of programs that can do the merging for you, we use SeqPrep but FLASH and numerous others are widely used as well.

It doesn't look like you did any quality trimming, based on the commands you listed (P06_S1_L001_R1_001.fastq would be a direct output file from MiSeq Reporter based on its name), so you should probably give that a try. With 300x coverage, bad reads and parts of reads will prevent velvet from collapsing a lot of bubbles in the graph which will result in lots of small contigs. Additionally, increasing the kmer size should help the assembly as well after you quality trim. You can also try using khmer to down-sample/normalize your data so erroneous reads/kmers don't cause issues.

1) Merging the reads to longer single reads are good suggestions, but I am also worried about false merge. Can SeqPrep and FLASH distinguish true read extension from false one?

2) Reads treaming and down-sampling reads can be saved if we use SPAdes assembler. I used it and save much time. It automatically assembled different subsample data with different kmers and merge the results.

3) Compared with SOAPdenovo and Velvet, our result shows SPAdes performed best on contig contiguity for this data set.

**krobison** · 09-09-2013, 06:55 AM

WRT to Flash, SeqPrep, etc -- you will get a lot of valid merges. You will also get some false merges if the reads overlap repeats in an unfortunate way.

I would recommend also trying a kmer-based error corrector such as MUSKET.

A lot of people are liking SPADES assembler, which has a cleanup step built in. I have very good luck with Ray. Velvet is the old workhorse, but many newer programs deliver more contiguous assemblies.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

wired assembly result--abysmally short contigs

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News