Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-14-2015, 08:54 AM

The optimal kmer-length should be less than read length

Other than that there are not really any strict rules, just the longer the read, and the more coverage, generally the longer the kmer you can use. SPAdes, as far as I know, does not select a kmer length, but rather makes a combined assembly (by default) using multiple pre-selected kmer lengths of 55, 33, and 21. I think these values are low and a higher value, particularly for the max, would be better, at least for 150bp reads and good coverage.

We have not had a good experience with tools that automatically try to determine the best kmer length based on kmer frequency histograms, and I don't think there is any theoretical validity to that approach. Running multiple assemblies with different kmer lengths, and selecting the one with the best metrics, seems like the best approach - at least, if you are using a fast assembler, like Velvet. SPAdes is too slow for that approach.

**Chipper** · 04-14-2015, 09:10 AM

You can restart SPAdes with the addition of new k-mers if you think the assembly is not good enough. It combines the results from multiple k-mers, to me that seems like the best approach.

**milw** · 04-14-2015, 09:20 AM

I've been doing a lot of bacterial assemblies with SPades 3.5, and in all cases I've seen using kmer 99 or 127 works best in terms of contig # and N50. This has been with 2x 150 fragment PE plus a mate pair library. I had one trial set of K12 data that had fewer misassemblies using K77 than with K99 or 127.

**bio_informatics** · 04-14-2015, 09:57 AM

Hi Milw,
Thank you for sharing your experience.
I do not have mate paired data. I hope that should not make a huge difference for k-mer outputs and resulting assembly?

I'll definitely try with 99 kmer and 127 as you've.

Originally posted by milw View Post

I've been doing a lot of bacterial assemblies with SPades 3.5, and in all cases I've seen using kmer 99 or 127 works best in terms of contig # and N50. This has been with 2x 150 fragment PE plus a mate pair library. I had one trial set of K12 data that had fewer misassemblies using K77 than with K99 or 127.

**bio_informatics** · 04-14-2015, 10:02 AM

Hi Chipper,
Oh, yes; I was oblivious to this feature. Thank you for reminding.

Originally posted by Chipper View Post

You can restart SPAdes with the addition of new k-mers if you think the assembly is not good enough. It combines the results from multiple k-mers, to me that seems like the best approach.

**bio_informatics** · 04-14-2015, 10:17 AM

Hi Brian,
Thanks for your valuable points.

Definitely, k-mer won't be the read length, (un)fortunately :P
That's correct, SPAdes makes a combined assembly based on k-mer used.
But then again, the k-mer used are governed by the read length. Hence, was my question.

I wanted to understand - should I let SPAdes predict its usage of k-mer which it identifies by read length. OR, should I check read length and based on it, I can run as (example from its documentation):

spades.py -k 21,33,55,77 -

As suggested by milw, and chipper; I should be attempting mentioned practices.

Thanks.

**milw** · 04-15-2015, 05:05 AM

Originally posted by bio_informatics View Post

Hi Milw,
Thank you for sharing your experience.
I do not have mate paired data. I hope that should not make a huge difference for k-mer outputs and resulting assembly?
.

It depends of course on what you're trying to assemble. I've had good luck with microbial BAC clones assembling completely just with fragment (paired-end) data- those are only ~100-150kb. Microbial genomes will probably give you a bunch of contigs.

Here's an example of fragment data only for 2.3Mb microbial genome, showing bigger contigs with increasing Kmer:

(final 'scaffolds' overlays final 'contigs' because there's no scaffolding without mate pair)

cheers- Scott

**bio_informatics** · 04-20-2015, 04:32 AM

Thanks much Scott. :-)

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

SPAdes: selecting K-mer based on read length

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News