Seqanswers Leaderboard Ad

**titusbrown** · 08-13-2015, 03:20 AM

diginorm k-mer size/coverage doesn't directly correlate with assembly parameters

Hi jov14,

the short answer is that because khmer/diginorm retains or rejects entire reads, the k-mer size and coverage of that process are only weakly connected with what the assembler sees and does. That having been said, we are working on increasing k size and doing things like memory efficient error correction instead, which would give you more choices.

A slightly longer answer: what diginorm is actually doing is aligning the reads to the De Bruijn graph, and while the alignment process depends on k, the alignment itself is not so sensitive to k. Then, diginorm looks at the coverage of the alignment in the graph and decides whether to accept or reject the read. This changes the coverage from random/whole genome shotgun to systematic/smooth, which has many (often good) effects on the resulting assembly. But it also tweaks the coverage distribution - while a coverage of 5 would be disastrous for whole genome shotgun (because you'd miss ~5% of bases!) the variance on the diginormed data is much lower, so you get a reduced set of reads that still contain all the information of the original set of reads.

I hope that helps!

**titusbrown** · 08-13-2015, 03:50 AM

Oh, sorry, to answer your original question:

I would suggest running a single pass C=20/k=20, and only doing further error trimming etc if you are running into out-of-memory problems. We've found C=20/k=20 works pretty well for most sequence.

**jov14** · 08-13-2015, 04:29 AM

Thanks for your answer and suggestion!
After Iposted this "problem" and had some more time to think again it came back to me:
Since, as you say, Diginorm only starts to exclude reads if ALL kmers in a read already have counts higher than the cutoff and reads are always kept if even only one new kmer is present in the read, of course the final kmer coverage for each individual kmer will be much higher than the cutoff. I simply forgot that and my problem is really nonexistant.

Acutally I already used three pass normalization procedures on previous data (where I had read lengths of 100 bp) using C=20 in the first pass and C=5 in the third (must have picked that up in one of your tutorials somewhere).
I usually then do two assemblies, one with first-pass-normalized data and one with third-pass-normalized data and then just pick the assembly that looks best (At least for single cell data both are usually way better than with non-normalized data).

However, would you say that for higher read lengths higher kmer values would bring some advantages (I would expect at least the identification of unique kmers for the kmer-trimming/error-correction-step would be perhaps more specific), or would you say the values should better just be left as they are?

**titusbrown** · 08-13-2015, 06:17 AM

You can probably get slightly better performance on nasty large repetitive genomes with larger k-mers, for sure! I balance that in my lab against the point that we feel very comfortable with k=20/C=20 for transcriptomes and metagenomes based on our personal experience.

Report back if you play around - I'd love to hear more!

Topics	Statistics	Last Post
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 57 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM
Mapping the snoRNAome in Zebrafish to Advance Disease Research by seqadmin Started by seqadmin, 03-18-2025, 12:50 PM	0 responses 50 views 0 reactions	Last Post by seqadmin 03-18-2025, 12:50 PM
TIGR Systems Offer a Compact Alternative to CRISPR for Gene Editing by seqadmin Started by seqadmin, 03-03-2025, 01:15 PM	0 responses 201 views 0 reactions	Last Post by seqadmin 03-03-2025, 01:15 PM

Seqanswers Leaderboard Ad

kmer size and coverage cutoff for digital normalization using the khmer suite

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News