No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • kmer size and coverage cutoff for digital normalization using the khmer suite


    I want to use digital normalization on a set of single cell sequencing data as well as metagenomic date from low complexity communities. I'm probably missing some really obvious point, but I just really not sure how to apply the recommended diginorm cutoffs to my relatively long Miseq-reads.

    Both, our single cell sequencing and our low-complex metagenomic sequencing data, were produced on a Miseq, yielding several million paired-end reads of ~250-300 bp length each.

    The general recommendations in the khmer documentation state that you should normalize to a coverage of 1x to 5x using three-pass normalization and a kmer size of 20.

    My question is: are those recommendations really suited for modern "long read" illumina data? If i reduce the kmer coverage for all kmers of length 20 to 5x or less, won't that reduce the coverage for larger kmers far too extremely?

    Without diginorm, the optimal kmer-size using e.g. metavelvet is mostly around k81-101 for my datasets. How can there be enough kmer-coverage left for kmers at that size for deBruiJn-graph based assemblies if already the kmers of length 20 are reduced to less than 5x coverage?

    My version of khmer doesn't seem to support using kmers larger than 31 so apparently larger kmer-sizes are simply not needed for diginorm. I just do not understand why...

  • #2
    diginorm k-mer size/coverage doesn't directly correlate with assembly parameters

    Hi jov14,

    the short answer is that because khmer/diginorm retains or rejects entire reads, the k-mer size and coverage of that process are only weakly connected with what the assembler sees and does. That having been said, we are working on increasing k size and doing things like memory efficient error correction instead, which would give you more choices.

    A slightly longer answer: what diginorm is actually doing is aligning the reads to the De Bruijn graph, and while the alignment process depends on k, the alignment itself is not so sensitive to k. Then, diginorm looks at the coverage of the alignment in the graph and decides whether to accept or reject the read. This changes the coverage from random/whole genome shotgun to systematic/smooth, which has many (often good) effects on the resulting assembly. But it also tweaks the coverage distribution - while a coverage of 5 would be disastrous for whole genome shotgun (because you'd miss ~5% of bases!) the variance on the diginormed data is much lower, so you get a reduced set of reads that still contain all the information of the original set of reads.

    I hope that helps!


    • #3
      Oh, sorry, to answer your original question:

      I would suggest running a single pass C=20/k=20, and only doing further error trimming etc if you are running into out-of-memory problems. We've found C=20/k=20 works pretty well for most sequence.


      • #4
        Thanks for your answer and suggestion!
        After Iposted this "problem" and had some more time to think again it came back to me:
        Since, as you say, Diginorm only starts to exclude reads if ALL kmers in a read already have counts higher than the cutoff and reads are always kept if even only one new kmer is present in the read, of course the final kmer coverage for each individual kmer will be much higher than the cutoff. I simply forgot that and my problem is really nonexistant.

        Acutally I already used three pass normalization procedures on previous data (where I had read lengths of 100 bp) using C=20 in the first pass and C=5 in the third (must have picked that up in one of your tutorials somewhere).
        I usually then do two assemblies, one with first-pass-normalized data and one with third-pass-normalized data and then just pick the assembly that looks best (At least for single cell data both are usually way better than with non-normalized data).

        However, would you say that for higher read lengths higher kmer values would bring some advantages (I would expect at least the identification of unique kmers for the kmer-trimming/error-correction-step would be perhaps more specific), or would you say the values should better just be left as they are?


        • #5
          You can probably get slightly better performance on nasty large repetitive genomes with larger k-mers, for sure! I balance that in my lab against the point that we feel very comfortable with k=20/C=20 for transcriptomes and metagenomes based on our personal experience.

          Report back if you play around - I'd love to hear more!


          Latest Articles


          • seqadmin
            Advanced Methods for the Detection of Infectious Disease
            by seqadmin

            The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
            11-27-2023, 01:15 PM
          • seqadmin
            Strategies for Investigating the Microbiome
            by seqadmin

            Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
            11-09-2023, 07:02 AM





          Topics Statistics Last Post
          Started by seqadmin, 11-27-2023, 08:12 AM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 11-22-2023, 09:29 AM
          1 response
          Last Post VilliamPast  
          Started by seqadmin, 11-22-2023, 08:53 AM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 11-21-2023, 08:24 AM
          0 responses
          Last Post seqadmin