Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jov14
    Member
    • Oct 2014
    • 18

    kmer size and coverage cutoff for digital normalization using the khmer suite

    Hi,

    I want to use digital normalization on a set of single cell sequencing data as well as metagenomic date from low complexity communities. I'm probably missing some really obvious point, but I just really not sure how to apply the recommended diginorm cutoffs to my relatively long Miseq-reads.

    Both, our single cell sequencing and our low-complex metagenomic sequencing data, were produced on a Miseq, yielding several million paired-end reads of ~250-300 bp length each.

    The general recommendations in the khmer documentation state that you should normalize to a coverage of 1x to 5x using three-pass normalization and a kmer size of 20.

    My question is: are those recommendations really suited for modern "long read" illumina data? If i reduce the kmer coverage for all kmers of length 20 to 5x or less, won't that reduce the coverage for larger kmers far too extremely?

    Without diginorm, the optimal kmer-size using e.g. metavelvet is mostly around k81-101 for my datasets. How can there be enough kmer-coverage left for kmers at that size for deBruiJn-graph based assemblies if already the kmers of length 20 are reduced to less than 5x coverage?

    My version of khmer doesn't seem to support using kmers larger than 31 so apparently larger kmer-sizes are simply not needed for diginorm. I just do not understand why...
  • titusbrown
    Junior Member
    • Aug 2013
    • 8

    #2
    diginorm k-mer size/coverage doesn't directly correlate with assembly parameters

    Hi jov14,

    the short answer is that because khmer/diginorm retains or rejects entire reads, the k-mer size and coverage of that process are only weakly connected with what the assembler sees and does. That having been said, we are working on increasing k size and doing things like memory efficient error correction instead, which would give you more choices.

    A slightly longer answer: what diginorm is actually doing is aligning the reads to the De Bruijn graph, and while the alignment process depends on k, the alignment itself is not so sensitive to k. Then, diginorm looks at the coverage of the alignment in the graph and decides whether to accept or reject the read. This changes the coverage from random/whole genome shotgun to systematic/smooth, which has many (often good) effects on the resulting assembly. But it also tweaks the coverage distribution - while a coverage of 5 would be disastrous for whole genome shotgun (because you'd miss ~5% of bases!) the variance on the diginormed data is much lower, so you get a reduced set of reads that still contain all the information of the original set of reads.

    I hope that helps!

    Comment

    • titusbrown
      Junior Member
      • Aug 2013
      • 8

      #3
      Oh, sorry, to answer your original question:

      I would suggest running a single pass C=20/k=20, and only doing further error trimming etc if you are running into out-of-memory problems. We've found C=20/k=20 works pretty well for most sequence.

      Comment

      • jov14
        Member
        • Oct 2014
        • 18

        #4
        Thanks for your answer and suggestion!
        After Iposted this "problem" and had some more time to think again it came back to me:
        Since, as you say, Diginorm only starts to exclude reads if ALL kmers in a read already have counts higher than the cutoff and reads are always kept if even only one new kmer is present in the read, of course the final kmer coverage for each individual kmer will be much higher than the cutoff. I simply forgot that and my problem is really nonexistant.

        Acutally I already used three pass normalization procedures on previous data (where I had read lengths of 100 bp) using C=20 in the first pass and C=5 in the third (must have picked that up in one of your tutorials somewhere).
        I usually then do two assemblies, one with first-pass-normalized data and one with third-pass-normalized data and then just pick the assembly that looks best (At least for single cell data both are usually way better than with non-normalized data).

        However, would you say that for higher read lengths higher kmer values would bring some advantages (I would expect at least the identification of unique kmers for the kmer-trimming/error-correction-step would be perhaps more specific), or would you say the values should better just be left as they are?

        Comment

        • titusbrown
          Junior Member
          • Aug 2013
          • 8

          #5
          You can probably get slightly better performance on nasty large repetitive genomes with larger k-mers, for sure! I balance that in my lab against the point that we feel very comfortable with k=20/C=20 for transcriptomes and metagenomes based on our personal experience.

          Report back if you play around - I'd love to hear more!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            New Genomics Tools and Methods Shared at AGBT 2025
            by seqadmin


            This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

            The Headliner
            The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
            03-03-2025, 01:39 PM
          • seqadmin
            Investigating the Gut Microbiome Through Diet and Spatial Biology
            by seqadmin




            The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
            02-24-2025, 06:31 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-20-2025, 05:03 AM
          0 responses
          21 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-19-2025, 07:27 AM
          0 responses
          26 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-18-2025, 12:50 PM
          0 responses
          20 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-03-2025, 01:15 PM
          0 responses
          188 views
          0 reactions
          Last Post seqadmin  
          Working...