Header Leaderboard Ad


Unexpected kmer occurence distribution across 10 bacterial genomes



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unexpected kmer occurence distribution across 10 bacterial genomes

    I'm trying to identify unique regions across a large collection of bacterial genomes and my first step is to use tallymer (from the genometools package) to identify unique kmers across the entire db. I started out with a small set of 10 genomes (to familiarize myself with the tools) and after building the suffix index I ran tallymer occratio to get a distribution of unique and non-unique kmers per mer-size, but the distribution is sort of opposite what I expected:

    # distribution of unique mers
    10 520
    11 1490
    12 2403
    13 2887
    14 3077
    15 3151
    16 3178
    17 3191
    18 3197
    19 3200
    20 3203
    21 3205
    22 3207

    # distribution of non unique mers (counting each non unique mer only once)
    10 689042
    11 1306536
    12 1748019
    13 1944353
    14 2012896
    15 2034879
    16 2041859
    17 2044257
    18 2045175
    19 2045601
    20 2045875
    21 2046092
    22 2046265

    Naively I had expected to find more instances of the smaller kmers in my test set than the larger mer-sizes. That assumption was based on my thinking that as my mer-size approaches my genome size the number of possible instances goes down (down to just 1 'mer' whose size is the length of the genome).

    Can anyone comment on what I am seeing based on their own experience? My assumption is based on that one very flimsy thought (a mer size of genome length can only occur once), but I wanted to make sure my results are not unexpected before moving ahead. Note that I have no reason to believe there was any problem with the execution of tallymer (or suffixerator prior to the tallymer occratio command). The jobs finished without warning or error and produced the expected outputs.

  • #2
    I have now looked at the distribution of kmers size 10-30 across the full set of bacterial genomes (~4600), and I'm seeing 6-8 billion kmers each for the size range I was considering for qPCR primer length (18-22bp). I decided to try dumping the actual kmers using tallymer (from genometools) and do some more exploring of my data but I've found that the final step, tallymer search, is going to take an unreasonable amount of time to dump 6-8 billion kmers (after watching it run for a day it would take several years to dump any single kmer size between 18-22).

    Can anyone suggest a strategy I could use to identify primers for qPCR that can uniquely identify bacterial genomes from gut genome samples? The methods I've looked at so far (home brew tallymer approach, and RUCS) seem like they would be good for a smaller collection of genomes but don't scale well up to the # of genomes I want to use as a background


    Latest Articles


    • seqadmin
      How RNA-Seq is Transforming Cancer Studies
      by seqadmin

      Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
      09-07-2023, 11:15 PM
    • seqadmin
      Methods for Investigating the Transcriptome
      by seqadmin

      Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

      Whole Transcriptome RNA-seq
      Whole transcriptome sequencing...
      08-31-2023, 11:07 AM





    Topics Statistics Last Post
    Started by seqadmin, 09-22-2023, 09:05 AM
    0 responses
    Last Post seqadmin  
    Started by seqadmin, 09-21-2023, 06:18 AM
    0 responses
    Last Post seqadmin  
    Started by seqadmin, 09-20-2023, 09:17 AM
    0 responses
    Last Post seqadmin  
    Started by seqadmin, 09-19-2023, 09:23 AM
    0 responses
    Last Post seqadmin