Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unexpected kmer occurence distribution across 10 bacterial genomes

    I'm trying to identify unique regions across a large collection of bacterial genomes and my first step is to use tallymer (from the genometools package) to identify unique kmers across the entire db. I started out with a small set of 10 genomes (to familiarize myself with the tools) and after building the suffix index I ran tallymer occratio to get a distribution of unique and non-unique kmers per mer-size, but the distribution is sort of opposite what I expected:

    # distribution of unique mers
    10 520
    11 1490
    12 2403
    13 2887
    14 3077
    15 3151
    16 3178
    17 3191
    18 3197
    19 3200
    20 3203
    21 3205
    22 3207

    # distribution of non unique mers (counting each non unique mer only once)
    10 689042
    11 1306536
    12 1748019
    13 1944353
    14 2012896
    15 2034879
    16 2041859
    17 2044257
    18 2045175
    19 2045601
    20 2045875
    21 2046092
    22 2046265

    Naively I had expected to find more instances of the smaller kmers in my test set than the larger mer-sizes. That assumption was based on my thinking that as my mer-size approaches my genome size the number of possible instances goes down (down to just 1 'mer' whose size is the length of the genome).

    Can anyone comment on what I am seeing based on their own experience? My assumption is based on that one very flimsy thought (a mer size of genome length can only occur once), but I wanted to make sure my results are not unexpected before moving ahead. Note that I have no reason to believe there was any problem with the execution of tallymer (or suffixerator prior to the tallymer occratio command). The jobs finished without warning or error and produced the expected outputs.

  • #2
    I have now looked at the distribution of kmers size 10-30 across the full set of bacterial genomes (~4600), and I'm seeing 6-8 billion kmers each for the size range I was considering for qPCR primer length (18-22bp). I decided to try dumping the actual kmers using tallymer (from genometools) and do some more exploring of my data but I've found that the final step, tallymer search, is going to take an unreasonable amount of time to dump 6-8 billion kmers (after watching it run for a day it would take several years to dump any single kmer size between 18-22).

    Can anyone suggest a strategy I could use to identify primers for qPCR that can uniquely identify bacterial genomes from gut genome samples? The methods I've looked at so far (home brew tallymer approach, and RUCS) seem like they would be good for a smaller collection of genomes but don't scale well up to the # of genomes I want to use as a background


    Latest Articles


    • seqadmin
      Exploring the Dynamics of the Tumor Microenvironment
      by seqadmin

      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
      07-08-2024, 03:19 PM
    • seqadmin
      Exploring Human Diversity Through Large-Scale Omics
      by seqadmin

      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
      06-25-2024, 06:43 AM





    Topics Statistics Last Post
    Started by seqadmin, 07-19-2024, 07:20 AM
    0 responses
    Last Post seqadmin  
    Started by seqadmin, 07-16-2024, 05:49 AM
    0 responses
    Last Post seqadmin  
    Started by seqadmin, 07-15-2024, 06:53 AM
    0 responses
    Last Post seqadmin  
    Started by seqadmin, 07-10-2024, 07:30 AM
    0 responses
    Last Post seqadmin