Unexpected kmer occurence distribution across 10 bacterial genomes

jmartin

Member

Join Date: Dec 2009

Posts: 78
- Share
- Tweet
#1

Unexpected kmer occurence distribution across 10 bacterial genomes

04-29-2022, 01:20 PM

I'm trying to identify unique regions across a large collection of bacterial genomes and my first step is to use tallymer (from the genometools package) to identify unique kmers across the entire db. I started out with a small set of 10 genomes (to familiarize myself with the tools) and after building the suffix index I ran tallymer occratio to get a distribution of unique and non-unique kmers per mer-size, but the distribution is sort of opposite what I expected:

# distribution of unique mers
10 520
11 1490
12 2403
13 2887
14 3077
15 3151
16 3178
17 3191
18 3197
19 3200
20 3203
21 3205
22 3207

# distribution of non unique mers (counting each non unique mer only once)
10 689042
11 1306536
12 1748019
13 1944353
14 2012896
15 2034879
16 2041859
17 2044257
18 2045175
19 2045601
20 2045875
21 2046092
22 2046265

Naively I had expected to find more instances of the smaller kmers in my test set than the larger mer-sizes. That assumption was based on my thinking that as my mer-size approaches my genome size the number of possible instances goes down (down to just 1 'mer' whose size is the length of the genome).

Can anyone comment on what I am seeing based on their own experience? My assumption is based on that one very flimsy thought (a mer size of genome length can only occur once), but I wanted to make sure my results are not unexpected before moving ahead. Note that I have no reason to believe there was any problem with the execution of tallymer (or suffixerator prior to the tallymer occratio command). The jobs finished without warning or error and produced the expected outputs.
Tags: None
jmartin

Member

Join Date: Dec 2009

Posts: 78
- Share
- Tweet
#2

05-04-2022, 11:40 AM

I have now looked at the distribution of kmers size 10-30 across the full set of bacterial genomes (~4600), and I'm seeing 6-8 billion kmers each for the size range I was considering for qPCR primer length (18-22bp). I decided to try dumping the actual kmers using tallymer (from genometools) and do some more exploring of my data but I've found that the final step, tallymer search, is going to take an unreasonable amount of time to dump 6-8 billion kmers (after watching it run for a day it would take several years to dump any single kmer size between 18-22).

Can anyone suggest a strategy I could use to identify primers for qPCR that can uniquely identify bacterial genomes from gut genome samples? The methods I've looked at so far (home brew tallymer approach, and RUCS) seem like they would be good for a smaller collection of genomes but don't scale well up to the # of genomes I want to use as a background
Comment

Previous template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Unexpected kmer occurence distribution across 10 bacterial genomes

Comment

Latest Articles

ad_right_rmr

News