I'm trying to identify unique regions across a large collection of bacterial genomes and my first step is to use tallymer (from the genometools package) to identify unique kmers across the entire db. I started out with a small set of 10 genomes (to familiarize myself with the tools) and after building the suffix index I ran tallymer occratio to get a distribution of unique and non-unique kmers per mer-size, but the distribution is sort of opposite what I expected:
# distribution of unique mers
10 520
11 1490
12 2403
13 2887
14 3077
15 3151
16 3178
17 3191
18 3197
19 3200
20 3203
21 3205
22 3207
# distribution of non unique mers (counting each non unique mer only once)
10 689042
11 1306536
12 1748019
13 1944353
14 2012896
15 2034879
16 2041859
17 2044257
18 2045175
19 2045601
20 2045875
21 2046092
22 2046265
Naively I had expected to find more instances of the smaller kmers in my test set than the larger mer-sizes. That assumption was based on my thinking that as my mer-size approaches my genome size the number of possible instances goes down (down to just 1 'mer' whose size is the length of the genome).
Can anyone comment on what I am seeing based on their own experience? My assumption is based on that one very flimsy thought (a mer size of genome length can only occur once), but I wanted to make sure my results are not unexpected before moving ahead. Note that I have no reason to believe there was any problem with the execution of tallymer (or suffixerator prior to the tallymer occratio command). The jobs finished without warning or error and produced the expected outputs.
# distribution of unique mers
10 520
11 1490
12 2403
13 2887
14 3077
15 3151
16 3178
17 3191
18 3197
19 3200
20 3203
21 3205
22 3207
# distribution of non unique mers (counting each non unique mer only once)
10 689042
11 1306536
12 1748019
13 1944353
14 2012896
15 2034879
16 2041859
17 2044257
18 2045175
19 2045601
20 2045875
21 2046092
22 2046265
Naively I had expected to find more instances of the smaller kmers in my test set than the larger mer-sizes. That assumption was based on my thinking that as my mer-size approaches my genome size the number of possible instances goes down (down to just 1 'mer' whose size is the length of the genome).
Can anyone comment on what I am seeing based on their own experience? My assumption is based on that one very flimsy thought (a mer size of genome length can only occur once), but I wanted to make sure my results are not unexpected before moving ahead. Note that I have no reason to believe there was any problem with the execution of tallymer (or suffixerator prior to the tallymer occratio command). The jobs finished without warning or error and produced the expected outputs.
Comment