That breaks down into these kmers:
CA, AT, TT, TA, AT, TT, TT
After reverse-complementing to store only a single canonical copy, either forward or reverse, we get this:
CA, AT, AA, AT, AT, AA, AA
So the kmer counts stored by the program would be:
AA: 3
AT: 3
CA: 1
This would equate to 3 unique kmers. The histogram would look like this:
Code:
#Depth Raw_Count Unique_Kmers 1 1 1 3 6 2
So line 1 means there was a single kmer (CA) that occurred exactly once, and it was counted exactly once. Line 2 means that there were 2 unique kmers (AT and AA) that each occurred 3 times, for a total of 6 occurrences.
Therefore - if you want to plot the coverage with respect to the genome, I suggest plotting the "unique" column. And to clarify, the number of "unique kmers" is not the same as the number of kmers that only occur once (I would call those "singleton kmers") - the second number of row 1 gives you the number of singleton kmers counted (1, in this case).
Leave a comment: