Seqanswers Leaderboard Ad

**Brian Bushnell** · 11-18-2014, 10:54 AM

If you download the BBMap package and run kmercountexact, it will tell you the size of the single-copy genomic peak, which is roughly the genome size for bacteria.

kmercountexact.sh in1=read1.fastq in2=read2.fq khist=khist.txt peaks=peaks.txt

You can plot the khist.txt if you want to look at it visually to see how many error kmers there are. "peaks.txt" will give you the location and size of the largest peaks; the number indicating the volume of the first peak should be approximately the genome size, though you can make it more accurate by factoring in the repeat peaks.

You can do some of this with Jellyfish, too, but I don't know the specifics.

**mandar.bobade60** · 11-18-2014, 11:24 PM

Thanks Brian. It, BBMAP, did work well, since it gave volume in first row which is very much near to my specie's nearby species' genomic size unlike jellyfish. Only thing is I had to subsample my very huge data, as with large dataset it threw exceptions.

Only one question in this aspect which you have mentioned in your message " though you can make it more accurate by factoring in the repeat peaks".. If you could clarify this, it would be of great help.

**Brian Bushnell** · 11-19-2014, 09:27 AM

If you plot the kmer frequency histogram of a single haploid organism, you will typically see a peak at 0 (the error peak), then the next peak is typically the most prominent and corresponds to the size of the genome that is unique.

Then there will typically be more, smaller peaks at higher kmer depths. Let's say the first peak is at 40, and contains 3 million kmers. That implies that you have roughly 40x coverage, and the genome contains 3Mbp of unique sequence. There will probably be another peak at 80, which is a lot smaller. This is the peak from 2-copy repeat regions of the genome. If it contains 100,000 kmers, then the expected genome size would actually be increased by 200,000bp - because each of those kmers occurs twice. The three-copy peak would be at 120, and if it contained 100,000 kmers, it would contribute 300,000bp to the expected genome size.

So - if your first peak P is at depth D, with volume V, then let's label subsequent peaks at depth N*D as PN with volume VN - e.g. P2 is the peak at 2*D, and has a volume of V2. The total genome size would be:

V1*1+V2*2+V3*3...+VN*N.

In each case, you can get the multiplier N by round(DN/D1) where DN is the depth (center) of the current peak and D1 is the depth of the first peak.

But, my program does not have a very sophisticated peak-calling algorithm, so it won't find all of the peaks - it will probably find the first 2, and then maybe some more. You'll have to do that manually from looking at the graph of the kmer histogram, or use a better peak-caller. For most small organisms, though, the vast majority of the genome is single-copy so the higher peaks can be ignored if you just want a rough estimate.

**mandar.bobade60** · 11-19-2014, 08:56 PM

Thanks alot Brian once again for your comprensive explanation. Further peaks, after first error and second actual, there are no apparent peaks. So, I assume whaterver first row volume is there, is my genome size.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 33 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Genome size estimation for paired end data

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News