Seqanswers Leaderboard Ad

**figure002** · 05-09-2011, 02:17 AM

I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA.

The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

**yanij** · 05-09-2011, 02:41 AM

Originally posted by figure002 View Post

I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA.

The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

Hey thanks for the information, so far I have just gone through the program Tallymer

It works great but somehow if you have a very large read size the file size of the suffix tree gonna be tremendously large. There is another paper from Waterman showing how to predict the genome size from the k-mer - Estimating the Repeat Structure and Length of DNA Sequences Using ℓ-Tuples

Do you have any idea on the speed and memory consumption of jellyfish?

**figure002** · 05-11-2011, 12:28 AM

Thanks for pointing me to Tallymer. I was looking for more tools to test on our data.

Another tool I found is GSP, but it doesn't look like a very a decent tool. The source package is all messy and I couldn't find it in a publication.

I've only just finished compiling jellyfish (jellyfish 1.1 has some compilation issues, but we just received a patch from the developer that fixes these issues). So I can't tell you anything about performance right now. I'll report back as soon as I have some results.

**yanij** · 05-11-2011, 02:59 AM

Originally posted by figure002 View Post

Thanks for pointing me to Tallymer. I was looking for more tools to test on our data.

Another tool I found is GSP, but it doesn't look like a very a decent tool. The source package is all messy and I couldn't find it in a publication.

I've only just finished compiling jellyfish (jellyfish 1.1 has some compilation issues, but we just received a patch from the developer that fixes these issues). So I can't tell you anything about performance right now. I'll report back as soon as I have some results.

Hi figure002, I guess the compilation error that u've encountered is this - "warnings being treated as errors". In my case, simply remove all the "-Werror" in the Makefiles would do.

Hope this may help.

**figure002** · 05-11-2011, 04:05 AM

Originally posted by yanij View Post

Hi figure002, I guess the compilation error that u've encountered is this - "warnings being treated as errors". In my case, simply remove all the "-Werror" in the Makefiles would do.

Hope this may help.

True, that's what I did at first. But the developer was so kind to fix the source which makes removing the -Werror unnecessary. He said he would upload the new package. That should at least make things easier.

**figure002** · 05-17-2011, 07:40 AM

Originally posted by yanij View Post

Do you have any idea on the speed and memory consumption of jellyfish?

I did some runs of both jellyfish and tallymer on test data, and I noticed that jellyfish is much faster (it was running with 32 threads) when it comes to k-mer counting. According to the Jellyfish paper, "Jellyfish offers a much faster and more memory-efficient solution" than suffix arrays, which are used in Tallymer I believe.

At this moment I'm running "tallymer suffixerator" and "jellyfish count" next to each other on a machine with 32 cores. "jellyfish count" is using around 0.2% memory, while "tallymer suffixerator" is using around 3.0% memory.

Thus so far I can confirm that Jellyfish is indeed faster and more memory efficient.

**SES** · 11-14-2012, 07:52 AM

Originally posted by figure002 View Post

I would like to know this as well. K-mer distribution in short read data seems to be the key, but that's as far as my understanding of this goes. I did find a tool for k-mer counting and genome size estimation: bioinformatic_tools:jellyfish, JELLYFISH - Fast, Parallel k-mer Counting for DNA.

The accompanying article for Jellyfish: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

I think that this estimation is based on a gamma distribution and is similar to the calculations made by Quake, Ray, ABySS, etc. and depend on something like 10-15X coverage. With low coverage, my experience is that the distribution of k-mer copies will show an exponential decay with the rate of decay depending on the repeat content and k-mer length. Does anyone else have experience in trying to make these calculations? It would be great if there was a way to make a reasonable estimate from lower coverage shotgun data.

**Qingl** · 11-14-2012, 08:24 AM

Originally posted by SES View Post

I think that this estimation is based on a gamma distribution and is similar to the calculations made by Quake, Ray, ABySS, etc. and depend on something like 10-15X coverage. With low coverage, my experience is that the distribution of k-mer copies will show an exponential decay with the rate of decay depending on the repeat content and k-mer length. Does anyone else have experience in trying to make these calculations? It would be great if there was a way to make a reasonable estimate from lower coverage shotgun data.

This paper present a good way to estimate genome size with low cov shotgun data, though it requires an assembled transcriptome as reference...

**Qingl** · 11-14-2012, 08:25 AM

Originally posted by Qingl View Post

This paper present a good way to estimate genome size with low cov shotgun data, though it requires an assembled transcriptome as reference...

Characterization of the Conus bullatus genome and its venom-duct transcriptome - PubMed

http://www.ncbi.nlm.nih.gov/pubmed/21266071

Our results provide the first global view of venom-duct transcription in any cone snail. A notable feature of Conus bullatus venoms is the breadth of A-superfamily peptides expressed in the venom duct, which are unprecedented in their structural diversity. We also find SNP rates within conopeptides …

**SES** · 11-14-2012, 09:27 AM

Originally posted by Qingl View Post

http://www.ncbi.nlm.nih.gov/pubmed/21266071

This is an interesting approach that I had not seen. It is not clear how/if the efficacy of the method was evaluated but it is something to explore. Thanks for the response.

**Qingl** · 11-14-2012, 11:21 AM

Originally posted by SES View Post

This is an interesting approach that I had not seen. It is not clear how/if the efficacy of the method was evaluated but it is something to explore. Thanks for the response.

Sure

The method has control sample that would support the efficacy

**SES** · 11-14-2012, 11:50 AM

Originally posted by Qingl View Post

Sure

The method has control sample that would support the efficacy

Excellent. Do you mean control with known genome size? That is what I was wondering. I'll take a closer look at the paper and see if I can apply the methods to my system.

**Qingl** · 11-14-2012, 04:57 PM

Originally posted by SES View Post

Excellent. Do you mean control with known genome size? That is what I was wondering. I'll take a closer look at the paper and see if I can apply the methods to my system.

Yes, I agree it's an excellent method~~~~

**aaronrjex** · 01-10-2013, 05:01 PM

Hi yanij

I don't know how useful this will be to you give the time since your post, but just in case....

The BGI method is based around the observation that the coverage achieved for a genome is based on the size of the genome and the total amount of sequence data generated. So if you sequence 100 Mb of data for a 10 Mb genome, you should get ~10-fold coverage.

Or as a simple equation: depth of coverage = total data / genome length.

If you have any two of these parameters (i.e., you know the amount of data you generated and you know the genome size) obviously you can calculate the third.

Usually when doing de novo genome sequencing you don't know the genome size, and since you don't have the genome, you don't know the coverage, but you do know how much data you've generated (i.e., the 'total sequence length' to use BGI's term). To estimate the genome size, you then need to estimate the coverage depth (N).

To do this, you can calculate the kmer frequency within your read data (most people will do this for one of their small insert libraries for which they have the most information). Meaning you chop all of the reads you've generated up in to kmers (a kmer of 17 is the most common, as it is long enough to yield fairly specific sequences (meaning that its unlikely the kmer is repeated throughout the genome by chance), but short enough to give you lots of data). You then count the frequency with which each 17-mer represented by your data is found among all of the reads generated and create a frequency histogram of this information. For non-repetitive regions of the genome, this histogram should be normally distributed around a single peak (although in real data you will have a asymptote near 1 because of rare sequencing errors etc). This peak value (or peak depth) is the mean kmer coverage for your data.

You can relate this value to the actual coverage of your genome using the formula M = N * (L – K + 1) / L, where M is the mean kmer coverage, N is the actual coverage of the genome, L is the mean read length and k is the kmer size.

L -k +1 gives you the number of kmers created per read.

So basically what the formula says is the kmer coverage for a genome is equal to the mean read coverage * the number of kmers per read divided by the read length.

Because you know L (your mean read length) and k (the kmer you used to estimate peak kmer coverage) and you've calculated M (soapdenovo comes with a script called kmerfreq that will this), you simply solve the equation for N as:

N = M/((L-k+1)/L) and calculate N.

Once you have that, divide your total sequence data by N and you have your genome estimate.

Hope that helps.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to estimating the genome size

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News