Seqanswers Leaderboard Ad

**Brian Bushnell** · 06-09-2014, 08:36 AM

Depending on the length and insert size of the reads, you can get an insert size histogram via overlap, which is fast and does not require assembly or mapping. You can do that like this:

bbmerge.sh in1=1.fastq in2=2.fastq ihist=ihist.txt reads=2000000

...which will just process the first two million reads. However, if the insert size is long enough that they don't overlap, it won't work and you need to assemble and map. Whether or not you can assemble only 10% of the reads depends on how much coverage you have. Do you know what kind of organism it is, or is it a metagenome?

You can estimate coverage via kmer-counting, like this:

khist.sh in1=1.fastq in2=2.fastq hist=hist.txt

Then you look at the histogram and find the first major peak, which tells you the approximate coverage. You could also speed it up by limiting it to some fraction of the total reads and then scaling the result by a factor.

Both of these are in the BBTools package. Note that these command lines are for Linux. If your computer uses Windows, the commands would be slightly different.

**bioman1** · 06-11-2014, 01:12 AM

@Brian Bushnell- Thanks for your suggestion.

I am working with plant genome of fruit crop. Can I use 10% of reads for denovo assemble and map the reads used for assembling to estimate coverage, insert size and heterozygosity?. Will be this analysis will be enough for approx estimation of these metrics?.
Another thing, can tool to estimate heterozygosity rate from mapped reads?

**avo** · 06-11-2014, 02:58 AM

I would rather do this with the whole dataset. If you have enough coverage (approx. >30-fold) the k-mer graph should not only be able to give you a hint about the genome size and coverage but also heterozygosity.

I haven't worked with the BBTools package yet but with Jellyfish and SOAPec. There is also a tool available for the estimation of these characteristics (see the attached paper). The Figures in there might also be helpful for the understanding of the k-mer graph:

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

**Brian Bushnell** · 06-11-2014, 09:17 AM

Originally posted by bioman1 View Post

@Brian Bushnell- Thanks for your suggestion.

I am working with plant genome of fruit crop. Can I use 10% of reads for denovo assemble and map the reads used for assembling to estimate coverage, insert size and heterozygosity?. Will be this analysis will be enough for approx estimation of these metrics?.
Another thing, can tool to estimate heterozygosity rate from mapped reads?

First, try merging the reads by overlap; you will know in under a minute whether the reads overlap or not (based on the percentage merged). If they do, then the insert size question is solved.

The kmer histogram can give you an estimate of the genome size, repetitiveness, AND the heterozygosity. There's really no way to tell whether 10% is enough for assembly without a genome size estimate. If you have 200Gbp, that would give 30x coverage for a ~700Mbp organism, which is very small for a tree (even ignoring the ploidy).

By the way, you can also do normalization and subsampling with BBTools, either of which will reduce the read count. For example, you could normalize to approximately 30x coverage like this:

bbnorm.sh in1=1.fastq in2=2.fastq hist=hist.txt out=normalized.fq target=30

...which will automatically determine how many reads you need to get a uniform 30x coverage. It's slower than sampling, but not too bad. The output from that command would be interleaved.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Coverage & insert size estimation

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News