Unconfigured Ad

**westerman** · 05-29-2009, 06:23 AM

Maybe it is early in the morning but I am not able to parse the phrase "I would like to sequence coverage along the genome from aligned reads". Perhaps you meant "... like to determine sequence coverage ...". In which case, yes, binning would work. Or just taking the number of reads times the average length of the reads all divided the genome length.

Many of the alignment programs will give you something that you can toss into a spreadsheet and come up with a fancy graph. Basically via binning with a bin size of 1.

Really it all depends on exactly what you want in the end. A single "X coverage" number? A graph? A mean with standard deviation? In any case the computational portion does not seem that complex to me.

**arendon** · 05-29-2009, 06:45 AM

Terribly sorry, I did make little sense.

I would like to know at each base along the genome how many reads saw that base. Alternatively, one can ask not at the single base level but at some interval distance, say 10bp. The output could be a wig file. This does not sound terribly complicated to do if the reads are sorted. I am more wondering whether there are tools that already do this.

Many thanks,

a

**ewilbanks** · 06-02-2009, 01:41 PM

Hi! I was working on just this issue a while back and was surprised by the relative lack of tools. Binning should work just fine. I'd recommend Aaron Quinlan's "BEDTools". the CoverageBed and genomeCoverageBed seem applicable, though I haven't used them yet.

Faculty & Staff

http://people.virginia.edu/~arq5x/bedtools.html

A friend of mine wrote a nifty script in R for me that calculates the coverage at each base-pair across the genome, using input of a text file with read genome coordinates. I'd be happy to send it to you, if that seems helpful. The data from this is quite noisy and usually needs smoothing of some sort (I did rolling means, using the R "zoo package"). I don't know what genome size you're working with, but I was using this on microbial genomes (~4 Mb) and the program runs in ~20-30 min.

Cheers,
Lizzy

**v_kisand** · 06-03-2009, 10:06 AM

I do not remember exactly but some time ago I messed with velvet I think it calculated coverage (I must admit I do not remember was that single base coverage)
and there was a tutorial how to get nice graphs using R.

Originally posted by ewilbanks View Post

Hi! I was working on just this issue a while back and was surprised by the relative lack of tools. Binning should work just fine. I'd recommend Aaron Quinlan's "BEDTools". the CoverageBed and genomeCoverageBed seem applicable, though I haven't used them yet.

Faculty & Staff

http://people.virginia.edu/~arq5x/bedtools.html

A friend of mine wrote a nifty script in R for me that calculates the coverage at each base-pair across the genome, using input of a text file with read genome coordinates. I'd be happy to send it to you, if that seems helpful. The data from this is quite noisy and usually needs smoothing of some sort (I did rolling means, using the R "zoo package"). I don't know what genome size you're working with, but I was using this on microbial genomes (~4 Mb) and the program runs in ~20-30 min.

Cheers,
Lizzy

**strob** · 06-08-2009, 10:42 PM

Hello all,

instead of having a complete chromosome/genome as a reference, I use many gene-scale sequences as my reference. I now want to see what the coverage per base is when I map my solexa reads against these many reference sequences. Is there already a tool out there that can do the job?
Can programs like soap, bowtie, ... provide me this type of information?
Is it also possible to do a blast (with stringent parameters) and than parse these blast results?

Any help/comments are more than welcome

**Jonathan** · 06-09-2009, 02:17 AM

There's an easy way to do it using output from the maq-pipeline:

Code:

...
[maq-steps]
...
maq pileup -p [your bfa] [your map] > pileup.out

cut -f 2-4 pileup.out > croppedpileup.out

#then launch R
R
#following are R commands
data <-read.table(file="croppedpileup.out",sep="\t",header=F)
colnames(data)<-c("pos","consensus","coverage")
depth<-mean(data[,"coverage"])
# depth now has the mean (overall)coverage
#set the bin-size
window<-101
rangefrom<-0
rangeto<-length(data[,"pos"])
data.smoothed<-runmed(data[,"coverage"],k=window)
png(file="cov_out.png",width=1900,height=1000)
plot(x=data[rangefrom:rangeto,"pos"],y=data.smoothed[rangefrom:rangeto],pch=".", cex=1,xlab="bp position",ylab="depth",type="l")
dev.off()

Feel free to leave R afterwards,
you should (unless some error occured) find a PNG-file containing the coverageplot in your directory;
Of course window can be changed (needs to be odd-numbered, though)
as well as rangefrom and rangeto values.

Edit:
Of course when using many sequences in maq,
you will most likely be interessted in keeping the first column of the pileup.out.
However, this will leed to much bigger files (longer R-load-times), and will require R-handling
as you probably want to slice and dice and plot them by sequence-ID I take it?

Any questions?
Best
-Jonathan

**Malabady** · 07-11-2009, 05:32 AM

hi all;
some papers mention X coverage and some says % coverage, is there any difference between both? we get % coverage by dividing total (reads*length) by genome size. if these two terms are different, how the X coverage is calculated? do we use haploid genome size instead of genome size?

**westerman** · 07-13-2009, 07:14 AM

Originally posted by Malabady View Post

hi all;
some papers mention X coverage and some says % coverage, is there any difference between both? we get % coverage by dividing total (reads*length) by genome size. if these two terms are different, how the X coverage is calculated? do we use haploid genome size instead of genome size?

From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

% coverage is how well the genome is actually covered after all mapping and assembly is done.

As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

So our 'X coverage' is 10X (1.5 Gbases / 150 Mbases)
Our '% coverage' is 66.6% (100 Mbases / 150 Mbases)

One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.

I use the haploid genome size or more specifically the C-value times 965Mbases/pg.

**nilshomer** · 07-13-2009, 07:38 AM

Originally posted by westerman View Post

From my understanding yes they are different and what you are calculating is the 'X' coverage. I.e., given the number of raw bases sequenced how many times (or X) does the sequencing potentially cover the genome.

% coverage is how well the genome is actually covered after all mapping and assembly is done.

As an example let's say we have 300M reads of 50 bases or 1.5 Gbase total. Our genome is 150M bases. After mapping (or assembly) we have a bunch of non-overlapping contigs that have 100M bases total.

So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
Our '% coverage' is 66.6% (100Mbases / 150Mbases)

One way to think about this is that percentages generally range from 0% to 100% and so having a percentage greater that 100 can be confusing.

I use the haploid genome size or more specifically the C-value times 965Mbases/pg.

Also, what is a genome? Is it the non-repetitive part? Is it the part that is sequencable with your X base-pair reads? Most genome-sequencing papers say >98% but I highly doubt this given the large fraction of ALU, SINE, and other repeat elements that confound short reads.

**Malabady** · 07-13-2009, 08:10 AM

Thanks Westerman,

In the example you gave:
So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
Our '% coverage' is 66.6% (100Mbases / 150Mbases)

Isn't the X coverage should be 0.1X in this case?

Also, did you mean that you use the haploid genome size (NOT the diploid, triploid, etc)?

**westerman** · 07-13-2009, 10:24 AM

Originally posted by Malabady View Post

Thanks Westerman,

In the example you gave:
So our 'X coverage' is 10X (150Mbases / 1.5Gbases)
Our '% coverage' is 66.6% (100Mbases / 150Mbases)

Isn't the X coverage should be 0.1X in this case?

That is what I get for posting early Monday morning.

'X' coverage is raw bases divided by genome size. So the above should be 1.5 Gbases / 150 Mbases or 10X. I will correct my original post.

Also, did you mean that you use the haploid genome size (NOT the diploid, triploid, etc)?

I believe that all published C-values for for haploid genomes. Although if someone knows for certain then please chime in.

**Malabady** · 07-13-2009, 12:14 PM

I agree that all published C-values are for haploid genomes. But one can roughly estimated the total genome size by multiplying the haploid size by the number of genomes. so if we are talking about diploid plant, we multiply the haploid genome size by 2. Then use this estimated genome size in the coverage calculation. Does this sounds correct to you?.....many thanks

**quinlana** · 07-14-2009, 08:04 AM

computing coverage

Hi,
As ewilbanks suggested, BEDTools will do this for you.

Faculty & Staff

http://people.virginia.edu/~arq5x/bedtools.html

If you want to compute coverage for "bins/windows" that march along the genome, you would use coverageBed. Let's say you've created a BED file called windows.bed for 10Kb windows across your genome and it looks like this (note BEDTools uses UCSC 0-based starts):
chr1 0 10000
chr1 9999 20000
...

Now, you also have a bed file of sequence reads called reads.bed. The following command with calculate for each window in windows.bed:
1) The number of reads in reads.bed that overlap the window
2) Coverage density (i.e. the fraction of base pairs in window.bed that are covered by >= 1 read in reads.bed)

coverageBed -a reads.bed -b windows.bed | sortBed -i stdin > windows.bed.coverage

Sample output:
<chr> <start> <end> <#reads> <# window bases covered> <windowSize> <fraction of window covered>
chr1 0 10000 0 0 10000 0
chr1 9999 20000 33 1000 10000 0.10

I hope this helps. If you need a script to create a BED file with windows of a given size, just let me know.

Best,
Aaron

**quinlana** · 07-14-2009, 08:52 AM

Also, "per base" coverage can be computed with genomeCoverageBed using the "-d" option. Unfortunately, it doesn't currently have an option to assume that the input BED file is sorted by chrom/start. Consequently, it loads the data into memory and sorts it internally. Thus, memory use is quite high if you have millions and millions of reads. A forthcoming release will fix this obvious limitation.

Note that genomeCoverageBed requires a "genome" file that tells it how long each chromosome is. One can quickly produce this by querying the UCSC databases.

For example, human (hg18):
> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg18.chromInfo" | grep -v ^chrom | head
chr1 247249719
chr1_random 1663265
chr10 135374737
chr10_random 113275
chr11 134452384
chr11_random 215294
chr12 132349534
chr13 114142980
chr13_random 186858
chr14 106368585

Or, of course, just use their browser.

Best,
Aaron

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

How to calculate coverage

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News