Seqanswers Leaderboard Ad

**swbarnes2** · 02-28-2013, 02:16 PM

CoverageBed in BEDtools will do most of what you want. You may want a bit of scripting to massage its output to exactly what you want.

**EricHaugen** · 02-28-2013, 02:17 PM

The reference base and read depth at each position could be obtained in an easily-parseable VCF format with
samtools mpileup -uf reference.fasta alignment.bam | bcftools view -cg -

I use the BEDOPS tools for other interval slicing/dicing such as you describe, they generally use minimal memory by using only sorted inputs, and the new "--max-mem" option in sort-bed may help make anything doable on your machine.

bedtools has the advantage of working with that VCF output directly though.

**Alphred** · 02-28-2013, 06:29 PM

Thanks for the responses!

I'm still a little concerned about the amount of memory I will need to obtain read depth at each position. I'm not trying to get coverage of just a list of SNPs or sequences, but literally every base in every exon in the genome. I was hoping there might be some way to obtain averages of base-specific coverage across intervals without having to output the depth at every base. I think (though I could be wrong) Bedtools genomeCoverageBed -d option can output the number of reads covering each and every base in each chromosome . . . for the entire genome this would equate to 3 billion rows of text. Is it possible to narrow the genomeCoverageBed command to cover only coding regions? Or is there a similar function (like CoverageBed) that can take a BED file of exon locations (as opposed to just chromosomes) and compute their per base coverage? In this case, maybe the number of rows would be narrowed down to 0.75 billion or so . . .

Am I approaching this problem correctly, or is there a more efficient solution?

Ultimately, the results I want would be something like this:

Gene_____Avg.Coverage.A_____Avg.Coverage.T_____Avg.Coverage.G.in.CG
a___________51________________73________________67
b___________39________________89________________100

**EricHaugen** · 02-28-2013, 07:17 PM

The number of rows will not be a memory problem if you only process one line at a time, and the final program in the piped command-line (e.g. your python script or maybe "groupby") just saves the counts per gene.

That bcftools command gives every base, not just variants. The only reason I suggested it instead of "bedtools coverage" is that it also gives you the reference base, which is not in the BAM file.

**AlexReynolds** · 02-28-2013, 11:11 PM

Originally posted by EricHaugen View Post

bedtools has the advantage of working with that VCF output directly though.

The vcf2bed script that is part of BEDOPS can be used to pipe BED data as standard input into other BEDOPS tools, like sort-bed and bedops, e.g.:

$ vcf2bed < my.vcf \
| sort-bed --max-mem 2G - \
| bedops --element-of -1 - foo.bed \
> answer.bed

The hyphen character denotes the use of standard input, in place of a regular file. (Of course, standard output from vcf2bed can be piped in bedtools utilities or any other app that takes in standard input.)

Another option is to use named pipes, if there is more than one input, two or more of which are of non-BED formats, e.g.:

$ mkfifo my_vcf_pipe
$ vcf2bed < my.vcf | sort-bed - > my_vcf_pipe
$ mkfifo my_gff_pipe
$ gff2bed < my.gff | sort-bed - > my_gff_pipe
$ bedops --intersect my_vcf_pipe my_gff_pipe > answer.bed
$ rm my_vcf_pipe my_gff_pipe

**Alphred** · 03-01-2013, 02:23 PM

Cool, thanks guys!

Just to make sure I understand correctly, based on the comments, I could use:

samtools mpileup -u -l exons.bed -f ref.fasta my.bam | bcftools view | “some tool to annotate gene names to positions” | “simple script to sum rows by gene and nucleotide” > output.vcf

Should this accomplish what I’m after? If so, what would be the simplest (no frills) method to annotate gene names to Chrom + Pos?

**swbarnes2** · 03-01-2013, 04:26 PM

If you have a .bed file with gene positions, you can use that to assign each line in the .bam to a gene position. You might just want to use that, and then use something like cut | sort | uniq -c to count how many reads hit every gene. With read counts + gene lengths, you can calculate average coverage.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 23 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Automate nucleotide-specific coverage retrieval from BAMs?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News