Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • elfuser
    Member
    • Dec 2009
    • 16

    Understanding coverage in VCF

    Hello and sorry if this question has been asked before or it is super obvious.
    I'm having a bit of a hard time finding out what the numbers mean in the vcf pileup.
    I created an alignment BAM file of my illumina paired-end reads to a bacterial genome using the following:

    Code:
    bwa aln ref.fa r1.fq > r1.sai
    bwa aln ref.fa r2.fq > r2.sai
    bwa sampe ref.fa r1.sai r2.sai r1.fq r2.fq > aln.sam
    samtools view -bS aln.sam > aln.bam
    Then I create the pileup with
    Code:
    samtools mpileup -uf aln.bam > pile.bcf
    bcftools view pile.bcf > pile.vcf
    I also create a depth file with
    Code:
    samtools depth aln.bam > depth.txt
    Let's take position 214 in the reference genome
    In the vcf
    Code:
    gi|218888746|ref|NC_011770.1|	214	.	G	C	202	.	DP=45;AF1=1;AC1=2;DP4=0,0,25,13;MQ=55;FQ=-141	GT:PL:GQ	1/1:235,114,0:99
    In the depth
    Code:
    gi|218888746|ref|NC_011770.1|	214	53
    I understand that the DP(45) value in the vcf file refers to the raw coverage of the given position with high quality(q13?). An that the coverage number in the depth file (53) refers to the number of bases called at that given position regardless of quality and/or conflict. I also understand that the DP4 values (0,0,25,13) reflect if the reads that cover that position match the reference or an alternative variant. In this case 38 (25+13) high quality reads match the alternative variant and 0 the reference. If this is right what happened to the other 7 (45-38) high quality reads that cover that position.

    Another doubt I have is when I find multiple genotypes at a position. For example:
    In the VCF
    Code:
    gi|218888746|ref|NC_011770.1|	3081	.	C	T,G	213	.	DP=70;AF1=1;AC1=2;DP4=0,0,31,32;MQ=50;FQ=-214	GT:PL:GQ	1/1:246,187,0,247,170,244:99
    My understanding is that in this case that 63 reads support either a T or a G at this position. So what I am trying to understand is how I can tell the support for each genotype. I tried looking at the GT:PL:GQ, but I dont quite understand it.
    Thanks!

Latest Articles

Collapse

  • GATTACAT
    Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
    by GATTACAT
    Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
    07-01-2026, 11:43 AM
  • SEQadmin2
    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
    by SEQadmin2


    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

    Here are nine questions we think about, in roughly the order they matter, before...
    06-18-2026, 07:11 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by SEQadmin2, Yesterday, 11:08 AM
0 responses
6 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-30-2026, 05:37 AM
0 responses
11 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-26-2026, 11:10 AM
0 responses
19 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-17-2026, 06:09 AM
0 responses
53 views
0 reactions
Last Post SEQadmin2  
Working...