Problem using GATK VariantEval to compare two VCFs

lemur2

Junior Member

Join Date: Aug 2010

Posts: 3
- Share
- Tweet
#1

Problem using GATK VariantEval to compare two VCFs

06-21-2011, 09:26 PM

Can't get GATK VariantEval to properly recognize overlap between two sets of VCF calls. Probably a VCF formatting issue, but I'm a little stumped.

To simplify the issue, here are two very simple example VCFs that I would like to compare:

The first VCF is comprised of just the header and first data line of the OMNI 764 calls for sample NA12878 (OMNI.764s.b37.annotated.vcf.gz), from the Broad's GSA ftp site (ftp -u gsapubftp-anonymous ftp.broadinstitute.org):

##fileformat=VCFv4.0
##FILTER=<ID=amb,Description="Ambiguous SNP. Could not determine true forward strand. May have ref/alt mismatches">
##FILTER=<ID=id10,Description="Within 10 bp of an known indel">
##FILTER=<ID=id20,Description="Within 20 bp of an known indel">
##FILTER=<ID=id5,Description="Within 5 bp of an known indel">
##FILTER=<ID=id50,Description="Within 50 bp of an known indel">
##FORMAT=<ID=GC,Number=.,Type=Float,Description="Gencall Score">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FilterLiftedVariants="analysis_type=FilterLiftedVariants input_file=[] sample_metadata=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=null reference_sequence=/seq/references/Homo_sapiens_assembly19/v0/Homo_\
sapiens_assembly19.fasta rodBind=[/broad/shptmp/chartl/tmp/0.159484125281782.sorted.vcf] rodToIntervalTrackName=null BTI_merge_rule=UNION DBSNP=null downsampling_type=null downsample_to_fraction=null downsample_to_coverage=null useOriginalQualities=fals\
e validation_strictness=SILENT unsafe=null num_threads=1 interval_merging=ALL read_group_black_list=null logging_level=INFO log_to_file=null quiet_output_mode=false debug_mode=false help=false out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_\
HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub"
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=CR,Number=.,Type=Float,Description="SNP Callrate">
##INFO=<ID=GentrainScore,Number=.,Type=Float,Description="Gentrain Score">
##INFO=<ID=HW,Number=.,Type=Float,Description="Hardy-Weinberg Equilibrium">
##SelectVariants="analysis_type=SelectVariants input_file=[] sample_metadata=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=[resources/OMNI_flipped_sites.intervals.list] reference_sequence=/humgen/1kg/referen\
ce/human_b36_both.fasta rodBind=[vcfs/Illumina_HapMap_Omni_2.5_764samples.annot.vcf] rodToIntervalTrackName=null BTI_merge_rule=UNION DBSNP=null downsampling_type=null downsample_to_fraction=null downsample_to_coverage=null useOriginalQualities=false va\
lidation_strictness=SILENT unsafe=null num_threads=1 interval_merging=ALL read_group_black_list=null logging_level=INFO log_to_file=null quiet_output_mode=false debug_mode=false help=false out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEAD\
ER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sample=null select_expressions=[] excludeNonVariants=false excludeFiltered=false"
##VariantAnnotator="input_file=[] interval_merging=ALL read_buffer_size=null unsafe=null sample_metadata=[] read_filter=[] rodBind=[/humgen/illumina/1kg_seq_vcfs/Illumina_HapMap_Omni_2.5_764samples.vcf] read_group_black_list=null log_to_file=null loggin\
g_level=INFO intervals=null BTI_merge_rule=UNION debug_mode=false downsample_to_fraction=null DBSNP=null num_threads=1 quiet_output_mode=false analysis_type=VariantAnnotator rodToIntervalTrackName=null help=false validation_strictness=SILENT downsample_\
to_coverage=null excludeIntervals=null reference_sequence=/humgen/1kg/reference/human_b36_both.fasta useOriginalQualities=false phone_home=STANDARD downsampling_type=null group=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.br\
oadinstitute.sting.gatk.io.stubs.VCFWriterStub vcfContainsOnlyIndels=false annotation=[ChromosomeCounts] sampleName=null useAllAnnotations=false assume_single_sample_reads=null list=false"
##reference=human_b36_both.fasta
##source=SelectVariants
##source=infiniumFinalReportConverterV1.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
1 534247 SNP1-524110 C T . PASS AC=16;AF=0.01;AN=1526;CR=99.86842;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491

The second is my test VCF that I made using samtools/bcftools, plus the same hit from the above OMNI vcf:

##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability that sample chromosomes are not all the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">
##INFO=<ID=CI95,Number=2,Type=Float,Description="Equal-tail Bayesian credible interval of the site allele frequency at the 95% level">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT coverage/coverage_2/master.bam
1 10327 . T C 43 . DP=36;AF1=0.5;CI95=0.5,0.5;DP4=1,7,0,11;MQ=22;FQ=46;PV4=0.42,0.11,1,1 GT:PL:GQ 0/1:73,0,98:76
1 534247 SNP1-524110 C T . PASS AC=16;AF=0.01;AN=1526;CR=99.86842;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491

When I run this command:

java -jar /data/software/GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -T VariantEval -R ../../reference/bwa-0.5.8c_index/human_g1k_\
v37.fasta -B:eval,VCF test.vcf -B:comp_omni764,VCF omni.test.vcf -o test.gatkreport

The report says there is no overlap between the two VCFs:

##:GATKReport.v0.1 CompOverlap : The overlap between eval and comp sites
CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate
CompOverlap comp_omni764 eval none all 1 1 1 0 0.00000000 0 0.00000000
CompOverlap comp_omni764 eval none known 0 0 0 0 0.00000000 0 0.00000000
CompOverlap comp_omni764 eval none novel 1 1 1 0 0.00000000 0 0.00000000

I have a couple of vague ideas about what might be causing the problem - the files are different VCF versions, and the header of the OMNI VCF claims it is against the b36 reference, even though it says in the name it's b37 (guess it was lifted over?)

Any ideas?
Tags: None

Previous template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Problem using GATK VariantEval to compare two VCFs

Latest Articles

ad_right_rmr

News