Can't get GATK VariantEval to properly recognize overlap between two sets of VCF calls. Probably a VCF formatting issue, but I'm a little stumped.
To simplify the issue, here are two very simple example VCFs that I would like to compare:
The first VCF is comprised of just the header and first data line of the OMNI 764 calls for sample NA12878 (OMNI.764s.b37.annotated.vcf.gz), from the Broad's GSA ftp site (ftp -u gsapubftp-anonymous ftp.broadinstitute.org):
##fileformat=VCFv4.0
##FILTER=<ID=amb,Description="Ambiguous SNP. Could not determine true forward strand. May have ref/alt mismatches">
##FILTER=<ID=id10,Description="Within 10 bp of an known indel">
##FILTER=<ID=id20,Description="Within 20 bp of an known indel">
##FILTER=<ID=id5,Description="Within 5 bp of an known indel">
##FILTER=<ID=id50,Description="Within 50 bp of an known indel">
##FORMAT=<ID=GC,Number=.,Type=Float,Description="Gencall Score">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FilterLiftedVariants="analysis_type=FilterLiftedVariants input_file=[] sample_metadata=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=null reference_sequence=/seq/references/Homo_sapiens_assembly19/v0/Homo_\
sapiens_assembly19.fasta rodBind=[/broad/shptmp/chartl/tmp/0.159484125281782.sorted.vcf] rodToIntervalTrackName=null BTI_merge_rule=UNION DBSNP=null downsampling_type=null downsample_to_fraction=null downsample_to_coverage=null useOriginalQualities=fals\
e validation_strictness=SILENT unsafe=null num_threads=1 interval_merging=ALL read_group_black_list=null logging_level=INFO log_to_file=null quiet_output_mode=false debug_mode=false help=false out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_\
HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub"
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=CR,Number=.,Type=Float,Description="SNP Callrate">
##INFO=<ID=GentrainScore,Number=.,Type=Float,Description="Gentrain Score">
##INFO=<ID=HW,Number=.,Type=Float,Description="Hardy-Weinberg Equilibrium">
##SelectVariants="analysis_type=SelectVariants input_file=[] sample_metadata=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=[resources/OMNI_flipped_sites.intervals.list] reference_sequence=/humgen/1kg/referen\
ce/human_b36_both.fasta rodBind=[vcfs/Illumina_HapMap_Omni_2.5_764samples.annot.vcf] rodToIntervalTrackName=null BTI_merge_rule=UNION DBSNP=null downsampling_type=null downsample_to_fraction=null downsample_to_coverage=null useOriginalQualities=false va\
lidation_strictness=SILENT unsafe=null num_threads=1 interval_merging=ALL read_group_black_list=null logging_level=INFO log_to_file=null quiet_output_mode=false debug_mode=false help=false out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEAD\
ER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sample=null select_expressions=[] excludeNonVariants=false excludeFiltered=false"
##VariantAnnotator="input_file=[] interval_merging=ALL read_buffer_size=null unsafe=null sample_metadata=[] read_filter=[] rodBind=[/humgen/illumina/1kg_seq_vcfs/Illumina_HapMap_Omni_2.5_764samples.vcf] read_group_black_list=null log_to_file=null loggin\
g_level=INFO intervals=null BTI_merge_rule=UNION debug_mode=false downsample_to_fraction=null DBSNP=null num_threads=1 quiet_output_mode=false analysis_type=VariantAnnotator rodToIntervalTrackName=null help=false validation_strictness=SILENT downsample_\
to_coverage=null excludeIntervals=null reference_sequence=/humgen/1kg/reference/human_b36_both.fasta useOriginalQualities=false phone_home=STANDARD downsampling_type=null group=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.br\
oadinstitute.sting.gatk.io.stubs.VCFWriterStub vcfContainsOnlyIndels=false annotation=[ChromosomeCounts] sampleName=null useAllAnnotations=false assume_single_sample_reads=null list=false"
##reference=human_b36_both.fasta
##source=SelectVariants
##source=infiniumFinalReportConverterV1.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
1 534247 SNP1-524110 C T . PASS AC=16;AF=0.01;AN=1526;CR=99.86842;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
The second is my test VCF that I made using samtools/bcftools, plus the same hit from the above OMNI vcf:
##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability that sample chromosomes are not all the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">
##INFO=<ID=CI95,Number=2,Type=Float,Description="Equal-tail Bayesian credible interval of the site allele frequency at the 95% level">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT coverage/coverage_2/master.bam
1 10327 . T C 43 . DP=36;AF1=0.5;CI95=0.5,0.5;DP4=1,7,0,11;MQ=22;FQ=46;PV4=0.42,0.11,1,1 GT:PL:GQ 0/1:73,0,98:76
1 534247 SNP1-524110 C T . PASS AC=16;AF=0.01;AN=1526;CR=99.86842;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
When I run this command:
java -jar /data/software/GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -T VariantEval -R ../../reference/bwa-0.5.8c_index/human_g1k_\
v37.fasta -B:eval,VCF test.vcf -B:comp_omni764,VCF omni.test.vcf -o test.gatkreport
The report says there is no overlap between the two VCFs:
##:GATKReport.v0.1 CompOverlap : The overlap between eval and comp sites
CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate
CompOverlap comp_omni764 eval none all 1 1 1 0 0.00000000 0 0.00000000
CompOverlap comp_omni764 eval none known 0 0 0 0 0.00000000 0 0.00000000
CompOverlap comp_omni764 eval none novel 1 1 1 0 0.00000000 0 0.00000000
I have a couple of vague ideas about what might be causing the problem - the files are different VCF versions, and the header of the OMNI VCF claims it is against the b36 reference, even though it says in the name it's b37 (guess it was lifted over?)
Any ideas?
To simplify the issue, here are two very simple example VCFs that I would like to compare:
The first VCF is comprised of just the header and first data line of the OMNI 764 calls for sample NA12878 (OMNI.764s.b37.annotated.vcf.gz), from the Broad's GSA ftp site (ftp -u gsapubftp-anonymous ftp.broadinstitute.org):
##fileformat=VCFv4.0
##FILTER=<ID=amb,Description="Ambiguous SNP. Could not determine true forward strand. May have ref/alt mismatches">
##FILTER=<ID=id10,Description="Within 10 bp of an known indel">
##FILTER=<ID=id20,Description="Within 20 bp of an known indel">
##FILTER=<ID=id5,Description="Within 5 bp of an known indel">
##FILTER=<ID=id50,Description="Within 50 bp of an known indel">
##FORMAT=<ID=GC,Number=.,Type=Float,Description="Gencall Score">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FilterLiftedVariants="analysis_type=FilterLiftedVariants input_file=[] sample_metadata=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=null reference_sequence=/seq/references/Homo_sapiens_assembly19/v0/Homo_\
sapiens_assembly19.fasta rodBind=[/broad/shptmp/chartl/tmp/0.159484125281782.sorted.vcf] rodToIntervalTrackName=null BTI_merge_rule=UNION DBSNP=null downsampling_type=null downsample_to_fraction=null downsample_to_coverage=null useOriginalQualities=fals\
e validation_strictness=SILENT unsafe=null num_threads=1 interval_merging=ALL read_group_black_list=null logging_level=INFO log_to_file=null quiet_output_mode=false debug_mode=false help=false out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_\
HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub"
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=CR,Number=.,Type=Float,Description="SNP Callrate">
##INFO=<ID=GentrainScore,Number=.,Type=Float,Description="Gentrain Score">
##INFO=<ID=HW,Number=.,Type=Float,Description="Hardy-Weinberg Equilibrium">
##SelectVariants="analysis_type=SelectVariants input_file=[] sample_metadata=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=[resources/OMNI_flipped_sites.intervals.list] reference_sequence=/humgen/1kg/referen\
ce/human_b36_both.fasta rodBind=[vcfs/Illumina_HapMap_Omni_2.5_764samples.annot.vcf] rodToIntervalTrackName=null BTI_merge_rule=UNION DBSNP=null downsampling_type=null downsample_to_fraction=null downsample_to_coverage=null useOriginalQualities=false va\
lidation_strictness=SILENT unsafe=null num_threads=1 interval_merging=ALL read_group_black_list=null logging_level=INFO log_to_file=null quiet_output_mode=false debug_mode=false help=false out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEAD\
ER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sample=null select_expressions=[] excludeNonVariants=false excludeFiltered=false"
##VariantAnnotator="input_file=[] interval_merging=ALL read_buffer_size=null unsafe=null sample_metadata=[] read_filter=[] rodBind=[/humgen/illumina/1kg_seq_vcfs/Illumina_HapMap_Omni_2.5_764samples.vcf] read_group_black_list=null log_to_file=null loggin\
g_level=INFO intervals=null BTI_merge_rule=UNION debug_mode=false downsample_to_fraction=null DBSNP=null num_threads=1 quiet_output_mode=false analysis_type=VariantAnnotator rodToIntervalTrackName=null help=false validation_strictness=SILENT downsample_\
to_coverage=null excludeIntervals=null reference_sequence=/humgen/1kg/reference/human_b36_both.fasta useOriginalQualities=false phone_home=STANDARD downsampling_type=null group=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.br\
oadinstitute.sting.gatk.io.stubs.VCFWriterStub vcfContainsOnlyIndels=false annotation=[ChromosomeCounts] sampleName=null useAllAnnotations=false assume_single_sample_reads=null list=false"
##reference=human_b36_both.fasta
##source=SelectVariants
##source=infiniumFinalReportConverterV1.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
1 534247 SNP1-524110 C T . PASS AC=16;AF=0.01;AN=1526;CR=99.86842;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
The second is my test VCF that I made using samtools/bcftools, plus the same hit from the above OMNI vcf:
##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability that sample chromosomes are not all the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">
##INFO=<ID=CI95,Number=2,Type=Float,Description="Equal-tail Bayesian credible interval of the site allele frequency at the 95% level">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT coverage/coverage_2/master.bam
1 10327 . T C 43 . DP=36;AF1=0.5;CI95=0.5,0.5;DP4=1,7,0,11;MQ=22;FQ=46;PV4=0.42,0.11,1,1 GT:PL:GQ 0/1:73,0,98:76
1 534247 SNP1-524110 C T . PASS AC=16;AF=0.01;AN=1526;CR=99.86842;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
When I run this command:
java -jar /data/software/GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -T VariantEval -R ../../reference/bwa-0.5.8c_index/human_g1k_\
v37.fasta -B:eval,VCF test.vcf -B:comp_omni764,VCF omni.test.vcf -o test.gatkreport
The report says there is no overlap between the two VCFs:
##:GATKReport.v0.1 CompOverlap : The overlap between eval and comp sites
CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate
CompOverlap comp_omni764 eval none all 1 1 1 0 0.00000000 0 0.00000000
CompOverlap comp_omni764 eval none known 0 0 0 0 0.00000000 0 0.00000000
CompOverlap comp_omni764 eval none novel 1 1 1 0 0.00000000 0 0.00000000
I have a couple of vague ideas about what might be causing the problem - the files are different VCF versions, and the header of the OMNI VCF claims it is against the b36 reference, even though it says in the name it's b37 (guess it was lifted over?)
Any ideas?