I'm running into a problem with GATK's VariantRecalibrator claiming the input VCF file has a malformed header. Any suggestions would be appreciated.
(version 1.4-21-g30b937d):
The provided VCF file has a malformed header: The FORMAT field was provided but there is no genotype/sample data
The VCF file looks fine and validates with vcftools. It was produced by GATK's UnifiedGenotyper followed by snpEff and the VariantAnnotator. I used the exact same set of commands to call and annotate another set of variants, for a set of 5 exome samples, and had no such errors.
##OriginalSnpEffCmd="SnpEff eff -c /net/homehost/project/evolgen/alison/EXOME/PIPELINE/prerequisites/snpEff_2_0_5/snpEff.config -o vcf -s TRANCHE2.snpeff.html -no-upstream -no-downstream -no-intergenic -no-intron -onlyCoding true GRCh37.65 TRANCHE2.grch37.vcf "
##OriginalSnpEffVersion="2.0.5 (build 2011-12-24), by Pablo Cingolani"
##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[S0149.recal.bam] read_buffer_size=null phone_home=STANDARD read_filter=[BadCigar] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/net/homehost/project/evolgen/alison/EXOME/PIPELINE/data/hg19_index/hg19.fasta rodBind=[] nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_level=INFO log_to_file=null help=false genotype_likelihoods_model=BOTH p_nonref_model=EXACT heterozygosity=0.001 pcr_error_rate=1.0E-4 genotyping_mode=DISCOVERY output_mode=EMIT_VARIANTS_ONLY standard_min_confidence_threshold_for_calling=30.0 standard_min_confidence_threshold_for_emitting=10.0 computeSLOD=false alleles=(RodBinding name= source=UNBOUND) min_base_quality_score=17 max_deletion_fraction=0.05 multiallelic=false max_alternate_alleles=5 min_indel_count_for_genotyping=5 indel_heterozygosity=1.25E-4 indelGapContinuationPenalty=10.0 indelGapOpenPenalty=45.0 indelHaplotypeSize=80 bandedIndel=false indelDebug=false ignoreSNPAlleles=false dbsnp=(RodBinding name=dbsnp source=/net/homehost/project/evolgen/alison/EXOME/PIPELINE/data/gatk_resources/dbsnp_132_hg19.vcf) out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_mismatching_base_and_quals=false"
##VariantAnnotator="analysis_type=VariantAnnotator input_file=[S0149.recal.bam] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/net/homehost/project/evolgen/alison/EXOME/PIPELINE/data/hg19_index/hg19.fasta rodBind=[] nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_level=INFO log_to_file=null help=false variant=(RodBinding name=variant source=TRANCHE2.vcf) snpEffFile=(RodBinding name=snpEffFile source=TRANCHE2.snpeff.vcf) dbsnp=(RodBinding name= source=UNBOUND) comp=[] resource=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub annotation=[SnpEff] excludeAnnotation=[] group=[StandardAnnotation] expression=[] useAllAnnotations=false list=false vcfContainsOnlyIndels=false MendelViolationGenotypeQualityThreshold=0.0 requireStrictAlleleMatch=false filter_mismatching_base_and_quals=false"
Here is the relevant header and the first line of the file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S0149
chrM 195 . C T 48.08 PASS AC=2;AF=1.00;AN=2;DP=4;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=78.85;MQ0=0;QD=12.02 GT:ADP:GQ:PL 1/1:0,4:2:6.02:79,6,0
A truncated version of the file containing only the first variant also causes the error.
(version 1.4-21-g30b937d):
The provided VCF file has a malformed header: The FORMAT field was provided but there is no genotype/sample data
The VCF file looks fine and validates with vcftools. It was produced by GATK's UnifiedGenotyper followed by snpEff and the VariantAnnotator. I used the exact same set of commands to call and annotate another set of variants, for a set of 5 exome samples, and had no such errors.
##OriginalSnpEffCmd="SnpEff eff -c /net/homehost/project/evolgen/alison/EXOME/PIPELINE/prerequisites/snpEff_2_0_5/snpEff.config -o vcf -s TRANCHE2.snpeff.html -no-upstream -no-downstream -no-intergenic -no-intron -onlyCoding true GRCh37.65 TRANCHE2.grch37.vcf "
##OriginalSnpEffVersion="2.0.5 (build 2011-12-24), by Pablo Cingolani"
##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[S0149.recal.bam] read_buffer_size=null phone_home=STANDARD read_filter=[BadCigar] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/net/homehost/project/evolgen/alison/EXOME/PIPELINE/data/hg19_index/hg19.fasta rodBind=[] nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_level=INFO log_to_file=null help=false genotype_likelihoods_model=BOTH p_nonref_model=EXACT heterozygosity=0.001 pcr_error_rate=1.0E-4 genotyping_mode=DISCOVERY output_mode=EMIT_VARIANTS_ONLY standard_min_confidence_threshold_for_calling=30.0 standard_min_confidence_threshold_for_emitting=10.0 computeSLOD=false alleles=(RodBinding name= source=UNBOUND) min_base_quality_score=17 max_deletion_fraction=0.05 multiallelic=false max_alternate_alleles=5 min_indel_count_for_genotyping=5 indel_heterozygosity=1.25E-4 indelGapContinuationPenalty=10.0 indelGapOpenPenalty=45.0 indelHaplotypeSize=80 bandedIndel=false indelDebug=false ignoreSNPAlleles=false dbsnp=(RodBinding name=dbsnp source=/net/homehost/project/evolgen/alison/EXOME/PIPELINE/data/gatk_resources/dbsnp_132_hg19.vcf) out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_mismatching_base_and_quals=false"
##VariantAnnotator="analysis_type=VariantAnnotator input_file=[S0149.recal.bam] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/net/homehost/project/evolgen/alison/EXOME/PIPELINE/data/hg19_index/hg19.fasta rodBind=[] nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_level=INFO log_to_file=null help=false variant=(RodBinding name=variant source=TRANCHE2.vcf) snpEffFile=(RodBinding name=snpEffFile source=TRANCHE2.snpeff.vcf) dbsnp=(RodBinding name= source=UNBOUND) comp=[] resource=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub annotation=[SnpEff] excludeAnnotation=[] group=[StandardAnnotation] expression=[] useAllAnnotations=false list=false vcfContainsOnlyIndels=false MendelViolationGenotypeQualityThreshold=0.0 requireStrictAlleleMatch=false filter_mismatching_base_and_quals=false"
Here is the relevant header and the first line of the file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S0149
chrM 195 . C T 48.08 PASS AC=2;AF=1.00;AN=2;DP=4;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=78.85;MQ0=0;QD=12.02 GT:ADP:GQ:PL 1/1:0,4:2:6.02:79,6,0
A truncated version of the file containing only the first variant also causes the error.