Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem using GATK VariantEval to compare two VCFs

    Can't get GATK VariantEval to properly recognize overlap between two sets of VCF calls. Probably a VCF formatting issue, but I'm a little stumped.

    To simplify the issue, here are two very simple example VCFs that I would like to compare:

    The first VCF is comprised of just the header and first data line of the OMNI 764 calls for sample NA12878 (OMNI.764s.b37.annotated.vcf.gz), from the Broad's GSA ftp site (ftp -u gsapubftp-anonymous ftp.broadinstitute.org):

    ##fileformat=VCFv4.0
    ##FILTER=<ID=amb,Description="Ambiguous SNP. Could not determine true forward strand. May have ref/alt mismatches">
    ##FILTER=<ID=id10,Description="Within 10 bp of an known indel">
    ##FILTER=<ID=id20,Description="Within 20 bp of an known indel">
    ##FILTER=<ID=id5,Description="Within 5 bp of an known indel">
    ##FILTER=<ID=id50,Description="Within 50 bp of an known indel">
    ##FORMAT=<ID=GC,Number=.,Type=Float,Description="Gencall Score">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##FilterLiftedVariants="analysis_type=FilterLiftedVariants input_file=[] sample_metadata=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=null reference_sequence=/seq/references/Homo_sapiens_assembly19/v0/Homo_\
    sapiens_assembly19.fasta rodBind=[/broad/shptmp/chartl/tmp/0.159484125281782.sorted.vcf] rodToIntervalTrackName=null BTI_merge_rule=UNION DBSNP=null downsampling_type=null downsample_to_fraction=null downsample_to_coverage=null useOriginalQualities=fals\
    e validation_strictness=SILENT unsafe=null num_threads=1 interval_merging=ALL read_group_black_list=null logging_level=INFO log_to_file=null quiet_output_mode=false debug_mode=false help=false out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_\
    HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub"
    ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
    ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
    ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
    ##INFO=<ID=CR,Number=.,Type=Float,Description="SNP Callrate">
    ##INFO=<ID=GentrainScore,Number=.,Type=Float,Description="Gentrain Score">
    ##INFO=<ID=HW,Number=.,Type=Float,Description="Hardy-Weinberg Equilibrium">
    ##SelectVariants="analysis_type=SelectVariants input_file=[] sample_metadata=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=[resources/OMNI_flipped_sites.intervals.list] reference_sequence=/humgen/1kg/referen\
    ce/human_b36_both.fasta rodBind=[vcfs/Illumina_HapMap_Omni_2.5_764samples.annot.vcf] rodToIntervalTrackName=null BTI_merge_rule=UNION DBSNP=null downsampling_type=null downsample_to_fraction=null downsample_to_coverage=null useOriginalQualities=false va\
    lidation_strictness=SILENT unsafe=null num_threads=1 interval_merging=ALL read_group_black_list=null logging_level=INFO log_to_file=null quiet_output_mode=false debug_mode=false help=false out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEAD\
    ER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sample=null select_expressions=[] excludeNonVariants=false excludeFiltered=false"
    ##VariantAnnotator="input_file=[] interval_merging=ALL read_buffer_size=null unsafe=null sample_metadata=[] read_filter=[] rodBind=[/humgen/illumina/1kg_seq_vcfs/Illumina_HapMap_Omni_2.5_764samples.vcf] read_group_black_list=null log_to_file=null loggin\
    g_level=INFO intervals=null BTI_merge_rule=UNION debug_mode=false downsample_to_fraction=null DBSNP=null num_threads=1 quiet_output_mode=false analysis_type=VariantAnnotator rodToIntervalTrackName=null help=false validation_strictness=SILENT downsample_\
    to_coverage=null excludeIntervals=null reference_sequence=/humgen/1kg/reference/human_b36_both.fasta useOriginalQualities=false phone_home=STANDARD downsampling_type=null group=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.br\
    oadinstitute.sting.gatk.io.stubs.VCFWriterStub vcfContainsOnlyIndels=false annotation=[ChromosomeCounts] sampleName=null useAllAnnotations=false assume_single_sample_reads=null list=false"
    ##reference=human_b36_both.fasta
    ##source=SelectVariants
    ##source=infiniumFinalReportConverterV1.0
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
    1 534247 SNP1-524110 C T . PASS AC=16;AF=0.01;AN=1526;CR=99.86842;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491

    The second is my test VCF that I made using samtools/bcftools, plus the same hit from the above OMNI vcf:

    ##fileformat=VCFv4.1
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
    ##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
    ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
    ##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability that sample chromosomes are not all the same">
    ##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">
    ##INFO=<ID=CI95,Number=2,Type=Float,Description="Equal-tail Bayesian credible interval of the site allele frequency at the 95% level">
    ##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
    ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
    ##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
    ##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
    ##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT coverage/coverage_2/master.bam
    1 10327 . T C 43 . DP=36;AF1=0.5;CI95=0.5,0.5;DP4=1,7,0,11;MQ=22;FQ=46;PV4=0.42,0.11,1,1 GT:PL:GQ 0/1:73,0,98:76
    1 534247 SNP1-524110 C T . PASS AC=16;AF=0.01;AN=1526;CR=99.86842;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491

    When I run this command:

    java -jar /data/software/GenomeAnalysisTK-1.0.5974/GenomeAnalysisTK.jar -T VariantEval -R ../../reference/bwa-0.5.8c_index/human_g1k_\
    v37.fasta -B:eval,VCF test.vcf -B:comp_omni764,VCF omni.test.vcf -o test.gatkreport

    The report says there is no overlap between the two VCFs:

    ##:GATKReport.v0.1 CompOverlap : The overlap between eval and comp sites
    CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate
    CompOverlap comp_omni764 eval none all 1 1 1 0 0.00000000 0 0.00000000
    CompOverlap comp_omni764 eval none known 0 0 0 0 0.00000000 0 0.00000000
    CompOverlap comp_omni764 eval none novel 1 1 1 0 0.00000000 0 0.00000000

    I have a couple of vague ideas about what might be causing the problem - the files are different VCF versions, and the header of the OMNI VCF claims it is against the b36 reference, even though it says in the name it's b37 (guess it was lifted over?)

    Any ideas?

Latest Articles

Collapse

  • seqadmin
    Best Practices for Single-Cell Sequencing Analysis
    by seqadmin



    While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
    06-06-2024, 07:15 AM
  • seqadmin
    Latest Developments in Precision Medicine
    by seqadmin



    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

    Somatic Genomics
    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
    05-24-2024, 01:16 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 06-17-2024, 06:54 AM
0 responses
10 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-14-2024, 07:24 AM
0 responses
20 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-13-2024, 08:58 AM
0 responses
17 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-12-2024, 02:20 PM
0 responses
20 views
0 likes
Last Post seqadmin  
Working...
X