Multi-sample VCF generated in GATK is not working with novoutil IUPAC

sdarko

Member

Join Date: Apr 2009
Posts: 51

Multi-sample VCF generated in GATK is not working with novoutil IUPAC

07-19-2012, 04:55 AM

Hi,

We are trying to find SNPs in a few loci of interest in rhesus macaque. To that end, we have sequenced exome sequences and I've gone through the process of calling SNPs using GATK and I've come up with a multi-sample VCF file.

Now we want to actually create alternative fasta references for each macaque at our loci of interest. I thought that Novocraft utility "novoutil iupac" would do exactly what I want it to do, which is (from the most recent version):

Code:

Novoutil iupac
      1. Add option ( -g ) to substitute genotype for the reference base. Without
         the -g option ambiguous bases are formed by combining the reference base
         and the genotype.
      2. Add option -i to substitute homozygous indel calls into the new reference.
         This is useful for reference guided assembly.
      3. Add option -q 99 that sets a lower quality limit (default 30) on usable VCF
         data lines.
      4. If the REF column of VCF file shows an N then IUPAC code is formed from the
         ALT alleles.

However, when I run novoutil iupac with my VCF file and the reference genome, I get the following error:

Code:

# novoutil iupac (V2.08.02) - Build Jun 28 2012 @ 16:23:10
# (C) 2008,2009,2010,2011 NovoCraft Technologies Sdn Bhd.
Loading VCF file...
novoutil: src/vcfsnps.cc:152: bool vcffile::hetero(tabbed&): Assertion `*gt == '|' || *gt == '/'' failed.

So it clearly doesn't like something about my VCF file, which looks something like this:

Code:

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[sorted_realigned_RG_20120712_unmasked_lane_1_index_3.bam, sorted_realigned_RG_20120712_unmasked_lane_1_index_8.bam, sorted_realigned_RG_20120712_unmasked_lane_2_index_1.bam, sorted_realigned_RG_20120712_unmasked_lane_2_index_2.bam, sorted_realigned_RG_20120712_unmasked_lane_5_index_6.bam, sorted_realigned_RG_20120712_unmasked_lane_5_index_9.bam, sorted_realigned_RG_20120712_unmasked_lane_6_index_4.bam, sorted_realigned_RG_20120712_unmasked_lane_6_index_5.bam, sorted_realigned_RG_20120712_unmasked_lane_3_index_7.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/vrc1_data/douek_lab/reference_sets/novoalign/alt_macaque/rhesus_TR_IG_unmasked.fa nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null quantize_quals=-1 defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_level=INFO log_to_file=null help=false genotype_likelihoods_model=BOTH p_nonref_model=EXACT heterozygosity=0.0010 pcr_error_rate=1.0E-4 genotyping_mode=DISCOVERY output_mode=EMIT_VARIANTS_ONLY standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=30.0 noSLOD=false annotateNDA=false alleles=(RodBinding name= source=UNBOUND) min_base_quality_score=17 max_deletion_fraction=0.05 max_alternate_alleles=3 min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indel_heterozygosity=1.25E-4 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 noBandedIndel=false indelDebug=false ignoreSNPAlleles=false dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_mismatching_base_and_quals=false"
##contig=<ID=IgH1,length=610659>
##contig=<ID=IgH2,length=260169>
##contig=<ID=IgK1,length=977354>
##contig=<ID=IgK2,length=4152696>
##contig=<ID=IgL1,length=20629>
##contig=<ID=IgL2,length=7376870>
##contig=<ID=IgL3,length=2230>
##contig=<ID=IgL4,length=7654437>
##contig=<ID=IgL5,length=8461>
##contig=<ID=IgL6,length=8824>
##contig=<ID=IgL7,length=1082>
##contig=<ID=IgL8,length=4740>
##contig=<ID=IgL9,length=13131495>
##contig=<ID=TRA1,length=23168555>
##contig=<ID=TRB1,length=19618912>
##contig=<ID=TRD1,length=1136629>
##contig=<ID=TRG1,length=17450015>
##reference=file:///vrc1_data/douek_lab/reference_sets/novoalign/alt_macaque/rhesus_TR_IG_unmasked.fa
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	monkey1_3	monkey1_8	monkey2_1	monkey2_2	monkey3_7	monkey5_6	monkey5_9	monkey6_4	monkey6_5
TRG1	2855	.	G	T	75.71	PASS	AC=6;AF=1.00;AN=6;DP=3;Dels=0.00;FS=0.000;HRun=3;HaplotypeScore=0.0000;MQ=41.02;MQ0=0;QD=25.24;SB=-0.02	GT:AD:DP:GQ:PL	./.	1/1:0,1:1:3.01:34,3,0	./.	1/1:0,1:1:3.01:34,3,0	./.	./.	./.	./.	1/1:0,1:1:3.01:42,3,0
TRG1	3864	.	A	G	42.75	LowQual	AC=4;AF=0.50;AN=8;BaseQRankSum=0.000;DP=4;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=150.00;MQ0=0;MQRankSum=-1.026;QD=21.37;ReadPosRankSum=1.026;SB=2.99	GT:AD:DP:GQ:PL	./.	1/1:0,1:1:3.01:42,3,0	./.	1/1:0,1:1:3.01:42,3,0	0/0:1,0:1:3.01:0,3,41	./.	0/0:1,0:1:3.01:0,3,42	./.	./.
TRG1	4614	.	T	C	110.43	PASS	AC=7;AF=0.70;AN=10;BaseQRankSum=0.000;DP=6;Dels=0.00;FS=0.000;HRun=2;HaplotypeScore=0.0000;MQ=101.00;MQ0=0;MQRankSum=1.380;QD=22.09;ReadPosRankSum=0.000;SB=-30.01	GT:AD:DP:GQ:PL	./.	0/1:1,1:2:32.56:35,0,33	1/1:0,1:1:3.01:41,3,0	./.	./.	0/0:1,0:1:3.01:0,3,34	1/1:0,1:1:3.01:37,3,0	./.	1/1:0,1:1:3.01:42,3,0
TRG1	4628	.	C	T	34.82	LowQual	AC=3;AF=0.75;AN=4;BaseQRankSum=-0.736;DP=3;Dels=0.00;FS=0.000;HRun=2;HaplotypeScore=0.0000;MQ=51.96;MQ0=0;MQRankSum=0.736;QD=11.61;ReadPosRankSum=0.736;SB=-29.60	GT:AD:DP:GQ:PL	./.	0/1:1,1:2:35.21:35,0,35	./.	./.	./.	1/1:0,1:1:3.01:34,3,0	./.	./.	./.
TRG1	5257	.	T	A,G	75.47	PASS	AC=3,2;AF=0.38,0.25;AN=8;BaseQRankSum=0.622;DP=114;Dels=0.00;FS=0.000;HaplotypeScore=0.1835;MQ=13.79;MQ0=4;MQRankSum=-2.795;QD=1.54;ReadPosRankSum=-0.487;SB=-24.83	GT:AD:DP:GQ:PL	./.	1/1:2,4,14:21:3.01:34,3,0,34,3,34	./.	0/0:2,2,10:14:3.01:0,3,34,3,34,34	0/1:1,5,14:20:25.32:73,0,25,76,31,107	2/2:2,3,3:8:3.01:42,42,42,3,3,0	./.	./.	./.
TRG1	5268	.	C	T	61.81	PASS	AC=3;AF=0.30;AN=10;BaseQRankSum=-0.431;DP=99;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.5441;MQ=14.90;MQ0=2;MQRankSum=2.302;QD=1.63;ReadPosRankSum=0.825;SB=-24.38	GT:AD:DP:GQ:PL	0/0:11,1:12:3:0,3,25	1/1:16,1:18:3.01:34,3,0	./.	0/0:9,0:10:3.01:0,3,34	0/1:17,3:20:25.32:69,0,25	0/0:8,0:8:3.01:0,3,42	./.	./.	./.
TRG1	5275	.	G	C	62.34	PASS	AC=3;AF=0.30;AN=10;BaseQRankSum=-0.945;DP=56;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.0882;MQ=19.71;MQ0=1;MQRankSum=2.439;QD=2.97;ReadPosRankSum=0.154;SB=-24.38	GT:AD:DP:GQ:PL	0/0:6,1:8:3:0,3,25	1/1:9,1:11:3.01:34,3,0	./.	0/0:9,0:9:3.01:0,3,34	0/1:8,2:10:25.32:70,0,25	0/0:5,0:5:3.01:0,3,41	./.	./.	./.

Does anyone have any experience with this sort of issue?

Thanks,
Sam

Tags: None

Previous template Next

Genetic Variation in Immunogenetics and Antibody Diversity

by seqadmin

The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
- Channel: Articles
11-06-2024, 07:24 PM
Choosing Between NGS and qPCR

by seqadmin

Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
- Channel: Articles
10-18-2024, 07:11 AM

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 22 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Multi-sample VCF generated in GATK is not working with novoutil IUPAC

Latest Articles

ad_right_rmr

News