Hi,
We are trying to find SNPs in a few loci of interest in rhesus macaque. To that end, we have sequenced exome sequences and I've gone through the process of calling SNPs using GATK and I've come up with a multi-sample VCF file.
Now we want to actually create alternative fasta references for each macaque at our loci of interest. I thought that Novocraft utility "novoutil iupac" would do exactly what I want it to do, which is (from the most recent version):
However, when I run novoutil iupac with my VCF file and the reference genome, I get the following error:
So it clearly doesn't like something about my VCF file, which looks something like this:
Does anyone have any experience with this sort of issue?
Thanks,
Sam
We are trying to find SNPs in a few loci of interest in rhesus macaque. To that end, we have sequenced exome sequences and I've gone through the process of calling SNPs using GATK and I've come up with a multi-sample VCF file.
Now we want to actually create alternative fasta references for each macaque at our loci of interest. I thought that Novocraft utility "novoutil iupac" would do exactly what I want it to do, which is (from the most recent version):
Code:
Novoutil iupac 1. Add option ( -g ) to substitute genotype for the reference base. Without the -g option ambiguous bases are formed by combining the reference base and the genotype. 2. Add option -i to substitute homozygous indel calls into the new reference. This is useful for reference guided assembly. 3. Add option -q 99 that sets a lower quality limit (default 30) on usable VCF data lines. 4. If the REF column of VCF file shows an N then IUPAC code is formed from the ALT alleles.
Code:
# novoutil iupac (V2.08.02) - Build Jun 28 2012 @ 16:23:10 # (C) 2008,2009,2010,2011 NovoCraft Technologies Sdn Bhd. Loading VCF file... novoutil: src/vcfsnps.cc:152: bool vcffile::hetero(tabbed&): Assertion `*gt == '|' || *gt == '/'' failed.
Code:
##fileformat=VCFv4.1 ##FILTER=<ID=LowQual,Description="Low quality"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions"> ##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias"> ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"> ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes"> ##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads"> ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth"> ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias"> ##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[sorted_realigned_RG_20120712_unmasked_lane_1_index_3.bam, sorted_realigned_RG_20120712_unmasked_lane_1_index_8.bam, sorted_realigned_RG_20120712_unmasked_lane_2_index_1.bam, sorted_realigned_RG_20120712_unmasked_lane_2_index_2.bam, sorted_realigned_RG_20120712_unmasked_lane_5_index_6.bam, sorted_realigned_RG_20120712_unmasked_lane_5_index_9.bam, sorted_realigned_RG_20120712_unmasked_lane_6_index_4.bam, sorted_realigned_RG_20120712_unmasked_lane_6_index_5.bam, sorted_realigned_RG_20120712_unmasked_lane_3_index_7.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/vrc1_data/douek_lab/reference_sets/novoalign/alt_macaque/rhesus_TR_IG_unmasked.fa nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null quantize_quals=-1 defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_level=INFO log_to_file=null help=false genotype_likelihoods_model=BOTH p_nonref_model=EXACT heterozygosity=0.0010 pcr_error_rate=1.0E-4 genotyping_mode=DISCOVERY output_mode=EMIT_VARIANTS_ONLY standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=30.0 noSLOD=false annotateNDA=false alleles=(RodBinding name= source=UNBOUND) min_base_quality_score=17 max_deletion_fraction=0.05 max_alternate_alleles=3 min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indel_heterozygosity=1.25E-4 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 noBandedIndel=false indelDebug=false ignoreSNPAlleles=false dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_mismatching_base_and_quals=false" ##contig=<ID=IgH1,length=610659> ##contig=<ID=IgH2,length=260169> ##contig=<ID=IgK1,length=977354> ##contig=<ID=IgK2,length=4152696> ##contig=<ID=IgL1,length=20629> ##contig=<ID=IgL2,length=7376870> ##contig=<ID=IgL3,length=2230> ##contig=<ID=IgL4,length=7654437> ##contig=<ID=IgL5,length=8461> ##contig=<ID=IgL6,length=8824> ##contig=<ID=IgL7,length=1082> ##contig=<ID=IgL8,length=4740> ##contig=<ID=IgL9,length=13131495> ##contig=<ID=TRA1,length=23168555> ##contig=<ID=TRB1,length=19618912> ##contig=<ID=TRD1,length=1136629> ##contig=<ID=TRG1,length=17450015> ##reference=file:///vrc1_data/douek_lab/reference_sets/novoalign/alt_macaque/rhesus_TR_IG_unmasked.fa #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT monkey1_3 monkey1_8 monkey2_1 monkey2_2 monkey3_7 monkey5_6 monkey5_9 monkey6_4 monkey6_5 TRG1 2855 . G T 75.71 PASS AC=6;AF=1.00;AN=6;DP=3;Dels=0.00;FS=0.000;HRun=3;HaplotypeScore=0.0000;MQ=41.02;MQ0=0;QD=25.24;SB=-0.02 GT:AD:DP:GQ:PL ./. 1/1:0,1:1:3.01:34,3,0 ./. 1/1:0,1:1:3.01:34,3,0 ./. ./. ./. ./. 1/1:0,1:1:3.01:42,3,0 TRG1 3864 . A G 42.75 LowQual AC=4;AF=0.50;AN=8;BaseQRankSum=0.000;DP=4;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ=150.00;MQ0=0;MQRankSum=-1.026;QD=21.37;ReadPosRankSum=1.026;SB=2.99 GT:AD:DP:GQ:PL ./. 1/1:0,1:1:3.01:42,3,0 ./. 1/1:0,1:1:3.01:42,3,0 0/0:1,0:1:3.01:0,3,41 ./. 0/0:1,0:1:3.01:0,3,42 ./. ./. TRG1 4614 . T C 110.43 PASS AC=7;AF=0.70;AN=10;BaseQRankSum=0.000;DP=6;Dels=0.00;FS=0.000;HRun=2;HaplotypeScore=0.0000;MQ=101.00;MQ0=0;MQRankSum=1.380;QD=22.09;ReadPosRankSum=0.000;SB=-30.01 GT:AD:DP:GQ:PL ./. 0/1:1,1:2:32.56:35,0,33 1/1:0,1:1:3.01:41,3,0 ./. ./. 0/0:1,0:1:3.01:0,3,34 1/1:0,1:1:3.01:37,3,0 ./. 1/1:0,1:1:3.01:42,3,0 TRG1 4628 . C T 34.82 LowQual AC=3;AF=0.75;AN=4;BaseQRankSum=-0.736;DP=3;Dels=0.00;FS=0.000;HRun=2;HaplotypeScore=0.0000;MQ=51.96;MQ0=0;MQRankSum=0.736;QD=11.61;ReadPosRankSum=0.736;SB=-29.60 GT:AD:DP:GQ:PL ./. 0/1:1,1:2:35.21:35,0,35 ./. ./. ./. 1/1:0,1:1:3.01:34,3,0 ./. ./. ./. TRG1 5257 . T A,G 75.47 PASS AC=3,2;AF=0.38,0.25;AN=8;BaseQRankSum=0.622;DP=114;Dels=0.00;FS=0.000;HaplotypeScore=0.1835;MQ=13.79;MQ0=4;MQRankSum=-2.795;QD=1.54;ReadPosRankSum=-0.487;SB=-24.83 GT:AD:DP:GQ:PL ./. 1/1:2,4,14:21:3.01:34,3,0,34,3,34 ./. 0/0:2,2,10:14:3.01:0,3,34,3,34,34 0/1:1,5,14:20:25.32:73,0,25,76,31,107 2/2:2,3,3:8:3.01:42,42,42,3,3,0 ./. ./. ./. TRG1 5268 . C T 61.81 PASS AC=3;AF=0.30;AN=10;BaseQRankSum=-0.431;DP=99;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.5441;MQ=14.90;MQ0=2;MQRankSum=2.302;QD=1.63;ReadPosRankSum=0.825;SB=-24.38 GT:AD:DP:GQ:PL 0/0:11,1:12:3:0,3,25 1/1:16,1:18:3.01:34,3,0 ./. 0/0:9,0:10:3.01:0,3,34 0/1:17,3:20:25.32:69,0,25 0/0:8,0:8:3.01:0,3,42 ./. ./. ./. TRG1 5275 . G C 62.34 PASS AC=3;AF=0.30;AN=10;BaseQRankSum=-0.945;DP=56;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.0882;MQ=19.71;MQ0=1;MQRankSum=2.439;QD=2.97;ReadPosRankSum=0.154;SB=-24.38 GT:AD:DP:GQ:PL 0/0:6,1:8:3:0,3,25 1/1:9,1:11:3.01:34,3,0 ./. 0/0:9,0:9:3.01:0,3,34 0/1:8,2:10:25.32:70,0,25 0/0:5,0:5:3.01:0,3,41 ./. ./. ./.
Thanks,
Sam