I am trying to extract consensus DNA sequences for 144 short subsequences out of a large genome, with IUPAC ambiguity codes showing polymorphic sites based on a BCF file produced by samtools mpileup. I am using samtools version 1.1-42-g0cc7b47 and bcftools 1.1-151-g095b6f2, and trying to follow the HOWTO on the bcftools wiki (https://github.com/samtools/bcftools/wiki/).
I was not able to pipe output from 'samtools mpileup' directly into 'bcftools call' (as shown in the HOWTO); this produced error messages about the lack of an index for the file. Saving the BCF output from mpileup and indexing with tabix got past that error, and running 'bcftools call -R regions.bed' on the indexed BCF file produced a compressed VCF.gz file for the regions of interest. So far, so good.
Trying to use 'bcftools consensus -i' to produce consensus sequences only for the regions of interest, with IUPAC ambiguity codes at polymorphic sites, has not yet worked. I extracted FASTA sequences for the regions of interest from the genome assembly using BEDtools fastaFromBed, and used those sequences as input to bcftools consensus, along with the vcf.gz file:
bcftools consensus -i -f targetregions.fa -o output.consensus.fa targetregions.vcf.gz
This returns an error message:
"The fasta sequence does not match the REF allele at <chr: pos>:
.vcf: [C]
.vcf: [S] <- (ALT)
.fa: [G]"
S is the appropriate ambiguity code for a C:G polymorphism - how can I get this sequence information written to the output file?
If I take out the -i option to use IUPAC ambiguity codes, the error message then uses the nucleotides rather than the IUPAC codes, but still complains about the fact that they don't match, and still produces no output.
I was not able to pipe output from 'samtools mpileup' directly into 'bcftools call' (as shown in the HOWTO); this produced error messages about the lack of an index for the file. Saving the BCF output from mpileup and indexing with tabix got past that error, and running 'bcftools call -R regions.bed' on the indexed BCF file produced a compressed VCF.gz file for the regions of interest. So far, so good.
Trying to use 'bcftools consensus -i' to produce consensus sequences only for the regions of interest, with IUPAC ambiguity codes at polymorphic sites, has not yet worked. I extracted FASTA sequences for the regions of interest from the genome assembly using BEDtools fastaFromBed, and used those sequences as input to bcftools consensus, along with the vcf.gz file:
bcftools consensus -i -f targetregions.fa -o output.consensus.fa targetregions.vcf.gz
This returns an error message:
"The fasta sequence does not match the REF allele at <chr: pos>:
.vcf: [C]
.vcf: [S] <- (ALT)
.fa: [G]"
S is the appropriate ambiguity code for a C:G polymorphism - how can I get this sequence information written to the output file?
If I take out the -i option to use IUPAC ambiguity codes, the error message then uses the nucleotides rather than the IUPAC codes, but still complains about the fact that they don't match, and still produces no output.
Comment