Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • rwhet052
    Junior Member
    • Jan 2011
    • 8

    bcftools consensus from BCF/VCF

    I am trying to extract consensus DNA sequences for 144 short subsequences out of a large genome, with IUPAC ambiguity codes showing polymorphic sites based on a BCF file produced by samtools mpileup. I am using samtools version 1.1-42-g0cc7b47 and bcftools 1.1-151-g095b6f2, and trying to follow the HOWTO on the bcftools wiki (https://github.com/samtools/bcftools/wiki/).

    I was not able to pipe output from 'samtools mpileup' directly into 'bcftools call' (as shown in the HOWTO); this produced error messages about the lack of an index for the file. Saving the BCF output from mpileup and indexing with tabix got past that error, and running 'bcftools call -R regions.bed' on the indexed BCF file produced a compressed VCF.gz file for the regions of interest. So far, so good.

    Trying to use 'bcftools consensus -i' to produce consensus sequences only for the regions of interest, with IUPAC ambiguity codes at polymorphic sites, has not yet worked. I extracted FASTA sequences for the regions of interest from the genome assembly using BEDtools fastaFromBed, and used those sequences as input to bcftools consensus, along with the vcf.gz file:
    bcftools consensus -i -f targetregions.fa -o output.consensus.fa targetregions.vcf.gz
    This returns an error message:
    "The fasta sequence does not match the REF allele at <chr: pos>:
    .vcf: [C]
    .vcf: [S] <- (ALT)
    .fa: [G]"

    S is the appropriate ambiguity code for a C:G polymorphism - how can I get this sequence information written to the output file?

    If I take out the -i option to use IUPAC ambiguity codes, the error message then uses the nucleotides rather than the IUPAC codes, but still complains about the fact that they don't match, and still produces no output.
  • rwhet052
    Junior Member
    • Jan 2011
    • 8

    #2
    bcftools consensus from BCF/VCF

    Update - further exploration found a method that works, although it may not be optimal.
    The sequence of steps is different from that outlined in the original query.

    # New version of samtools view can subset a BAM file based on regions in a BED file
    samtools view -b -o output_1.bam -L regions.bed input.bam

    # Sort and index output BAM file:
    samtools sort -m 30G -O 'bam' -o sorted.output_1.bam -@ 6 -T tmp.sort output_1.bam
    samtools index sorted.output_1.bam

    # Call SNPs from subset BAM file using mpileup, limiting pileup to regions in regions.bed
    samtools mpileup -l regions.bed -f ref.fa -go output.bcf sorted.output_1.bam

    # Index the BCF file with tabix, then use bcftools call to call SNPs:
    tabix output.bcf
    bcftools call -mv -Oz -o output.calls.vcf.gz output.bcf

    # Index the output VCF file with tabix, then call the consensus sequences for the reference genome assembly:
    tabix output.calls.vcf.gz
    bcftools consensus -f ref.fa -i output.calls.vcf.gz > all.cns.fa

    # Extract the regions of interest from all.cns.fa using BEDtools fastaFromBed:
    ~/software/bedtools/bin/fastaFromBed -fi all.cns.fa -bed regions.bed -fo regions.cns.fa

    Comment

    Latest Articles

    Collapse

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by SEQadmin2, Today, 11:58 AM
    0 responses
    9 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-05-2026, 10:09 AM
    0 responses
    25 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-04-2026, 08:59 AM
    0 responses
    35 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-02-2026, 12:03 PM
    0 responses
    56 views
    0 reactions
    Last Post SEQadmin2  
    Working...