Hi Everyone,
I am trying to generate a FASTA sequence from a BAM file using a BCF/VCF file as an intermediate. I would like to create a temporary BCF/VCF file from which I can make a FASTA sequence as well as estimate nucleotide polymorphism using the vcftools package.
Fasta sequences can only generated for a single sample at a time, so I need to create a BCF/VCF file with a single sample. To do this, I have two commands: one that converts from BAM to BCF/VCF and the other which converts fro BCF/VCF to FASTA. Frustratingly, when I select a single sample in the second command, my FASTA sequence contains lots of ambiguities. But when I select it in the first, it works just fine. I would like to create a single BCF/VCF file for a region of interest and then generate FASTA sequences for each sample, rather than having to make a BCF/VCF file for every sample (waste of space!).
Bear with me while I give each pair of commands with the FASTA formatted output. The first two use a VCF intermediate, the last two use a BCF.
samtools mpileup -gIq20 -r 2L:53202-53313 -f /home/data/mau12/mau12.fa /home/data/maupoly/maupolyI.bam | bcftools view -g - > all.vcf
bcftools view -Ss mel.txt all.vcf | vcfutils.pl vcf2fq | seqtk fq2fa - | seqtk cutN -
>0:53202-53313
TGTAAGTAAATARAAAGGAACTAGAATTTAATCCTAGATCTGRATAAGTATSAAAACAAG
ACCTMCRSGTACTTGCTARRACTCGACATTYAAAAAAYRTATTKCTTTTTAG
samtools mpileup -gIq20 -r 2L:53202-53313 -f /home/data/mau12/mau12.fa /home/data/maupoly/maupolyI.bam | bcftools view -gs mel.txt - > mel.vcf
bcftools view -S tmp.vcf | vcfutils.pl vcf2fq | seqtk fq2fa - | seqtk cutN -
>2L:53202-53313
TGTAAGTAAATARAAAGGAACTAGAATTTAATCCTAGATCTGRATAAGTATSAAAACAAG
ACCTMCRSGTACTTGCTARRACTCGACATTYAAAAAAYRTATTKCTTTTTAG
samtools mpileup -gIq20 -r 2L:53202-53313 -f /home/data/mau12/mau12.fa /home/data/maupoly/maupolyI.bam | bcftools view -bgs mel.txt - > mel.bcf
bcftools view mel.bcf | vcfutils.pl vcf2fq | seqtk fq2fa - | seqtk cutN -
>2L:53202-53313
TGTAAGTAAATAAAAAGGAACTAGAATTTAATCCTAGATCTGAATAAGTATCAAAACAAG
ACCTCCGCGTACTTGCTAAGACTCGACATTTAAAAAACGTATTTCTTTTTAG
samtools mpileup -gIq20 -r 2L:53202-53313 -f /home/data/mau12/mau12.fa /home/data/maupoly/maupolyI.bam | bcftools view -bg - > all.bcf
bcftools view -s mel.txt all.bcf | vcfutils.pl vcf2fq | seqtk fq2fa - | seqtk cutN -
>2L:53202-53313
TGTAAGTAAATARAAAGGAACTAGAATTTAATCCTAGATCTGRATAAGTATSAAAACAAG
ACCTMCRSGTACTTGCTARRACTCGACATTYAAAAAAYRTATTKCTTTTTAG
The third set of commands gives me the right output: no ambiguities, and I've check that this is 100% correct from an independent source.
Does anyone have any idea why this happens???
Thanks in advance.
Sarah
I am trying to generate a FASTA sequence from a BAM file using a BCF/VCF file as an intermediate. I would like to create a temporary BCF/VCF file from which I can make a FASTA sequence as well as estimate nucleotide polymorphism using the vcftools package.
Fasta sequences can only generated for a single sample at a time, so I need to create a BCF/VCF file with a single sample. To do this, I have two commands: one that converts from BAM to BCF/VCF and the other which converts fro BCF/VCF to FASTA. Frustratingly, when I select a single sample in the second command, my FASTA sequence contains lots of ambiguities. But when I select it in the first, it works just fine. I would like to create a single BCF/VCF file for a region of interest and then generate FASTA sequences for each sample, rather than having to make a BCF/VCF file for every sample (waste of space!).
Bear with me while I give each pair of commands with the FASTA formatted output. The first two use a VCF intermediate, the last two use a BCF.
samtools mpileup -gIq20 -r 2L:53202-53313 -f /home/data/mau12/mau12.fa /home/data/maupoly/maupolyI.bam | bcftools view -g - > all.vcf
bcftools view -Ss mel.txt all.vcf | vcfutils.pl vcf2fq | seqtk fq2fa - | seqtk cutN -
>0:53202-53313
TGTAAGTAAATARAAAGGAACTAGAATTTAATCCTAGATCTGRATAAGTATSAAAACAAG
ACCTMCRSGTACTTGCTARRACTCGACATTYAAAAAAYRTATTKCTTTTTAG
samtools mpileup -gIq20 -r 2L:53202-53313 -f /home/data/mau12/mau12.fa /home/data/maupoly/maupolyI.bam | bcftools view -gs mel.txt - > mel.vcf
bcftools view -S tmp.vcf | vcfutils.pl vcf2fq | seqtk fq2fa - | seqtk cutN -
>2L:53202-53313
TGTAAGTAAATARAAAGGAACTAGAATTTAATCCTAGATCTGRATAAGTATSAAAACAAG
ACCTMCRSGTACTTGCTARRACTCGACATTYAAAAAAYRTATTKCTTTTTAG
samtools mpileup -gIq20 -r 2L:53202-53313 -f /home/data/mau12/mau12.fa /home/data/maupoly/maupolyI.bam | bcftools view -bgs mel.txt - > mel.bcf
bcftools view mel.bcf | vcfutils.pl vcf2fq | seqtk fq2fa - | seqtk cutN -
>2L:53202-53313
TGTAAGTAAATAAAAAGGAACTAGAATTTAATCCTAGATCTGAATAAGTATCAAAACAAG
ACCTCCGCGTACTTGCTAAGACTCGACATTTAAAAAACGTATTTCTTTTTAG
samtools mpileup -gIq20 -r 2L:53202-53313 -f /home/data/mau12/mau12.fa /home/data/maupoly/maupolyI.bam | bcftools view -bg - > all.bcf
bcftools view -s mel.txt all.bcf | vcfutils.pl vcf2fq | seqtk fq2fa - | seqtk cutN -
>2L:53202-53313
TGTAAGTAAATARAAAGGAACTAGAATTTAATCCTAGATCTGRATAAGTATSAAAACAAG
ACCTMCRSGTACTTGCTARRACTCGACATTYAAAAAAYRTATTKCTTTTTAG
The third set of commands gives me the right output: no ambiguities, and I've check that this is 100% correct from an independent source.
Does anyone have any idea why this happens???
Thanks in advance.
Sarah
Comment