Hi
I am trying to build a consensus sequence for the chloroplast based on a bam file of reads that aligned to the reference chloroplast. I will ultimately repeat this process for many populations and use the consensus sequences in a phylogenetic analysis.
After reading several threads online I'm using the following commands:
However, I still have questions about a few of the flags/settings in samtools mpileup and bcftools
Samtools mpileup:
1) does setting -d to 5000 mean that only information from 5000 reads are considered for the output?
2) am I correct that using -E means that BAQ scores for each base are calculated anew and scores associated with the bam file are ignored? What is the advantage of this?
3) Should I be setting -Q in order to avoid information from reads that have a low BAQ at the position in question? How does one determine a cutoff for Q?
4) the F, L and m flags deal with INDELS. I do not expect many of these given how close my species is to the reference. Should I be setting m higher than the default in that case? I really don't understand what F does...can someone please clarify?
Bcftools/ vcfutils
1) does the -g setting force the algorithm to call a single base?
2) what settings do I need to use to include ambiguous codes? My samples are pools of individuals so I would like for my consensus to include the IUPAC ambiguity codes in cases where there are some reads with one base and others with another.
Sorry for all the questions: I've had a hard time finding clear answers/documentation on this topic. If I've overlooked something please let me know!
Thanks!
I am trying to build a consensus sequence for the chloroplast based on a bam file of reads that aligned to the reference chloroplast. I will ultimately repeat this process for many populations and use the consensus sequences in a phylogenetic analysis.
After reading several threads online I'm using the following commands:
However, I still have questions about a few of the flags/settings in samtools mpileup and bcftools
Samtools mpileup:
1) does setting -d to 5000 mean that only information from 5000 reads are considered for the output?
2) am I correct that using -E means that BAQ scores for each base are calculated anew and scores associated with the bam file are ignored? What is the advantage of this?
3) Should I be setting -Q in order to avoid information from reads that have a low BAQ at the position in question? How does one determine a cutoff for Q?
4) the F, L and m flags deal with INDELS. I do not expect many of these given how close my species is to the reference. Should I be setting m higher than the default in that case? I really don't understand what F does...can someone please clarify?
Bcftools/ vcfutils
1) does the -g setting force the algorithm to call a single base?
2) what settings do I need to use to include ambiguous codes? My samples are pools of individuals so I would like for my consensus to include the IUPAC ambiguity codes in cases where there are some reads with one base and others with another.
Sorry for all the questions: I've had a hard time finding clear answers/documentation on this topic. If I've overlooked something please let me know!
Thanks!