I've just produced consensus sequence for a pool of illumina reads mapped against a reference assembly using samtools mpileup. After extracting fasta from the vcf I'm presented with sequence for most (95%?) of the contigs from the reference assembly, but with a fair portion of that sequence appearing as ambiguous bases. The unambiguous portion mostly appears to be the exact reference, but at some positions appear the IUPAC code symbols that represent multiple possible bases.
I have 3 basic questions:
1) Are the regions of ambiguous bases in the output consensus areas where there were no queries that mapped? Or are they regions where my queries differed significantly from the reference?
2) In the generated consensus, is everywhere I see an IUPAC multi-base nucleotide a SNP?
3) For every contig I've checked, the output consensus has the proper number of ambiguous bases all the way up to the very start of the contig in the reference assembly. But on the tail end, sometimes the output consensus just stops early, short of the end that I see in the reference. Why is that?
I have 3 basic questions:
1) Are the regions of ambiguous bases in the output consensus areas where there were no queries that mapped? Or are they regions where my queries differed significantly from the reference?
2) In the generated consensus, is everywhere I see an IUPAC multi-base nucleotide a SNP?
3) For every contig I've checked, the output consensus has the proper number of ambiguous bases all the way up to the very start of the contig in the reference assembly. But on the tail end, sometimes the output consensus just stops early, short of the end that I see in the reference. Why is that?
Comment