Hello-
Here's a tricky one...I've been beating my head for several days trying to figure out what's going on. Here's the project: we're looking at a nuclear gene panel of 198 genes, enriched using RainDance (so there's no risk capturing homologous genes outside the gene set...RainDance reported that their primer sets are all unique). We ran one sample per HiSeq lane.
That part was all easy. The alignment, which I expected to be straightforward has become a challenge. Aligning against the full human genome a lot of coverage is lost do regions of homology in the genome outside the gene set. To get around this problem and preserve genomic positions, I masked (replaced nucleotide with N) at nonamplified positions in the genome. The result is a genome full of N's except for the exons of the 198 genes of interest. The alignment (using BWA) works fine and looks great. However, variant detection has been an issue. I tried a variety of different settings, but ultimately settled on what's listed on the mpileup page:
1. samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf
2. bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
The only modification is I set -D to something like 1M (I have very deep coverage and didn't want to rule anything out because of high coverage).
After running vcfutils there are zero variants detected. When I look at the raw bcf file there are >400000 variants. When I look at them, the reference allele at every position is 'N'. So, I expect some reference positions to be 'N' since since I masked the sequences under the primers, but these should be minimal to be sure. I double checked sequences as well by looking at the masked reference genome I built and even where the reference is not N, bcftools is reporting that it is.
Any ideas about what could be happening to make bcftools think every position is N? Any suggestions would be greatly appreciated...thanks.
Here's a tricky one...I've been beating my head for several days trying to figure out what's going on. Here's the project: we're looking at a nuclear gene panel of 198 genes, enriched using RainDance (so there's no risk capturing homologous genes outside the gene set...RainDance reported that their primer sets are all unique). We ran one sample per HiSeq lane.
That part was all easy. The alignment, which I expected to be straightforward has become a challenge. Aligning against the full human genome a lot of coverage is lost do regions of homology in the genome outside the gene set. To get around this problem and preserve genomic positions, I masked (replaced nucleotide with N) at nonamplified positions in the genome. The result is a genome full of N's except for the exons of the 198 genes of interest. The alignment (using BWA) works fine and looks great. However, variant detection has been an issue. I tried a variety of different settings, but ultimately settled on what's listed on the mpileup page:
1. samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf
2. bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
The only modification is I set -D to something like 1M (I have very deep coverage and didn't want to rule anything out because of high coverage).
After running vcfutils there are zero variants detected. When I look at the raw bcf file there are >400000 variants. When I look at them, the reference allele at every position is 'N'. So, I expect some reference positions to be 'N' since since I masked the sequences under the primers, but these should be minimal to be sure. I double checked sequences as well by looking at the masked reference genome I built and even where the reference is not N, bcftools is reporting that it is.
Any ideas about what could be happening to make bcftools think every position is N? Any suggestions would be greatly appreciated...thanks.
Comment