Hello,
I simulated a mutated Drosophila Melanogaster genome from the reference one and then I used a sequencer simulator to produce reads on that genome (with potential sequencing errors). I launched BWA to map the reads on the reference and then I used samtools pileup for discovering the SNPs. Since my genome is simulated I can check if the found SNPs are correct.
The results were quite surprising since samtools found 149,126 SNPs whereas only 19,976 were actually sequenced. Even if samtools found 94.24% of the SNPs, 87.38% of the results were false positives.
For example on the chromosome 2L, in the 10,000 first bp, I should have 3 SNPs at positions 8405, 9049 and 9290.
Here is the content of the pileup file produced by samtools for this region:
To produce the pileup file, I launched the following commands:
Am I misinterpreting the pileup file or am I executing samtools in the wrong way?
I've read that topic. I tried VarScan which doesn't fit my needs. If I understood well, we have to tell it under which criteria a variation has to be considered as a SNP. I also tried mpileup. Surprisingly in the output file I only have indels, it didn't detect any substitution. And once again there are plenty of false positives among these indels.
If you have any idea of what's wrong, it would be very helpful to tell me
Thanks in advance!
I simulated a mutated Drosophila Melanogaster genome from the reference one and then I used a sequencer simulator to produce reads on that genome (with potential sequencing errors). I launched BWA to map the reads on the reference and then I used samtools pileup for discovering the SNPs. Since my genome is simulated I can check if the found SNPs are correct.
The results were quite surprising since samtools found 149,126 SNPs whereas only 19,976 were actually sequenced. Even if samtools found 94.24% of the SNPs, 87.38% of the results were false positives.
For example on the chromosome 2L, in the 10,000 first bp, I should have 3 SNPs at positions 8405, 9049 and 9290.
Here is the content of the pileup file produced by samtools for this region:
Code:
2L 8117 G A 33 33 37 2 A$a ~~ 2L 8118 T C 30 30 37 1 c$ ~ 2L 8225 C A 30 30 37 1 ^FA ~ 2L 8226 C A 30 30 37 1 A ~ 2L 8398 A G 54 54 37 9 GggGGGGgG ~~~~~~~~~ 2L 8552 A W 1 1 37 3 T,, ~~~ 2L 8556 C S 1 1 37 3 .,g ~~~ 2L 9049 C A 66 66 37 13 aaaAAaAAaaAAa ~~~~~~~~~~~~~ 2L 9051 A R 8 8 37 13 ,,,G.,..g,.., ~~~~~~~~~~~~~ 2L 9290 C A 81 81 37 18 AaaAaAaAAAAaAAAAAa ~~~~~~~~~~~~~~~~~~
Code:
samtools view -buSt dmel5.fai file.sam | samtools sort - file.sorted samtools pileup -vcf dmel5.fa file.sorted.bam > file.pileup
I've read that topic. I tried VarScan which doesn't fit my needs. If I understood well, we have to tell it under which criteria a variation has to be considered as a SNP. I also tried mpileup. Surprisingly in the output file I only have indels, it didn't detect any substitution. And once again there are plenty of false positives among these indels.
If you have any idea of what's wrong, it would be very helpful to tell me
Thanks in advance!
Comment