No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAMtools mpileup prediction - few nucleotides off

    Hi all,

    I am using samtools - mpileup (version 0.1.18) to predict SNP and indels in bacterial genomes. When applying this pipeline to artificially generated Illumina data (using dwgsim), it seems to me that mpilup is able to predict the region where indels appear, but is reporting in some cases the wrong indel. In total around 2000 indels are reported, and for ~ 300 of them, a behavior like below is observed.

    E.g. on position 3.211.144, dwgsim artificially introduced a deletion of 3 nucleotides (-GAG) in the reference sequence.

    However, mpileup is giving on position 3211140 as ref TGCCG and as alternative TG, which would result in a deletion of CCG:
    ref_id 3211140 . TGCCG TG 214 . INDEL;DP=97;VDB=0.0710;AF1=1;AC1=2;DP4=0,0,24,28;MQ=39;FQ=-192 GT:PL:GQ 1/1:255,157,0:99

    However, when looking at the alignment of the genes in the BAM file (with tview), this does not seem to be correct (see attached).

    It seems to me that samtools is predicting the indel correctly, but is rather making an error in reporting, by reporting an indel with the same size a few nucleotides upstream of the real indel location. Or am I missing something?

    Thank you in advance!

    Attached Files

  • #2
    On possibility; I'm wondering if all your software is on the same page with regard to what the reference sequence actually is.

    You know that samtools mpileup uses an index for your fasta, and if you changed your fasta, and didn't remake this index, that might explain why the software is confused about what the sequence exactly is there.

    So, try remaking that index with samtools faidx, then rerun samtools mpileup.


    • #3
      Thanks for the suggestion!
      Tried to re-run the whole pipeline, starting from the BWA alignment (including the faidx indexing of the reference sequence), but no luck.
      Moreover, mpileup - bcf - varfilter is able to predict around 81% (1891 out of 2321) of the present indels correctly, and for around 15% it's able to predict the correct "region" (+- 5 nt), but does not report the correct indel.

      Seems that for these 15% indels, there are other indels reporterd in the neighborhood of the "reported" indel in the total mpileup-output (before running varFilter), and one of these other indels (not reported) seem to be the true one.
      A few lines of the complete mpileup-output is given below, the reported deletion is -GT (fourth line) while the correct one is -TT (last line). The wrong one might be selected because it has more reads supporting it (40 + 32 versus 38 + 30).

      NC_000913       84533   .       G       .       209     .       DP=109;VDB=0.0392;AF1=0;AC1=0;DP4=52,55,0,2;MQ=48;FQ=-282;PV4=0.5,1,0.075,1     PL      0
      NC_000913       84533   .       GCT     G       54.5    .       INDEL;DP=109;VDB=0.0808;AF1=1;AC1=2;DP4=1,5,21,14;MQ=57;FQ=-63.5;PV4=0.08,0.27,1,1      GT:PL:GQ        1/1:95,29,0:55
      NC_000913       84534   .       C       .       209     .       DP=104;VDB=0.0318;AF1=0;AC1=0;DP4=50,52,1,1;MQ=48;FQ=-282;PV4=1,1,0.081,0.076   PL      0
      NC_000913       84534   .       CTGT    CT      214     .       INDEL;DP=104;VDB=0.0777;AF1=1;AC1=2;DP4=0,0,40,32;MQ=52;FQ=-252 GT:PL:GQ        1/1:255,217,0:99
      NC_000913       84535   .       T       .       207     .       DP=96;VDB=0.0413;AF1=0;AC1=0;DP4=48,47,0,1;MQ=47;FQ=-282;PV4=1,0,0.17,0.15      PL      0
      NC_000913       84536   .       G       .       205     .       DP=96;VDB=0.0639;AF1=0;AC1=0;DP4=47,45,0,0;MQ=47;FQ=-282        PL      0
      NC_000913       84536   .       GTT     G       214     .       INDEL;DP=96;VDB=0.0777;AF1=1;AC1=2;DP4=0,0,38,30;MQ=52;FQ=-240  GT:PL:GQ        1/1:255,205,0:99


      Latest Articles


      • seqadmin
        Advanced Methods for the Detection of Infectious Disease
        by seqadmin

        The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
        Yesterday, 01:15 PM
      • seqadmin
        Strategies for Investigating the Microbiome
        by seqadmin

        Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
        11-09-2023, 07:02 AM





      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 08:12 AM
      0 responses
      Last Post seqadmin  
      Started by seqadmin, 11-22-2023, 09:29 AM
      1 response
      Last Post VilliamPast  
      Started by seqadmin, 11-22-2023, 08:53 AM
      0 responses
      Last Post seqadmin  
      Started by seqadmin, 11-21-2023, 08:24 AM
      0 responses
      Last Post seqadmin