No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAMtools mpileup prediction - few nucleotides off

    Hi all,

    I am using samtools - mpileup (version 0.1.18) to predict SNP and indels in bacterial genomes. When applying this pipeline to artificially generated Illumina data (using dwgsim), it seems to me that mpilup is able to predict the region where indels appear, but is reporting in some cases the wrong indel. In total around 2000 indels are reported, and for ~ 300 of them, a behavior like below is observed.

    E.g. on position 3.211.144, dwgsim artificially introduced a deletion of 3 nucleotides (-GAG) in the reference sequence.

    However, mpileup is giving on position 3211140 as ref TGCCG and as alternative TG, which would result in a deletion of CCG:
    ref_id 3211140 . TGCCG TG 214 . INDEL;DP=97;VDB=0.0710;AF1=1;AC1=2;DP4=0,0,24,28;MQ=39;FQ=-192 GT:PL:GQ 1/1:255,157,0:99

    However, when looking at the alignment of the genes in the BAM file (with tview), this does not seem to be correct (see attached).

    It seems to me that samtools is predicting the indel correctly, but is rather making an error in reporting, by reporting an indel with the same size a few nucleotides upstream of the real indel location. Or am I missing something?

    Thank you in advance!

    Attached Files

  • #2
    On possibility; I'm wondering if all your software is on the same page with regard to what the reference sequence actually is.

    You know that samtools mpileup uses an index for your fasta, and if you changed your fasta, and didn't remake this index, that might explain why the software is confused about what the sequence exactly is there.

    So, try remaking that index with samtools faidx, then rerun samtools mpileup.


    • #3
      Thanks for the suggestion!
      Tried to re-run the whole pipeline, starting from the BWA alignment (including the faidx indexing of the reference sequence), but no luck.
      Moreover, mpileup - bcf - varfilter is able to predict around 81% (1891 out of 2321) of the present indels correctly, and for around 15% it's able to predict the correct "region" (+- 5 nt), but does not report the correct indel.

      Seems that for these 15% indels, there are other indels reporterd in the neighborhood of the "reported" indel in the total mpileup-output (before running varFilter), and one of these other indels (not reported) seem to be the true one.
      A few lines of the complete mpileup-output is given below, the reported deletion is -GT (fourth line) while the correct one is -TT (last line). The wrong one might be selected because it has more reads supporting it (40 + 32 versus 38 + 30).

      NC_000913       84533   .       G       .       209     .       DP=109;VDB=0.0392;AF1=0;AC1=0;DP4=52,55,0,2;MQ=48;FQ=-282;PV4=0.5,1,0.075,1     PL      0
      NC_000913       84533   .       GCT     G       54.5    .       INDEL;DP=109;VDB=0.0808;AF1=1;AC1=2;DP4=1,5,21,14;MQ=57;FQ=-63.5;PV4=0.08,0.27,1,1      GT:PL:GQ        1/1:95,29,0:55
      NC_000913       84534   .       C       .       209     .       DP=104;VDB=0.0318;AF1=0;AC1=0;DP4=50,52,1,1;MQ=48;FQ=-282;PV4=1,1,0.081,0.076   PL      0
      NC_000913       84534   .       CTGT    CT      214     .       INDEL;DP=104;VDB=0.0777;AF1=1;AC1=2;DP4=0,0,40,32;MQ=52;FQ=-252 GT:PL:GQ        1/1:255,217,0:99
      NC_000913       84535   .       T       .       207     .       DP=96;VDB=0.0413;AF1=0;AC1=0;DP4=48,47,0,1;MQ=47;FQ=-282;PV4=1,0,0.17,0.15      PL      0
      NC_000913       84536   .       G       .       205     .       DP=96;VDB=0.0639;AF1=0;AC1=0;DP4=47,45,0,0;MQ=47;FQ=-282        PL      0
      NC_000913       84536   .       GTT     G       214     .       INDEL;DP=96;VDB=0.0777;AF1=1;AC1=2;DP4=0,0,38,30;MQ=52;FQ=-240  GT:PL:GQ        1/1:255,205,0:99


      Latest Articles


      • seqadmin
        Multiomics Techniques Advancing Disease Research
        by seqadmin

        New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

        A major leap in the field has
        02-08-2024, 06:33 AM
      • seqadmin
        The 3D Genome: New Technologies and Emerging Insights
        by seqadmin

        The study of three-dimensional (3D) genomics explores the spatial structure of genomes and their role in processes like gene expression and DNA replication. By employing innovative technologies, researchers can study these arrangements to discover their role in various biological processes. Scientists continue to find new ways in which the organization of DNA is involved in processes like development1 and disease2.

        Basic Organization and Structure
        01-22-2024, 03:25 PM





      Topics Statistics Last Post
      Started by seqadmin, Today, 08:52 AM
      0 responses
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 08:57 AM
      0 responses
      Last Post seqadmin  
      Started by seqadmin, 02-14-2024, 09:19 AM
      0 responses
      Last Post seqadmin  
      Started by seqadmin, 02-12-2024, 03:37 PM
      0 responses
      Last Post seqadmin