Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAMtools mpileup prediction - few nucleotides off

    Hi all,

    I am using samtools - mpileup (version 0.1.18) to predict SNP and indels in bacterial genomes. When applying this pipeline to artificially generated Illumina data (using dwgsim), it seems to me that mpilup is able to predict the region where indels appear, but is reporting in some cases the wrong indel. In total around 2000 indels are reported, and for ~ 300 of them, a behavior like below is observed.

    E.g. on position 3.211.144, dwgsim artificially introduced a deletion of 3 nucleotides (-GAG) in the reference sequence.

    However, mpileup is giving on position 3211140 as ref TGCCG and as alternative TG, which would result in a deletion of CCG:
    ref_id 3211140 . TGCCG TG 214 . INDEL;DP=97;VDB=0.0710;AF1=1;AC1=2;DP4=0,0,24,28;MQ=39;FQ=-192 GT:PL:GQ 1/1:255,157,0:99

    However, when looking at the alignment of the genes in the BAM file (with tview), this does not seem to be correct (see attached).

    It seems to me that samtools is predicting the indel correctly, but is rather making an error in reporting, by reporting an indel with the same size a few nucleotides upstream of the real indel location. Or am I missing something?

    Thank you in advance!

    Pieter
    Attached Files

  • #2
    On possibility; I'm wondering if all your software is on the same page with regard to what the reference sequence actually is.

    You know that samtools mpileup uses an index for your fasta, and if you changed your fasta, and didn't remake this index, that might explain why the software is confused about what the sequence exactly is there.

    So, try remaking that index with samtools faidx, then rerun samtools mpileup.

    Comment


    • #3
      Thanks for the suggestion!
      Tried to re-run the whole pipeline, starting from the BWA alignment (including the faidx indexing of the reference sequence), but no luck.
      Moreover, mpileup - bcf - varfilter is able to predict around 81% (1891 out of 2321) of the present indels correctly, and for around 15% it's able to predict the correct "region" (+- 5 nt), but does not report the correct indel.

      Seems that for these 15% indels, there are other indels reporterd in the neighborhood of the "reported" indel in the total mpileup-output (before running varFilter), and one of these other indels (not reported) seem to be the true one.
      A few lines of the complete mpileup-output is given below, the reported deletion is -GT (fourth line) while the correct one is -TT (last line). The wrong one might be selected because it has more reads supporting it (40 + 32 versus 38 + 30).

      Code:
      NC_000913       84533   .       G       .       209     .       DP=109;VDB=0.0392;AF1=0;AC1=0;DP4=52,55,0,2;MQ=48;FQ=-282;PV4=0.5,1,0.075,1     PL      0
      NC_000913       84533   .       GCT     G       54.5    .       INDEL;DP=109;VDB=0.0808;AF1=1;AC1=2;DP4=1,5,21,14;MQ=57;FQ=-63.5;PV4=0.08,0.27,1,1      GT:PL:GQ        1/1:95,29,0:55
      NC_000913       84534   .       C       .       209     .       DP=104;VDB=0.0318;AF1=0;AC1=0;DP4=50,52,1,1;MQ=48;FQ=-282;PV4=1,1,0.081,0.076   PL      0
      NC_000913       84534   .       CTGT    CT      214     .       INDEL;DP=104;VDB=0.0777;AF1=1;AC1=2;DP4=0,0,40,32;MQ=52;FQ=-252 GT:PL:GQ        1/1:255,217,0:99
      NC_000913       84535   .       T       .       207     .       DP=96;VDB=0.0413;AF1=0;AC1=0;DP4=48,47,0,1;MQ=47;FQ=-282;PV4=1,0,0.17,0.15      PL      0
      NC_000913       84536   .       G       .       205     .       DP=96;VDB=0.0639;AF1=0;AC1=0;DP4=47,45,0,0;MQ=47;FQ=-282        PL      0
      NC_000913       84536   .       GTT     G       214     .       INDEL;DP=96;VDB=0.0777;AF1=1;AC1=2;DP4=0,0,38,30;MQ=52;FQ=-240  GT:PL:GQ        1/1:255,205,0:99

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Exploring the Dynamics of the Tumor Microenvironment
        by seqadmin




        The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
        07-08-2024, 03:19 PM
      • seqadmin
        Exploring Human Diversity Through Large-Scale Omics
        by seqadmin


        In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
        06-25-2024, 06:43 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 07:20 AM
      0 responses
      15 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 07-16-2024, 05:49 AM
      0 responses
      35 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 07-15-2024, 06:53 AM
      0 responses
      38 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 07-10-2024, 07:30 AM
      0 responses
      41 views
      0 likes
      Last Post seqadmin  
      Working...
      X