Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • markusli
    Junior Member
    • Feb 2015
    • 2

    Indels in bacterial RNA-seq data

    Hello everyone! It's my first time posting here.

    In the lab I'm at, we're working mostly with bacteria (A bit of yeast here and there, but rarely) and I got involved in a project where we're working with RNA-seq data from Illumina that consists of a decent amount of E.coli RNA sequences.

    Currently I'm trying to look for indels in these data, and it took me shamefully long to notice, that VarScan, which I was using for variant calling, in fact only reports one alternate allele per position and somehow just loses the others. I only noticed it because reads supporting alternate allele and those supporting the reference allele do not add up to total coverage at many positions. I wasn't that surprised as most of the tools are optimized (or in fact created solely) for analyzing human data which is quite different from bacterial and so one has to be careful when sticking bacteria where is not their place.

    In my troubles I happened upon this site: http://www.oliverelliott.org/article...t_mpileup2vcf/ . Where a man of knowledge was troubled by the same problem and wrote a program in C++ that takes the input from smatools mpileup and turns it into a .vcf file from which I can sort out indels.
    Now I'm not all that good in reading C++ so I wouldn't notice any mistakes and would have to rely on user feedback, but lo and behold there is none!

    Has anyone ever seen or used this piece of software or knows any other that could help me in my troubles with my bacterial data?

    Now bear in mind that I got my BsC just last spring in biology and couldn't write even a line of python by the life of me at the time so you could say I'm quite wet behind the ears.
    Last edited by markusli; 02-26-2015, 11:10 AM.
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Be very careful with that program as its author fundamentally misunderstands what he's doing. BAM->pileup->VCF is absolutely NOT a format conversion. Actually, bcftools doesn't even call variants, samtools mpileup does, so he didn't even get that part right. The biggest mistake that the author makes is believing that if a variant exists in a read that it should be in the VCF file. This is simply false. Typical NGS reads are full of sequencing mistakes and you need multiple alignments giving evidence of a variant in order to reasonably declare that it actually exists. Not doing this is a complete fail.

    The reason many tools don't accurately report multiallelic variants is that they're designed around diploid organisms and use a model dependent on that. If you're doing an experiment that requires calling rare variants in pooled data, then use a tool intended for that (I don't know of any off-hand, but that's not what I work on).

    Comment

    • markusli
      Junior Member
      • Feb 2015
      • 2

      #3
      But he doesn't expect you to use it to somehow generate a vcf from a bamfile, but to pipe samtools mpileup to this tool and have it do the heavy lifting.
      It generates a vcf file that has total depth, reference supporting read count, variant supporting read count and qualities for each reported position and doesn't really seem to just report all inconsistencies between reads and the reference sequence.

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        The processing that samtools mpileup is doing before being piped into that script is non-existent. There are two ways of running mpileup:
        1. You can have mpileup call variants in VCF or BCF format
        2. You can have mpileup create a pileup (or mpileup) of each base


        His script processes the output of #2. Varscan does the same thing, but it also doesn't perform a trivial conversion of that to VCF.

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-26-2026, 11:10 AM
        0 responses
        8 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        44 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        104 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        125 views
        0 reactions
        Last Post SEQadmin2  
        Working...