Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jmartin
    Member
    • Dec 2009
    • 78

    Understanding samtools mpileup consensus

    I've just produced consensus sequence for a pool of illumina reads mapped against a reference assembly using samtools mpileup. After extracting fasta from the vcf I'm presented with sequence for most (95%?) of the contigs from the reference assembly, but with a fair portion of that sequence appearing as ambiguous bases. The unambiguous portion mostly appears to be the exact reference, but at some positions appear the IUPAC code symbols that represent multiple possible bases.

    I have 3 basic questions:

    1) Are the regions of ambiguous bases in the output consensus areas where there were no queries that mapped? Or are they regions where my queries differed significantly from the reference?

    2) In the generated consensus, is everywhere I see an IUPAC multi-base nucleotide a SNP?

    3) For every contig I've checked, the output consensus has the proper number of ambiguous bases all the way up to the very start of the contig in the reference assembly. But on the tail end, sometimes the output consensus just stops early, short of the end that I see in the reference. Why is that?
  • swbarnes2
    Senior Member
    • May 2008
    • 910

    #2
    You are using vcfutils.pl from samtools?

    Here's the relevant code from that program:



    Code:
    my %het = (AC=>'M', AG=>'R', AT=>'W', CA=>'M', CG=>'S', CT=>'Y',
    			 GA=>'R', GC=>'S', GT=>'K', TA=>'W', TC=>'Y', TG=>'K');
    
            $q = $1 if ($t[7] =~ /FQ=(-?[\d\.]+)/);
    	  if ($q < 0) {
    		$_ = ($t[7] =~ /AF1=([\d\.]+)/)? $1 : 0;
    		$b = ($_ < .5 || $alt eq '.')? $ref : $alt;
    		$q = -$q;
    	  } else {
    		$b = $het{"$ref$alt"};
    		$b ||= 'N';
    So if the FQ value is positive, and concatenating together the REF and the ALT makes a combination found in that %het, then W's and K's and whatever get put in the consensus.

    Comment

    • jmartin
      Member
      • Dec 2009
      • 78

      #3
      I am using vcfutils.pl vcf2fq to build fastq from the vcf. But I only have a tenuous grasp on the full extent of the code thats going on. Under what condition does an ambiguous base get used? Looking at that snippit, I'm not getting what $q is, so I wasn't able to follow why it goes into one branch over the other. I do get what is being assigned and why in both branches though.

      I guess I basically need to know when an 'n' is used in the consensus output.

      Comment

      • swbarnes2
        Senior Member
        • May 2008
        • 910

        #4
        Originally posted by jmartin View Post
        I am using vcfutils.pl vcf2fq to build fastq from the vcf. But I only have a tenuous grasp on the full extent of the code thats going on. Under what condition does an ambiguous base get used? Looking at that snippit, I'm not getting what $q is, so I wasn't able to follow why it goes into one branch over the other. I do get what is being assigned and why in both branches though.
        $q is the FQ value.

        I guess I basically need to know when an 'n' is used in the consensus output.
        Ah, that's different.

        First:

        Code:
        if ($t[1] - $last_pos > 1) {
        	  $seq .= 'n' x ($t[1] - $last_pos - 1);
        	  $qual .= '!' x ($t[1] - $last_pos - 1);
        	}
        So if there is for some reason a gap in your all-points vcf, it puts n's there.

        Or

        Code:
        else {
        		$b = $het{"$ref$alt"};
        		$b ||= 'N';
        	  }
        If REF concatenated with ALT isn't in that starting %het, then it puts an N instead.

        That seems to be the only two points in that script where N's or n's get used.

        Note that indels get handled differently, they are not put into the consensus, there is instead a window of lowercase letters around the putative indel.

        Comment

        • jmartin
          Member
          • Dec 2009
          • 78

          #5
          I get it now, thank you very much!

          Comment

          Latest Articles

          Collapse

          • GATTACAT
            Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by GATTACAT
            Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
            07-01-2026, 11:43 AM
          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

            Here are nine questions we think about, in roughly the order they matter, before...
            06-18-2026, 07:11 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Yesterday, 11:08 AM
          0 responses
          6 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-30-2026, 05:37 AM
          0 responses
          11 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-26-2026, 11:10 AM
          0 responses
          19 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-17-2026, 06:09 AM
          0 responses
          53 views
          0 reactions
          Last Post SEQadmin2  
          Working...