Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dagarfield
    Member
    • Aug 2010
    • 39

    Calling tri-allelic SNPs using samtools (or similar)

    Hi folks,

    We work with sea urchin larvae in our lab. They are very, very, very tiny and, thus, we need to collect a whole bunch of them at a time to get sufficient starting material for NGS. Urchins are also highly polymorphic.

    RESULT: There are times in which some SNPs are effectively tri-allelic in a single sample, something that simply isn't ever going to happen if your sample consists of a happily diploid individual human (or medical model system of your choice).

    To see what happens when one has three alleles at a polymorphic site, I constructed a fake dataset (which I can provide) consisting of three reads each of three different haplotypes. Using samtools mpileup, I can generate the following line for the base in question

    Code:
    samtools mpileup -f mySeqs.fa combined.bam > combined.pileup
    
    dgarfield$ less combined.pileup | grep 124217
    Scaffold1200	124217	G	9	aaattt,,,	=========p
    Great, the program sees that there are three alleles at 124217

    Now, lets take a look at the results of bcftools view

    Code:
    samtools mpileup -uf mySeqs.fa combined.bam > combined.pileup_u
    
    dgarfield$ bcftools view -cg combined.pileup_u | grep 124217
    Scaffold1200	124217	.	G	T	19.1	.	DP=9;AF1=0.5;CI95=0.5,0.5;DP4=0,3,0,6;MQ=60;FQ=19.1;PV4=1,1,1,1	GT:PL:GQ	0/1:49,0,49:49
    T? That was not what I was expecting. I was hoping for A,T,G

    That brings me to my two questions.

    1) Given the equal balance of alleles at SNP 124217, why does bcftools choose 'T'?
    2) Are there any situations in which bcftools can return more than two alleles at a single SNP?

    Any insights would be greatly appreciated.

    Thanks,

    David
  • dagarfield
    Member
    • Aug 2010
    • 39

    #2
    Here's a response I got from the samtools mailing list...not overly encouraging for Samtools for this problem. Any suggestions for other good SNP calling programs?

    You should have been hoping for "A,T" not "T" or "A,T,G" because G is the reference so not an alternate allele.

    But samtools and bcftools can't handle your situation. The current version always assumes the sample is diploid.
    I understand there is some experimentation at handling haploid samples (good for X and Y chromosomes
    as well as true haploid situations), but handling high ploidy/arbitrary mixtures is something else that needs its
    own calculations with a prior over distributions on the four nucleotides (or more if you want to consider overlapping
    indels).

    Comment

    • dagarfield
      Member
      • Aug 2010
      • 39

      #3
      Another response from Heng Li on the samtools help list

      Samtools-0.1.13 always assumes the sample is diploid and on a diploid genome, it is impossible to have three different alleles. Nonetheless, you may still see two alternative alleles in a single sample. This indicates that the sample has two alleles but both different from the reference.

      However, samtools does not handle triallelic alleles properly. Although you may occasionally see them in the VCF report, the QUAL and the GT are not computed in the proper way. Perhaps glfmultiples is better in this case. Note that glfmultiples also assumes the input is diploid. Multi-ploidy and multi-allele are two different issues.

      Heng

      Comment

      • dagarfield
        Member
        • Aug 2010
        • 39

        #4
        Oh, all kinds of good things. More from Heng Li.

        Samtools is not designed for pooling experiments. There are a few callers designed for that, but I do not know which is the best. For estimating allele frequency from DNA pools, someone used to point me to:

        Recent statistical analyses suggest that sequencing of pooled samples provides a cost effective approach to determine genome-wide population genetic parameters. Here we introduce PoPoolation, a toolbox specifically designed for the population genetic analysis of sequence data from pooled individuals …


        I have never read the paper carefully, though.

        Heng

        And from Manuel Rivas at the Broad

        Hello David,

        You can also use Syzygy for pooled data available at:



        Best,
        Manuel
        Manuel distinguishes between Syzygy and GATK

        Syzygy is used for targeted sequencing applications (customized seq, pooled
        seq, and is applicable to whole exome sequencing as well) with both
        individual and pooled level data. For small genomes it would work well.

        GATK's functionality is for whole genome applications and whole exome
        applications with individual level data.
        Poking around a bit on the web, it seems like VARiD might be a good option for some kinds of reads, but I've not used it myself. I'd be keen to hear from anyone how knows how VARiD does with pooled samples.



        Happy Computing,

        DG
        Last edited by dagarfield; 03-11-2011, 07:11 AM.

        Comment

        • Jose Blanca
          Member
          • Aug 2009
          • 70

          #5
          We work with mixed samples coming from different individuals and we have developed an SNP caller to work with them. You can take a look at:



          Best regards,

          Jose Blanca

          Comment

          • mrxcm3
            Junior Member
            • Oct 2010
            • 9

            #6
            Whilst your specific problem is not one that I have had to consider - I have been working with anonomous (unbarcoded) sample pools. I have found these SNP callers useful;

            SYZYGY
            FREE BAYES
            VarScan

            My understanding of VARiD is that it was not suitable for my application (non-barcoded data) in that it treats DNA from the same read group as originating from the same sample. I may have this wrong though.

            Good Luck.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Pathogen Surveillance with Advanced Genomic Tools
              by seqadmin




              The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
              Today, 11:48 AM
            • seqadmin
              New Genomics Tools and Methods Shared at AGBT 2025
              by seqadmin


              This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

              The Headliner
              The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
              03-03-2025, 01:39 PM
            • seqadmin
              Investigating the Gut Microbiome Through Diet and Spatial Biology
              by seqadmin




              The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
              02-24-2025, 06:31 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-20-2025, 05:03 AM
            0 responses
            26 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-19-2025, 07:27 AM
            0 responses
            33 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-18-2025, 12:50 PM
            0 responses
            25 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-03-2025, 01:15 PM
            0 responses
            190 views
            0 reactions
            Last Post seqadmin  
            Working...