Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calling variants when mapping reads from different species than reference

    Hi all,

    My SNP calling pipeline seems to be poorly performing when tasked with mapping reads from a sample that is from a related species from the reference assembly. (This sort of thing is common in plant breeding.) Obviously the more variants that differentiate reference and sample, the harder it is to map reads correctly, so it makes sense that this is harder than finding intraspecies SNPs.

    Nonetheless, I am 100% confident it can do better than it is doing. I can see from the bam that there are reads mapping to a given region, and even then most of the time my SNP caller (the last free version of GATK) fails to emit genotypes, even of sites that simply match the reference. Interestingly, the few times that it does emit a genotype, it is highly highly biased towards A&T calls.

    Is any of this a classic symptom of a parameter I need to adjust, or a sign that I need to switch SNP callers? Can anyone cite literature that compares SNP callers' performance (or influence of different parameters) in this particular challenging task? I haven't had much luck searching.

    Thanks!

    Jonathan

  • #2
    I don't have any papers, but I'd be interested in seeing if you get better results using BBMap for mapping and BBMap's variant caller for variant-calling. BBMap can align at short reads at quite low identity, so it's good for cross-species alignment (you can increase sensitivity with the "minid" flag). Assuming you have paired interleaved reads in a single file, and starting with the raw reads, the commands would be something like:

    Code:
    bbduk.sh in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
    bbmap.sh in=clean.fq ref=ref.fa out=mapped.sam
    callvariants.sh in=mapped.sam ref=ref.fa ploidy=2 out=vars.vcf
    For multiple samples, the callvariants command would be:

    Code:
    callvariants.sh in=sample1.sam,sample2.sam,sample3.sam ref=ref.fa ploidy=2 out=vars.vcf multisample
    At JGI we are using CallVariants a lot these days for various strains of things mapped to a divergent reference (largely Aspergillus, E.coli, and some plants from cross-breeding experiments). I don't really do variant-calling in production any more (just for testing and development) but the people who do at JGI switched to CallVariants because, as they told me, it performs much better than their prior pipelines (GATK, FreeBayes, and a couple others), particularly for variants with low allele fractions (using the "rarity" flag). In general I think it's best to use the defaults, though of course you need to specify ploidy correctly.

    Comment


    • #3
      Brian-- Thanks. I'll give BBMap a shot and will certainly report success if I find it!

      Comment


      • #4
        Great - I look forward to your feedback.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Understanding Genetic Influence on Infectious Disease
          by seqadmin




          During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

          Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
          09-09-2024, 10:59 AM
        • seqadmin
          Addressing Off-Target Effects in CRISPR Technologies
          by seqadmin






          The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
          08-27-2024, 04:44 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 06:25 AM
        0 responses
        13 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 01:02 PM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-18-2024, 06:39 AM
        0 responses
        14 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-11-2024, 02:44 PM
        0 responses
        14 views
        0 likes
        Last Post seqadmin  
        Working...
        X