Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indel detection in NGS high coverage amplicons

    I am attempting to detect indels from a panel of clones resulting from CRISPR targeted deletion. Regions around the target were PCR amplified to produce a roughly 160bp amplicon, which was then sequenced with as a PE150 run.
    I've been banging my head against finding a tool that can:

    1) Detect which clones have indels
    2) Identify the location of these indels, ideally in a VCF or similar file such that the full panel of 96 clones can visualized (i.e. in IGV).
    3) Provide annotation details about the read depth at the indel position and percentage of the sequences that contain an indel.
    4) Is a stand-alone tool that I can install on my Unix box.

    This question is very similar to one asked back in 2012, which was never sufficiently answered.
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    So far, I have tried a number of methods, each of which have some key failings.

    CRISPR Genome Analyzer:
    crispr-ga.net
    Doesn't localize the indel detected. Only available as a web app; I need something that I can install locally and hammer on.

    Outknocker:
    Doesn't give localizations. Web based.

    Pindel:
    Pindel ends up missing a number of indels that are clearly there when viewing the reads from the BAM file. Also, pindel format is not good for looking across many clones, and the conversion pindel-to-vcf loses any depth and read-percentage information.

    GATK HaplotypeCaller:
    Misses a lot of very obvious indels. The caller appears to not be optimized for non-tiled non-fragmented reads.

    GASVPRO and SVseq:
    Simply don't detect any indels where they clearly exist. Again, the callers seem non-optimized for amplicon sequencing.

    XHMM:
    Refuses to do the PCA normalization with fewer targets than samples

    CONTRA:
    Doesn't deal well with small analysis windows for detecting indels shorter than 11bp.

    CRISPR-GA uses a BLAT based alignment for their backend, but I can't find any additional information on how I would do further indel detection after a basic BLAT alignment. Any ideas on a BLAT based indel calling solution?
    This seems like it should be a dead simple problem, something that was solved back in the 80's, but I can't find a good reference to a tool to use. Any help is appreciated.

    Thanks

  • #2
    I have the same experience as you - all popular variant calling tools out there are optimized for shotgun libraries. For targeted resequencing with PCR amplicons I do a pileup with samtools and parse the output, calling variants at suitable thresholds.

    Comment


    • #3
      Pindel is able to call indels from many samples at the same time and give you indication of which samples are the carriers. The new pindel2vcf gives you a clean way to view the result with the numbers of reads supporting ref and alt alleles.

      you could put a list of bam files in the config file and run all samples together.

      Comment


      • #4
        Just wondering if you've tried Geneious? It's commercial software, so maybe that doesn't meet your requirements, and I'm biased since I wrote the variant caller in it, but I've never seen it fail to call an obvious variant.

        If you're willing to share your data along with the locations of some obvious indels that aren't getting called, then I can run it through Geneious for you and let you know how it goes.

        Comment


        • #5
          Well, after much frustration, I stumbled onto the package freebayes "Bayesian haplotype-based polymorphism discovery and genotyping"
          Bayesian haplotype-based genetic polymorphism discovery and genotyping. - freebayes/freebayes


          The caller seems to work well on amplicon data and ended up being the cleanest and most complete VCF file (with ref and alt allele frequencies).

          Comment


          • #6
            sorry, double post
            Last edited by alexholman; 02-24-2015, 07:03 AM. Reason: double post

            Comment


            • #7
              Alex shared some of his data with me to run through Geneious and we corresponded a bit by email, so I thought I'd share the results with anyone else who's interested.

              Geneious called the indels although it split them into multiple adjacent indels in some situations which isn't ideal. I hope to improve this soon.

              I also ran his data through FreeBayes, which also found the obvious indels he expected, but it didn't find other 'obvious' indels he wasn't aware of.

              The main problem was that the data was poorly aligned. For example, in one sample, one allele had a 29bp deletion and the other allele a 44bp deletion in the same region. The alignment created using BWA mem had failed to span the 44bp deletion, so no neither Geneious nor FreeBayes would call this indel from this alignment. I generated a better alignment using Geneious, and then Geneious called both indels, although split it a way that made it difficult to infer the two alleles. FreeBayes still failed to identify the two alleles in this case even when provided with an improved alignment.

              For indels like this, I recommend aligning using either Geneious, or BBMap which both successfully span large indels. Or maybe other aligners have settings to tweak that will improve results around indels.

              And for variant calling on this type of data, both Geneious or FreeBayes do OK, although neither works perfectly on the data Alex provided even when I generated a better alignment.

              Comment


              • #8
                Originally posted by Matt Kearse View Post
                I also ran his data through FreeBayes, which also found the obvious indels he expected, but it didn't find other 'obvious' indels he wasn't aware of.
                That's my experience with Freebayes as well, I haven't found a good set of parameters to work with the amplicon data I have. I typically use GMAP (when I have Sanger data) and GSNAP (for Illumina data) to align this kind of data, and with my own custom caller on pileups from samtools am quite happy...

                Comment


                • #9
                  Originally posted by Matt Kearse View Post
                  For indels like this, I recommend aligning using either Geneious, or BBMap which both successfully span large indels. Or maybe other aligners have settings to tweak that will improve results around indels.
                  One of my colleagues has a Geneious license, so I might try it. However, I'm a command line guy - is there a way to automate Geneious for amplicon data? (I typically have thousands of samples with custom inline barcoding schemes)

                  Comment


                  • #10
                    Originally posted by sarvidsson View Post
                    One of my colleagues has a Geneious license, so I might try it. However, I'm a command line guy - is there a way to automate Geneious for amplicon data? (I typically have thousands of samples with custom inline barcoding schemes)
                    Unfortunately no, there isn't a Geneious command line interface. You can align or variant call in bulk by selecting all the data sets and choosing the options once.

                    Or if that's not sufficient you can put together workflows with optional custom code fragments. See https://www.youtube.com/watch?v=uvgB2_YBmD4 for a short demo of workflows.

                    Also, one limitation of Geneious is that you can't yet export to VCF format so you'll have to settle for CSV export for now.

                    Comment


                    • #11
                      Originally posted by Matt Kearse View Post
                      Unfortunately no, there isn't a Geneious command line interface. You can align or variant call in bulk by selecting all the data sets and choosing the options once.
                      Or if that's not sufficient you can put together workflows with optional custom code fragments. See https://www.youtube.com/watch?v=uvgB2_YBmD4 for a short demo of workflows.
                      I'll have a look at the custom workflows. Would it be possible to bulk import thousands of (typically paired) FASTQ files and assigning sample IDs to them?

                      Originally posted by Matt Kearse View Post
                      Also, one limitation of Geneious is that you can't yet export to VCF format so you'll have to settle for CSV export for now.
                      VCF would be nice but is not a must. If the CSV format contain enough data I could genereate VCF from it where needed. First I'd like to compare the aligner/caller to our current amplicon re-sequencing pipeline.

                      Comment


                      • #12
                        Originally posted by sarvidsson View Post
                        Would it be possible to bulk import thousands of (typically paired) FASTQ files and assigning sample IDs to them?
                        If prior to import you give the FASTQ files names that match their sample ID then their file name becomes the effective sample ID. Paired files should have an suffix (e.g 1 or 2) which will get stripped from the name when you pair them within Geneious which can be done in bulk.

                        It's probably best you just try it with a sample or two to start with to see if Geneious gives acceptable results on your data.

                        Comment


                        • #13
                          You may also find Scalpel useful (http://scalpel.sourceforge.net/) which uses an assembly step during indel calling (http://www.ncbi.nlm.nih.gov/pubmed/25128977 ) that may help with some of the alignment-derived false negatives.

                          Comment


                          • #14
                            Originally posted by alexholman View Post
                            I am attempting to detect indels from a panel of clones resulting from CRISPR targeted deletion. Regions around the target were PCR amplified to produce a roughly 160bp amplicon, which was then sequenced with as a PE150 run.
                            I've been banging my head against finding a tool that can:

                            1) Detect which clones have indels
                            2) Identify the location of these indels, ideally in a VCF or similar file such that the full panel of 96 clones can visualized (i.e. in IGV).
                            3) Provide annotation details about the read depth at the indel position and percentage of the sequences that contain an indel.
                            4) Is a stand-alone tool that I can install on my Unix bo
                            Thank you, for your very informative topic with a happy ending

                            Comment


                            • #15
                              Originally posted by alexholman View Post
                              Well, after much frustration, I stumbled onto the package freebayes "Bayesian haplotype-based polymorphism discovery and genotyping"
                              Bayesian haplotype-based genetic polymorphism discovery and genotyping. - freebayes/freebayes


                              The caller seems to work well on amplicon data and ended up being the cleanest and most complete VCF file (with ref and alt allele frequencies).
                              I have Crisp dataset with PE 250bp, and I tried Bayesian software tool, but I see very little INDEL in the vcf result file, and I can see the INDEL in IGV when the bam file is loaded into IGV view. I would like you to share some of detail info with me.
                              1) I used bwa-mem as an aligner, and which one you used?
                              2) my freebayes command line:
                              freebayes -f /home/db/chr5.fa --region chr5:112818960-112819204 my_sorted.bam > results.vcf

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 09:45 AM
                              0 responses
                              202 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 08:54 AM
                              0 responses
                              212 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-02-2024, 03:00 PM
                              0 responses
                              194 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X