Header Leaderboard Ad

Collapse

models and softwares for SNP and indel detections

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • models and softwares for SNP and indel detections

    Hello,

    I'm rather new in NGS field. I previously did an internship about rnaseq data : gene and isoform expression level estimation and differential expression between two conditions. So I know some models and tools used for these issues.

    I start now my PhD and I have to work with dna seq data. I first must focuse on the SNP and indel detections issues. I haven't found a lot of information yet, because I don't know where to look for. I have only found SNVMix "predicting single nucleotide variants from next-generation sequencing of tumors", wich seems interesting.

    Is there some kind of blogs for dna seq like RNA-Seq blog ?

    I specify that my PhD is in cancer field. I'm interested in the models and softwares developped to solve these issues.

    Can you give me the references of papers or softwares that you have read/used in this field ?

    Thanks for your help,
    Jane

  • #2
    Two SNP and indel callers that you can search for in seqAnswers are samtools mpileup:

    http://samtools.sourceforge.net/mpileup.shtml

    and GATK:

    http://www.broadinstitute.org/gsa/wi...alysis_Toolkit

    sections: 5.1, 5.4 (Unified Genotyper) and 5.5.

    Chris

    Comment


    • #3
      Thanks for your answer.

      I was also wondering about the reliability of the Illumina pipeline, especially for SNP and indel detections: I have the results of 2 dna seq experiments and for each, the list of SNP and indel.
      The results have been established through the illumina pipeline. I haven't managed to find information about the model used by illumina for such analyses.

      Do you know where I can find this information? Have you some information about the quality of these analyses?

      Comment


      • #4
        There is something about CASAVA vs GATK here:

        http://biostar.stackexchange.com/que...va-1-8-vs-gatk

        It's probably worth reading all the replies - and there is a pre-publication paper as well about the comparison.

        Chris

        Comment


        • #5
          Thanks for your answers Chris ! I read the papers about the comparison between CASAVA and GATK and I start to have an overview of the matter.
          I also read the paper about SNVMix and it seems to be a very interesting model !

          I will try to summarize what I've understood until now. Please correct me or add extra information. I have understood that the Illumina's tool for alignment, SNP and indel detection is CASAVA and the newest version seems to be CASAVA1.8.

          Other tools are:
          • GATK, which is related to BWA (aligner tool), for the three issues. What is the relation between GATK and BWA? I understood that both of them can be use to aligne the reads.
          • SNVMix for the SNV detections, which seems adapted for cancer data (as mine)
          • mpileup for SNV and Indel detections.


          Does anyone know other tools?


          About the comparison between GATK and CASAVA, the conclusion is:
          Code:
          We conclude that CASAVA1.8 has come a long way and can be considered a mature SNP calling approach. However, CASAVA1.8 does not deliver the same quality in the indel calling set compared to the newly incorporated Dindel-algorithm of GATK. It hence remains the best practice to use CASAVA1.8 for producing fastq les and switch at this stage to the academic tools for mapping, alignment improvement and variant calling.
          It seems that I should study the indel detection with an other tool than the one from Illumina, but the results for SNP detection should be acceptable.

          Finally, do you know if the models not adapted for cancer data, should be avoided when working in this field?

          Comment


          • #6
            Thanks for your answers Chris ! I read the papers about the comparison between CASAVA and GATK and I start to have an overview of the matter.
            I also read the paper about SNVMix and it seems to be a very interesting model !

            I will summarize what I've understood until now. Please correct me or add extra information. The Illumina's tool for alignment, SNP and indel detection is CASAVA and the newest version seems to be CASAVA1.8.

            Other tools are:
            • GATK, which is related to BWA (aligner tool), for the three issues. What is the relation between GATK and BWA? I understood that both of them can be use to aligne the reads.
            • SNVMix for the SNV detections, which seems adapted for cancer data (as mine)
            • mpileup for SNV and Indel detections.


            Does anyone know other tools?


            About the comparison between GATK and CASAVA, the conclusion is:
            We conclude that CASAVA1.8 has come a long way and can be considered a mature SNP calling approach. However, CASAVA1.8 does not deliver the same quality in the indel calling set compared to the newly incorporated Dindel-algorithm of GATK. It hence remains the best practice to use CASAVA1.8 for producing fastq les and switch at this stage to the academic tools for mapping, alignment improvement and variant calling.
            It seems that I should study the indel detection with an other tool than the one from Illumina, but the results for SNP detection should be acceptable.


            Finally, do you know if the models not adapted specifically for cancer data, should be avoided when working in this field?

            Comment


            • #7
              GATK, which is related to BWA (aligner tool), for the three issues. What is the relation between GATK and BWA? I understood that both of them can be use to aligne the reads.

              BWA is for mapping reads to a genomic reference. There are other tools like bowtie, stampy, novoalign that can do this as well. BWA is the standard in many places as it is open source, fast and can do gapped alignments.

              GATK is a set of tools for analysing exome and genomic DNA datasets. It does things like realign reads using multiple alignments where the mapper on its own doesn't do as well. This will eliminate some false positive SNPs by correcting alignments. GATK also recalibrates the theoretical base qualities. Mainly it is used to call SNPs and indels.

              The usual pipeline is:

              sequencer -> fastq -> BWA -> SAM -> samtools/picard -> sorted, dedupped BAM -> GATK -> realigned, recalibrated BAM -> GATK -> SNPs and indels (in VCF format) -> GATK -> recalibrated, filtered SNPs and indels.

              Chris

              Comment


              • #8
                Feels strange to promote that, but have a look here:

                http://seqanswers.com/wiki/How-to/exome_analysis

                Comment


                • #9
                  Thanks, I'd not seen that page on the SeqAnswers wiki. Looks like a nice summary of the GATK stuff on one page!

                  Chris

                  Comment


                  • #10
                    The one bad thing about CASAVA is that since it was quite bad for some times (there were bugs in the perl scripts!), no one got in the habit of using it, so everyone learned Samtools or Maq or GATK. Now that CASAVA is better, no one really wants to go back and use it when they already know other suites. So if you have questions about CASAVA, you won't nearly as much support here as you would get if you were asking about SAMtools or GATK. You could always ask Illumina...but in my experience, they aren't terribly helpful.

                    Comment


                    • #11
                      Thanks for your answers.
                      I didn't know Maq. I read few things about it and I understood that it's a mapping tool. Is it also for SNP and indel detections ?

                      I have the feeling that there are not a lot of tools to deal with these SNP and indel detection issues: 3-5. What would you suggest me to start with? What is the simplest to install, to run...?

                      Finally, has someone tried SNVMix? Do you know if there exist forums about NGS tools in cancer field?

                      Comment


                      • #12
                        Just for completeness. Rather recently, papers describing the variant callers in both GATK and samtools have appeared. Reading them might clarify some of the question.

                        on samtools:
                        Heng Li: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics (2011) 27 (21): 2987-2993. doi:10.1093/bioinformatics/btr509

                        on GATK:
                        Mark A DePristo et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 491–498 (2011). doi:10.1038/ng.806

                        Comment


                        • #13
                          Heng Li mentions a few SNP callers in this biostar thread:

                          http://biostar.stackexchange.com/que...notype-callers

                          maq is for mapping and SNP detection I believe. It was written by Hen Li who also did BWA and samtools mpileup, so maq may be out of date now especially for SNP detection, but I'm not sure as have never used it. It may still be good for mapping, but is slower than BWA.

                          Also, have no experience with SNVMix or SNP detection in cancer. It may be harder to look for SNPs in cancer cells as they may no longer be diploid if you have heterogeneous cell populations: samtools and GATK rely on the fact that you are looking for SNPs and indels in diploid cells (so caution needs to be applied when looking for SNPs in X and Y chromosomes as well).

                          One of the main differences between GATK and samtools is that GATK tends to give many more SNPs and relies on the variant recalibration to find better quality SNPs. But in my experience, both perform well for finding most of the likely candidate SNPs in exome data. You'll still need to verify any novel SNPs found with something like sanger sequencing, etc.

                          Chris

                          Comment


                          • #14
                            Thanks again !

                            From the given list, I've found news tools for SNP or indel detections:
                            • Atlas SNP
                            • Dindel
                            • FreeBayes
                            • QCALL
                            • Slider II
                            • SNP Seeker
                            • SPLINTER
                            • Syzygy
                            • VARiD
                            • VarScan


                            And there exist probably more...
                            I will go through this list to see what is worth to try, especially with my cancer data.

                            Comment


                            • #15
                              Here's a new one I stumbled across. No idea if it's any good, but it might be applicable to cancer data if you have a heterogenous population. http://nar.oxfordjournals.org/conten...kr599.abstract

                              Comment

                              Working...
                              X