Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Variant Annotation Tools

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variant Annotation Tools

    Hi all,
    I'm looking for suggestions of variant annotation tools for large data sets.
    For example, I've called variants using Samtools pileup and now I want to go from a huge list of variants to a list of annotations and a simple method for filtering them.
    Any thoughts on things I might try?
    Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
    Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
    Projects: U87MG whole genome sequence [Website] [Paper]

  • #2
    We've done similar things at my lab. We haven't dealt with very large datasets, just a couple of illumina and 454 together. If you want to take a look at our documentation you can. We also did a small tutorial session on the topic.
    I hope that could serve you as inspiration.

    Comment


    • #3
      You can try VarScan...takes pileup as input

      Comment


      • #4
        Thanks for the feedback.

        I'm going to try VarScan because I've already done the variant calling and have the pileup files.

        Does anyone have suggestions on annotation and filtering programs downstream of VarScan for annotation?

        For example, going from the list of variants to coding consequences (marking whether and how variants affect coding sequences), and parsing by type of variant (indels vs SNVs) or coverage/quality?

        I'm actually also having trouble getting VarScan to work, actually:

        I used samtools 0.1.7-5 (r528) to generate pileup using the -c -a -f hg18.fa -r 0.0000007 options.
        When I tried running one of the "pileup2" commands in VarScan, this is happening:
        java -jar /home/mclark/varScan/VarScan.v2.2.jar pileup2indel chr21.pileup
        Min coverage: 8
        Min reads2: 2
        Min var freq: 0.01
        Min avg qual: 15
        P-value thresh: 0.99
        Reading input from chr21.pileup
        Chrom Position Ref Var Reads1 Reads2 VarFreq Strands1 Strands2 Qual1 Qual2 Pvalue
        Parsing Exception on line:
        chr21 9719766 N A 68 0 59 3 ^Z.^~,^~, `2/
        For input string: "A"
        Any ideas what's going on and how I can get around it?

        I'm also wondering what the possible Options are when running each command in VarScan. I don't see a list on the site (and if it's in the code, I'm afraid I may not be savvy enough to figure that out myself so assistance is appreciated). For example, can I play with "min avg qual" and such? Thanks.
        Last edited by Michael.James.Clark; 06-18-2010, 01:45 PM.
        Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
        Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
        Projects: U87MG whole genome sequence [Website] [Paper]

        Comment


        • #5
          The VarScan manual site says that it cannot process pileup created with the -c option:

          "Do NOT use the -c parameter. It generates consensus format, which is different from pileup format. The next release of VarScan will recognize both formats. Note, to save disk space and file I/O, you can redirect pileup output directly to VarScan with a "pipe" command. For example:

          samtools pileup -f reference.fasta myData.bam | java -jar VarScan.v2.1.jar pileup2snp"

          c stands for consensus and it looks just as the parsing exception was caused by that consensus "A". So you should run pileup without -c to use it for VarScan. Or wait for the promised next release/someone to do a clever hack to the code ...

          Comment


          • #6
            try samtools pileup -vcf
            gives only varients

            Comment


            • #7
              Great, thanks guys. I think last week I was only seeing the "Documentation" not the "Manual" from the site. The Manual describes just what I wanted to know.

              Rao, the -c option's consensus output appears to be the issue. Can still potentially use -v to only output variants, though.
              Last edited by Michael.James.Clark; 06-21-2010, 09:36 AM.
              Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
              Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
              Projects: U87MG whole genome sequence [Website] [Paper]

              Comment


              • #8
                Alright, that worked and I got output. It looks believable to me, but I've encountered some issues.

                For one thing, I can't get the "filter" command to report anything. No matter what settings I use, it reports 0 variants passing filter, and for the other, when I delve into the variant file, I can find variants that should pass filter. Has anyone gotten it to work?

                I also tried the somatic command, and it looks like it worked, but I've got some curiosities in it as well. Example output:

                Min coverage: 8x for Normal, 6x for Tumor
                Min reads2: 2
                Min strands2: 1
                Min var freq: 0.2
                Min freq for hom: 0.75
                Min avg qual: 15
                P-value thresh: 0.99
                Somatic p-value: 0.05
                127671560 shared positions
                122884470 had sufficient coverage for comparison
                121991210 were called Reference
                12445 were mixed SNP-indel calls and filtered
                176060 were called Germline
                8887 were called LOH
                685647 were called Somatic
                10221 were called Unknown
                0 were called Variant
                I'm thrown by the "0 were called Variant". Anyone know what that means?
                Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                Projects: U87MG whole genome sequence [Website] [Paper]

                Comment


                • #9
                  Originally posted by Michael.James.Clark View Post
                  Hi all,
                  I'm looking for suggestions of variant annotation tools for large data sets.
                  For example, I've called variants using Samtools pileup and now I want to go from a huge list of variants to a list of annotations and a simple method for filtering them.
                  Any thoughts on things I might try?
                  Hi,

                  You could try SVA in DUKE (http://people.genome.duke.edu/~dg48/sva/index.php).

                  I think this big guy can satisfy your request if you have a big computer.

                  Wu

                  Comment


                  • #10
                    Originally posted by wuhoucdc View Post
                    Hi,

                    You could try SVA in DUKE (http://people.genome.duke.edu/~dg48/sva/index.php).

                    I think this big guy can satisfy your request if you have a big computer.

                    Wu
                    My only concern is that I have heard it hard-codes dbsnp 127 or something (can anyone confirm, N=1). Even still it is a great piece of software!

                    Comment


                    • #11
                      You might want to try www.gene-talk.de

                      Comment


                      • #12
                        Is there an update to this post recommending tools for variant annotation and analysis? I'm trying to use R's VariantAnnotation package but the learning curve is frustrating me and I'm not sure it's worth the effort...

                        Comment


                        • #13
                          I believe the two most commonly used tools are annovar and SNPEff. Annovar handles many types of annotations and is built for filtering. SNPEff produces some nice html files for your web-viewing enjoyment in addition to text files.

                          Comment


                          • #14
                            I agree annovar and SNPeff seem to be most widely used for variant annotation. For variant analysis there are e.g. ingenuity (commercial), annotate-it and www.gene-talk.de. We are using GeneTalk at the institute for medical genetics at Berlin Charité and are collaborating with the R&D. The platform seems to be rather commonly used now. We have currently about one hundred single exomes analyzed per day by about 500 unique users. The annotation is based on annovar. The filtering and interpretation tools are codeveloped by us but it is generally a project open to any kind of collaboration. We just added a new filter for compound heterozygous filtering so if this is something you are interested in, just try it out,...

                            Comment


                            • #15
                              I agree with Annotation and snpEff being widely used. I had a chance to use SeattleSeq annotation recently when I had to calculate some Grantham scores - http://snp.gs.washington.edu/SeattleSeqAnnotation137/ You could check it out if you'd prefer a web interface to submit jobs to. Galaxy does a bit of annotation as well ( I've used Galaxy for obtaining PhyloP scores).

                              Comment

                              Working...
                              X