Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ulz_peter
    Senior Member
    • Feb 2010
    • 219

    Exome sequencing analysis manual

    Hi Folks,

    As I was writing a short guide of Exome analysis in our Institute, I thought it might be of some use to others especially for newbies, who need some kind of starting point to get to analysis of exome data (pretty much like the RNA-seq manual I once read in an older thread...). Instead of explaining everything in 100 new threads one could then point to that manual...

    It is the way we do exome analysis at our Institute, but I would be happy if people help improve the manual, add their knowledge and expand it, like a common knowledge base for exome-level analysis.

    I attached the pdf version and a .doc version within a zip folder, as the filesize was too large for uploading the doc file alone.

    The most updated version can be found in the SeqWiki (http://seqanswers.com/wiki/How-to/exome_analysis)
    (just to make it clear, it is not regularly updated and it's only goal is to get people started on the use of tools often used in exome sequencing)

    Any comments highly appreciated!

    P.S. I added a (very) short visualization chapter
    Attached Files
    Last edited by ulz_peter; 04-12-2012, 10:08 PM. Reason: updated manual
  • liu_xt005
    Member
    • Jun 2011
    • 24

    #2
    Thanks a lot.

    I have learned a lot.

    Comment

    • maubp
      Peter (Biopython etc)
      • Jul 2009
      • 1544

      #3
      You have a weird typo "foolwong" just above the FASTQ example.

      Also your introduction about the different FASTQ encodings is out of date now. Illumina now follow the Sanger convention. They also changed the read naming convention, in particular the old /1 and /2 suffixes are gone

      See this thread for details:
      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


      Also I've never heard FASTQ called a "FastAlignment and Quality file" (glossary on last page).

      Comment

      • ulz_peter
        Senior Member
        • Feb 2010
        • 219

        #4
        thanks for the hints. As we do not produce Illumina data in ourlab (yet) I haven't heard of those changes, although they seem to have been implemented a while ago...

        the typo should mean following, I will rewrite that part and repost it...

        Comment

        • pc2009open
          Junior Member
          • Jun 2009
          • 3

          #5
          This is a great document.

          Thanks a lot. This is a great document. I wish I had read this document earlier.

          Comment

          • Heisman
            Senior Member
            • Dec 2010
            • 534

            #6
            As the GATK local realignment around indels portion of the website does not explicitly state to "FixMateInformation", I am curious if that will affect downstream analysis in anyway?

            Great document, by the way.

            Comment

            • Jon_Keats
              Senior Member
              • Mar 2010
              • 279

              #7
              Very nice document, thanks for sharing.

              Comment

              • NGSfan
                Senior Member
                • Apr 2009
                • 181

                #8
                Great work ulz_peter ! That is exactly what my pipeline is like and I'm glad to see that my choices of tools are also someone else's favorites.

                I have seen others use different tools but I think the BWA + GATK + ANNOVAR is the best combination of tools so far...

                Comment

                • pc2009open
                  Junior Member
                  • Jun 2009
                  • 3

                  #9
                  Hi ulz_peter,

                  I have gone through almost the whole process according to your suggestions. However, at the "3.2. Variant quality score recalibration", I encountered some problems. (I used the TATK-1.0.5506 version.)

                  I got the error message: "Argument with name '--cluster_file' is missing." However, I did not put "--cluster_file" at all.

                  I looked at some help documents, and found that this kind of "cluster_file" is supposed to be generated by "GenerateVariantClusters". Have you used GenerateVariantClusters before? Is it necessary?

                  Thanks again for the wonderful manual.

                  Comment

                  • Heisman
                    Senior Member
                    • Dec 2010
                    • 534

                    #10
                    Originally posted by NGSfan View Post
                    Great work ulz_peter ! That is exactly what my pipeline is like and I'm glad to see that my choices of tools are also someone else's favorites.

                    I have seen others use different tools but I think the BWA + GATK + ANNOVAR is the best combination of tools so far...
                    We use Novoalign + SAMtools... I'm curious if there are any papers out there comparing the methods?

                    Comment

                    • ulz_peter
                      Senior Member
                      • Feb 2010
                      • 219

                      #11
                      Hi guys ,
                      Thanks for all your responses. I must admit that the GATK parts are a little outdated (already). I'm gonna switch to the new version this week and will update the manual accordingly...

                      @pc2009open: I can't find any hint for the use of a cluster_file argument in variant quality score recalibration... Anyone else had seen that?

                      Comment

                      • NGSfan
                        Senior Member
                        • Apr 2009
                        • 181

                        #12
                        Originally posted by Heisman View Post
                        We use Novoalign + SAMtools... I'm curious if there are any papers out there comparing the methods?
                        Papers covering all variations and combinations have been hard to find. I did find one under review (Nature proceedings?) where they claim CASAVA 1.8 comes pretty close to GATK.

                        I think Novoalign is an excellent aligner, although it requires some tweaking to increase sensitivity on indels that are missed with default settings.

                        We have done a comparison in our lab with BWA , Stampy, Novoalign, and BFAST. Stampy is the best aligner in our hands (detected more of our SNV and INDEL training set), but Novoalign alignments looked a lot cleaner. I think perhaps with tweaking the gap open penalty for indels, Novoalign might have performed better - just takes some effort to test the parameters more to see if can handle all cases.

                        GATK is definitely ahead of the game for SNV and indel calling (sensitivity and specificity wise). SAMtools is sufficient - probably you can lean on it if you set the parameters to emphasize specificity instead of sensitivity.

                        Comment

                        • raonyguimaraes
                          Member
                          • Jun 2010
                          • 38

                          #13
                          Thank you so much for posting this pipeline, I've been doing the same for some time. Tomorrow I will post some comments about my results so far.

                          I think you could sum this pipeline to yours:



                          Let's make from this thread a big reference for who is doing exome sequencing ... Please !!!

                          Comment

                          • raonyguimaraes
                            Member
                            • Jun 2010
                            • 38

                            #14
                            One question. How many raw snps you are getting after running Unifier Genotyper for the first time ?

                            Here I'm getting about 300 000 snps and I think there is something wrong with this numbers ... Shouldn't it be around 20 000 snps?

                            I'm running my analysis again using a BED file from SeqCap EZ Human Exome Library v2.0 (http://www.nimblegen.com/products/se...tml#annotation) but still ... 300 thousands snps are a lot ...
                            Last edited by raonyguimaraes; 10-10-2011, 05:41 PM. Reason: english mistaken ... :)

                            Comment

                            • Heisman
                              Senior Member
                              • Dec 2010
                              • 534

                              #15
                              Originally posted by NGSfan View Post
                              Papers covering all variations and combinations have been hard to find. I did find one under review (Nature proceedings?) where they claim CASAVA 1.8 comes pretty close to GATK.

                              I think Novoalign is an excellent aligner, although it requires some tweaking to increase sensitivity on indels that are missed with default settings.

                              We have done a comparison in our lab with BWA , Stampy, Novoalign, and BFAST. Stampy is the best aligner in our hands (detected more of our SNV and INDEL training set), but Novoalign alignments looked a lot cleaner. I think perhaps with tweaking the gap open penalty for indels, Novoalign might have performed better - just takes some effort to test the parameters more to see if can handle all cases.

                              GATK is definitely ahead of the game for SNV and indel calling (sensitivity and specificity wise). SAMtools is sufficient - probably you can lean on it if you set the parameters to emphasize specificity instead of sensitivity.
                              Interesting. Thank you for your post. We do a pretty good job (I think) using the latest SAMtools mpileup command with the -A and -B options and setting a minimum mapping quality per read at 50, but I haven't done anything rigorous to determine what our sensitivity/specificity is. I may go ahead an look at comparing it with GATK.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 10:09 AM
                              0 responses
                              8 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...