Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Exome sequencing analysis manual

    Hi Folks,

    As I was writing a short guide of Exome analysis in our Institute, I thought it might be of some use to others especially for newbies, who need some kind of starting point to get to analysis of exome data (pretty much like the RNA-seq manual I once read in an older thread...). Instead of explaining everything in 100 new threads one could then point to that manual...

    It is the way we do exome analysis at our Institute, but I would be happy if people help improve the manual, add their knowledge and expand it, like a common knowledge base for exome-level analysis.

    I attached the pdf version and a .doc version within a zip folder, as the filesize was too large for uploading the doc file alone.

    The most updated version can be found in the SeqWiki (http://seqanswers.com/wiki/How-to/exome_analysis)
    (just to make it clear, it is not regularly updated and it's only goal is to get people started on the use of tools often used in exome sequencing)

    Any comments highly appreciated!

    P.S. I added a (very) short visualization chapter
    Attached Files
    Last edited by ulz_peter; 04-12-2012, 10:08 PM. Reason: updated manual

  • #2
    Thanks a lot.

    I have learned a lot.

    Comment


    • #3
      You have a weird typo "foolwong" just above the FASTQ example.

      Also your introduction about the different FASTQ encodings is out of date now. Illumina now follow the Sanger convention. They also changed the read naming convention, in particular the old /1 and /2 suffixes are gone

      See this thread for details:
      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


      Also I've never heard FASTQ called a "FastAlignment and Quality file" (glossary on last page).

      Comment


      • #4
        thanks for the hints. As we do not produce Illumina data in ourlab (yet) I haven't heard of those changes, although they seem to have been implemented a while ago...

        the typo should mean following, I will rewrite that part and repost it...

        Comment


        • #5
          This is a great document.

          Thanks a lot. This is a great document. I wish I had read this document earlier.

          Comment


          • #6
            As the GATK local realignment around indels portion of the website does not explicitly state to "FixMateInformation", I am curious if that will affect downstream analysis in anyway?

            Great document, by the way.

            Comment


            • #7
              Very nice document, thanks for sharing.

              Comment


              • #8
                Great work ulz_peter ! That is exactly what my pipeline is like and I'm glad to see that my choices of tools are also someone else's favorites.

                I have seen others use different tools but I think the BWA + GATK + ANNOVAR is the best combination of tools so far...

                Comment


                • #9
                  Hi ulz_peter,

                  I have gone through almost the whole process according to your suggestions. However, at the "3.2. Variant quality score recalibration", I encountered some problems. (I used the TATK-1.0.5506 version.)

                  I got the error message: "Argument with name '--cluster_file' is missing." However, I did not put "--cluster_file" at all.

                  I looked at some help documents, and found that this kind of "cluster_file" is supposed to be generated by "GenerateVariantClusters". Have you used GenerateVariantClusters before? Is it necessary?

                  Thanks again for the wonderful manual.

                  Comment


                  • #10
                    Originally posted by NGSfan View Post
                    Great work ulz_peter ! That is exactly what my pipeline is like and I'm glad to see that my choices of tools are also someone else's favorites.

                    I have seen others use different tools but I think the BWA + GATK + ANNOVAR is the best combination of tools so far...
                    We use Novoalign + SAMtools... I'm curious if there are any papers out there comparing the methods?

                    Comment


                    • #11
                      Hi guys ,
                      Thanks for all your responses. I must admit that the GATK parts are a little outdated (already). I'm gonna switch to the new version this week and will update the manual accordingly...

                      @pc2009open: I can't find any hint for the use of a cluster_file argument in variant quality score recalibration... Anyone else had seen that?

                      Comment


                      • #12
                        Originally posted by Heisman View Post
                        We use Novoalign + SAMtools... I'm curious if there are any papers out there comparing the methods?
                        Papers covering all variations and combinations have been hard to find. I did find one under review (Nature proceedings?) where they claim CASAVA 1.8 comes pretty close to GATK.

                        I think Novoalign is an excellent aligner, although it requires some tweaking to increase sensitivity on indels that are missed with default settings.

                        We have done a comparison in our lab with BWA , Stampy, Novoalign, and BFAST. Stampy is the best aligner in our hands (detected more of our SNV and INDEL training set), but Novoalign alignments looked a lot cleaner. I think perhaps with tweaking the gap open penalty for indels, Novoalign might have performed better - just takes some effort to test the parameters more to see if can handle all cases.

                        GATK is definitely ahead of the game for SNV and indel calling (sensitivity and specificity wise). SAMtools is sufficient - probably you can lean on it if you set the parameters to emphasize specificity instead of sensitivity.

                        Comment


                        • #13
                          Thank you so much for posting this pipeline, I've been doing the same for some time. Tomorrow I will post some comments about my results so far.

                          I think you could sum this pipeline to yours:



                          Let's make from this thread a big reference for who is doing exome sequencing ... Please !!!

                          Comment


                          • #14
                            One question. How many raw snps you are getting after running Unifier Genotyper for the first time ?

                            Here I'm getting about 300 000 snps and I think there is something wrong with this numbers ... Shouldn't it be around 20 000 snps?

                            I'm running my analysis again using a BED file from SeqCap EZ Human Exome Library v2.0 (http://www.nimblegen.com/products/se...tml#annotation) but still ... 300 thousands snps are a lot ...
                            Last edited by raonyguimaraes; 10-10-2011, 05:41 PM. Reason: english mistaken ... :)

                            Comment


                            • #15
                              Originally posted by NGSfan View Post
                              Papers covering all variations and combinations have been hard to find. I did find one under review (Nature proceedings?) where they claim CASAVA 1.8 comes pretty close to GATK.

                              I think Novoalign is an excellent aligner, although it requires some tweaking to increase sensitivity on indels that are missed with default settings.

                              We have done a comparison in our lab with BWA , Stampy, Novoalign, and BFAST. Stampy is the best aligner in our hands (detected more of our SNV and INDEL training set), but Novoalign alignments looked a lot cleaner. I think perhaps with tweaking the gap open penalty for indels, Novoalign might have performed better - just takes some effort to test the parameters more to see if can handle all cases.

                              GATK is definitely ahead of the game for SNV and indel calling (sensitivity and specificity wise). SAMtools is sufficient - probably you can lean on it if you set the parameters to emphasize specificity instead of sensitivity.
                              Interesting. Thank you for your post. We do a pretty good job (I think) using the latest SAMtools mpileup command with the -A and -B options and setting a minimum mapping quality per read at 50, but I haven't done anything rigorous to determine what our sensitivity/specificity is. I may go ahead an look at comparing it with GATK.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Non-Coding RNA Research and Technologies
                                by seqadmin




                                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                                Nobel Prize for MicroRNA Discovery
                                This week,...
                                10-07-2024, 08:07 AM
                              • seqadmin
                                Recent Developments in Metagenomics
                                by seqadmin





                                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                                09-23-2024, 06:35 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 06:35 AM
                              0 responses
                              6 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 02:44 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-11-2024, 06:55 AM
                              0 responses
                              14 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-02-2024, 04:51 AM
                              0 responses
                              110 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X