Unconfigured Ad

**liu_xt005** · 09-22-2011, 07:59 AM

Thanks a lot.

I have learned a lot.

**maubp** · 09-22-2011, 09:26 AM

You have a weird typo "foolwong" just above the FASTQ example.

Also your introduction about the different FASTQ encodings is out of date now. Illumina now follow the Sanger convention. They also changed the read naming convention, in particular the old /1 and /2 suffixes are gone

See this thread for details:

Upcoming changes in CASAVA - SEQanswers

http://seqanswers.com/forums/showthread.php?t=8895

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

Also I've never heard FASTQ called a "FastAlignment and Quality file" (glossary on last page).

**ulz_peter** · 09-22-2011, 11:50 PM

thanks for the hints. As we do not produce Illumina data in ourlab (yet) I haven't heard of those changes, although they seem to have been implemented a while ago...

the typo should mean following, I will rewrite that part and repost it...

**pc2009open** · 10-08-2011, 05:23 PM

This is a great document.

Thanks a lot. This is a great document. I wish I had read this document earlier.

**Heisman** · 10-08-2011, 05:50 PM

As the GATK local realignment around indels portion of the website does not explicitly state to "FixMateInformation", I am curious if that will affect downstream analysis in anyway?

Great document, by the way.

**Jon_Keats** · 10-09-2011, 09:50 AM

Very nice document, thanks for sharing.

**NGSfan** · 10-09-2011, 01:03 PM

Great work ulz_peter ! That is exactly what my pipeline is like and I'm glad to see that my choices of tools are also someone else's favorites.

I have seen others use different tools but I think the BWA + GATK + ANNOVAR is the best combination of tools so far...

**pc2009open** · 10-09-2011, 04:50 PM

Hi ulz_peter,

I have gone through almost the whole process according to your suggestions. However, at the "3.2. Variant quality score recalibration", I encountered some problems. (I used the TATK-1.0.5506 version.)

I got the error message: "Argument with name '--cluster_file' is missing." However, I did not put "--cluster_file" at all.

I looked at some help documents, and found that this kind of "cluster_file" is supposed to be generated by "GenerateVariantClusters". Have you used GenerateVariantClusters before? Is it necessary?

Thanks again for the wonderful manual.

**Heisman** · 10-09-2011, 07:35 PM

Originally posted by NGSfan View Post

Great work ulz_peter ! That is exactly what my pipeline is like and I'm glad to see that my choices of tools are also someone else's favorites.

I have seen others use different tools but I think the BWA + GATK + ANNOVAR is the best combination of tools so far...

We use Novoalign + SAMtools... I'm curious if there are any papers out there comparing the methods?

**ulz_peter** · 10-09-2011, 09:52 PM

Hi guys ,
Thanks for all your responses. I must admit that the GATK parts are a little outdated (already). I'm gonna switch to the new version this week and will update the manual accordingly...

@pc2009open: I can't find any hint for the use of a cluster_file argument in variant quality score recalibration... Anyone else had seen that?

**NGSfan** · 10-10-2011, 01:49 PM

Originally posted by Heisman View Post

We use Novoalign + SAMtools... I'm curious if there are any papers out there comparing the methods?

Papers covering all variations and combinations have been hard to find. I did find one under review (Nature proceedings?) where they claim CASAVA 1.8 comes pretty close to GATK.

I think Novoalign is an excellent aligner, although it requires some tweaking to increase sensitivity on indels that are missed with default settings.

We have done a comparison in our lab with BWA , Stampy, Novoalign, and BFAST. Stampy is the best aligner in our hands (detected more of our SNV and INDEL training set), but Novoalign alignments looked a lot cleaner. I think perhaps with tweaking the gap open penalty for indels, Novoalign might have performed better - just takes some effort to test the parameters more to see if can handle all cases.

GATK is definitely ahead of the game for SNV and indel calling (sensitivity and specificity wise). SAMtools is sufficient - probably you can lean on it if you set the parameters to emphasize specificity instead of sensitivity.

**raonyguimaraes** · 10-10-2011, 05:25 PM

Thank you so much for posting this pipeline, I've been doing the same for some time. Tomorrow I will post some comments about my results so far.

I think you could sum this pipeline to yours:

https://www.vlsci.org.au/sites/default/files/GatkBestPrac_V3Pipeline_20Sept11.pdf

Let's make from this thread a big reference for who is doing exome sequencing ... Please !!!

**raonyguimaraes** · 10-10-2011, 05:37 PM

One question. How many raw snps you are getting after running Unifier Genotyper for the first time ?

Here I'm getting about 300 000 snps and I think there is something wrong with this numbers ...

Shouldn't it be around 20 000 snps?

I'm running my analysis again using a BED file from SeqCap EZ Human Exome Library v2.0 (http://www.nimblegen.com/products/se...tml#annotation) but still ... 300 thousands snps are a lot ...

**Heisman** · 10-10-2011, 06:10 PM

Originally posted by NGSfan View Post

Papers covering all variations and combinations have been hard to find. I did find one under review (Nature proceedings?) where they claim CASAVA 1.8 comes pretty close to GATK.

I think Novoalign is an excellent aligner, although it requires some tweaking to increase sensitivity on indels that are missed with default settings.

We have done a comparison in our lab with BWA , Stampy, Novoalign, and BFAST. Stampy is the best aligner in our hands (detected more of our SNV and INDEL training set), but Novoalign alignments looked a lot cleaner. I think perhaps with tweaking the gap open penalty for indels, Novoalign might have performed better - just takes some effort to test the parameters more to see if can handle all cases.

GATK is definitely ahead of the game for SNV and indel calling (sensitivity and specificity wise). SAMtools is sufficient - probably you can lean on it if you set the parameters to emphasize specificity instead of sensitivity.

Interesting. Thank you for your post. We do a pretty good job (I think) using the latest SAMtools mpileup command with the -A and -B options and setting a minimum mapping quality per read at 50, but I haven't done anything rigorous to determine what our sensitivity/specificity is. I may go ahead an look at comparing it with GATK.

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Today, 05:37 AM	0 responses 5 views 0 reactions	Last Post by SEQadmin2 Today, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 50 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 110 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

Exome sequencing analysis manual

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News