Exome sequencing analysis manual

cllorens replied

02-03-2012, 02:04 PM
Hi Jon
Thank you for your prompt and instructive answer. I going to check the posts and links you indicate here.
Leave a comment:
Jon_Keats replied

02-03-2012, 11:51 AM
Hi Carlos,

1)The dbSNP to knownsites change is correct as you figured out the thread is using an older version than the current GATK 1.4.x

2)As you notice the use of properly formated VCF files is pretty uniform now in the GATK so I would move to using those files.

3)You should not have an issue using a VCF from NCBI though you may need to modify to match by adding chr to each chromosome postion. Anyone that follows my threads would see I'd strongly recommend not using UCSC based references or annotations, remember GATK is largely related to 1000 genomes which uses Sanger accepted references (ie. GRCh37).

4) Most of the GATK steps will work better if you use the provided files on the GSA ftp site as they are all formatted correctly though they expect reference files from NCBI,ensembl,1000 genomes. However, last I check they were dbSNP132 versions.

5) If you are interested we have posted a pipeline that implements most of the suggestions on this thread less a couple of changes to match the current GATK best practice guidelines for exomes. (http://seqanswers.com/forums/showthr...?t=4589&page=4) or (www.keatslab.org)
Leave a comment:
cllorens replied

02-03-2012, 09:48 AM
Hi Peter and everybody

First of all. I want to thank you for this excellent pipeline and all explanations. I downloaded the pdf you gave in the first post and was able to reproduce the steps therein until arriving to the section 2.6 "Quality score recalibration" where i
am now stopped and would really appreciate if you or anybody else could clarify me some questions.

In this section 2.6. you suggest to run the Step Count covariates of GATK using the option --DBSNP with the lastest SNP release version of UCSC (in your example dbsnp132.txt i took snp135 from UCSC). About this i am wondered about how to resolve the following troubles I met in order to keep going the protocol:

1) In the version (i think the last one) i have of GATK the option -DBSNP does not exist. I saw the options and check threads and i think that the option DBSNP is now called "knownSites". I guess so could you confirm me that this is correct?

2) "KnownSites" let me to run Count covariates but it gives me the error above, asking me to use a SNP database in the VCF format of the 1000 genomes project.

------------------------------------------------------------------------------------------------------------
~/assembling/tools/gatk/GenomeAnalysisTK-1.4-14-g2e47336> java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R hg19ss.fasta -knownSites:snp135.txt -I illumina.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile SNP.recal_data.csv
INFO 17:45:22,015 HelpFormatter - ---------------------------------------------------------------------------------
INFO 17:45:22,016 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.4-14-g2e47336, Compiled 2012/01/11 11:42:24
INFO 17:45:22,016 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 17:45:22,016 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki
INFO 17:45:22,016 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa
INFO 17:45:22,017 HelpFormatter - Program Args: -l INFO -R hg19ss.fasta -knownSites:snp135.txt -I illumina.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile SNP.recal_data.csv
INFO 17:45:22,017 HelpFormatter - Date/Time: 2012/02/03 17:45:22
INFO 17:45:22,017 HelpFormatter - ---------------------------------------------------------------------------------
INFO 17:45:22,017 HelpFormatter - ---------------------------------------------------------------------------------
INFO 17:45:22,027 GenomeAnalysisEngine - Strictness is SILENT
INFO 17:45:22,059 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 17:45:22,071 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 17:45:23,776 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 1.4-14-g2e47336):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.
##### ERROR ------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------

3) My question (but only if my question to step 1 is that i was doing the correct think) here is about if UCSC does provide any tool to convert plain files to VCF or if is there any conversor available to convert UCSC SNPs DBs into VCF?

4) If there is not a conversor from UCSC to VCF i think I could get the dbsnp from the NCBI FTP but now the question is if i do that could I use this NCBI dbSNP over exome data mapped over the hg19 of the UCSC?? or i should repeat the whole protocol
from the beginning using this time the human genome downloaded from NCBI as a reference?

thank you in advance
Carlos
Leave a comment:
Mali Salmon replied

01-18-2012, 05:39 AM
Wow, thanks a lot Peter for the helpful information
Mali
Leave a comment:
ulz_peter replied

01-18-2012, 05:30 AM
Hi Mali,
The 1000Genomes people tend to put many samples in a single BAM file separated through their Readgroup tags. GATK states VariantRecalibration needs a lot of SNPs to work smoothly, so since we analyse very few Exomes I had to switch to VariantFiltrator.
I'd do the analysis separately for each sample and look afterwards to combine the data.
In case you're searching for a disease causing gene in a set of patients I'd recommend to have a look at IBD2 (http://compbio.charite.de/contao/ind...cles/xibd.html). It could extract regions with same haplotypes for different patients. THat decreases the amount of SNPs drastically...

You're very welcome,
Peter
Leave a comment:
Mali Salmon replied

01-18-2012, 05:00 AM
Thanks a lot Peter for quick reply.
What do you mean by "single-exome" analysis? Do you mean single-end reads (as in my case) or single sample?
I actually have data from 4 patients, and I thought of finding variants for each patient separately. Would you recommend to run them all in a single analysis?
Thanks again
Mali
Leave a comment:
ulz_peter replied

01-18-2012, 04:49 AM
Hi Mali Salmon,

I imported the document to the Seq-Wiki (see http://seqanswers.com/wiki/How-to/exome_analysis). The problem is: Variant recalibration doesn't really work for single-exome analyses. In the updated version (which is already kind of out-of-date) in the Seq Wiki, I already wrote that I got back to SNP filtering using VariantFiltration as I often got a lot of error messages trying to do Variant recalibration on a single sample.

There is a link to the GATK Homepage on the wiki. If you still want to do Variant Quality Score Recalibration I'd recommend you stick to their guidelines.

Hope that helps (and I am really happy some people really read the guideline)
Best regards,
Peter
Leave a comment:
Mali Salmon replied

01-18-2012, 03:15 AM
Hi All
I have been trying to follow the exome-analysis pipeline written by ulz_peter (thanks petter for this clear and nice document). I have encountered the same problem as pc2009open when trying to run "VariantRecalibrator" (using GATK version 1.0.5336)
I got the same error message: "Argument with name '--cluster_file' is missing.
I am wondering if you solved the problem, and if there is an updated analysis pipeline working with latest GATK version
Thanks
Leave a comment:
godfrey1 replied

12-27-2011, 09:07 AM
Thanks !

Looks useful- thanks for distributing this.
GP
Leave a comment:
blackgore replied

11-24-2011, 06:39 AM
The I/O exceptions are just slight warnings that the software can't "phone home" as it has to deal with proxies and firewalls and the like. It doesn't affect the result, and you can silence the errors by providing the right parameters. This is just an example output I grabbed.
Leave a comment:
ulz_peter replied

11-24-2011, 06:26 AM
I've never seen that error...
Also I've never seen that HttpMethodDirector - I/O Exceptions...

Maybe you should talk to the GATK people at the GetSatisfaction Page:

503 Service Temporarily Unavailable

http://getsatisfaction.com/gsa
Leave a comment:

blackgore replied

11-24-2011, 06:17 AM

In following the workflow mentioned above, I've come up against an error, and I'm wondering if I'm alone in this. Has anyone experienced difficulty with using CountCovariates tool, specifically with errors regarding accessing information from the input BAM file? I've tried this with several samples, but keep getting the same error, "Bad input: Could not find any usable data in the input BAM file(s)"

(for those interested, the BAM files in question are not empty, and work just fine with samtools view).

Code:

java -Xmx16g -jar /$Software/GenomeAnalysisTK-1.3-17-gc62082b/GenomeAnalysisTK.jar -T CountCovariates -R /$Genomes/Broad/Human/b37/human_g1k_v37.fasta -I $Projects/data/SampleA_bowtie.gatk.realign.bam -nt 8 -l INFO -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -log RECAL.log -recalFile RECAL.csv --knownSites $Genomes/Broad/Human/b37/dbsnp_132.b37.vcf

INFO  14:01:25,870 HelpFormatter - ---------------------------------------------------------------------------------
INFO  14:01:25,875 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.3-17-gc62082b, Compiled 2011/11/18 15:24:46
INFO  14:01:25,875 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  14:01:25,876 HelpFormatter - Please view our documentation at [url]http://www.broadinstitute.org/gsa/wiki[/url]
INFO  14:01:25,876 HelpFormatter - For support, please view our support site at [url]http://getsatisfaction.com/gsa[/url]
INFO  14:01:25,877 HelpFormatter - Program Args: -T CountCovariates -R /$Genomes/Broad/Human/b37/human_g1k_v37.fasta -I $Projects/data/SampleA_bowtie.gatk.realign.bam -nt 8 -l INFO -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -log RECAL.log -recalFile RECAL.csv --knownSites $Genomes/Broad/Human/b37/dbsnp_132.b37.vcf
INFO  14:01:25,878 HelpFormatter - Date/Time: 2011/11/24 14:01:25
INFO  14:01:25,878 HelpFormatter - ---------------------------------------------------------------------------------
INFO  14:01:25,878 HelpFormatter - ---------------------------------------------------------------------------------
INFO  14:01:26,052 RodBindingArgumentTypeDescriptor - Dynamically determined type of $Genomes/Broad/Human/b37/dbsnp_132.b37.vcf to be VCF
INFO  14:01:26,064 GenomeAnalysisEngine - Strictness is SILENT
INFO  14:01:26,815 RMDTrackBuilder - Loading Tribble index from disk for file $Genomes/Broad/Human/b37/dbsnp_132.b37.vcf
INFO  14:01:30,532 MicroScheduler - Running the GATK in parallel mode with 8 concurrent threads
INFO  14:01:32,326 CountCovariatesWalker - The covariates being used here:
INFO  14:01:32,327 CountCovariatesWalker -      ReadGroupCovariate
INFO  14:01:32,327 CountCovariatesWalker -      QualityScoreCovariate
INFO  14:01:32,327 CountCovariatesWalker -      CycleCovariate
INFO  14:01:32,328 CountCovariatesWalker -      DinucCovariate
INFO  14:01:41,189 CountCovariatesWalker - Writing raw recalibration data...
INFO  14:01:44,145 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
INFO  14:01:44,146 HttpMethodDirector - Retrying request
INFO  14:01:44,149 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
INFO  14:01:44,149 HttpMethodDirector - Retrying request
INFO  14:01:44,152 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
INFO  14:01:44,153 HttpMethodDirector - Retrying request
INFO  14:01:44,155 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
INFO  14:01:44,155 HttpMethodDirector - Retrying request
INFO  14:01:44,158 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
INFO  14:01:44,158 HttpMethodDirector - Retrying request
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 1.3-17-gc62082b):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our wiki for extensive documentation [url]http://www.broadinstitute.org/gsa/wiki[/url]
##### ERROR Visit our forum to view answers to commonly asked questions [url]http://getsatisfaction.com/gsa[/url]
##### ERROR
##### ERROR MESSAGE: Bad input: Could not find any usable data in the input BAM file(s).
##### ERROR ------------------------------------------------------------------------------------------

Leave a comment:

christophpale replied

11-22-2011, 12:48 AM
I think the -L argument expects the intervals file in SAM format
(http://www.broadinstitute.org/gsa/wi...line_arguments). If yes, use bedtools' bedToBam

Originally posted by liu_xt005 View Post

Following ulz_peter's original doc, I have some problem when doing the SNP-calling.

java -Xmx4g -jar /path/GenomeAnalysisTK-1.1-35-ge253f6f/GenomeAnalysisTK.jar \
-glm BOTH \
-R hg18.fa \
-T UnifiedGenotyper \
-I myinput.marked.realigned.fixed.recal.bam \
-D dbsnp132_hg18.txt \
-o myoutput.snps.vcf \
-metrics snps.metrics \
-stand_call_conf 50.0 \
-stand_emit_conf 10.0 \
-dcov 1000 \
-A DepthOfCoverage \
-A AlleleBalance \
-L hg18_exonIntervals.bed

This "-L" option does not work.
I got the hg18_exonIntervals.bed from UCSC as ulz_peter's original doc shows.
I run the SNP-calling without the "-L" line.
Then the variant quality score recalibration step does not work, generating an empty output.tranches file.

Can somebody help me out? Thanks a lot.
Leave a comment:
raonyguimaraes replied

10-27-2011, 07:56 AM
dad

Hello all, I have good news ...

After using annovar, I finally got to the number of 22709 variants on my data.

From there I'm now trying to filter based on this approach:

The numbers are pretty close so I think I'm on the right track

22709 Variants
11.179 Variants
4766 Variants
4222 Variants
removed frequency > 0.01
878 Variants
427 Variants
Leave a comment:
liu_xt005 replied

10-27-2011, 07:36 AM
Filtering exon+10bp bed file

Originally posted by raonyguimaraes View Post

I rerunned all my analysis several times since the alignment using BWA -I option till UnifiedGenotyper and I'm still getting this output from UnifiedGenotyper:

Visited bases 3102559836
Callable bases 2864301370
Confidently called bases 1142400
% callable bases of all loci 92.321
% confidently called bases of all loci 0.037
% confidently called bases of callable loci 0.040
Actual calls made 350263

Since it's the same number of variants I'm getting from DNANexus I'm starting to believe this is the right number before filtering the variants ...

I decided to download the BED File from UCSC Table Browser as suggested on the manual using "Exon plus 10bp", and tried to use this file with UnifiedGenotyper since I was using a bedfile from "SeqCap EZ Human Exome Library v2.0".

Running this part again I was receiving a message saying that there were overlaps on the intervals, so decided to use bedtools to merge the intervals with the command:

Does anyone else had to do it ?

After doing this and trying again i'm receiving the message:

Sortbed didn't help me so I wrote an script in python to parse this bedfile and put everything as it should be chr1/1, chr2/2 and so on ... Does anyone else had to do the same?

I checked the quality of my reads with FASTQC and they looked ok, so I didn't do any clean on my reads before using BWA->GATK.

What I could use to clean my illumina reads ? NGS Backbone, SeqClean, CleanSeq, Prinseq, FastQX ? Does anyone improved the number of calls by doing it ?

For the Variant Quality Score Recalibration they suggest that "in order to achieve the best exome results one needs to use an exome SNP callset with at least 30 samples."

Does anyone tried to merge other exomes and got better results from it ?

Still looking for my 20k variants

The exon+10bp bed file helps. I got 150k variants from an individual without it, and reduced the number to 30k with it. The UCSC style bed file contains random and hapmap-derived positions. My reference genome for alignment does not contain such bases, so I manually filtered out (by perl) those from the exon+10bp bed file and made it work.
Leave a comment:

Previous 1 3 4 5 6 7 8 9 10 template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, Yesterday, 02:46 PM	0 responses 11 views 0 likes	Last Post by seqadmin Yesterday, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 13 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 23 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News