Exome sequencing analysis manual

rahilsethi replied

09-07-2012, 07:01 AM
GATK -dcov option???

I have additional question to raonyguimaraes's post
Does anyone know in details about GATK -dcov option in UnifiedGenotyper. I tried to look in GATK Manual but could not find much about it other than the following information:
-dcov [50 for 4x, 200 for >30x WGS or Whole exome]
in the link:

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html

Also if not specified what default value this option takes?

If you anyone knows about it could you please send me the link to the information resource?

Thanks in advance
Leave a comment:
Jane M replied

09-07-2012, 02:14 AM
Originally posted by fongchun View Post

I think I might have figured out at least part of your question with regard to the file recal_data.grp. If you look at the GATK methods and workflow page under the "Base Quality Score Recalibrator" section, it shows the recal_data.grp being used as part of the -BQSR parameter:

Code:

java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fasta \ -I input.bam \ -BQSR recalibration_report.grp \ -o output.bam

\

Interesting thing is the documentation for the PrintReads program doesn't include the -BQSR parameter...

Ah, interesting.. I only noticed this information about PrintReads (http://www.broadinstitute.org/gatk/g...ntReads.html):

java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T PrintReads \
-o output.bam \
-I input1.bam \
-I input2.bam \
--read_filter MappingQualityZero

I didn't check where you suggested me. And here it's much clearer:

java -jar GenomeAnalysisTK.jar \
-T PrintReads \
-R reference.fasta \
-I input.bam \
-BQSR recalibration_report.grp \
-o output.bam

The grp file is used and there is an output bam file
Thanks fongchun!
Leave a comment:
Jane M replied

09-07-2012, 01:58 AM
Originally posted by AJERYC View Post

I'm not sure if we are running the same version of GATK. For Quality score recalibration I use the following instructions

java -Xmx16G -jar gatk/GenomeAnalysisTK.jar -I input.marked.realigned.fixed.bam -R hg19/hg19.fa -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile input.recal_data.csv -knownSites:dbsnp,VCF dbsnp135.hg19.vcf

java -Xmx16G -jar gatk/GenomeAnalysisTK.jar \-l INFO \-R hg19.fa \-I input.marked.realigned.fixed.bam \-T TableRecalibration \--out input.marked.realigned.fixed.recal.bam \-recalFile input.recal_data.csv

You can see I get 2 files, one is the bam file and the other one is the recal_data (that you get in the first instruction. Maybe you are missing the second instruction and that is why you dont get the bam file.

The point is that we are not using the same version. You probably have a version before v2.0 and have a version after 2.0. From this 2.0 version, CountCovariates and TableRecalibration do not exist anymore. That's a pity because the process was rather clear. The csv file generated at the CountCovariates step is then used at the TableRecalibration step...

Last edited by Jane M; 09-07-2012, 02:05 AM.
Leave a comment:
AJERYC replied

09-06-2012, 10:13 PM
Originally posted by Jane M View Post

Thank you AJERYC!

Because of some troubles with my version of dbSNP, I haven't managed to run:

but I am still wondering if I should run the PrintReads step since I only have one bam file and if my recalibrated bam file will be the recal_data.grp file. Any idea?

I'm not sure if we are running the same version of GATK. For Quality score recalibration I use the following instructions

java -Xmx16G -jar gatk/GenomeAnalysisTK.jar -I input.marked.realigned.fixed.bam -R hg19/hg19.fa -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile input.recal_data.csv -knownSites:dbsnp,VCF dbsnp135.hg19.vcf

java -Xmx16G -jar gatk/GenomeAnalysisTK.jar \-l INFO \-R hg19.fa \-I input.marked.realigned.fixed.bam \-T TableRecalibration \--out input.marked.realigned.fixed.recal.bam \-recalFile input.recal_data.csv

You can see I get 2 files, one is the bam file and the other one is the recal_data (that you get in the first instruction. Maybe you are missing the second instruction and that is why you dont get the bam file.
Leave a comment:
fongchun replied

09-06-2012, 08:19 PM
Originally posted by Jane M View Post

Not yet...

I think I might have figured out at least part of your question with regard to the file recal_data.grp. If you look at the GATK methods and workflow page under the "Base Quality Score Recalibrator" section, it shows the recal_data.grp being used as part of the -BQSR parameter:

Code:

java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fasta \ -I input.bam \ -BQSR recalibration_report.grp \ -o output.bam

\

Interesting thing is the documentation for the PrintReads program doesn't include the -BQSR parameter...
Leave a comment:
Jane M replied

09-06-2012, 12:57 PM
Originally posted by fongchun View Post

Did you ever get an answer to these questions? I am running into the same issues with the newer version of GATK.

Thanks,

Fong

Not yet...
Leave a comment:
fongchun replied

09-06-2012, 11:47 AM
What is the option -l INFO? Is it still in use in the new version?
I guess that -o recal_data.grp is equivalent to -recalFile input.recal_data.csv. Am I right? What is the interest of this file, I don't see when it is used...
Finally, where is the recalibrated bam file ? There is only .grp output file at the BaseRecalibrator step.

I am a bit confused by these changes between versions...

Did you ever get an answer to these questions? I am running into the same issues with the newer version of GATK.

Thanks,

Fong
Leave a comment:
Jane M replied

09-06-2012, 08:47 AM
Thank you AJERYC!

Because of some troubles with my version of dbSNP, I haven't managed to run:

java -Xmx4g -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-I my_reads.bam \
-R resources/Homo_sapiens_assembly18.fasta \
-knownSites bundle/hg18/dbsnp_132.hg18.vcf \
-knownSites another/optional/setOfSitesToMask.vcf \
-o recal_data.grp

but I am still wondering if I should run the PrintReads step since I only have one bam file and if my recalibrated bam file will be the recal_data.grp file. Any idea?
Leave a comment:
AJERYC replied

09-06-2012, 04:17 AM
Originally posted by Jane M View Post

Thank both of you.
I managed to run by using a different temporary folder :

My output file is generating finally Thanks !
But I have this output:

Did you get also this message?

Code:

Ignoring SAM validation error: ERROR: Record 104416914, Read name HWI-ST584_0081:4:1204:9783:76003#AGTCAA, MAPQ should be 0 for unmapped read.

Jane

This error messages means these sequences can not be aligned against the reference genome. That is why you use the VALIDATION_STRINGENCY=LENIENT option so that the programs points you the sequences but dont stop running.
Leave a comment:
vivek_ replied

09-05-2012, 08:55 AM
Originally posted by frewise View Post

Hi, raonyguimaraes, why did you remove genes with multiple variants in your last 2 steps?

I think its the opposite, he's taking genes with multiple variants into his regions of interest and treating the others with less importance.
Leave a comment:
Jane M replied

09-05-2012, 08:36 AM
One additional question regarding the quality score recalibration: from the v2.0 of GATK, this seems to be performed in one step only.

Previously, it was:

java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R hg19.fa --DBSNP dbsnp132.txt -I input.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile input.recal_data.csv

java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R hg19.fa -I input.marked.realigned.fixed.bam -T TableRecalibration --out input.marked.realigned.fixed.recal.bam -recalFile input.recal_data.csv

but now, it's rather

java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R hg19.fa -knowSites dbsnp132.txt -I input.marked.realigned.fixed.bam -T BaseRecalibrator -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate --out input.recal_data.csv

From GATK doc:
java -Xmx4g -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-I my_reads.bam \
-R resources/Homo_sapiens_assembly18.fasta \
-knownSites bundle/hg18/dbsnp_132.hg18.vcf \
-knownSites another/optional/setOfSitesToMask.vcf \
-o recal_data.grp

with this additional step only if using several bam files (if I understand well the documentation: "PrintReads can dynamically merge the contents of multiple input BAM files, resulting in merged output sorted in coordinate order" )

java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T PrintReads \
-o output.bam \
-I input1.bam \
-I input2.bam \
--read_filter MappingQualityZero

What is the option -l INFO? Is it still in use in the new version?
I guess that -o recal_data.grp is equivalent to -recalFile input.recal_data.csv. Am I right? What is the interest of this file, I don't see when it is used...
Finally, where is the recalibrated bam file ? There is only .grp output file at the BaseRecalibrator step.

I am a bit confused by these changes between versions...
Leave a comment:
Jane M replied

09-05-2012, 06:03 AM
Thank both of you.
I managed to run by using a different temporary folder :

java -Djava.io.tmpdir=/mnt/seq3/seq3/LMMC/GAR/GAR_sain -jar /share/apps/picard-tools-1.76/FixMateInformation.jar INPUT=/mnt/seq3/seq3/LMMC/GAR/GAR_sain/GAR_sain.realigned.bam OUTPUT=/mnt/seq3/seq3/LMMC/GAR/GAR_sain/GAR_sain.realigned.fixed.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

My output file is generating finally Thanks !
But I have this output:

[Wed Sep 05 14:53:57 CEST 2012] net.sf.picard.sam.FixMateInformation INPUT=[/mnt/seq3/seq3/LMMC/GAR/GAR_sain/GAR_sain.realigned.bam] OUTPUT=/mnt/seq3/seq3/LMMC/GAR/GAR_sain/GAR_sain.realigned.fixed.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false
[Wed Sep 05 14:53:57 CEST 2012] Executing as [...]; Java HotSpot(TM) Server VM 1.6.0_26-b03; Picard version: 1.76(1261)
INFO 2012-09-05 14:53:57 FixMateInformation Sorting input into queryname order.
Ignoring SAM validation error: ERROR: Record 104416902, Read name HWI-ST584_0081:4:2206:2021:2237#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416903, Read name HWI-ST584_0081:4:1102:17248:85426#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416904, Read name HWI-ST584_0081:4:1102:2000:84379#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416905, Read name HWI-ST584_0081:4:1103:2169:69666#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416906, Read name HWI-ST584_0081:4:1103:8346:98868#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416907, Read name HWI-ST584_0081:4:1104:12598:60315#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416908, Read name HWI-ST584_0081:4:1105:1489:49925#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416909, Read name HWI-ST584_0081:4:1105:16587:151639#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416910, Read name HWI-ST584_0081:4:1105:8354:181686#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416911, Read name HWI-ST584_0081:4:1107:17717:4065#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416912, Read name HWI-ST584_0081:4:1108:4813:156146#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416913, Read name HWI-ST584_0081:4:1108:9333:173330#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416914, Read name HWI-ST584_0081:4:1204:9783:76003#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416915, Read name HWI-ST584_0081:4:1206:4951:59169#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416916, Read name HWI-ST584_0081:4:2104:7277:38433#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416917, Read name HWI-ST584_0081:4:2106:5867:60527#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416918, Read name HWI-ST584_0081:4:2202:1737:149838#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416919, Read name HWI-ST584_0081:4:2205:16036:86626#AGTCAA, MAPQ should be 0 for unmapped read.
Ignoring SAM validation error: ERROR: Record 104416920, Read name HWI-ST584_0081:4:2207:1759:176116#AGTCAA, MAPQ should be 0 for unmapped read.
INFO 2012-09-05 15:13:24 FixMateInformation Sorting by queryname complete.
INFO 2012-09-05 15:13:24 FixMateInformation Output will be sorted by coordinate
INFO 2012-09-05 15:13:24 FixMateInformation Traversing query name sorted records and fixing up mate pair information.
INFO 2012-09-05 15:13:34 FixMateInformation Processed 1,000,000 records. Elapsed time: 00:00:10s. Time for last 1,000,000: 10s. Last read position: chr14:74,489,555
INFO 2012-09-05 15:13:52 FixMateInformation Processed 2,000,000 records. Elapsed time: 00:00:27s. Time for last 1,000,000: 17s. Last read position: chr4:104,072,163
INFO 2012-09-05 15:14:05 FixMateInformation Processed 3,000,000 records. Elapsed time: 00:00:41s. Time for last 1,000,000: 13s. Last read position: chr21:31,654,746

Did you get also this message?

Code:

Ignoring SAM validation error: ERROR: Record 104416914, Read name HWI-ST584_0081:4:1204:9783:76003#AGTCAA, MAPQ should be 0 for unmapped read.

Jane
Leave a comment:
AJERYC replied

09-05-2012, 03:00 AM
IOException: No space left on device

Originally posted by Jane M View Post

Thank you for your help oyvindbusk,
I changed -T CountCovariates by BaseRecalibrator and -T TableRecalibration by PrintReads.

Nevertheless, I am stuck before this step, with Picard, when running:
I got:
I googled it, it seems to be a problem of space... I don't know if it's because the folder /tmp is too small or if I should add -Xmx4g... On the cluster that I'm using, this -Xmx4g is not working.
Any idea?

I have got the same errors and it seems I ran out of space in the hard drive. If you are running a very large exome file (i.e. 100 x exome) you may need a large space for your temporary files. Try to increase your hard drive space and see what happens.
Leave a comment:
oyvindbusk replied

09-05-2012, 01:28 AM
I think I had a similar problem resolved by increasing the size of temp-folder. I would try this first. If it were the memory it would probably say something like "not sufficient memory".

Ø
Leave a comment:
Jane M replied

09-05-2012, 12:02 AM
Thank you for your help oyvindbusk,
I changed -T CountCovariates by BaseRecalibrator and -T TableRecalibration by PrintReads.

Nevertheless, I am stuck before this step, with Picard, when running:

java -Djava.io.tmpdir=/tmp -jar /share/apps/picard-tools-1.76/FixMateInformation.jar INPUT=$path/$i/$i.realigned.bam OUTPUT=$path/$i/$i.realigned.fixed.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

I got:

[Wed Sep 05 09:59:27 CEST 2012] net.sf.picard.sam.FixMateInformation INPUT=[/mnt/seq3/seq3/LMMC/GAR/GAR_sain/GAR_sain.realigned.bam] OUTPUT=/mnt/seq3/seq3/LMMC/GAR/GAR_sain/GAR_sain.realigned.fixed.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false
[Wed Sep 05 09:59:27 CEST 2012] Executing as [..]; Java HotSpot(TM) Server VM 1.6.0_26-b03; Picard version: 1.76(1261)
INFO 2012-09-05 09:59:27 FixMateInformation Sorting input into queryname order.
[Wed Sep 05 10:15:30 CEST 2012] net.sf.picard.sam.FixMateInformation done. Elapsed time: 16.06 minutes.
Runtime.totalMemory()=954466304
FAQ: http://sourceforge.net/apps/mediawik...itle=Main_Page
Exception in thread "main" net.sf.samtools.util.RuntimeIOException: java.io.IOException: No space left on device
at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:228)
at net.sf.samtools.util.SortingCollection.add(SortingCollection.java:150)
at net.sf.picard.sam.FixMateInformation.doWork(FixMateInformation.java:147)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
at net.sf.picard.sam.FixMateInformation.main(FixMateInformation.java:75)
Caused by: java.io.IOException: No space left on device
at java.io.FileOutputStream.write(Native Method)
at org.xerial.snappy.SnappyOutputStream.writeInt(SnappyOutputStream.java:105)
at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:126)
at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:100)
at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:137)
at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:219)
... 5 more

I googled it, it seems to be a problem of space... I don't know if it's because the folder /tmp is too small or if I should add -Xmx4g... On the cluster that I'm using, this -Xmx4g is not working.
Any idea?

Last edited by Jane M; 09-05-2012, 12:57 AM.
Leave a comment:

Previous 1 2 3 4 5 6 10 template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, Yesterday, 02:46 PM	0 responses 11 views 0 likes	Last Post by seqadmin Yesterday, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 13 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 17 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 23 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News