Unconfigured Ad

**kenietz** · 05-09-2012, 10:36 PM

hi aituka,
i was trying to figure out how to use GATK with torrent data as well. More or less i ended up with something similar to your pipeline. But my customers gave me some info about expected deletions which i could not find using that pipeline. Turned out to be a problem with markduplicate step. After the marking lots of SNP and Indels disappear from the vcf even though one can see the expected deletion in the bam file when using IGV.

So for now after i add readgroup, sort and index the bam file i directly go to UnifiedGenotyper and VariantFiltration from GATK. So i skip all recalibration and etc. Interestingly the results with and without recalibration,both excluding markduplicates are almost identical.

Btw, i use 'bwa bwasw' to align my data to human.

**colinmolter** · 05-10-2012, 12:20 AM

Hi tuka,
I do agree with kenietz. The markduplicate trick might remove too much reads. The problem relates to this post: http://seqanswers.com/forums/showthread.php?t=6854 (Removing duplicates is it really necessary?). When sequencing the whole exome, you don't expect too much duplicates. Removing them might be a good option to avoid PCR duplicates. However, in targeted sequencing, it might be natural to have a lot of duplicates. Removing them might not be the good option.

In an example of ion PGM data, I got:
429,424 reads, among which I could align (bwa bwasw) 383,091 (89.21%) reads.
But there was 371,738 duplicates !

As for Kenietz, results with or without duplicates were identical.

**kenietz** · 05-10-2012, 12:36 AM

@colinmolter:
exactly the same situation. i have this targeted reseq with coverage of 1000-2000x and around 95% of the reads were marked duplicates.
Its sad that the documentation on GATK is not enough. But unfortunately that is the situation one can not describe all possibilities and hence sometimes when one follows the general guide one is getting lost. Is good that there are many forums around tho

Cheers

**kenietz** · 05-17-2012, 10:53 PM

Hi guys,
what argument did you use for UnifiedGenotyper and VariantFiltration?

**colinmolter** · 05-17-2012, 11:46 PM

Originally posted by kenietz View Post

Hi guys,
what argument did you use for UnifiedGenotyper and VariantFiltration?

I used something like this:

Code:

java -Xmx4g -jar gatk.jar -T UnifiedGenotyper -R bwa6.1/h_g1k_v37.fasta -L targetintervals.bed -nt 16 -A AlleleBalance -A DepthOfCoverage -stand_call_conf 30.0 -stand_emit_conf 10 -glm BOTH --dbsnp dbsnp_135.b37.vcf -o DS.SNV.all.vcf -metrics DS.SNV.all.vcf.metric -I s1.bam -I s2.bam

what about yours?
any comments?
colin

**kenietz** · 05-17-2012, 11:55 PM

Hi colin,
i followed a guide for GATK for Illumina which i found on this site. But had to modify it just a tiny bit.

`java -jar /opt/GATK/GenomeAnalysisTK.jar -R $refgen -T UnifiedGenotyper -I realigned.recal.bam --dbsnp /mnt/hd/GATK_recource_bundle/hg19/dbsnp_135.hg19.vcf.reordered -o gatk_var.raw.vcf --num_threads 8 -L pbrm1_intervals.bed --genotype_likelihoods_model BOTH --metrics_file snp.metrics -stand_call_conf 30 -stand_emit_conf 10 -A DepthOfCoverage -A AlleleBalance`;

`java -Xmx4g -jar /opt/GATK/GenomeAnalysisTK.jar -R $refgen -T VariantFiltration --variant gatk_var.raw.vcf -o gatk_var_filtered.vcf --clusterWindowSize 10 --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "HARD_TO_VALIDATE" --filterExpression "DP < 5 " --filterName "LowCoverage" --filterExpression "QUAL < 30.0 " --filterName "VeryLowQual" --filterExpression "QUAL > 30.0 && QUAL < 50.0 " --filterName "LowQual" --filterExpression "QD < 1.5 " --filterName "LowQD" --filterExpression "SB > -10.0 " --filterName "StrandBias"`;

Seems that results from GATK and VariantCaller plugin from torrent suite differ

Now trying to make the standalone variantcaller to run. Partial success for now tho

**adaptivegenome** · 05-18-2012, 10:41 AM

Has anyone tried TMAP over BWA?

**aggp11** · 05-18-2012, 12:15 PM

@genericforms

I am sorry, but I am not sure if your question is more general or specifically towards this problem.

In case it is more general, then yes I have tried both TMAP and BWA on our datasets and it seems like there is not much of a difference if you select the right BWA algorithm (aln-samse / bwasw).. As TMAP selects the appropriate algorithm based on the characteristics (most probably read length).. unless off-course TMAP decides to run SSAHA, which would give different results.

Thanks,
Praful

**adaptivegenome** · 05-18-2012, 12:17 PM

Originally posted by aggp11 View Post

@genericforms

I am sorry, but I am not sure if your question is more general or specifically towards this problem.

In case it is more general, then yes I have tried both TMAP and BWA on our datasets and it seems like there is not much of a difference if you select the right BWA algorithm (aln-samse / bwasw).. As TMAP selects the appropriate algorithm based on the characteristics (most probably read length).. unless off-course TMAP decides to run SSAHA, which would give different results.

Thanks,
Praful

Yes, I was curious in your experience how BWA-SW matches up to TMAP...

**nilshomer** · 05-20-2012, 05:40 PM

Originally posted by aggp11 View Post

@genericforms

I am sorry, but I am not sure if your question is more general or specifically towards this problem.

In case it is more general, then yes I have tried both TMAP and BWA on our datasets and it seems like there is not much of a difference if you select the right BWA algorithm (aln-samse / bwasw).. As TMAP selects the appropriate algorithm based on the characteristics (most probably read length).. unless off-course TMAP decides to run SSAHA, which would give different results.

Thanks,
Praful

Simulate some Ion 100-200bp human data and compare the two software for yourself (included are the simulator, simulation results evaluator, and plotting software. Feel free to create a new post to give feedback

**adaptivegenome** · 05-21-2012, 04:08 PM

We are working on it Nils. In a moment of weakness I got lazy and wanted to know if someone had already looked into this...

Topics	Statistics	Last Post
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM

Unconfigured Ad

pipeline for ion torrent data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News