Seqanswers Leaderboard Ad

**fjrossello** · 12-07-2012, 04:32 PM

Originally posted by stephlo View Post

Hi all,

I am doing an exome analysis with BWA 0.6.1-r104 and GATK v2.2-8-gec077cd.
I have paired end reads, my protocol until now is (in brief, omitting options etc.)

bwa aln R1.fastq
bwa aln R2.fastq
bwa sampe R1.sai R2.sai
picard/CleanSam.jar
picard/SortSam.jar
picard/MarkDuplicates.jar
picard/AddOrReplaceReadGroups.jar
picard/BuildBamIndex.jar
GATK -T RealignerTargetCreator -known dbsnp.vcf
GATK -T IndelRealigner -known dbsnp.vcf
GATK -T BaseRecalibrator -knownSites dbsnp.vcf
GATK -T PrintReads

A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

I have 85767226 paired end = 171534452 sequences in fastQ file

BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.

MarkDuplicates reports:

Read 165619516 records. 2 pairs never matched.
Marking 20272927 records as duplicates.
Found 2919670 optical duplicate clusters.

so nearly 6 million reads seem to miss.

CreateTargets MicroScheduler reports

35915555 reads were filtered out during traversal out of 166579875 total (21.56%)
-> 428072 reads (0.26% of total) failing BadMateFilter
-> 16077607 reads (9.65% of total) failing DuplicateReadFilter
-> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

so nearly 5 million reads seem to miss

The Realigner MicroScheduler reports

0 reads were filtered out during traversal out of 171551640 total (0.00%)

which appears a miracle to me since
1) there are even more reads now than input sequences,
2) all those crappy reads reported by CreateTargets do not appear.

From Base recalibration MicroScheduler, I get

41397379 reads were filtered out during traversal out of 171703265 total (24.11%)
-> 16010068 reads (9.32% of total) failing DuplicateReadFilter
-> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.

I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?

Thanks for any comments!

Hi Stephio,

Just a comment. I use a similar approach - no CleanSam step used - for exome capture analysis and got comparable results. IMHO, the protocol We are currently following is standard and sound. At the same time, I found no logical explanation to the variation in the no. of read counts. I am also concerned about the high no. of failing MappingQualityZeroFilter reads. Any logical/clear explanation for that?

Cheers,

Fernando

**stephlo** · 12-08-2012, 05:38 AM

Hi Fernando,

I posted the same question on the GATK forum and was told not to worry as long as those counts are "roughly silimar" :

Just a moment...

http://gatkforums.broadinstitute.org/discussion/1891/picard-gatk-read-counts-varying-between-programs#latest

Best,

Stephan

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

GATK: read counts varying between programs

Comment

Comment

Latest Articles

ad_right_rmr

News