Hey everyone,
I've been using Novoalign to map Illumina reads (TruSeq capture, HiSeq paired-end sequencing), then using GATK base quality recalibration to hopefully get better results. But you strangely get both ends of reads with very high reported quality scores after GATK base quality recalibration. Novoalign features its own recalibration which does not show these same effects, but if you use GATK base quality recalibration then once again very high quality scores at both ends are observed.
These quality scores, particularly at the 3' end doesn't seem right for this data. In addition the same effects are not seen from a BWA alignment of the same data. I have seen the same effect on each dataset I have tried this out on so far. (All TruSeq, HiSeq) As the -H option was set in Novoalign many reads were trimmed (as much as leaving only 16 bases), and these effects are still observed after removing all trimmed reads.
Novoalign mapped reads (without Novoalign recalibration) before GATK recalibration

Novoalign mapped reads (without Novoalign recalibration) after GATK recalibration

Novoalign mapped reads (with Novoalign recalibration) before GATK recalibration

Novoalign mapped reads (with Novoalign recalibration) after GATK recalibration

BWA mapped reads before GATK recalibration

BWA mapped reads after GATK recalibration

Uploaded with ImageShack.us
My pipeline has been:
alignment, sort/order, FastQC, Duplicate removal (MarkDuplicates), GATK base quality recalibration, FastQC. I would've had the first FastQC step after but has been easier to implement in this case, and I'm not thinking it would be hiding anything (duplicate levels ~17%).
Any enlightenment would be appreciated.
I've been using Novoalign to map Illumina reads (TruSeq capture, HiSeq paired-end sequencing), then using GATK base quality recalibration to hopefully get better results. But you strangely get both ends of reads with very high reported quality scores after GATK base quality recalibration. Novoalign features its own recalibration which does not show these same effects, but if you use GATK base quality recalibration then once again very high quality scores at both ends are observed.
These quality scores, particularly at the 3' end doesn't seem right for this data. In addition the same effects are not seen from a BWA alignment of the same data. I have seen the same effect on each dataset I have tried this out on so far. (All TruSeq, HiSeq) As the -H option was set in Novoalign many reads were trimmed (as much as leaving only 16 bases), and these effects are still observed after removing all trimmed reads.
Novoalign mapped reads (without Novoalign recalibration) before GATK recalibration

Novoalign mapped reads (without Novoalign recalibration) after GATK recalibration

Novoalign mapped reads (with Novoalign recalibration) before GATK recalibration

Novoalign mapped reads (with Novoalign recalibration) after GATK recalibration

BWA mapped reads before GATK recalibration

BWA mapped reads after GATK recalibration

Uploaded with ImageShack.us
My pipeline has been:
alignment, sort/order, FastQC, Duplicate removal (MarkDuplicates), GATK base quality recalibration, FastQC. I would've had the first FastQC step after but has been easier to implement in this case, and I'm not thinking it would be hiding anything (duplicate levels ~17%).
Any enlightenment would be appreciated.
Comment