Hello folks,
After downloading and realigning the reads of the ChIP-seq sample GSM721212 with bwa, I get a quite strange mismatch profile for the reads with high mapping quality - please have a look at the attached figure from samstat.
The overall mapping quality is unobtrusive and there are no overrepresented sequences in the sample according to fastqc, but still 70k of them have a mismatch just at position 16. I am astonished, that all four bases at pretty equal ratios mismatch at this site. Could the sequencing run have been disturbed at this cycle ? Nevertheless the "Per base sequence quality" is not different at base 16 than at neighboring bases.
I have not yet tested other aligners (next step to do), but I would like to ask if you ever or even frequently encountered such samples? I am glad about some musing how such a profile may be explainable.
My second question: Do you know a tool to extract a defined subset of reads (e.g. all reads with a mismatch at position 16) from the sam file? I know this should be possible based on the CIGAR strings, but I would like to avoid reinventing the wheel if there would already be a nice tool available.
Thanks a lot. I am looking forward to your answers
Matthias
After downloading and realigning the reads of the ChIP-seq sample GSM721212 with bwa, I get a quite strange mismatch profile for the reads with high mapping quality - please have a look at the attached figure from samstat.
The overall mapping quality is unobtrusive and there are no overrepresented sequences in the sample according to fastqc, but still 70k of them have a mismatch just at position 16. I am astonished, that all four bases at pretty equal ratios mismatch at this site. Could the sequencing run have been disturbed at this cycle ? Nevertheless the "Per base sequence quality" is not different at base 16 than at neighboring bases.
I have not yet tested other aligners (next step to do), but I would like to ask if you ever or even frequently encountered such samples? I am glad about some musing how such a profile may be explainable.
My second question: Do you know a tool to extract a defined subset of reads (e.g. all reads with a mismatch at position 16) from the sam file? I know this should be possible based on the CIGAR strings, but I would like to avoid reinventing the wheel if there would already be a nice tool available.
Thanks a lot. I am looking forward to your answers
Matthias
Comment