Hi all,
I am working with a BS-Seq dataset and I came across this result that puzzles me a bit.
I ran fastqc on the fastq files first and I got a estimated duplication level of 36.83% (fastqc plot attached)
Afterwards, I mapped the data using Bismark: Here's the mapping report:
Number of paired-end alignments with a unique best hit: 165375035
Mapping efficiency: 71.3%
Sequences with no alignments under any condition: 52756927
Sequences did not map uniquely: 13328411
The number of sequences that did not map uniquely is less than 10% the number of mapped sequences
So I can only think of two possibilities here:
1- Our dataset really contains a high level of polyclonality (therefore we'll have to worry about it and improve the protocol we use to prepare the BS-Seq library). This would imply that >20% of the duplicate reads are not mapped at all explaining the difference in duplication levels between fastqc and bismark. Have any bismark users come across something like this before?
2- Could it be that there is something about the way fastqc estimates the duplicate levels that artificially boosts the numbers of duplicates in our dataset? I'm not really sure about this because I used fastqc in the past and it always seemed to work really well but I wonder if there is something about bisulfite converted reads that could cause this behaviour
Thanks a lot in andvance for your answers!
I am working with a BS-Seq dataset and I came across this result that puzzles me a bit.
I ran fastqc on the fastq files first and I got a estimated duplication level of 36.83% (fastqc plot attached)
Afterwards, I mapped the data using Bismark: Here's the mapping report:
Number of paired-end alignments with a unique best hit: 165375035
Mapping efficiency: 71.3%
Sequences with no alignments under any condition: 52756927
Sequences did not map uniquely: 13328411
The number of sequences that did not map uniquely is less than 10% the number of mapped sequences
So I can only think of two possibilities here:
1- Our dataset really contains a high level of polyclonality (therefore we'll have to worry about it and improve the protocol we use to prepare the BS-Seq library). This would imply that >20% of the duplicate reads are not mapped at all explaining the difference in duplication levels between fastqc and bismark. Have any bismark users come across something like this before?
2- Could it be that there is something about the way fastqc estimates the duplicate levels that artificially boosts the numbers of duplicates in our dataset? I'm not really sure about this because I used fastqc in the past and it always seemed to work really well but I wonder if there is something about bisulfite converted reads that could cause this behaviour
Thanks a lot in andvance for your answers!
Comment