I've run Bismark to align a set of BS-Seq data. Some (not all) of the samples had low mapping efficiency (~20%). I then tried mapping R1 and R2 separately and found that R1 mapped at >70% while R2 mapped at ~30% (both in undirectional mode). Then I tried bsseeker and it reported a 72.2% mapping rate. By checking the CIGAR, I saw that most of the R2 reads contained a not short soft clipping in the ends (e.g. 91M60S). An examples of these reads is:
A00437:548:HN5NMDSX3:1:1101:24939:1344 (aligned by bsseeker but not bismark. CIGAR: 60M91S; POS: chr1:159204290)
AAGTTTTTTATATATAGATATGTGTATAATGATATATAGTAAATGTATATAGAGTTTAGTGTGAGAGTGGGAGGGTTGGGGTGGTTGTTGAGGTTGTATAATGAAGTTATTTTAGGGAGTTATTGGGTGTTTGTTTAGTTATTTATGGGTT
The bolded part was soft-clipped, while the front part mapped to chr1:159204290-159204349 (60nt) if converting all Cs to Ts in the reference.
I checked the fastqc of these reads but didn't see adaptor contamination or over-represented sequences in R2, so it's a mystery what these clipped sequences are and why they occur only in R2. Does anyone have any ideas? Thanks.
A00437:548:HN5NMDSX3:1:1101:24939:1344 (aligned by bsseeker but not bismark. CIGAR: 60M91S; POS: chr1:159204290)
AAGTTTTTTATATATAGATATGTGTATAATGATATATAGTAAATGTATATAGAGTTTAGTGTGAGAGTGGGAGGGTTGGGGTGGTTGTTGAGGTTGTATAATGAAGTTATTTTAGGGAGTTATTGGGTGTTTGTTTAGTTATTTATGGGTT
The bolded part was soft-clipped, while the front part mapped to chr1:159204290-159204349 (60nt) if converting all Cs to Ts in the reference.
I checked the fastqc of these reads but didn't see adaptor contamination or over-represented sequences in R2, so it's a mystery what these clipped sequences are and why they occur only in R2. Does anyone have any ideas? Thanks.