Originally posted by zeam
View Post
Thanks for your inquiry, I will try and explain everything as briefly and fully as possible.
I just had a quick look at the paper you mentioned. Bismark is most similar to their algorithm (1), however it does not only do a C->T conversion of the Watson and Crick strands of the referece and the reads, but in addition it does a G->A conversion of both the reads and the reference as well. The method described in their algorithm (1) does work fine for what we call directional libraries (i.e. you only expect to see either C->T converted reads originating from the Watson or the Crick strand, such as in Lister et al, 2009)(if you have a library like this specify --directional when running Bismark). However if libraries are non-directional, you will see reads from all possible 4 bisulfite strands (which is the limitation they describe as the G-poor strand), such as in Popp et al, 2010). Thus, Bismark works very well also for non-directional libraries which the approach described in (1) doesn't.
The number of tolerated non-bisulfite mismatches can be adjusted using the parameters -n and -l mainly (maybe also -e). In general, I recommend keeping the tolerated number of non-BS mismatches to a minimum, as the reduced complexity of the reads make mapping hard enough, so allowing extra non-BS mismatches is even more likely to result in false mappings.
Bismark converts both the read sequence and the reference sequence into BS-space to avoid mapping bias due the methylation state of the read. In this way, "BS mismatches" will rather be perfect matches during the mapping process. For the unique best alignment the methylation is called for every position where there is a C in there reference genome and a C or T in the read (or when there is a G in the genome and a G or A in the reference, this depends on which strand the sequence mapped to).
In some cases it is possible that there is a C in the read sequence and a T in the reference sequence, which should in fact count as a non-BS mismatch, but after the conversion this is not detectable any more. There are now 2 ways to handle this.
(a) one can look at these positions and decide to discard the reads in question
(b) one can leave these reads in as removing these reads would mean that one would introduce a bias in removing preferentially methylated reads (as they contained more C, this would also work for the G/A case). By doing so you wouldn't perform methylation calls at the positions in question and thus wouldn't introduce artificial methylation calls (as they are a T in the reference), but you accept that there might be some degree of mismapped reads (keeping the non-BS mismatches to a minimum will help remove such mismaps)
I personally tried to introduce as few arbitrary filtering steps which might introduce whatever bias regarding the methylation level as possible, which is why Bismark keeps these reads in. If one wanted to get rid of these sequences they could run a simple script on the Bismark result files and remove such reads. I haven't checked the prevalence of these read species yet, but I reckon if the read length is reasonably long and the non-BS mismatch allowance is kept to a reasonable minimum mismaps shouldn't be a problem at all.
I hope this answered your questions, if I was unclear you can also send me an email directly to
[email protected].
Best wishes,
Felix
Comment