We have tested a number of the publically available bisulfite mapping programs, however we found that most of these software packages suffered from being slow, poorly documented or not very flexible regarding alignment parameters; in addition, the tested programs only performed mapping to a reference genome but none of them generated an output that can be readily used by wet lab scientists to look at actual methylation levels in their experiments (apologies but we did not yet look at the new tool MethylCoder which was posted here on SEQanswers only last week).
We have therefore developed a new tool, called Bismark, for the time-efficient analysis of BS-seq data. Bismark is a software package written in Perl that is based on the short read aligner Bowtie. All associated files can be downloaded at:
How does it work?
Bismark supports both single-end and paired-end read mapping in either fastA or fastQ format produced by e.g. the Illumina Genome Analyser IIx, while retaining much of the flexibility of Bowtie (adjust the seed length, number of mismatches, --best...). Sequence reads are first transformed into fully bisulfite-converted forward (C > T) and reverse reads (G > A conversion of the forward strand),
before aligning them to similarly converted versions of the genome (also C>T and G>A). Sequence reads that produce a unique best alignment from the four alignment processes against the bisulfite genome (which are running in parallel), are then compared to the normal genomic sequence and the methylation state of all cytosine positions in the read is determined.
What does the output look like?
In the first instance, Bismark produces a comprehensive output file which can either be imported into a genome viewer (e.g. SeqMonk) or be further mined by downstream scripts. The output includes (amongst others) the following information:
(1) Seq ID
(2) strand
(2) chromosome
(3) start
(4) end
(5) observed bisulfite sequence
(6) equivalent genomic sequence
(7) methylation call string
(8) read conversion
(9) genome conversion
More details about the output format can be found at: http://seqanswers.com/wiki/Custom_Bismark_output_format.
In the current version, Bismark discriminates between cytosines in CpG context or in any other context.
Bismark comes with a supplementary methylation extractor script that operates on Bismark result files and extracts the methylation call for every single C analysed. It comes with a few options, such as ignoring the first x number of positions in the methylation call string, e.g. to remove a restriction enzyme site. The position of every single C will be written out to a new output file, depending on context (CpG or any other), whereby methylated Cs will be labelled as forward reads, non-methylated Cs as reverse reads. These positions can then be imported into a genome viewer (e.g. SeqMonk) and data analysis can commence. This output is regularly analysed by the researchers at our institute themselves!
Because the output files can become very large (an Illumina lane with 35 million reads can potentially contain a LOT of Cs!), one can either specify to get a large, comprehensive output or obtain strand-specific outputs (every read can potentially originate from each of the four strands generated by a bisulfite PCR. Please be aware that it does not make any sense to look at strand-specific methylation if the bisulfite experiment did not conserve strand-specificity in the first place. In these cases, the individual files can and should later on in SeqMonk be merged into a data group again).
Some stats about Bismark
So far we have successfully used Bismark for bisulfite mapping against various genomes (mouse NCBIM37, human NCBI36 and GRCh37, or Yeast SGD1.01) and confirmed its functionality by re-analysing a number of published results. Depending on the type of experiment we typically see alignment rates of 55-65%.
Bismark holds the reference genome in memory, in addition the 4 parallel instances of bowtie need some memory as well. Therefore I estimate the memory usage to be around 10GB (we can run 2 instances of Bismark on a 2 year old quad-core, 24GB RAM Linux server at the same time). Alignment speed depends largely the read length and on the specified bowtie parameters; allowing many mismatches and specifying --best is rather slow, whereas only looking for perfect matches can easily reach 5-10 million sequences per hour.
We have initially designed Bismark to support the kinds of analyses we require, thus if you have some ideas or suggestions which should be implemented please let us know.
If you have any comments about Bismark we would like to hear them. You can either enter them in our bug tracking system at: http://www.bioinformatics.bbsrc.ac.uk/bugzilla/ or send me a message directly.
Bismark at ISMB
I will also be presenting Bismark with a poster at ISMB in Boston soon (July 11-13) where I am happy to answer any questions.
We have therefore developed a new tool, called Bismark, for the time-efficient analysis of BS-seq data. Bismark is a software package written in Perl that is based on the short read aligner Bowtie. All associated files can be downloaded at:
How does it work?
Bismark supports both single-end and paired-end read mapping in either fastA or fastQ format produced by e.g. the Illumina Genome Analyser IIx, while retaining much of the flexibility of Bowtie (adjust the seed length, number of mismatches, --best...). Sequence reads are first transformed into fully bisulfite-converted forward (C > T) and reverse reads (G > A conversion of the forward strand),
before aligning them to similarly converted versions of the genome (also C>T and G>A). Sequence reads that produce a unique best alignment from the four alignment processes against the bisulfite genome (which are running in parallel), are then compared to the normal genomic sequence and the methylation state of all cytosine positions in the read is determined.
What does the output look like?
In the first instance, Bismark produces a comprehensive output file which can either be imported into a genome viewer (e.g. SeqMonk) or be further mined by downstream scripts. The output includes (amongst others) the following information:
(1) Seq ID
(2) strand
(2) chromosome
(3) start
(4) end
(5) observed bisulfite sequence
(6) equivalent genomic sequence
(7) methylation call string
(8) read conversion
(9) genome conversion
More details about the output format can be found at: http://seqanswers.com/wiki/Custom_Bismark_output_format.
In the current version, Bismark discriminates between cytosines in CpG context or in any other context.
Bismark comes with a supplementary methylation extractor script that operates on Bismark result files and extracts the methylation call for every single C analysed. It comes with a few options, such as ignoring the first x number of positions in the methylation call string, e.g. to remove a restriction enzyme site. The position of every single C will be written out to a new output file, depending on context (CpG or any other), whereby methylated Cs will be labelled as forward reads, non-methylated Cs as reverse reads. These positions can then be imported into a genome viewer (e.g. SeqMonk) and data analysis can commence. This output is regularly analysed by the researchers at our institute themselves!
Because the output files can become very large (an Illumina lane with 35 million reads can potentially contain a LOT of Cs!), one can either specify to get a large, comprehensive output or obtain strand-specific outputs (every read can potentially originate from each of the four strands generated by a bisulfite PCR. Please be aware that it does not make any sense to look at strand-specific methylation if the bisulfite experiment did not conserve strand-specificity in the first place. In these cases, the individual files can and should later on in SeqMonk be merged into a data group again).
Some stats about Bismark
So far we have successfully used Bismark for bisulfite mapping against various genomes (mouse NCBIM37, human NCBI36 and GRCh37, or Yeast SGD1.01) and confirmed its functionality by re-analysing a number of published results. Depending on the type of experiment we typically see alignment rates of 55-65%.
Bismark holds the reference genome in memory, in addition the 4 parallel instances of bowtie need some memory as well. Therefore I estimate the memory usage to be around 10GB (we can run 2 instances of Bismark on a 2 year old quad-core, 24GB RAM Linux server at the same time). Alignment speed depends largely the read length and on the specified bowtie parameters; allowing many mismatches and specifying --best is rather slow, whereas only looking for perfect matches can easily reach 5-10 million sequences per hour.
We have initially designed Bismark to support the kinds of analyses we require, thus if you have some ideas or suggestions which should be implemented please let us know.
If you have any comments about Bismark we would like to hear them. You can either enter them in our bug tracking system at: http://www.bioinformatics.bbsrc.ac.uk/bugzilla/ or send me a message directly.
Bismark at ISMB
I will also be presenting Bismark with a poster at ISMB in Boston soon (July 11-13) where I am happy to answer any questions.
Comment