Announcement

Collapse
No announcement yet.

Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

    We have tested a number of the publically available bisulfite mapping programs, however we found that most of these software packages suffered from being slow, poorly documented or not very flexible regarding alignment parameters; in addition, the tested programs only performed mapping to a reference genome but none of them generated an output that can be readily used by wet lab scientists to look at actual methylation levels in their experiments (apologies but we did not yet look at the new tool MethylCoder which was posted here on SEQanswers only last week).

    We have therefore developed a new tool, called Bismark, for the time-efficient analysis of BS-seq data. Bismark is a software package written in Perl that is based on the short read aligner Bowtie. All associated files can be downloaded at:

    http://www.bioinformatics.bbsrc.ac.uk/projects/


    How does it work?

    Bismark supports both single-end and paired-end read mapping in either fastA or fastQ format produced by e.g. the Illumina Genome Analyser IIx, while retaining much of the flexibility of Bowtie (adjust the seed length, number of mismatches, --best...). Sequence reads are first transformed into fully bisulfite-converted forward (C > T) and reverse reads (G > A conversion of the forward strand),
    before aligning them to similarly converted versions of the genome (also C>T and G>A). Sequence reads that produce a unique best alignment from the four alignment processes against the bisulfite genome (which are running in parallel), are then compared to the normal genomic sequence and the methylation state of all cytosine positions in the read is determined.


    What does the output look like?

    In the first instance, Bismark produces a comprehensive output file which can either be imported into a genome viewer (e.g. SeqMonk) or be further mined by downstream scripts. The output includes (amongst others) the following information:

    (1) Seq ID
    (2) strand
    (2) chromosome
    (3) start
    (4) end
    (5) observed bisulfite sequence
    (6) equivalent genomic sequence
    (7) methylation call string
    (8) read conversion
    (9) genome conversion

    More details about the output format can be found at: http://seqanswers.com/wiki/Custom_Bismark_output_format.


    In the current version, Bismark discriminates between cytosines in CpG context or in any other context.

    Bismark comes with a supplementary methylation extractor script that operates on Bismark result files and extracts the methylation call for every single C analysed. It comes with a few options, such as ignoring the first x number of positions in the methylation call string, e.g. to remove a restriction enzyme site. The position of every single C will be written out to a new output file, depending on context (CpG or any other), whereby methylated Cs will be labelled as forward reads, non-methylated Cs as reverse reads. These positions can then be imported into a genome viewer (e.g. SeqMonk) and data analysis can commence. This output is regularly analysed by the researchers at our institute themselves!

    Because the output files can become very large (an Illumina lane with 35 million reads can potentially contain a LOT of Cs!), one can either specify to get a large, comprehensive output or obtain strand-specific outputs (every read can potentially originate from each of the four strands generated by a bisulfite PCR. Please be aware that it does not make any sense to look at strand-specific methylation if the bisulfite experiment did not conserve strand-specificity in the first place. In these cases, the individual files can and should later on in SeqMonk be merged into a data group again).


    Some stats about Bismark

    So far we have successfully used Bismark for bisulfite mapping against various genomes (mouse NCBIM37, human NCBI36 and GRCh37, or Yeast SGD1.01) and confirmed its functionality by re-analysing a number of published results. Depending on the type of experiment we typically see alignment rates of 55-65%.

    Bismark holds the reference genome in memory, in addition the 4 parallel instances of bowtie need some memory as well. Therefore I estimate the memory usage to be around 10GB (we can run 2 instances of Bismark on a 2 year old quad-core, 24GB RAM Linux server at the same time). Alignment speed depends largely the read length and on the specified bowtie parameters; allowing many mismatches and specifying --best is rather slow, whereas only looking for perfect matches can easily reach 5-10 million sequences per hour.

    We have initially designed Bismark to support the kinds of analyses we require, thus if you have some ideas or suggestions which should be implemented please let us know.

    If you have any comments about Bismark we would like to hear them. You can either enter them in our bug tracking system at: http://www.bioinformatics.bbsrc.ac.uk/bugzilla/ or send me a message directly.


    Bismark at ISMB

    I will also be presenting Bismark with a poster at ISMB in Boston soon (July 11-13) where I am happy to answer any questions.
    Last edited by fkrueger; 05-12-2014, 12:27 PM. Reason: Changed the title - not that new a tool anymore ...

  • #2
    Thanks to some feedback which I got so far I fixed some of the initial release bugs.

    I also changed the output format for both single-end and paired-end alignments slightly so that it is more concise and smaller. More details can be found at http://seqanswers.com/wiki/Custom_Bismark_output_format.


    I'd like to mention that in order to get the most reliable output I recommend specifying the "--best" option for bowtie alignments. This will however take more time to complete. (For most applications we used Bismark so far we found the difference of specifying "--best" to be only marginal (less than 1% difference in mapped reads), therefore running it with the default parameters means a 2-3 x faster alignment altogether).


    I am still happy for any kind of feedback!

    Comment


    • #3
      Is there support for SAM/BAM alignment format in Bismark. If so you could just plug and play with other bisulfite aligners.

      Comment


      • #4
        Bismark is not a mere bisulfite alignment application, but it is also a methylation caller at the same time. To perform the methylation calls correctly it needs to know the conversion state of both the bisulfite read and the genome, as this will determine whether we have to look for C->T or G->A substitutions and whether we need to look up- or downstream to determine if the C was in CpG context or not.

        To our knowledge most methylation mappers do only align reads in a C->T converted form and do not perform all four possible alignments in parallel (CT converted reads to CT converted genome, GA converted reads to CT converted genome, CT converted reads to GA converted genome and GA converted reads to GA converted genome), making this 4-alignment output unique to Bismark. As the SAM/BAM output of other mapping programs can't supply this information I don't see an easy way to mix in the output of other aligners and perform the methylation calls in Bismark.

        Conversely, even though it might be possible to modify the code to get the four instances of Bowtie to report SAM output instead (or convert it to SAM/BAM format later on), this would possibly not make any difference to the downstream methylation call as it is not based on the alignment format but on read/genome conversion state. The Bismark output format does now contain only essential information, and as far as we are aware SAM format doesn't offer to store methylation information as such.

        We experienced that wet lab scientists want to know the methylation state of Cytosines at base pair-resolution rather than getting read alignments which do not necessarily tell them much about the underlying methylation levels. Thus, the final format which will ultimately be handed out to the researchers themselves just contains information about the chromosome/position of an individual C, and whether it was methylated or not.

        In summary , Bismark uses a unique logic for its alignments and methylation calling procedure which won't allow a lot of plug and play with other aligner outputs.

        Comment


        • #5
          Although I agree that methylation states can't be utilized in SAM/BAM, we (in our insubstantial experience) have found that people wish to visualize both the mappings and methylation status on a system such as the UCSC genome browser. While I'm aware of your program (SeqMonk) and fully agree with your approach in providing visualization through that means, it might be worthwhile to provide the mapping results obtained by Bismark as a SAM/BAM format through a simple conversion script to generate the necessary information (including the CIGAR string showing conversion relative to genome), thus giving an option for people to visualize their results on a system of their choice (which can handle BAM, for example).

          On an unrelated note:
          How do you present methylation status to the wet-lab biologists? One of the ways we analyze our data is to calculate bisulfite conversion rate by identifying C --> T conversion in a non-CpG context to get an idea of the quality of bisulfite conversion. We also like to give methylation "levels" per CpG based on the genomic position, and I'm wondering if you have a good way of converting your output into these variables, or if it's completely embedded into SeqMonk. Our ultimate goal is to show methylation as a BEDgraph-like track over the full genome.

          Thanks for a great program. Looking forward to seeing the results.

          Comment


          • #6
            Hi Oliver,

            thanks for your comments. I am aware that many people are using the UCSC genome browser, therefore I am going look into the option of generating a more universal output in SAM/BAM format.

            After an analysis is completed, Bismark will give a methylation call summary which will provides a very general overview of the methylation state in either CpG context or non-CpG context which might look like this:

            Final Cytosine Methylation Report
            =================================
            Total number of C's analysed: 170715821
            Total methylated C's in non-CpG context: 10243954
            Total methylated C's in CpG context: 8576784
            Total C to T conversions in non-CpG context: 116666488
            Total C to T conversions in CpG context: 35228595

            C methylated but not in CpG context: 8.1%
            C methylated in CpG context: 19.6%

            This can be used to estimate the bisulfite conversion rate.

            But as you guessed correctly, SeqMonk provides us with all the data quantitation and calculation (we) can currently think of. This includes filtering for a certain amount of reads per C, calculating methylation levels e.g. as % methylated, trends over certain features (e.g. CpG islands) etc.

            I'll probably put up a quick example file of a methylation analysis to illustrate some of these aspects next week. Our goal is that the wet-lab scientists can - after the initial Bismark analysis - toy around with their data in SeqMonk as much as they like, which takes off work from our shoulders and gives them the feeling that they are analysing their data themselves rather than just handing it over to the Bioinformatics department.
            Last edited by fkrueger; 06-20-2010, 11:08 AM.

            Comment


            • #7
              I have made a few screenshots of different ways to analyse and present methylation data. We basically use SeqMonk for most of our graphical as well as quantitative tasks as it offers a large repertoire of useful tools to handle this kind of data.

              All graphical results can be exported into annotated report files which can be used for plotting purposes or used in further analyses. Just have a look at the attached example file. (The data shown was taken from Meissner et al, Nature, 2008, PMID: 18600261; GSE11034)
              Attached Files
              Last edited by fkrueger; 06-21-2010, 12:56 AM. Reason: added reference to the data used.

              Comment


              • #8
                Hi,

                I have attached a beta version of a perl script that can convert Bismark single-end mapping to SAM format.

                Usage:
                bismark_to_SAM.pl -c [chrom sizes file] -i [bismark mapping output] -o [SAM output]

                -c [chrom sizes file] - file containing length of chromosomes/sequences used for Bowtie mapping
                -i [bismark mapping output] - file containing bismark mapping output
                -o [SAM output] - name for output file in SAM format (default: [input].sam)

                The chrom sizes file should be in the format of:

                <chr> TAB <length> TAB <2bit file>

                This file is typically obtained from UCSC genome browser download if you're using a model organism genome. However, as long as you have the file with the chromosome/sequence name in column 1 and length in column 2, the program should work. Please ensure the name of the chromosome in the chrom sizes file is the same as the name that bismark outputs (i.e. The name of the sequences you mapped against).

                Please let me know if there's an issue.

                Cheers,
                Oliver
                Attached Files
                Last edited by olivertam; 06-24-2010, 09:24 AM. Reason: Updated perl script to generate a SAM file sorted by chromosome and start co-ordinate (allows indexing)

                Comment


                • #9
                  Thanks to feedback we got both via email and at ISMB in Boston we worked on some bugfixes and suggestions for Bismark, which are now available for download from http://www.bioinformatics.bbsrc.ac.uk/projects/.

                  The following features received some attention:


                  Bismark Genome Preparation

                  - If the specified genome directory does already contain a bisulfite genome folder, all contents of this directory will be removed before creating and indexing a new bisulfite genome

                  - The genome indexer will now convert DNA ambiguity code into N's before making the bisulfite genomes (anything other than C, A, T or G will appear as N afterwards)

                  - The indexer will now also handle fastA files with mutltiple sequence entries in addition to (a list of) fastA files in the specified genome folder


                  Methylation Extractor

                  - Fixed a bug whereby the single-end strand-specific output got two of the four possible strands mixed up. Also, the --ignore <int> option did previously offset some of the positions of the methylation calls by the <int> specified. Both features should now work as intended

                  - For paired-end alignments with rather short fragment length it is theoretically possible to read stretches of overlapping sequence with both read 1 and read 2. In order not to score the methylation calls for overlapping sequence twice, we added an option (--no_overlap) to score overlapping methylation calls only from the first read of a given alignment


                  Further comments and suggestions are most welcome!

                  Comment


                  • #10
                    I'm interested to know if anyone looked further into getting data from bismark into different genome browsers. I have used bismark with paired end reads and I need to visualise the data in GBrowse. Usually I do this using a BAM file. The bismark2sam perl script attached above in this thread was designed for single end reads and I also had a problem when using it if I treat my data as single end because it didn't like the non ACGTN characters in my reference genome.

                    As well as needing to get the mapping results into GBrowse I also need to supply the methylated sequence of regions found to be differentially methylated. For non BS data I generate a pileup file and call a consensus but that relies on having a SAM file. How are other people handling this is give the lab people the BS converted sequence for subsequent PCR confirmation etc?

                    I also really like the suggestion made previously to have a bedgraph (or I would prefer BigWig) file for showing % conversion/methylation across chromosomes.

                    Comment


                    • #11
                      Hi Natstreet,

                      I have included an updated version of the script that will handle any degenerate nucleotide in the reference. I'm still assuming the input (from Illumina output) is still A, C, T, G or N.
                      As for making the tool to handle paired end Bismark output, it's yet to be done. I'm afraid that I don't have much experience with paired-ends output, but if you have some Bismark paired-ends output that you don't mind using as a test dataset, I'd be happy to try and make it work.

                      Please let me know if there are problems with the new script

                      Cheers,
                      Oliver
                      Attached Files

                      Comment


                      • #12
                        To follow up on the BEDGraph/BigWig idea, we have developed a workaround for this. Again, this is based on single-end analysis, so I haven't tested paired-ends

                        1) Use methylation-extractor on the Bismark output, with '--comprehensive' option ('--merge_non_CpG' is optional)
                        2) Run the following script (genome_methylation_bismark2bedGraph.pl). This script sorts the methylation extractor output, then parses the results to generate an "overall methylation level" as a BEDGraph file, with one sampled cytosine site per line.
                        3) Use the bedGraphToBigWig program (available online, I believe) to convert the BEDGraph to BigWig.

                        Here's the usage for the genome_methylation_bismark2bedGraph.pl

                        Usage: genome_methylation_count.pl (--cutoff [threshold] ) [Bismark methylation caller output] > [output]

                        --cutoff [threshold] - The minimum number of times a methylation state was
                        seen for that nucleotide before its methylation
                        percentage is reported.
                        Default is no threshold

                        The output file is a tab-delimited BedGraph file with the following information:

                        <Chromosome> <Start Position> <End Position> <Methylation Percentage>

                        Bismark methylation caller (v0.2.0 or later) should produce three output files
                        (CpG, CHG and CHH) when using the "--comprehensive" option
                        (Two files if using the "--merge_non_CpG" option).
                        To count both CpG and Non-CpG, combine the output files.

                        Bismark methylation caller (v0.1.5 or earlier) should produce two output files
                        (CpG and Non-CpG) when using the "--comprehensive" option.
                        To count both CpG and Non-CpG, combine the two output files.


                        Let me know if you have any questions, issues or bugs (since this is a workaround)

                        Cheers,
                        Oliver
                        Attached Files

                        Comment


                        • #13
                          Thanks for the replies, it's very much appreciated (and shows the power of this forum!). I'll test both scripts today. I can re-map the data as single ends anyway because I make no use of the fact that it's paired - this was just the easiest way to increase the sequencing output.

                          [QUOTE=olivertam;26595I'm afraid that I don't have much experience with paired-ends output, but if you have some Bismark paired-ends output that you don't mind using as a test dataset, I'd be happy to try and make it work.[/QUOTE]

                          I've put an example bismark paired end output file here. If you have a chance to take a look it would be great.

                          Comment


                          • #14
                            What organism is this?
                            If possible, could you provide all the chromosome names that you used for mapping, plus their length?

                            Thanks heaps.

                            Cheers,
                            Oliver

                            Comment


                            • #15
                              Sorry - I should have realised some basic info would help!

                              The data is form Arabidopsis thaliana. I've just added a file called length.tab to the same ftp location that has then chrom lengths.

                              I've tried the bismark2sam and genome_methylation2bed scripts and both seem to be working perfectly. I used bedGraph2BigWig and the track looks great in GBrowse.

                              Do you also have any advice about the best way to extract the consensus BS-treated sequence from the bismark results files? I was thinking to use samtools after making a pileup file but I haven't actually tried it yet.

                              Again, all your help is much appreciated.

                              Comment

                              Working...
                              X