Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

cnoirot replied

05-14-2013, 05:55 AM
Hi
I'd like to clearly understand the meaning of PE directional read in bismark.

Here is the description of the sequencing kit provider : "First of all our BS-SEQ kit is not directional [...]. Secondly, you are correct in saying that read 1 is for the original strands and the second read is for the neo synthesized DNA."

So do I have to use --non_directional option or not ?

Thanks for your help.
Leave a comment:
oria34 replied

05-14-2013, 03:51 AM
Hi all,

I finally managed to increase the allowed number of filehandles in my systems and Bismark Methyl Extractor ran like a champ
It took something around 12-15h to write down all the five files for a BC whole genome assembled in 18xx linking groups, so, in my opinion, it was really fast (I used --buffer_size 6GB)

An error message apeared at the end of the run but all the files seem to be ok:

Can't move to /XX/XX/XX/XX/Bisulfite/Bismark_Genome/XXXXXXX/: Not a directory at bismark_methylation_extractor line 3300.

Thank you so much for your support

PS. Just for the record; I used Bowtie 2 for the aligment and took no more than one day (~50 millions read pairs), quite fast I would say!

Last edited by oria34; 05-14-2013, 06:06 AM.
Leave a comment:
fkrueger replied

05-14-2013, 12:26 AM
Originally posted by luuloi View Post

Hi Felix,
Can I run Bismark, bowtie1 in multi threads -p option to tune the performance faster? I did it with bowtie2, but as you memtioned bowtie2 seems to be slow than bowtie1 with your experience. I have been waiting it for 4 days with size of .Bam file is 21M, it is so slow. BTW, when you will release multi thread Bismark? I have really looking forward to it. I have 14 WGBS samples for it

Hi there, unfortunately there is no simple way to use the -p option for Bowtie1 in the current implementation of Bismark. This is because Bismark requires the alignments to be reported in the same order as the sequences appear in the input file, and multi-threaded Bowtie1 (-p > 1) does not guarantee this order. In the multi-threaded version of Bismark we are planning to take care of this shortcoming by splitting the input file into several chunks while transcribing the sequence files into bisulfite converted versions, and then essentially run several instances of Bismark at the same time. I can currently not promise however when I will find the time for the implementation though.

We do indeed see a somwhat slower speed of Bowtie2, but what you are describing sounds more like there is something going wrong. I have just had 4 lanes of HiSeq with ~200M sequences each that aligned in parallel in less than 30h. Also, with a nice cluster you should be able to align all your 14 samples with Bowtie1 overnight... maybe you want to contact me via email so we can talk about what is currently taking so long?

Cheers,
Felix
Leave a comment:
luuloi replied

05-13-2013, 10:37 PM
Hi Felix,
Can I run Bismark, bowtie1 in multi threads -p option to tune the performance faster? I did it with bowtie2, but as you memtioned bowtie2 seems to be slow than bowtie1 with your experience. I have been waiting it for 4 days with size of .Bam file is 21M, it is so slow. BTW, when you will release multi thread Bismark? I have really looking forward to it. I have 14 WGBS samples for it
Leave a comment:
fkrueger replied

05-10-2013, 02:02 AM
We have just released a new release of Bismark (v0.7.12) that is intended to fix the single-end alignment mode for Bowtie 2 which was accidentally slowed down by forgetting to remove a sleep() command while debugging... The changes in more detail:

- Bismark: Removed a rogue sleep(1) command that would slow down single-end Bowtie 2 alignments for a single lane of HiSeq (200M sequences) from ~1 day to 6 years and 4 months (roughly)
- bismark2bedGraph: keeps now track of the temp files it just created in a session instead of using all files in the output folder ending in ".methXtractor.temp". This lets you kick off the bedGraph conversion step from already sorted, individual methXtractor.temp files if desired

Bismark can be downloaded here: http://www.bioinformatics.babraham.a...jects/bismark/.
Leave a comment:
fkrueger replied

04-22-2013, 03:12 AM
We have just released a new version of Bismark v0.7.11 (so it wasn't quite the last release before the introduction of multi-threading after all...). This version addresses some bugs and splits out the bedGraph and genome-wide cytosine report coversion options from the methylation extractor to the modules bismark2bedGraph and bedGraph2cytosine. These modules replace the former scripts 'genome_methylation_bismark2bedGraph.pl' and 'genome_wide_cytosine_report.pl'. The Bismark methylation does now call these modules, but they can also be run independently as stand alone tools.

Here are all changes in some more detail:

• Bismark: Fixed non-functional single-end alignments with Bowtie2 which were accidentally broken by introducing the option '--pbat' in v0.7.10 (an evil 'if' instead of 'elsif'...)
• For paired-end alignments with Bowtie 1, the option '--non_bs_mm' would accidentally confuse the number of mismatches of read 1 and read 2 whenever the first read aligned in reverse orientation, i.e. for OB and CTOT alignments. This has now been corrected
• Previously, the option '--non_bs_mm' would potentially output non-integer values for Bowtie 2 alignments if the read (or reference) contained 'N' characters. Alignment scores from 'N's are now adjusted so that they count as mismatches similar to what Bowtie 1 does. This works for fine reads with up to and including 5 N's (which is quite a lot...)
• Methylation extractor: To avoid duplication and keep code modular, the bedGraph conversion step invoked by the option '--bedGraph' is now been farmed out to the module 'bismark2bedGraph'. This script is independent of the methylation extractor and also works as a stand-alone tool from the methylation extractor output (compressed or gzip compressed files). To work well from within the methylation extractor this script (which is now included in the Bismark package) needs to reside in the same folder as the 'bismark_methylation_extractor' itself
• bismark2bedGraph: Temporary chromosome files now have an input file name included in their file name to enable parallel processing of several files in the same directory at the same time
• To avoid duplication and keep code modular, the bedGraph to genome-wide cytosine methylation report step invoked by the option '--cytosine_report' has now been split out to the module 'bedGraph2cytosine'. This script is independent of the methylation extractor and also works as a stand-alone tool from the Bismark bedGraph '--counts' output (compressed or gzip compressed files). To work well from within the methylation extractor this script (which is now included in the Bismark package) needs to reside in the same folder as the 'bismark_methylation_extractor' itself
• Deduplication script: Fixed some warnings that were thrown if '--bam' was not specified

Bismark is available for download from: http://www.bioinformatics.babraham.a...jects/bismark/
Leave a comment:
fkrueger replied

04-17-2013, 01:00 PM
We have just released a new version of Bismark (v0.7.10), which is probably the last release before the implementation of multi-threading for both Bismark and the methylation extractor.

This new release adds several new features, such as BAM or compressed temporary output as well as a dedicated option for aligning PBAT-Seq libraries. Most notably, the methylation extractor sees a 60% speed increase for the processing of ungapped reads (which is the default). Here are all changes in more detail:

• Bismark: Added new option '--gzip' that causes temporary bisulfite conversion files to be written out in a GZIP compressed form to save disk space. This option is available for most alignment modes with the exception of paired-end FastA files
• Added new option '--bam' that causes the output file to be written out in BAM format instead of the default SAM format. Bismark will attempt to use the path to Samtools that was specified with '--samtools_path', or, if it hasn't been specified explicitly, attempt to find Samtools in the PATH. If no installation of Samtools can be found the SAM output will be compressed with GZIP instead (yielding a .sam.gz output file)
• Added new option '--samtools_path' to point Bismark to your Samtools installation, e.g. /home/user/samtools/. Does not need to be specified explicitly if Samtools is in the PATH
• Added new option '--pbat' which is to be used for PBAT-Seq libraries (Post-Bisulfite Adapter Tagging; Kobayashi et al., PLoS Genetics, 2012). This is essentially the exact opposite of alignments in 'directional' mode, as it will only launch two alignment threads to the CTOT and CTOB strands instead of the normal OT and OB ones. The option '--pbat' works currently only for single-end and paired-end FastQ files for use with Bowtie1 and uncompressed temporary files only (there are no plans to extend this to other alignment modes at present)
• Methylation extractor: The methylation extractor does now also read BAM files, however this requires a working copy of Samtools. The new option '--samtools_path' may point the methylation extractor to your Samtools installation, e.g. /home/user/samtools/. This does not need to be specified explicitly if Samtools is in the PATH
• Added new option '--gzip' to write out the primary methylation extractor files (CpG_OT_..., CpG_OB_... etc) in a GZIP compressed form to save disk space. This option does not work on bedGraph and genome-wide cytosine reports as they are 'tiny' anyway
• The methylation extractor does now treat InDel free reads differently than before which leads to a ~60% increase in extraction speed for ungapped alignments in SAM format!
• Deduplication script: The deduplication script does now also read BAM files, however this requires a working copy of Samtools. The new option '--samtools_path' may point the script to your Samtools installation, e.g. /home/user/samtools/. This does not need to be specified explicitly if Samtools is in the PATH
• The deduplication script also received the new option '--bam' to write out deduplicated files directly in BAM format. If no installation of Samtools can be found the SAM output will be compressed with GZIP instead (yielding a .sam.gz output file)
• The Bismark User Guide and RRBS Guide have been updated

Bismark is available here: http://www.bioinformatics.babraham.a...jects/bismark/.
Leave a comment:
fkrueger replied

04-12-2013, 02:23 AM
Hi Oria,

The user last week was able to get the Linux file limit increased by his IT department; once this was done the bedGraph and cytosine report seems to have worked fine.
If you don't have the option to increase this limit, concentrating on a smaller number (a couple of hundred?) of contigs would probably be the easiest option. You would indeed have to process the SAM file and only write out the lines you are interested in, and then run the methylation extractor again on those. By the way I am sorry for the slow speed of the methylation extractor right now, I am in the process of speeding it up (will probably take a while though).

PS: Your library must be directional since you got a fairly high mapping efficiency with only OT and OB alignments (most libraries are directional anyway)
Leave a comment:
oria34 replied

04-12-2013, 02:09 AM
Thank you for the fast reply!

Yes, my genome has a lot of unassembled contigs so that is the problem.

I will extract particular chromosomes of interest from the output SAM file and work only with that subset of data (so I need to run again the methylation extractor, right?) .

I don't think that increasing the number of the aloowed filehandles is even an option since I am just a "remote user" (not sure about the term) at a supercomputing center...

This was really helpful, many thanks

Last edited by oria34; 07-04-2013, 10:18 AM.
Leave a comment:
simonandrews replied

04-11-2013, 11:48 PM
Originally posted by oria34 View Post

Failed to open filehandle: Too many open files at bismark_methylation_extractor line 2921

We actually had a separate report for this last week. The problem is that when it's extracting data for the bedGraph files bismark keeps a set of files open, one for each 'chromosome', so that it can write out all of the data in parallel.

If however you're working on a genome which isn't assembled into chromosomes but instead has a large number of assembly contigs then bismark will try to open a file for each contig. On all operating systems there is a limit to the number of files which can simultaneously be open for writing, and if the number of contigs is larger than the number of allowed filehandles then the script will fail. On the linux systems we checked here the limit is 1024 files, so if you had more contigs than this then this would trigger the problem.

The quick and ugly fix is to increase the number of allowed filehandles on your system. Another option would be to remove very short contigs from your assembly, as these usually make up the majority of total contigs, but contribute very little uniquely mappable sequence.

We have thought about whether we could easily fix this within bismark but the compromises in terms of efficiency to dynamically close and reopen filehandles as required are quite nasty and this probably isn't something we're going to implement in the near future.
Leave a comment:
oria34 replied

04-11-2013, 11:14 PM
HI all,

This is my first time using Bismark (actually, my first time in this NextGen stuff) and I really think is a very complete software that makes things pretty easy to the newbies. So, many thanks for you work Felix

I have some Fish BS converted DNA set to compare methylation state at some genes of interest so the BedGraph option is a really useful tools to my purposes. However I am having some problems when the script tries to write down the file.

My data is: paired-en, directional (I guess, not sure), It was trimmed and the encoding is Illumina 1.5 (Phred 64???). Roughly 50 million reads.

Warning: I am really new with all this stuff (coming from classical Pop. Genetics) so please people, be gentle with bioinformatic terms

Firstly I tried to run the scripts in two different files (in two different remote servers) but apparently the temporal files that the script creates have the same name so the program failed. Thas was pretty easy to fix by running the scripts in two different folders, one for each data set so the temporal files with the same name are now in different folder.

But...I will crash again but for different reasons:

Processed 32126386 lines from BPF1_methyl_1.fq_bismark_bt2_pe.sam in total
Total number of methylation call strings processed: 64252772

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 1297451318

Total methylated C's in CpG context: 121971031
Total methylated C's in CHG context: 869279
Total methylated C's in CHH context: 2305201

Total C to T conversions in CpG context: 71035093
Total C to T conversions in CHG context: 301484428
Total C to T conversions in CHH context: 799786286

C methylated in CpG context: 63.2%
C methylated in CHG context: 0.3%
C methylated in CHH context: 0.3%

================================================================
Methylation information will now be written into a bedGraph file
================================================================

Using the following files as Input:
CpG_context_BPF1_methyl_1.fq_bismark_bt2_pe.txt

Now writing methylation information for file CpG_context_BPF1_methyl_1.fq_bismark_bt2_pe.txt to individual files for each chromosome
Failed to open filehandle: Too many open files at bismark_methylation_extractor line 2921, <IN> line 65658042.
32584.414u 1126.687s 9:56:48.66 94.1% 0+0k 0+164137336io 0pf+0w

Any ideas? I will try the separate script from the Bismark site in the meanwhile.

For the record: Alignment took something like 20h and methylation_extractor around 12h before crash. I used Botwie 2 - I also naively assumed that was better to use a newer version.

Here is the aligment report:

Bowtie was run against the bisulfite genome of /fs/lustre/wrk/user/Bisulfite/Bismark_Runs/ with the specified options: -q --phred64 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500

Option '--directional' specified: alignments to complementary strands will be ignored (i.e. not performed)
Final Alignment report
======================
Sequence pairs analysed in total: 62785587
Number of paired-end alignments with a unique best hit: 40067266
Mapping efficiency: 63.8%
Sequence pairs with no alignments under any condition: 18464624
Sequence pairs did not map uniquely: 4253697
Sequence pairs which were discarded because genomic sequence could not be extracted: 281

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 20027636 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 20039349 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 1631659739

Total methylated C's in CpG context: 131857571
Total methylated C's in CHG context: 1142574
Total methylated C's in CHH context: 3268321

Total C to T conversions in CpG context: 80295548
Total C to T conversions in CHG context: 361837285
Total C to T conversions in CHH context: 1053258440

C methylated in CpG context: 62.2%
C methylated in CHG context: 0.3%
C methylated in CHH context: 0.3%
Leave a comment:
ruggedtextile replied

03-08-2013, 05:36 AM
Originally posted by fkrueger View Post

This speed seems very slow indeed, I'll give a few ideas (in no particular order) what I could think of:

- has the data been trimmed to remove adapter, poor quality and the RRBS bias? I would say a typical directional RRBS run with trimmed reads with Bowtie1 should align ~15-20 million reads/h, for non-directional mode probably a bit less. I don't know any numbers for non-directional Bowtie2 RRBS though.

- The use of Bowtie 2 in Bismark is in my experience indeed quite a bit slower than Bowtie 1, especially for shorter reads. How long is your read length, and do you really need to run alignments with Bowtie2?

- Is the data really non-directional? There are not many protocols that generate this type of data, and this can indeed slow down the alignments noticably. That is because if the data is not non-directional then the 2 threads running against the complementary strands are trying very hard but can't find many alignments. This might also explain why 2 threads are working very hard, while the other 2 are already idling as they are waiting for the other threads to catch up.

- is the data really encoded with the old Illumina qualities (phred64). Not sure if Bowtie2 would complain about it but the wrong quality encoding scheme could make alignments much slower.

If you could send me a FastQC report (the zip file) of your FastQ files via email I could take a quick look and advise further.

Thanks for the fast reply Felix. You've given me a few things to try straightaway. In answer to your questions:

1. Yes the data was trimmed, but by the sequencing center not by me, so I don't know the precise details. I will circle back and check with them, but the data seems to align fine with BSMap so I suspect this is not the problem.

2. These are 50bp PE reads. I have no reason to use bowtie2 over bowtie1. I had naively assumed that 2 was always better than 1, but clearly not in this case. I will rerun with bowtie1 and see if that resolves anything.

3. Hmmm. I will recheck about the directionality.

4. Again, I will recheck with the sequencing center. FastQC reports the encoding as 'Illumina 1.5', which I thought was Phred64.

I'll email you the FastQC report now. It all looked fine to me, but this is my first time with any form of BS data.
Leave a comment:
fkrueger replied

03-08-2013, 05:03 AM
This speed seems very slow indeed, I'll give a few ideas (in no particular order) what I could think of:

- has the data been trimmed to remove adapter, poor quality and the RRBS bias? I would say a typical directional RRBS run with trimmed reads with Bowtie1 should align ~15-20 million reads/h, for non-directional mode probably a bit less. I don't know any numbers for non-directional Bowtie2 RRBS though.

- The use of Bowtie 2 in Bismark is in my experience indeed quite a bit slower than Bowtie 1, especially for shorter reads. How long is your read length, and do you really need to run alignments with Bowtie2?

- Is the data really non-directional? There are not many protocols that generate this type of data, and this can indeed slow down the alignments noticably. That is because if the data is not non-directional then the 2 threads running against the complementary strands are trying very hard but can't find many alignments. This might also explain why 2 threads are working very hard, while the other 2 are already idling as they are waiting for the other threads to catch up.

- is the data really encoded with the old Illumina qualities (phred64). Not sure if Bowtie2 would complain about it but the wrong quality encoding scheme could make alignments much slower.

If you could send me a FastQC report (the zip file) of your FastQ files via email I could take a quick look and advise further.
Leave a comment:
ruggedtextile replied

03-08-2013, 04:33 AM
I'm running Bismark on some PE RRBS (non-directional) data that we've just recieved. This our first time looking at RRBS data and running Bismark and I'm surprised at the speed.

Running with bowtie2 specified and default parameters otherwise:

Code:

bismark --bowtie2 --non_directional --phred64-quals /mnt/nas_omics/shared_data/iGenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta -1 4_lib1_lane1_1.fq.gz -2 4_lib2_lane1_2.fq.gz

And it's taken 3 hours to process 5M sequences. I.e this point in the output has been reached:

Code:

... Processed 4600000 sequences so far Processed 4700000 sequences so far Processed 4800000 sequences so far Processed 4900000 sequences so far Processed 5000000 sequences so far Processed 5100000 sequences so far Processed 5200000 sequences so far ...

This machine has 96GB RAM and 8 CPU cores, so I feel it should be faster than this? Strangely when I run top I see 4 bowtie2-align processes (as expected), but with two of them on 100% and the other two on ~20%. Is it normal for two of the alignment processes to be throttled in this way or is this indicative of a bottleneck somewhere else (IO maybe?).

Other tools like BSMap were orders of magnitude faster than this, so I feel I must be missing something or have something misconfigured somewhere. Any advice welcome.
Leave a comment:
fkrueger replied

03-05-2013, 03:36 AM
We have just released a new version of Bismark (v0.7.9) which fixes the following three issues for the methylation extractor:

- The new function '--buffer_size <string>' for the bedGraph sort command is set to the new default value of 2G (sort would die if this option was not set)
- The replacement of pipe ('|') characters in the name of reference chromosomes for the bedGraph sorting step should now work as intended
- If multiple files are to be processed for genome-wide methylation reports in a single command, the reference genome is only read in once (instead of stopping due to the presence of several chromosomes with the same name...)

Bismark is available here.

Last edited by fkrueger; 03-05-2013, 03:37 AM. Reason: typo
Leave a comment:

Previous 1 20 27 28 29 30 31 32 33 34 template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 21 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 21 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News