Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

fkrueger replied

03-01-2013, 06:00 AM
We have just released a new version of Bismark (v0.7.8) which adds a few more functions to Bismark and the methylation extractor. Here are the main changes:

• Bismark: Added new option '--non_bs_mm' which prints an extra column at the end of SAM files showing the number of non-nisulfite mismatches of a read. This option is not available in '--vanilla' format.
Format for single-end reads: "XA:Z:mismatches".
Format for paired-end reads: read 1: "XA:Z:mismatches", read 2: "XB:Z:mismatches"
• Bismark: The mapping report file names were changed to _bismark_(SE/PE)_report.txt (Bowtie 1) or _bt2_bismark_(SE/PE)_report.txt (Bowtie 2) to keep it more uniform
• Methylation extractor: The input file(s) may now be specified with a file path which abrogates the need to be in the same directory as the input file(s) when calling the methylation extractor
• Methylation extractor: Added new function '--buffer_size' to increase the physical memory used for the sorting the output by chromosomal positions (only needed for bedGraph output)
• Methylation extractor: Reference sequence files containing pipe ('|') characters were found to crash the methylation extractor as the chromosome name was used for filenames. These characters are now replaced with underscores when the reads are sorted during the bedGraph step
• Updated the Bismark User Guide with sections for the bedGraph and genome-wide methylation report outputs, and Appendix IV is now showing alignment stats for the test data

Bismark can be downloaded here: http://www.bioinformatics.babraham.a...jects/bismark/
Leave a comment:
fkrueger replied

02-27-2013, 10:30 AM
I would always run adapter trimming, simply because it removes incorrect portions of the read that you can't spot between millions of other reads and it also allows you to use fairly strict mapping settings (such as the default parameters), which is a) much quicker and b) less prone to mismapping events. I believe that increasing -n from 1 to 2 for the test dataset in the Bismark paper would increase the alignment time from ~30 mins to 150 mins (!) without noticably improving the mapping results....

Can you please keep me posted whether the methylation extractor works with your two approaches and/or if it was the evil bash script...

Cheers,
Felix
Leave a comment:
mxenoph replied

02-27-2013, 08:13 AM
Thank you for your quick reply.

Based on the fastqc reports there wasn't any adapter contamination problem. However the PHRED scores were quite low towards the read end..didn't think that would affect the mapping so much though and was counting on -n set to a high number to offset any issues.
Is there an empirical rule that tells you that you should trim the data if a certain percentage of the bases in the read have low PHRED or should I always perform the trimming? I currently run bismark on the trimmed reads and hopefully that will increase the mapping efficiency.

About the error, I called the command from a bash script:

bismark_methylation_extractor -o "$out_dir/$exp/" --comprehensive --report --bedGraph -s "$infile"

and I am now running the same command from the terminal with absolute and relative path for the output directory, just to check if that's what causes the problem. Hopefully I am not doing more stupid than that!
Leave a comment:
fkrueger replied

02-27-2013, 02:05 AM
Originally posted by mxenoph View Post

Hi,

I just started using bismark (v0.7.7) for a set of publicly available dataset (GSE30199) from mouse and I am running on some errors and low mapping efficiency for paired-end reads.

The dataset contains both single-end and paired-end reads. Bismark maps the SE reads with an efficiency of about 55% for most of the samples, but for the PE data the mapping efficiency for all the samples is really low (less than 10% nearly in all samples). I initially ran bismark with the -q -n 1 -k 2 --best and the -I and -X as defined by the PE library and --directional, with the bowtie 1 aligner; however the mapping efficiency only increased by ~5% when I ran it again changing only the options -n to 2 and -I and -X set to the default values. A colleague of mine ran an older version of bismark on the same data, but treated the PE files as SE and got mapping efficiencies of ~60%. Also, the bisulfite genome was prepared for MM9 in that case whereas I am using MM10 (but don't think that caused the poor mapping).

I am wondering if treating PE as SE would be a good () solution to my problem or if I could use a different set of options to increase the mapping efficiency for the PE reads.

Now, I also get an error (see below) when running bismark_methylation_extractor on the SE sam files.

No such file or directory at /homes/mxenoph/bin/bismark_methylation_extractor line 175, <IN> line 35808595[/FONT]

the script has already processed 35808571 lines of the sam file and produced the output files except the bedGraph one. I briefly checked the bismark_methylation_extractor and on line 175 it tries to create the bedGraph file. What I don't get is why it tries to set the output directory for that file as the directory of the script and not the specified (by -o) one.

Any input would be very much appreciated. Thanks for bearing with me and my long post!

Hi there,

If you see a much lower mapping efficiency for paired end files compared to single end mode it is usually caused by one of the following factors:
- Adapter contamination or poor qualities, possibly especially in read 2
- The fragment size was larger than the allowed insert size (which is obviously not the problem here)
- The fragment length was so short that you started getting paired end reads that completely overlap

To rectify that I would suggest that you get Trim Galore and simply run it with the default parameters, like so:

trim_galore --paired --trim1 read_1.fastq read_2.fastq

This will
- remove poor quality base calls from the 3’ end of reads
- remove Illumina adapters from both reads
- trim 1 extra bp from the 3’ end so that completely overlapping fragments do still align with Bowtie 1

The Trim Galore documentation gives you more details about why these overlapping reads are a problem, and you can also take a look at this protocol of how we treat our samples internally. I am convinced that following these steps will fix the low mapping efficiencies you are seeing.

In general, using PE as SE reads would work, but it is obviously not making the most out of the data. In any case, for paired-end data you should use the option --no_overlap for the methylation extractor to disregard redundant information of overlapping reads.

About the error message in the methylation extractor: I would probably need some more information about the command you used, line 176 in my current development version (0.7.8) reads:
open (OUT,'>',$output_dir.$bedGraph_output) …
which should attempt to write the bedGraph output to the directory specified with –o. Please find the current version of the methylation extractor attached, and just send me and email if there are any further problems.
Attached Files

bismark_methylation_extractor.pl (144.4 KB, 28 views)
Leave a comment:
mxenoph replied

02-26-2013, 04:23 PM
Low mapping efficiency and bedGraph error

Hi,

I just started using bismark (v0.7.7) for a set of publicly available dataset (GSE30199) from mouse and I am running on some errors and low mapping efficiency for paired-end reads.

The dataset contains both single-end and paired-end reads. Bismark maps the SE reads with an efficiency of about 55% for most of the samples, but for the PE data the mapping efficiency for all the samples is really low (less than 10% nearly in all samples). I initially ran bismark with the -q -n 1 -k 2 --best and the -I and -X as defined by the PE library and --directional, with the bowtie 1 aligner; however the mapping efficiency only increased by ~5% when I ran it again changing only the options -n to 2 and -I and -X set to the default values. A colleague of mine ran an older version of bismark on the same data, but treated the PE files as SE and got mapping efficiencies of ~60%. Also, the bisulfite genome was prepared for MM9 in that case whereas I am using MM10 (but don't think that caused the poor mapping).

I am wondering if treating PE as SE would be a good () solution to my problem or if I could use a different set of options to increase the mapping efficiency for the PE reads.

Now, I also get an error (see below) when running bismark_methylation_extractor on the SE sam files.

No such file or directory at /homes/mxenoph/bin/bismark_methylation_extractor line 175, <IN> line 35808595[/FONT]

the script has already processed 35808571 lines of the sam file and produced the output files except the bedGraph one. I briefly checked the bismark_methylation_extractor and on line 175 it tries to create the bedGraph file. What I don't get is why it tries to set the output directory for that file as the directory of the script and not the specified (by -o) one.

Any input would be very much appreciated. Thanks for bearing with me and my long post!
Leave a comment:
pkachroo replied

02-18-2013, 01:01 AM
Thanks a lot Felix for such elaborate & clear reply. It surely helped me to clear these initial doubts and i can proceed accordingly

PS: Regarding alignment i was confused before but now it is clear to me & there was no memory warning after running the cluster job for Bismark.
Leave a comment:
fkrueger replied

02-17-2013, 01:18 AM
Originally posted by pkachroo View Post

Dear all,

Hello. My name is Priya. I am working on RRBS data. I used BISMARK for paired-end Illumina data. I had few queries regarding the ouput. Could someone kindly look into my output file data as I am not sure if the output is not unusual and I can proceed surely with further downstream analysis

1. I am attaching normal text document also containing mapping report for:

B1929_NoIndex_L008_R1_001_val_1.fq_Bismark_paired-end_mapping_report.txt

and few lines of CpG Methylation ratio and bedgraph file which do not seem properly aligned to me so I am a bit confused like mentioned by Felix Krueger above:
<Chromosome> <Start Position> <End Position> <Methylation Percentage>

2. Can methylation percentage be zero like I have in my text document ? In this case do I need to filter ratios with value < 5 or 10 < before proceeding further.

3. Is is better/compulsory to use --no_overlap option for paired end data because I did not use it for my results ?

4. Also I am not sure how to understand if C's are methylated by just looking at the Methylation ratio file as mentioned in the thread written by user "wilhelml":

Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data - SEQanswers

http://seqanswers.com/forums/showthread.php?p=46285

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

read_id:2:82:3541:4552#0 - 1 1343509 z : C : un-methylated C on top strand
read_id:2:82:3541:4552#0 + 1 1343515 Z : C : methylated C on top strand

Is there any Key in BISMARK for understanding this ?

I would be grateful if I get some help from this forum for my further analysis.

Thanks & sincere regards,

Priya

Hi Priya,

To 1) What exactly doesn’t seem properly aligned to you in the output files? Just as a reminder, bedGraph files are naturally 0-based, which might cause calls to look offset by 1. What is a bit puzzling is that the bedGraph file doesn’t appear to be sorted properly by chromosome and position, did you get any warning (memory etc) during the run?
As a general comment your files seem to have aligned with an alright mapping efficiency (looks a bit like ~40bp long reads?), however the methylation percentages especially in non-CG context are really quite high:
C methylated in CpG context: 73.5%
C methylated in CHG context: 29.9%
C methylated in CHH context: 29.7%

This normally indicates that something did not work out as expected with the bisulfite conversion efficiency. If this is the case then you should probably not start analyzing the results much further…

To 2) Yes, the methylation can be zero whenever you only have unmethylated but no methylated reads. You do not want to filter out on reads based on the percentage methylation, but you might want to filter position based on the read coverage. If you run the methylation extractor with the options --bedGraph and --counts you can see the exact number of methylated and unmethylated calls as well, and you could also use --cutoff_threshold to filter for a certain read depth per position. You could of course also filter later with the package you are using for the downstream analysis (we use the Bisfulfite methylation pipeline in SeqMonk for such a coverage filtering step).

To 3) We would recommend using --no_overlap for paired end data because you would get an unfair coverage of overlapping parts of reads, which might affect coverage filtering of reads. As overlapping calls are technical replicates of each other that were merely copied by PCR they should be the same anyway.

To 4) This is really simple. Cytosine positions receiving a '+' are methylated and '–' means unmethylated. The last column only indicates the cytosine context which may seem redundant but if you write several outputs into the same file (e.g. CHH and CHG context) it is useful if you want to separate them out later. So,
read_id:2:82:3541:4552#0 - 1 1343509 z is unmethylated (CpG)
read_id:2:82:3541:4552#0 + 1 1343515 Z is methylated (CpG)

It is not immediately obvious just from this position to tell the strand on which this C resides, however the methylation extractor output sorts the calls into files that carry the strand name in their file names.

If you have any further questions don’t hesitate to contact me further via email.

Best,
Felix
Leave a comment:
fkrueger replied

02-17-2013, 12:47 AM
Originally posted by kuul2jai View Post

Hi,
I tried to used the option to align multiple fastq.gz files at once.

bismark $refgenome_dir -1 lane1_r1.fastq.gz,lane2_r1.fastq.gz -2 lane1_r2.fastq.gz,lane2_r2.fastq.gz

Only lane1_r1 & r2 were recognized, but not lane2 r1 & r2. For now I think I can fix this by catting the fastq files together before running bismark.

PS: I tried different seperators: space, comma, comma with space

Paired-end mode requires files to be comma separated, so
-1 lane1_r1.fastq.gz,lane2_r1.fastq.gz -2 lane1_r2.fastq.gz,lane2_r2.fastq.gz
should have worked. Could you send me the on screen messages that you got to [email protected]
Leave a comment:
kuul2jai replied

02-16-2013, 02:46 PM
Hi,
I tried to used the option to align multiple fastq.gz files at once.

bismark $refgenome_dir -1 lane1_r1.fastq.gz,lane2_r1.fastq.gz -2 lane1_r2.fastq.gz,lane2_r2.fastq.gz

Only lane1_r1 & r2 were recognized, but not lane2 r1 & r2. For now I think I can fix this by catting the fastq files together before running bismark.

PS: I tried different seperators: space, comma, comma with space
Leave a comment:
pkachroo replied

02-15-2013, 07:34 AM
Bismark output details

Dear all,

Hello. My name is Priya. I am working on RRBS data. I used BISMARK for paired-end Illumina data. I had few queries regarding the ouput. Could someone kindly look into my output file data as I am not sure if the output is not unusual and I can proceed surely with further downstream analysis

1. I am attaching normal text document also containing mapping report for:

B1929_NoIndex_L008_R1_001_val_1.fq_Bismark_paired-end_mapping_report.txt

and few lines of CpG Methylation ratio and bedgraph file which do not seem properly aligned to me so I am a bit confused like mentioned by Felix Krueger above:
<Chromosome> <Start Position> <End Position> <Methylation Percentage>

2. Can methylation percentage be zero like I have in my text document ? In this case do I need to filter ratios with value < 5 or 10 < before proceeding further.

3. Is is better/compulsory to use --no_overlap option for paired end data because I did not use it for my results ?

4. Also I am not sure how to understand if C's are methylated by just looking at the Methylation ratio file as mentioned in the thread written by user "wilhelml":

Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data - SEQanswers

http://seqanswers.com/forums/showthread.php?p=46285

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

read_id:2:82:3541:4552#0 - 1 1343509 z : C : un-methylated C on top strand
read_id:2:82:3541:4552#0 + 1 1343515 Z : C : methylated C on top strand

Is there any Key in BISMARK for understanding this ?

I would be grateful if I get some help from this forum for my further analysis.

Thanks & sincere regards,

Priya
Attached Files

bismark_output.doc (3.8 KB, 9 views)
Leave a comment:
fkrueger replied

10-31-2012, 09:37 AM
Originally posted by shadow19c View Post

Hello,
I want to know how can I have the description of each position of the genome:
I have used Bsseeker before and the output I had

1.Read ID (from the header columns in Solexa seq/fastq/qseq/fasta file, or a serial number of the original input)

2.Number of mismatches between the genomic seq and the BS read list in columns 6 and 7. The bisulfite converted sites between read Ts to genomic Cs are not included.

3.The strand which the read may be from (+FW, +RC, -RC, -FW)

4.The coordinate of the mapped position: the first 2 digits indicate the chromosome, the "+" or "-" indicate the mapped strand. The last 10 digits are the 0-based, 5'-end coordinate of the mapped genomic sequence on the Watson strand.

5.The genomic sequence of the mapped region plus +2 and -2 bps.

6.BS read sequences from 5' to 3': if the reads are uniquely mapped as they were FW reads, the original reads are shown. If the reads are uniquely mapped as they were RC reads, their reverse complements are shown.

7.Summarised sequence of methylated sites: the methylated CG/CHG/CHH sites are marked as X/Y/Z (upper case), whereas the unmethylated CG/CHG/CHH sites are marked as x/y/z (lower case). This column is summarised directly from Columns 6 and 7.

8.Index=1 if three consecutive methylation non-CG sites appear. Index =0, otherwise.

Is similar as Bismark with vanilla output.

I developped a script whick I can obtain the number of unmethylated or methylated reads in a curretn stranbd , and also the total number of reads in the current strand and methylation level.

To resume it's like a coverage files which give me the mean or median of coverga cytosine and uncoverage cytosine!!!!

Any idea?

Hi Mohamed,

I am afraid I don't have a script that converts the Bismark output to BS-Seeker output so you can use your pre-existing pipeline; however pretty much all the points mentioned above are contained within the Bismark, methylation extractor, the full cytosine context output or several of them.

Specifically, the full genome cytosine report seems to be what you are looking for. The output can either be for all CpG positions or optionally for all genomic cytosines (all contexts). The genome-wide cytosine methylation output file (optional) is tab-delimited in the following format:

<chromosome> <position> <strand> <count methylated> <count non-methylated> <C-context> <trinucleotide context>

Please read the methylation extractor documentation or type 'bismark_methylation_extractor --help'.
Leave a comment:
shadow19c replied

10-31-2012, 05:37 AM
Hello,
I want to know how can I have the description of each position of the genome:
I have used Bsseeker before and the output I had

1.Read ID (from the header columns in Solexa seq/fastq/qseq/fasta file, or a serial number of the original input)

2.Number of mismatches between the genomic seq and the BS read list in columns 6 and 7. The bisulfite converted sites between read Ts to genomic Cs are not included.

3.The strand which the read may be from (+FW, +RC, -RC, -FW)

4.The coordinate of the mapped position: the first 2 digits indicate the chromosome, the "+" or "-" indicate the mapped strand. The last 10 digits are the 0-based, 5'-end coordinate of the mapped genomic sequence on the Watson strand.

5.The genomic sequence of the mapped region plus +2 and -2 bps.

6.BS read sequences from 5' to 3': if the reads are uniquely mapped as they were FW reads, the original reads are shown. If the reads are uniquely mapped as they were RC reads, their reverse complements are shown.

7.Summarised sequence of methylated sites: the methylated CG/CHG/CHH sites are marked as X/Y/Z (upper case), whereas the unmethylated CG/CHG/CHH sites are marked as x/y/z (lower case). This column is summarised directly from Columns 6 and 7.

8.Index=1 if three consecutive methylation non-CG sites appear. Index =0, otherwise.

Is similar as Bismark with vanilla output.

I developped a script whick I can obtain the number of unmethylated or methylated reads in a curretn stranbd , and also the total number of reads in the current strand and methylation level.

To resume it's like a coverage files which give me the mean or median of coverga cytosine and uncoverage cytosine!!!!

Any idea?
Leave a comment:
PeteH replied

10-30-2012, 12:08 AM
Originally posted by fkrueger View Post

Hi Pete,

The '|' characters in the file name do indeed seem to redirect the output instead of merely creating a temporary file name. I have amended the script to replace pipe characters with underscores now (attached to this note), hope it works.
As a side note, the same problem is likely to also affect the latest version of the Bismark methylation extractor; I shall have this fixed and put up on the Bismark project page upon my return from annual leave.

No hurry, and thanks!
Leave a comment:
fkrueger replied

10-29-2012, 10:38 PM
Originally posted by JCrooks

Bismark is an amazing tool for mapping and analyzing bisulfite-seq. I have used it by myself it is very helpful in this concern. I Will also suggest it for new users and researchers, because it could really help them a great deal.

Thanks for all the fish!
Leave a comment:
fkrueger replied

10-29-2012, 10:37 PM
Hi Pete,

The '|' characters in the file name do indeed seem to redirect the output instead of merely creating a temporary file name. I have amended the script to replace pipe characters with underscores now (attached to this note), hope it works.
As a side note, the same problem is likely to also affect the latest version of the Bismark methylation extractor; I shall have this fixed and put up on the Bismark project page upon my return from annual leave.
Attached Files

genome_methylation_bismark2bedGraph_v4_no_pipe.pl (9.0 KB, 31 views)
Leave a comment:

Previous 1 21 28 29 30 31 32 33 34 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News