Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

akramdi replied

05-16-2016, 09:37 AM
Hi Felix,

Thanks for the quick response!

I have another question: I recently tried to understand why my paired-end BS-seq data maps with a mapping efficiency of 40% (it seemed rather low to me), so I conducted some tests as suggested in previous messages dealing with low mappability efficiency. One thing I did is map a subset (-u 1000000) of read 1 and read 2 individually in single end mode to see if the mapping efficiency would go up. Plus, to be safe, I ran the mapping in --non_directional mode because I am not sure of the library type.

Here's the part I don't understand: I have different OT, OB, CTOT, CTOB ratios between read1 and read2.

Mapping report for read1
Final Alignment report
======================
Sequences analysed in total: 1000000
Number of alignments with a unique best hit from the different alignments: 602427
Mapping efficiency: 60.2%
Sequences with no alignments under any condition: 177212
Sequences did not map uniquely: 220361
Sequences which were discarded because genomic sequence could not be extracted: 0

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT: 299725 ((converted) top strand)
CT/GA: 302357 ((converted) bottom strand)
GA/CT: 150 (complementary to (converted) top strand)
GA/GA: 195 (complementary to (converted) bottom strand)

Mapping report for read2
Final Alignment report
======================
Sequences analysed in total: 1000000
Number of alignments with a unique best hit from the different alignments: 457737
Mapping efficiency: 45.8%
Sequences with no alignments under any condition: 365712
Sequences did not map uniquely: 176551
Sequences which were discarded because genomic sequence could not be extracted: 1

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT: 3950 ((converted) top strand)
CT/GA: 3937 ((converted) bottom strand)
GA/CT: 223725 (complementary to (converted) top strand)
GA/GA: 226124 (complementary to (converted) bottom strand)

Have you come across a similar situation? what could cause such discordance?

Thanks again,
Amira
Leave a comment:
fkrueger replied

05-13-2016, 09:58 AM
Hi Amira,

The numbers do indeed suggest that the library was constructed in a directional manner. The alignments to the CTOT and CTOB strands are < 0.3% so this is not terrible (and it is quite expected).

To move on I would either suggest you re-align the data in the default mode, i.e. without --non_directional specified, or if you wanted to continue with the data as it is I would discard the CTO* files before continuing with the bismark2bedGraph step.

Cheers, Felix
Leave a comment:
akramdi replied

05-13-2016, 09:18 AM
Hi Felix,

I was sent some bismark outputs (+ reports) to analyze. I am looking at the mapping reports, and I wondering if the ratio of mapped reads between OT, OB, CTOT, CTOB is strange or not. Can I conclude that the library was constructed in a directionnal manner? If so, I would have expected to see less reads aligned to CTOT, CTOB.

Final Alignment report
======================
Sequences analysed in total: 44844469
Number of alignments with a unique best hit from the different alignments: 30974987
Mapping efficiency: 69.1%
Sequences with no alignments under any condition: 4171455
Sequences did not map uniquely: 9698027
Sequences which were discarded because genomic sequence could not be extracted: 36

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT: 15485820 ((converted) top strand)
CT/GA: 15393574 ((converted) bottom strand)
GA/CT: 47834 (complementary to (converted) top strand)
GA/GA: 47723 (complementary to (converted) bottom strand)

Thanks!

Amira
Leave a comment:
alim123 replied

05-03-2016, 06:45 AM
Sir, one more question, as you suggest to trim the sequences further..So would i used raw seq or already trimmed sequence for further trimming.

Thank you
Leave a comment:
fkrueger replied

05-03-2016, 04:53 AM
Thanks for sending some QC plots via email. The images were fairly small but it looks like you are using 2x100bp reads generated with the EpiGnome kit (there is a hefty bias at the start of reads that needs to be removed before mapping.

Also you are using a version of Bismark that is almost 3 years and 22 releases outdated. Just get the latest versions here: https://github.com/FelixKrueger/Bismark or here: http://www.bioinformatics.babraham.a...jects/bismark/.

You then need to run Trim Galore with --clip_r1 6 --clip_r2 6 (potentially more), and then run the alignments using Bismark v0.16.1 in default mode.

Also please make sure to read up on biases using paired-end PBAT at QC Fail. I am pretty sure that you will see nice mapping efficiencies after following the steps in the cookbooks attached. Cheers, Felix
Leave a comment:
fkrueger replied

05-03-2016, 04:02 AM
Originally posted by alim123 View Post

Dear Felix

I am getting really low mapping efficiencies nearly 37% when i am using Bismark ( bowtie1, n=3).
when I am using BSmap with same sequences (pair end) on the same parameter, i am getting mapping efficiency nearly 80%.

But the problem is with bsmap i am further struggling for data analysis, like how to get methylated and non methylated reads from the .txt file.

So plz suggest me for better mapping efficiency. it will be really helpful for me.

Thank you

I am copying below the reply I have sent you via email already:

Hi Alim,

In order to give you a better diagnosis of the issues it would be good to have more information about your experiment, such as the sequencing length, the type of library, the genome in question and the version numbers of all programs involved etc. It would also be good to get access to the FastQC profiles of your samples.

General rules you should adhere to are:

1) Adapter and quality trimming, e.g. using Trim Galore
2) Use Bowtie2 mode (which is the default mode anyway)
3) Update to the latest version (v0.16.1)

More detailed hands-on information may also be found here:

Page Not Found - Epigenesys - platform for Health, Medicine & Advisors

http://www.epigenesys.eu/en/protocols/bio-informatics/483-quality-control-trimming-and-alignment-of-bisulfite-seq-data-prot-57

or here:

Babraham Bioinformatics - Training Courses

http://www.bioinformatics.babraham.ac.uk/training.html#bsseq

Hope this helps,
Best, Felix
Leave a comment:
alim123 replied

05-03-2016, 03:51 AM
Dear Felix

I am getting really low mapping efficiencies nearly 37% when i am using Bismark ( bowtie1, n=3).
when I am using BSmap with same sequences (pair end) on the same parameter, i am getting mapping efficiency nearly 80%.

But the problem is with bsmap i am further struggling for data analysis, like how to get methylated and non methylated reads from the .txt file.

So plz suggest me for better mapping efficiency. it will be really helpful for me.

Thank you
Leave a comment:
fkrueger replied

04-07-2016, 12:26 AM
Given that the majority of reads come from the CTOT and CTOB strands it is to be expected that C is high while G is low. The fraction aligning to the OT and OB strands would do the exact opposite, so I think overall the picture you are seeing is quite alright.

If you were operating on SAM files then cat should have worked (because this works on a line by line basis and doesn't do any clever (or weird) things as samtools merge does without -n. You only might have extra header lines in the middle somewhere but they should be ignored by the methylation extractor. Merging right from the start is probably the cleanest solution though.
Leave a comment:
johnstonL replied

04-06-2016, 08:38 AM
Felix, thanks for the reply.

Yes this library was prepared with the Zymo Meth-Seq kit as you have guessed. I have been thinking perhaps the library was not prepared that well. I ran a assessment on a raw sample in Fastqc and got the following base sequence content:

The % of cytosine is awfully high compared to other reports I have seen and considering this is RRBS data.

Actually, I have been simply concatenating the output files from the alignment step for input into the methylation extractor rather than using any sort of merging tool (i.e. cat sample1_1.sam sample1_2.sam > sample1). Would this cause the extractor to make incorrect methylation calls? I'm going to switch to combining fastq (with cat) instead like you have suggested to avoid this issue altogether.
Attached Files

Screen shot 2016-04-06 at 12.25.18 PM.png (68.9 KB, 132 views)
Leave a comment:
fkrueger replied

04-06-2016, 08:06 AM
This is indeed a somewhat odd ratio of strands, it reminds me very much of the single-cell BS-Seq technique which undergoes several rounds of PBAT-type pull-down and strand-regeneration. I have not heard of an RRBS-method that would be doing this, do you happen to know how these libraries were generated (e.g. using the Zymo Pico Methyl-Seq kit?).
I would probably just try to align only Read 1 (also in --non-directional mode) just to see if the mapping efficiency goes up a bit because 38% seems rather low).

In principle it is fine to merge files at the BAM level and then feed the merged file to the methylation extractor. A word of caution is in order for paired-end end files though if you are using Samtools merge because this merging does not normally guarantee that Read 1 and Read 2 are kept in the same order in the following file unless you specify the option -n:

Code:

Options: -n sort by read names

Similarly, already merged files my be sorted by read name using

Code:

samtools sort -n

to bring them back into the correct format.

If the R1/R2 order is muddled with then the overlap detection and the strand assignments would go a little mad so this is to be avoided.

I generally tend to merge files right from the start at the FastQ file level because this also reduces the number of files generated in the process considerably. Hope this helps, Felix
Leave a comment:
johnstonL replied

04-06-2016, 07:46 AM
Hi Felix,

I am working on analyzing some pair ended non-directional RRBS libraries with Bismark and have come across a few confusions. I begin by trimming the pair end files with Trim Galore (making use of the -rrbs -non-directional options and everything else default) and then aligning with bismark. What strikes me the most from the alignment report is that I get a strange ratio of mapped reads between OT, OB, CTOT, CTOB:

Final Alignment report
======================
Sequence pairs analysed in total: 12664432
Number of paired-end alignments with a unique best hit: 4529206
Mapping efficiency: 35.8%
Sequence pairs with no alignments under any condition: 5506764
Sequence pairs did not map uniquely: 2628462
Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 608018 ((converted) top strand)
GA/CT/CT: 1626505 (complementary to (converted) top strand)
GA/CT/GA: 1673560 (complementary to (converted) bottom strand)
CT/GA/GA: 621123 ((converted) bottom strand)

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 117899629

Total methylated C's in CpG context: 7232502
Total methylated C's in CHG context: 362290
Total methylated C's in CHH context: 1218612
Total methylated C's in Unknown context: 2

Total unmethylated C's in CpG context: 13074777
Total unmethylated C's in CHG context: 29946985
Total unmethylated C's in CHH context: 66064463
Total unmethylated C's in Unknown context: 3

C methylated in CpG context: 35.6%
C methylated in CHG context: 1.2%
C methylated in CHH context: 1.8%
C methylated in unknown context (CN or CHN): 40.0%

Also, the way the library was given to me is that each sample is split into 2 pairs (sample1_pair1_forward, sample1_pair1_reverse, sample1_pair2_forward, sample1_pair2_reverse). If I wanted to run the methylation extractor on the samples, would it be a problem if I simply gave it the concatenated output of the alignment pairs?

i.e
sample1_pair1_aligned + sample1_pair2_aligned > sample1_aligned
methylation extractor sample1_aligned ...other samples

Thanks
Leave a comment:
hsiehph replied

03-16-2016, 05:46 PM
Originally posted by fkrueger View Post

Hi hsiehph,

I believe this problem has been fixed by now in this issue. You can get the latest development version of Bismark by cloning it from Github.

Thanks Felix. I will test it with the latest development version.
Leave a comment:
fkrueger replied

03-16-2016, 02:03 AM
Hi hsiehph,

I believe this problem has been fixed by now in this issue. You can get the latest development version of Bismark by cloning it from Github.
Leave a comment:
hsiehph replied

03-15-2016, 11:19 PM
Originally posted by fkrueger View Post

Hi Dipro,

This was indeed a typo which will be fixed in the next release which is actually due out today or tomorrow (and will finally support parallel alignments – so stay tuned!).

A couple of things about the command you used:

bismark_methylation_extractor -s -o --samtools_path --bedGraph --counts --remove_spaces --buffer_size --cytosine_report --genome_folder

'Failed to read from file /path/to/file_fq.gz_bismark_bt2.bismark.cov: No such file or directory'
Sorry if it is a stupid question, but did you change the ‘/path/to/file’ by a valid path of the file on your system?

-s: not necessary (will be determined automatically)
-o /requires/path/to/output/folder
--samtools_path /requires/path/to/samtools/executable
--counts: not necessary (used by default)
--remove_spaces: only use this if really necessary, will otherwise cost time and temporary space
--buffer_size: requires input, e.g. 10G
--genome_folder /requires/path/to/genome/folder

input file is required

If you still struggle can you just send me the onscreen-text via email? This would make spotting mistakes in the command much easier. Cheers, Felix

I also have a similar error message:
gzip: output_folder/input.bismark.cov.gz: No such file or directory
No last chromosome was defined, something must have gone wrong while reading the data in (e.g. specified wrong file path for a gzipped coverage file?). Please check your command!

However, I can see the input_folder.bismark.cov.gz file exist in my output_folder/.

The command I used as follows:
bismark_methylation_extractor -p --no_overlap --bedGraph --counts --buffer_size 10G --cytosine_report --CX --split_by_chromosome -o output_folder/ --genome_folder genome_bowtie1/ --multicore 6 input_folder/input_folder.sam

Interestingly, when I removed the -o output_folder/ (i.e. to the current directory), the script will finish properly. Does anyone have similar experience? The version of bismark is the latest one. Thanks for help.
Leave a comment:
fkrueger replied

03-15-2016, 08:57 AM
Thanks for spotting that, and yes you are absolutely right it needs to be methylated / (methylated + unmethylated) *100.
Leave a comment:

Previous 1 3 4 5 6 7 8 9 16 34 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News