Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

fkrueger replied

09-01-2014, 12:17 PM
I am keen to hear more about what you discover!
Leave a comment:
kevinrue replied

09-01-2014, 08:38 AM
Hi Felix,

Thanks for the feedback on the fastQC and alignment report. Plus the list of DMR tools. I only tested BiSeq so far during my initial attempt to an analytic pipeline, and didn't find any significant DMR, but Katja told me the pacakge was meant for RRBS analyses. Anyway, as you can see the proper alignment of my reads might have been the limiting factor.

Indeed, as far as I can see, bismark does indeed a good job aligning reads, with good quality reads and adequate parameters
Given my recent arrival in the bisulfite-sequencing field (I have only dealt with transcriptomics data for the past two years, microarray and RNA-seq), I can only strive to understand all the complexity of the bisulfite data, the sources during their obtention and their subsequent analysis which you are much more familiar with than I am.

New result

I have a little update which surprised me. I have attached a screenshot of the alignment rate for SingleEnd/PE reads mode, with or without truncating to 75bp. Two of them were only performed on the first million read/pairs using "-u 1000000" to post you this answer today

What surprised me, as I believed chimeric read/pairs were being discarded during the alignment, is that truncating both mates to 75 bp (or fewer if quality trimming left less than that) restored an alignment rate of 56.8% (of 1 million pair), compared to 28.4% (of 8.4 million pairs, in) for full-length read pairs.
If inserts were indeed chimeric, truncating the read pairs should identified those reads as chimeric and left them out again, shouldn't it? Or am I missing something?

Now, I think I will repeat tonight the alignment of truncated read pairs with the entire 8.4 million read pairs to satisfy my OCDs and give you a table with fully comparable numbers, but I am fairly certain that these figures are representative enough.

Many thanks for your help and time
Kevin
Attached Files

t1.png (4.7 KB, 42 views)
Last edited by kevinrue; 09-01-2014, 12:48 PM. Reason: missing image upload?!
Leave a comment:
fkrueger replied

09-01-2014, 07:37 AM
Hi Kevin,

The FastQC reports look completely normal, and the results you posted look quite decent as well. So for the record, can we just state that it's not really Bismark's fault but the PBAT protocol itself is generating wild artefacts .

Regarding DMR analysis: I don't think there anything is accepted as gold standard just yet, new tools are being published almost every week. Promising and widely used tools are: bsseq, BiSeq, methylKit, MOABS, RADMeth, DMAP, RnBeads and a few more, most of them interface rather nicely with the output of Bismark. Please find attached a review of the underlying concepts of some of these tools by Mark Robinson: http://biorxiv.org/content/biorxiv/e...07120.full.pdf, I hope it may help.
Leave a comment:
kevinrue replied

09-01-2014, 06:17 AM
Hi,

Thank you both for the very quick answers. Much appreciated. It took me a bit of time to get this answer together. I hope it's clear.

I didn't mention yet that I am working with the Bos taurus UMD3.1 genome. 3 billion base pairs on 29 autosomes + 2 sex chromosomes: comparable to human.

frozenlyse
I attached the fastQC archives (recompressed in tar format to satisfy the upload requirements of SeqAnswer), as I am not sure which part of it you were interested in. Due to the C->T conversion caused by the bisulfite treatment, I am not particularly concerned by:
the enriched proportion of T through the forward read

The enriched proportion of A in the reverse read

The low GC content in both reads

the enriched proportion of k-mers containing C and A in the forward read

the enriched proportion of k-mers containing G and T in the reverse read

bwameth sounds like a good alternative. I will look for a standard pipeline of analysis from alignment to differential methylation calls.. except if you have a link to offer

Felix

Sorry I missed that post, thanks for forwarding. I started reading it to understand the entire procedure, although it sounds convoluted. I suppose that after all those alignments, I would have to aggregate the methylation calls for each cytosine (from the paired alignment + single end forward + pbat single end reverse), and then feed it into an analytic software package.
What tools do you recommend for statistical analysis of differential methylation by the way?

Info
I just tried to align 1 million non-truncated forward reads in single end mode and got 60.7% again.
Looks like the problem of chimeric reads indeed.

Here is a copy paste of the full bismark report if it is any use:
Bismark report for: /workspace/scratch/krue/Methylation/2014-08-27_analysis/Trimmomatic/Paired/C1_ATCACG_1P.fastq (version: v0.12.1)
Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)
Bismark was run with Bowtie 2 against the bisulfite genome of /workspace/storage/genomes/bostaurus/UMD3.1.75/source_file/ with the specified options: -q --score-min L,0,-0.2 --ignore-quals

Final Alignment report
======================
Sequences analysed in total: 1000000
Number of alignments with a unique best hit from the different alignments: 606793
Mapping efficiency: 60.7%
Sequences with no alignments under any condition: 179520
Sequences did not map uniquely: 213687
Sequences which were discarded because genomic sequence could not be extracted: 5

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT: 303497 ((converted) top strand)
CT/GA: 303291 ((converted) bottom strand)
GA/CT: 0 (complementary to (converted) top strand)
GA/GA: 0 (complementary to (converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 14776304

Total methylated C's in CpG context: 644197
Total methylated C's in CHG context: 32496
Total methylated C's in CHH context: 56579
Total methylated C's in Unknown context: 756

Total unmethylated C's in CpG context: 249170
Total unmethylated C's in CHG context: 3536727
Total unmethylated C's in CHH context: 10257135
Total unmethylated C's in Unknown context: 8798

C methylated in CpG context: 72.1%
C methylated in CHG context: 0.9%
C methylated in CHH context: 0.5%
C methylated in Unknown context (CN or CHN): 7.9%

C1_ATCACG_1P.fastq_bismark_bt2_SE_report.txt (END)
Attached Files

C1_ATCACG_1P_fastqc.tar (478.0 KB, 28 views)

C1_ATCACG_2P_fastqc.tar (421.5 KB, 22 views)
Leave a comment:
fkrueger replied

09-01-2014, 04:17 AM
A recent post in this thread was asking something quite similar, maybe the reply is helpful in some way:

Originally posted by fkrueger View Post

Hi eicht,

First of all, I can feel your pain, we have also spent quite a while trying to find the reason or solutions to poor mapping efficiencies for PE PBAT libraries. Generally, the original PBAT protocol only produces alignments to the complementary strands (CTOT and CTOB), however the EpiGnome kit produces standard, i.e. directional, libraries that align to the OT and OB strands only. This means that R1, and also paired-end alignments, should only align to the OT and OB strands; however R2 needs to be aligned to the CTOT and CTOB strands because it is the reverse complement of R1. This is exactly what the option --pbat does, and in that sense the mapping efficiencies you are describing make perfect sense.

We haven’t worked with data generated by EpiGnome but used a homebrew PBAT PE protocol, but I suppose the mechanisms will be the same since they seem to be caused by the random priming step in the protocol. Since we saw the drastically reduced mapping efficiencies in PE mode we reasoned that the protocol might generate chimeric reads, i.e. reads starting at some position in the genome and ending at another. To test this we mapped the data in SE mode and paired up pairs where both R1 and R2 produced alignments. We then imported this data into SeqMonk as Hi-C data, and looked at the amount of reads that had its partner in trans (on another chromosome) as a fairly crude measure. As you can see on the screenshot attached, when we quantitated where in the genome the partner reads aligned to of reads that aligned to chromosome 1, we found that they seem to be scattered pretty much all over the genome in no discernible pattern! In this case a bit more than 40% of reads had their partner read in trans on another chromosome (partner reads aligning to chr 1 but far away were not even considered here).

Now if you would perform some similar pairing up on your reads you might get a similar answer, and if you add up the numbers of properly aligning PE reads and the number of chimeric reads (which won’t be reported by Bismark since they are no valid PE alignments), you should be very close to the mapping efficiencies you achieve in SE mapping.

We came to a similar conclusion, and we by now also settled on the procedure you described (someone coined the term ‘Dirty Harry mode’ because it doesn’t seem to be the cleanest of methods…):

(1) directional PE alignment to start and grabbing the unmapped R1 and R2 reads out the end using the option ‘--unmapped’. Properly aligned PE reads should be methylation extracted using --no_overlap.
(2) unmappedR1 is then mapped SE directional
(3) unmappedR2 is mapped SE directional under -pbat conditions.
Single-end aligned R1 and R2 can then be methylation extracted normally as they should in theory map to different places in the genome anyway so don’t require attention to overlapping reads.

As another suggestion:
Like all PBAT applications, the first couple of bp (6bp in the case of the EpiGnome kit) produce a noticeable bias in sequence composition as well as later on in the M-bias plots (see an example here http://www.bioinformatics.babraham.a...SE_report.html). EpiCentre recommend ignoring the first 6bp of the methylation call string themselves, but it is probably a good idea to remove these residues even before mapping because we have seen that they may reduce the mapping efficiency somewhat. This could be done with --clip_r1 6 and --clip_r2 6 within Trim Galore.

I start to think that paired-end for bisulfite is more headache than it is worth sometimes.

At least for paired-end PBAT protocols I would probably second this notion, however even sequencing longer SE reads (e.g. 150bp SE) will likely start reading into the chimeric portion of the read so also have some mappability problems...

Just as another comment, we ran a couple of tests on 2x76bp EpiGnome paired end data as well and achieved mapping rates of >75% in all cases (using --bowtie2 default mode). So maybe the problem of chimaeric reads starts a little later which might explain the drop the efficiency for long reads? A tool supporting local alignment such as bwa-meth might indeed help in this case...
Leave a comment:
frozenlyse replied

09-01-2014, 03:58 AM
kevinrun - what does fastqc look like on your forward and reverse reads?

We've recently switched over to the same kit, and now only do 71bp PE reads (we can get that out of 2 50 cycle SBS kits). I similarly got not great alignment using bismark (even after trimming the 6bp from each end) - however I get good alignment using bwameth, and methylation % obtained on the same dataset run through either pipeline were very similar (r=0.97 IIRC) - so perhaps bwa-meth is worth a look? I get about 80% of reads aligning with a mapping quality >=40.
Leave a comment:
kevinrue replied

09-01-2014, 03:42 AM
Dear all,

Apologies if this was answered previously, but I couldn't find the information by Googling and browsing the thread. I am a postgraduate bioinformatics student dealing with the first methylation dataset in our group.

Issue
My issue is that I see strikingly low alignment rate by bismark (28-45% for bisulfite libraries, 49-55% for control non-bisulfite libraries).

Details
I was given a dataset of 2 x 150bp reads of whole-genome bisulfite sequencing libraries (WGBS) using the Epignome Methyl-Seq kit, sequencing using Illumina platform. I have clipped adapter sequences and trimmed low-quality bases with Trimmomatic, as suggested in the Epicentre guide (http://www.epibio.com/applications/b...eq-kit?details). This left over 90% of the reads in each sample with some workable sequence.

From reading I have done so far, I feel that the read length is much longer than necessary, and might even cause more problems than it brings information. It seem 50-75 read length is better suited to this type of analysis.
I would like to know if anyone could advise me appropriate non-default alignment parameters suited to my data. Do I have to control the number of mismatches allowed in the seed alignments? The length of the seeds? The overall alignment score? etc etc.
I prefer asking experienced users rather than test and overfit parameters of my own.

Other

In the meantime, I truncated the forward read file of the sample with worst alignment rate (28.4%) to 75 bp, and try to align that as single-end to see if I get a better alignment rate.
It turned out to align 63.9% !

Rather than clip all my reads, I'd be very grateful to learn how to salvage information from the full length reads

Thanks in advance!
Leave a comment:
fkrueger replied

08-23-2014, 02:09 AM
Originally posted by adeirossi View Post

Hi Christophe,

I have also seen a falling number of methylation calls for read 2 in my data. I have always assumed this was because I ran bismark_methylation_extractor with the --no_overlap option and many of the read pairs had overlaps (i.e. insert length less than twice the read length). With --no_overlap, when the extractor encounters an overlap between read 1 and read 2, I believe it uses the read 1 calls only and discards the read 2 calls, which would explain why the number of read 2 calls decreases as you go toward its 3' end. Because this "no overlap" filtering occurs after mapping, I assume there is no impact on the alignment of read 2.

If, on the other hand, you are not using the --no_overlap option or you don't have many reads pairs with inserts <200 bp, then I have no idea what's going on.

Andrew

I can only second Andrew's conclusions. Just for the record, you should also see a similar drop in CpG calls, but since these are less frequent than CHH or CHG calls it might be less obvious. If you mouse-over the HTML report you will see the actual number of calls involved. Thanks Andrew.
Leave a comment:
adeirossi replied

08-22-2014, 09:42 PM
Hi Christophe,

I have also seen a falling number of methylation calls for read 2 in my data. I have always assumed this was because I ran bismark_methylation_extractor with the --no_overlap option and many of the read pairs had overlaps (i.e. insert length less than twice the read length). With --no_overlap, when the extractor encounters an overlap between read 1 and read 2, I believe it uses the read 1 calls only and discards the read 2 calls, which would explain why the number of read 2 calls decreases as you go toward its 3' end. Because this "no overlap" filtering occurs after mapping, I assume there is no impact on the alignment of read 2.

If, on the other hand, you are not using the --no_overlap option or you don't have many reads pairs with inserts <200 bp, then I have no idea what's going on.

Andrew
Leave a comment:
christophe replied

08-22-2014, 03:21 PM
M-bias: CHH and CHG total calls drop off in Read2

Hi Everyone,

I have questions regarding M-bias plot and CHH,CHG methylation calls generated with bismark. I hope it is the right thread to ask otherwise let me know where should have I posted it.
Has anyone seen a drop off of CHH and CHG total calls within a M-bias plot generated by bismark tool?
I have observed this drop off only in the M-bias read2 plot.
(See Attached png plots)
The data are from Whole Genome BiSulphite with reads of 101 bases
Library Prep was performed using Ultralow Ovation MS kit.

How to interpret the falling number of calls after 50 bases? and ....
How this can impact the alignment of read2 in bismark? and the Methylation calls?

Thanks for your help,
Christophe
Attached Files

S1.fastq_bismark_bt2_pe.deduplicated.M-bias_R1.png (6.9 KB, 33 views)

S1.fastq_bismark_bt2_pe.deduplicated.M-bias_R2.png (8.7 KB, 33 views)
Leave a comment:
fkrueger replied

08-14-2014, 05:32 AM
Hi Sung,

This might be a useful feature but I think it would be more sensible to apply this on e.g. the coverage report which has all values sorted by position with percentage methylation and count methylated/unmethyalted already, rather than resticting the methylation extractor itself. It should be relatively straight forward to filter the .cov file with a small script and should only run for a few seconds, but I'll be off on holiday for a while now...
Leave a comment:
sunggong replied

08-14-2014, 04:48 AM
Hi Felix,

I thought it would be really nice if 'methyl_extrator' have an option to search a specified interval only (e.g. -L chr2, -L chr2:1000-2000, or -L <target_region.bed>).

Thanks,
Sung
Leave a comment:
fkrueger replied

08-07-2014, 12:43 AM
Hi Wilson,

If you've got many small FastQ files it should be absolutely fine to merge the BAM files and then run the methylation extractor. This is at least true for single-end files, however we have observed that samtools merge might interfere with the Read1/Read2 order of paired-end files, and they might need to be sorted by read name afterwards (e.g. with samtools sort -n). Again, since you seem to have single-end files it should be fine.

The only (minor) inconvenience is that you won't be able to generate the swish html report for the entire sample with bismark2report because the sample is split into many small reports.
Leave a comment:
wilson90 replied

08-06-2014, 06:00 PM
Hi. I am just wondering if it is advisable to merge bam files produced by Bismark using samtools merge, and then run methyl_extrator?

The reason why I am asking this question is because I have hundreds of gzipped single-end fastq files and I am very reluctant to merge all the fastq files. The most convenient way is to run bismark on every fastq files, and merge the bamfiles before the running methyl_extrator.

Is it "legal"?
Leave a comment:
fkrueger replied

07-29-2014, 11:11 AM
Hi Andrew,

I am currently away on holiday but I will definitely take a look once I am back. Thanks for your continued support of the methylationXtractor ;D
Leave a comment:

Previous 1 8 15 16 17 18 19 20 21 28 34 template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, 05-14-2024, 07:03 AM	0 responses 20 views 0 likes	Last Post by seqadmin 05-14-2024, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 44 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 54 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 42 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News