Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

fkrueger replied

10-05-2012, 04:28 AM
Originally posted by shadow19c View Post

hello,
thank you very much.
I have a question a bout the deduplication (what is mean?)

If your sample contains several reads that start and end at the very same position there is a good chance that you are not looking at genuinely unique sequences but that you are sequencing the same fragment that has been amplified by PCR multiple times over and over again. A deduplication step would reduce all alignments with the same genomic coordinates to a single read. Some more details should also be mentioned in the aforementioned guide document.
Leave a comment:
shadow19c replied

10-05-2012, 04:23 AM
hello,
thank you very much.
I have a question a bout the deduplication (what is mean?)

If I am doing a BS-seq in Thaliana, the deduplication is needeed?
Leave a comment:
fkrueger replied

10-04-2012, 07:08 AM
It is highly recommended to remove adapters and poor quality portions from reads to increase the mapping efficiency and confidence in the methylation data.

A typical workflow would be:

Raw data --> FastQC (quality control) --> Trim Galore (adapter/quality trimming) --> Bismark (alignments) --> deduplication --> downstream analysis of your choice

Here is a guide-document explaining all these steps in more detail.

Best,
Felix
Leave a comment:
shadow19c replied

10-04-2012, 07:02 AM
Hello,
everyone, I'm new here i have a lot of questions about Bs-seq and more precisely about bismark.
I see that with paire-end seq we can use Bismark to do the mapping but before do you know if I have to remore adaptors, short fragments?
If it is yes do you know if there is some programs to do that or juste a remove directly in my reads.

Thanks
Leave a comment:
fkrueger replied

10-02-2012, 02:04 AM
We have just released a new version of Bismark (version 0.7.7), which mainly extends the functionality of the Bismark methylation extractor, as recently discussed here on SeqAnswers. The methylation extractor does now include the functionality of the two additional scripts genome_methylation_bismark2bedGraph as well as genome_wide_cytosine_report; this means that it can, in addition to the standard methylation extractor output, generate sorted bedGraph and/or genome-wide cytosine report output files directly using the options --bedGraph or --cytosine_report, respectively.

Here are all changes in more detail:

Bismark
• When reading in the genome file Bismark does now automatically remove \r line ending characters as well. This sometimes caused problems when genome files had been edited on Windows machines.
• Added support for the Bowtie 2 options '--rdg int1,int2' and '--rfg int1,int2' to adjust the gap open and extension penalties for both read and reference sequence. This might be useful for very special conditions (e.g. PacBio data...)

Bismark methylation extractor
• Renamed methylation_extractor to bismark_methylation_extractor
• Added new function '-o/--output' to specify an output directory. This became necessary for integration into Galaxy
• Added new function '--no_header' to suppress the Bismark version header in the output files if plain alignment data is more desirable
• Added option '--bedGraph' to produce a bedGraph output file once the methylation extraction has finished; this reports the genomic location of a cytosine and its methylation state (in %). By default, only cytosines in CpG context will be sorted/reported
• Implemented option '--cutoff threshold' to set the minimum number of times a methylation state has to be seen for that nucleotide before its methylation percentage is reported
• Implemented option '--counts' which adds two additional columns to the bedGraph output file to enable further calculations:
Column 5: count of methylated calls per position
Column 6: count of unmethylated calls per position
• Implemented option '--CX_context' so that the sorted bedGraph output file contains information on every single cytosine that was covered in the experiment irrespective of its sequence context
• Added option '--cytosine_report' which produces a genome-wide methylation report for all cytosines. By default, the output uses 1-based chromosome coordinates and reports CpG context only. The output considers all Cs on both forward and reverse strands and reports their position, strand, trinucleotide content and methylation state
• Option '--CX_context' applies to the cytosine report as well. The output file wil contain information on every single cytosine in the genome irrespective of its context. This applies to both forward and reverse strands
• Implemented option '--zero_based' to use zero-based coordinates like used in e.g. bed files instead of 1-based coordinates
• Implemented option '--genome_folder PATH' to be used to extract sequences from. Accepted formats are FastA files ending with '.fa' or '.fasta'
• Added an option '--split_by_chromosome' which writes the cytosine report output to individual chromosome files instead of to one single very large file

Bismark is available for download at www.bioinformatics.babraham.ac.uk/projects/
Leave a comment:
ELoomis replied

09-27-2012, 11:13 AM
Filtering out low poorly converted reads

I'm using the non-CpG context cytosines as a measure of conversion efficiency for my sample, and I'd like to filter out any reads with a particularly low efficiency. This might be a bigger problem for my application (different sequencing platform, longer reads and locus specific) than most users doing RRBS, but I imagine this type of filter would be good for any bisulfite mapping application...
Ideally, you would be able to adjust a threshold in the command line and select which context to use as the measure (non-CpG vs. only CHH?).
Does this sound like something that other users would find useful?
Leave a comment:
my_bio replied

09-22-2012, 06:26 AM
Originally posted by fkrueger View Post

Some individual bases with a low base call quality can make it through indeed; this is due to the way the quality trimming is performed by Cutadapt. While I wouldn't have a certain number to go with it I would imagine that the amount of these is tiny (my guess is well below 0.1%). In the example you linked, the '#' is probably the Illumina flag for expressing that the pipeline had trouble determining the base signal, and this does not equal a general poor quality call.

After all, it would only matter if the base in question was a cytosine position in the genome (so ~20% of the time). And even then, the call might have been correct (albeit with a poor quality score), or might be one of the other bases that are not involved in methylation calling at all anyway, i.e. A, G or N (N is quite likely if the score was '#'). So overall I agree that this *might* result in an incorrect methylation call very occasionally, however if there were 10 such calls in ~5 billion correct calls (which you may easily get from one lane of HiSeq), I believe this is something one could live with (especially since quality checking would slow down the methylation extraction process quite noticeably...). Don't you think?

What you said is reasonable, thanks.
Leave a comment:
fkrueger replied

09-22-2012, 02:47 AM
Originally posted by my_bio View Post

Thank you for your prompt reply. Trim Galore is a powerful tools to perform quality control and I have run our data through Trim_galore. but a read may still have a little lower sequencing quality bases after quality trimming. for example, sequencing quality of a read after trimming may like this:
GEEEEEFFCFFEEEEEECEEEGGGGEFFFFDGDGFGGGGDGFGDFGGFDG#CCCCCBBBAA.
there is a base has lower quality("#") in this read, if this base have unique aligned reference genome's cytosine, this base may affect the accuracy of the methylation level, So we need ignore this base.

Some individual bases with a low base call quality can make it through indeed; this is due to the way the quality trimming is performed by Cutadapt. While I wouldn't have a certain number to go with it I would imagine that the amount of these is tiny (my guess is well below 0.1%). In the example you linked, the '#' is probably the Illumina flag for expressing that the pipeline had trouble determining the base signal, and this does not equal a general poor quality call.

After all, it would only matter if the base in question was a cytosine position in the genome (so ~20% of the time). And even then, the call might have been correct (albeit with a poor quality score), or might be one of the other bases that are not involved in methylation calling at all anyway, i.e. A, G or N (N is quite likely if the score was '#'). So overall I agree that this *might* result in an incorrect methylation call very occasionally, however if there were 10 such calls in ~5 billion correct calls (which you may easily get from one lane of HiSeq), I believe this is something one could live with (especially since quality checking would slow down the methylation extraction process quite noticeably...). Don't you think?
Leave a comment:
my_bio replied

09-21-2012, 11:41 PM
Originally posted by fkrueger View Post

We strongly recommend adapter and quality trimming of sequencing files before the alignments are carried out in the first place, and indeed we run all our samples through Trim Galore to do this (a protocol is available here). If you adhere to this procedure there is no need to filter for good quality basecalls afterwards.

Thank you for your prompt reply. Trim Galore is a powerful tools to perform quality control and I have run our data through Trim_galore. but a read may still have a little lower sequencing quality bases after quality trimming. for example, sequencing quality of a read after trimming may like this:
GEEEEEFFCFFEEEEEECEEEGGGGEFFFFDGDGFGGGGDGFGDFGGFDG#CCCCCBBBAA.
there is a base has lower quality("#") in this read, if this base have unique aligned reference genome's cytosine, this base may affect the accuracy of the methylation level, So we need ignore this base.
Leave a comment:
fkrueger replied

09-21-2012, 10:53 PM
Originally posted by my_bio View Post

To accurately calculate methylation level of cytosine, it's necessary to add another option to filter low sequencing quality reads. that is to say, if a base's sequencing quality is lower than 20, methylation extractor will ignore it.

We strongly recommend adapter and quality trimming of sequencing files before the alignments are carried out in the first place, and indeed we run all our samples through Trim Galore to do this (a protocol is available here). If you adhere to this procedure there is no need to filter for good quality basecalls afterwards.
Leave a comment:
my_bio replied

09-21-2012, 10:16 PM
To accurately calculate methylation level of cytosine, it's necessary to add another option to filter low sequencing quality reads. that is to say, if a base's sequencing quality is lower than 20, methylation extractor will ignore it.
Leave a comment:
my_bio replied

09-21-2012, 06:09 PM
If the new version of methylation extractor have been updated, please inform us, thanks.
Leave a comment:
fkrueger replied

09-21-2012, 07:52 AM
Originally posted by my_bio View Post

It seems to work alright by now and I strongly suggest you to add these functions to methylation extractor. By the way, to my opinion, it is needed to splits output into different files for each chromosome. So we can parallel process by chromosome in subsequent analysis.

I'll start thinking about implementing it into the methylation extractor if I find some time next week. Splitting the output per chromosomes requires only a couple of extra lines, but we could have it as another option.

Last edited by fkrueger; 09-21-2012, 08:14 AM.
Leave a comment:
my_bio replied

09-21-2012, 06:17 AM
Originally posted by fkrueger View Post

Hi my_bio,

I have now changed the output to be in the following format:
<chromosome> <position> <strand> <count methylated> <count non-methylated> <C context> <trinucleotide context>

I also fixed the compile errors, strangely enough it ran without any warnings on our system... I hope it'll work nicely now.

It seems to work alright by now and I strongly suggest you to add these functions to methylation extractor. By the way, to my opinion, it is needed to splits output into different files for each chromosome. So we can parallel process by chromosome in subsequent analysis.
Leave a comment:
fkrueger replied

09-21-2012, 01:42 AM
Hi my_bio,

I have now changed the output to be in the following format:
<chromosome> <position> <strand> <count methylated> <count non-methylated> <C context> <trinucleotide context>

I also fixed the compile errors, strangely enough it ran without any warnings on our system... I hope it'll work nicely now.
Attached Files

genome_wide_cytosine_report.pl (12.5 KB, 51 views)
Leave a comment:

Previous 1 23 30 31 32 33 34 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News