Seqanswers Leaderboard Ad

**fkrueger** · 11-22-2010, 04:29 AM

Originally posted by zeam View Post

Hi fkrueger,

In BISMARK,I didn't notice the management of BS-mismatch and non-BS-mismatch.So can you give us some details about you algorithm?

Have you read the paper:Epigenomics and Genome Wide Methylation Profiling.In this paper,the author mentioned 3 algorithms,can you tell me which algorithm do you choose,and how do you handle the algorithm limitations?

Thanks for you wonderful program!
Looking forward to you replying.

Dear Zeam,

Thanks for your inquiry, I will try and explain everything as briefly and fully as possible.

I just had a quick look at the paper you mentioned. Bismark is most similar to their algorithm (1), however it does not only do a C->T conversion of the Watson and Crick strands of the referece and the reads, but in addition it does a G->A conversion of both the reads and the reference as well. The method described in their algorithm (1) does work fine for what we call directional libraries (i.e. you only expect to see either C->T converted reads originating from the Watson or the Crick strand, such as in Lister et al, 2009)(if you have a library like this specify --directional when running Bismark). However if libraries are non-directional, you will see reads from all possible 4 bisulfite strands (which is the limitation they describe as the G-poor strand), such as in Popp et al, 2010). Thus, Bismark works very well also for non-directional libraries which the approach described in (1) doesn't.

The number of tolerated non-bisulfite mismatches can be adjusted using the parameters -n and -l mainly (maybe also -e). In general, I recommend keeping the tolerated number of non-BS mismatches to a minimum, as the reduced complexity of the reads make mapping hard enough, so allowing extra non-BS mismatches is even more likely to result in false mappings.
Bismark converts both the read sequence and the reference sequence into BS-space to avoid mapping bias due the methylation state of the read. In this way, "BS mismatches" will rather be perfect matches during the mapping process. For the unique best alignment the methylation is called for every position where there is a C in there reference genome and a C or T in the read (or when there is a G in the genome and a G or A in the reference, this depends on which strand the sequence mapped to).

In some cases it is possible that there is a C in the read sequence and a T in the reference sequence, which should in fact count as a non-BS mismatch, but after the conversion this is not detectable any more. There are now 2 ways to handle this.

(a) one can look at these positions and decide to discard the reads in question

(b) one can leave these reads in as removing these reads would mean that one would introduce a bias in removing preferentially methylated reads (as they contained more C, this would also work for the G/A case). By doing so you wouldn't perform methylation calls at the positions in question and thus wouldn't introduce artificial methylation calls (as they are a T in the reference), but you accept that there might be some degree of mismapped reads (keeping the non-BS mismatches to a minimum will help remove such mismaps)

I personally tried to introduce as few arbitrary filtering steps which might introduce whatever bias regarding the methylation level as possible, which is why Bismark keeps these reads in. If one wanted to get rid of these sequences they could run a simple script on the Bismark result files and remove such reads. I haven't checked the prevalence of these read species yet, but I reckon if the read length is reasonably long and the non-BS mismatch allowance is kept to a reasonable minimum mismaps shouldn't be a problem at all.

I hope this answered your questions, if I was unclear you can also send me an email directly to
[email protected].

Best wishes,
Felix

**natstreet** · 12-02-2010, 11:44 PM

genome_methylation_bismark2bedGraph_withcolor.pl

I was just using the genome_methylation_bismark2bedGraph_withcolor.pl script that is now at the bismark website and I'm a bit confused. The help for the script says that it reports % methylation but I see values > 100. Is it definitely % methylation reported rather than read count? If so, why do I see values greater than 100?

I also used the latest version of the script that was posted to this thread and that seems to give the expected % output.

Thanks

**fkrueger** · 12-03-2010, 01:17 AM

Hi Natstreet,

as you are probably aware this conversion script was provided by Oliver Tam, I have just 'borrowed' it from here to make it availabe with the rest of Bismark (of course giving Oliver credit for it...). I hope he can comment on this as soon as the night is over in Canada.

best wishes,
Felix

**fkrueger** · 12-03-2010, 01:40 AM

I have actually just looked in the code which looks like this:

Code:

 
if($use_color){
      my $meth_value = $meth_percentage * 10;
      $meth_value = int($meth_value + 0.5);
      $meth_value = 1000 if $meth_value > 1000;
      my $itemRgb = determine_color($meth_value);
      print "$last_chr\t$last_pos\t$end_pos\t$meth_value\t$itemRgb\n"
}
else{
      print "$last_chr\t$last_pos\t$end_pos\t$meth_percentage\n" 
}

This means that If you choose to use colours then the script will output a methylation value which is the methylation percentage x 10. If you don't use colours it will use the methylation percentage. I don't exactly know why this difference is needed though.

best,
Felix

**natstreet** · 12-03-2010, 07:49 AM

Ah - that explains it thanks. Sorry I didn't dig into the code myself!

I also just noticed that seqmonk now exports bed files

**sgrimm** · 12-20-2010, 06:39 AM

non-unique mappings

I've noticed that the bismark output summary reports the number of sequences that did not map uniquely, but not what those sequences are. Is this something that could be added to the program? Ideally, I'd like a separate output file that reports all of the mappings for those reads that don't map uniquely [like the -a bowtie parameter]. However, even just a list of IDs or a FASTQ identifying which reads those are would be quite helpful [maybe something like the --max parameter in bowtie]. Is this an option that could be (easily) added?

**fkrueger** · 12-20-2010, 06:47 AM

Hi sgrimm,

I have been thinking about implementing an option to output unmapped reads (such as --un <unmapped.out>) already... In theory it should be fairly easy to also write out sequences that produce ambiguous alignments, either into the same unmapped.out file or even into yet another output file. But I am afraid I probably won't have the time to look at this before the christmas holidays... How urgently would you require such a feature?

Happy christmas!

**sgrimm** · 12-20-2010, 06:52 AM

non-unique mappings

It is not so urgent that it cannot wait until the new year. Happy holidays!

**fkrueger** · 12-23-2010, 02:04 AM

Originally posted by sgrimm View Post

I've noticed that the bismark output summary reports the number of sequences that did not map uniquely, but not what those sequences are. Is this something that could be added to the program? Ideally, I'd like a separate output file that reports all of the mappings for those reads that don't map uniquely [like the -a bowtie parameter]. However, even just a list of IDs or a FASTQ identifying which reads those are would be quite helpful [maybe something like the --max parameter in bowtie]. Is this an option that could be (easily) added?

Dear sgrimm,

Against all odds I did now find some time to look at your enquiry. Even though it would theoretically be possible to output every single alignment for a given sequence, I don't think that this is something I want to include into the standard version of Bismark. As Bismark performs multiple sequence and reference alignments in different converted forms, such a '-a' option would probably create an enormous amount of not meaningful data and it would run tremendously slower.

Rather, I have now implemented the following two options into Bismark:

--un <filename>: Specifying this option will write all unmapped reads into a file called <filename>. This will include both un-mappable reads and reads that have been rejected due to ambiguity.

--ambiguos <filename>: Specifying this option will write all reads that have been rejected due to ambiguity into a separate file. The two options may be combined or used individually. If --ambiguous is specified in addition to --un, sequences with multiple alignments will not be reported to the file specified by --un <filename>.

I hope that these new options will help identifying potential sources of problems/bias/errors in a given BS-library (maybe run FastQC on the unmapped fraction to find out more about this pool of sequences?). The latest version (v0.2.5) can be obtained from the Bismark homepage.

Note: If you can't see the latest version you might have to push Ctrl + refresh to force the BBSRC cache to update.

Happy Christmas!

**fkrueger** · 01-18-2011, 01:47 AM

We just released a new version of Bismark (v0.2.6) which fixes a bug that might occur when very lax alignment parameters were specified (allowing 10+ mismatches). The new version can be found at the Babraham Bioinformatics homepage.

Note: If you can't see the latest version you might have to push Ctrl + refresh to force the BBSRC cache to update.

**olivertam** · 01-26-2011, 11:32 AM

Originally posted by sunsnow86 View Post

could somebody write a small script to make the format of the output alignment fit to the downstream statistical analysis pipeline which developed by the MIT Broad Institute for RRBS method.
(http://www.broadinstitute.org/~cbock...dev/index.html). I think it will be very useful for wet scientist. Thanks

Hi sunsnow86,

Sorry for the slow response. I'm afraid I don't really understand the system that the Broad is using for their RRBS analysis. The tracks that they describe is very "descriptive" without really explaining what they did. I think it might be possible to "replicate" some of their tracks on the UCSC genome browser based on what they look like, but I'm not certain if that's what you want. If you have examples of the expected input files for their pipeline, I can certainly have a closer look and see what we can do.

Cheers,
Oliver

**fkrueger** · 01-30-2011, 02:25 AM

We are happy to announce that the Bismark documentation has received a - long overdue - complete overhaul. The previous documentation (INSTALL.txt and README.txt) have been replaced by a Bismark User Guide. It contains a Quick reference to get started quickly and in addition to that contains detailed information about BS-Seq in general, about how Bismark works, its output formats and discusses some of the available options. Its Appendix section includes an overview of all available options of all three components of the Bismark package (bismark_genome_preparation, Bismark and methylation_extractor).

We do now offer a small BS-Seq test data set for download so that users can try Bismark out. The test data set consists of 10,000 sequences taken from a human shotgun BS-Seq dataset (SRR020138, Lister et al. 2009). The sequences have been trimmed to 50 bp and its FastQ base call qualities are Phred33 encoded.

As a minor improvement, both the bismark_genome_preparation and bismark itself will now accept reference genome sequence files (in FastA format) with the file extension .fasta in addition to .fa.

The latest Bismark release (v0.3.0) and all associated files can be obtained from the Bismark website.

If you have any suggestions/comments on how to improve the Bismark User Guide please let us know.

**cwhelan** · 02-03-2011, 11:47 AM

Directional option and paired ends

I've been trying out Bismark and really like it so far.

I have a question about some of the results I'm seeing. I'm aligning paired-end Illumina BS reads against hg19, currently using Bismark 0.2.6. The Bismark report I'm getting includes this breakdown of strand alignments:

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 13018969 ((converted) top strand)
GA/CT/GA: 45696 ((converted) bottom strand)
GA/CT/CT: 78891 (complementary to (converted) bottom strand)
CT/GA/GA: 13065015 (complementary to (converted) top strand)

So it seems like only the reads from the strands that came from the top strand are in my sample. Is this what the --directional parameter is meant for? Currently the --directional parameter is only enabled for single-end mapping. Is there any way I can make it work with my paired-end data? Or does anyone have any other tips for dealing with this type of data? Even though there aren't many, the alignments to the two theoretical strands seem like wasted effort, and potentially a source of confusion for the algorithm - is that true?

Thanks, and sorry if I'm missing something obvious!

**fkrueger** · 02-03-2011, 12:43 PM

Dear cwhelan,

Thanks for your message. Indeed you seem to have used a directional library for paired-end sequencing, which means that alignments to the two complementary strands do normally only exist theoretically. Even though they make up only ~ 0.4% of your total alignments, I agree that they would most likely introduce some false methylation calls into your data.

I have to say that I have not been dealing with a lot of directional paired-end libraries so far, but the output you linked strikes me as odd as well. My first guess is that 2 lines from this report output are mixed up, it probably should read more like this:

13018969 ((converted) top strand)
13065015 ((converted) bottom strand)
45696 (complementary to (converted) top strand)
78891 (complementary to (converted) bottom strand)

I can assure you that dealing with these many strands and read/genome conversions can be very confusing sometimes, so I assume that I have introduced this confusion myself. I will however look into it tomorrow or over the weekend at the latest. In fact I have been slacking with implementing a --directional option, this might be a good incentive to finally get it done.

There should be two quick fixes for the problem at hand:

(a) If you use the methylation extractor to get strand-specific output it should produce files for the different strands OT and CTOT (original top and compl. to OT), and OB and CTOB (original bottom and compl. to OB). I would think that this gives you 2 very large files (presumably for OT and OB), you can then delete the 2 other ones.

(b) Writing a tiny script which works on the Bismark mapping output to filter out reads which align to the theoretical strands. I am happy to write such a script for you if needed, however I would still need to figure out which read/strand conversion caused the mix-up in the report.

In any case it would be really helpful if you could send me a sample of your paired-end files (maybe the first 10K or so sequences of both files, that should fit into a normal email. My mail address is: [email protected]).

Sorry for the inconvenience this might cause, I will try my best to get it fixed as soon as possible.

Best wishes,
Felix

**fkrueger** · 02-04-2011, 05:50 AM

I have now looked into it and indeed found that the strand names were confused in the alignment report. Note that this really only affected the summary report and not the read alignments or their methylation calls as such. The vast majority of reads should (and does now) align to the original top and bottom strands.

I have also added a --directional option for paired-end reads and we recommend using it if you know that your library has been constructed in a directional manner.

The latest version of Bismark (v0.4.0) as well as its documentation is now available for download at the Bismark homepage (You might need to press Ctrl+Refresh to force the cache to update).

Thanks cwhelan for spotting this!

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News