Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

adeirossi replied

07-28-2014, 04:11 PM
Hi Felix,

Sorry to perpetuate my reputation for breaking the methylation extractor, but I recently noticed an issue with the patch I submitted that was incorporated in the v0.12.2 release.

Background on the old patch: Running v0.12.1 of bismark_methylation_extractor, I experienced a failure when I used high --ignore or --ignore_r2 values that equaled or exceeded the alignment length. The old patch I came up with simply called "next" when it encountered this condition, and this indeed prevented the fatal error.

However, I recently noticed an unfortunate behavior: with --ignore set to 0 and --ignore_r2 set to a high value, when a read 2 alignment is encountered that is shorter than --ignore_r2, the read 1 alignment is discarded as well. Scrutinizing the code, the behavior is caused by the "next" call from the old patch. Since the while loop iterates over each read pair (not each read), calling "next" discards the whole read pair though only one of the two reads may be problematic.

I came up with a new patch that corrects this behavior of the old one (see attached): now when one read is too short for its "ignore" value, its mate is not automatically discarded. If you could take a look when you get the chance, that would be much appreciated. My apologies for not catching this the first time.

Andrew
Attached Files

bismark_methylation_extractor-0.12.5-fix_ignore_behavior.pl (184.1 KB, 27 views)
Last edited by adeirossi; 08-01-2014, 03:37 PM. Reason: Fix attached script to not spew Perl warnings
Leave a comment:
fkrueger replied

07-25-2014, 02:30 AM
Sorting the BAM file into smaller files by chromosome shouldn't matter, you only need to take care that you don't sort by position as well so that you do not disturb the Read 1/ Read 2 order of the file. Since read pairs should always align to the same chromosome you could probably resort the reads by name (samtools sort -n) in case they were sorted by position.
It is very unlikely that we will change the methylation extractor so that it will be able to cope with sorted or otherwise random R1/R2 order since this may be a huge (t)ask memory-wise because BAM files can easily reach several tens of GB in size.
Leave a comment:
sunggong replied

07-25-2014, 02:10 AM
Originally posted by fkrueger View Post

As it stands the extractor is only using a single core. There have been plans to extend this to multi-core usage for quite some time however I never quite got round to doing it. We were thinking however to lock ourselves away for a day or two in the not too distant future where we would all have a go at doing something that never gets done due to the day-to-day business, and multi-threading of Bismark is very high up on that list!

For the time being you might want to split up the BAM file into several smaller chunks and launch them individually to gain some speed. Cheers, Felix

I got this error while trying to run bismark_methylation_extractor by chromosome-wise (BAM files splitted by chromosome):
"This might be a result of sorting the paired-end SAM/BAM files by chromosomal position which is not compatible with correct methylation extraction. Please use an unsorted file instead".

It seems I need to provide unsorted BAM files, but I thought it would be great if bismark could deal with a sorted BAM file as this makes splitting the original BAM file by chromosome easier using samtools.
Leave a comment:
fkrueger replied

07-23-2014, 05:33 AM
Originally posted by sunggong View Post

Hi,
I am wondering whether bismark_methylation_extractor is written to use multi threads? Any plan, if not?
Thanks,
Sung

As it stands the extractor is only using a single core. There have been plans to extend this to multi-core usage for quite some time however I never quite got round to doing it. We were thinking however to lock ourselves away for a day or two in the not too distant future where we would all have a go at doing something that never gets done due to the day-to-day business, and multi-threading of Bismark is very high up on that list!

For the time being you might want to split up the BAM file into several smaller chunks and launch them individually to gain some speed. Cheers, Felix
Leave a comment:
sunggong replied

07-23-2014, 03:49 AM
bismark_methylation_extractor

Hi,
I am wondering whether bismark_methylation_extractor is written to use multi threads? Any plan, if not?
Thanks,
Sung
Leave a comment:
fkrueger replied

07-21-2014, 12:36 PM
We had to add one additional check to the way ambiguous alignments are handled in Bowtie 2 mode. In more detail this adds a test whether the current ambiguous alignment is worse than the best alignment so far, in which case the sequence does not get flagged as ambiguous. A hotfix (v0.12.5) is available for download now. Sorry for that.
Leave a comment:
fkrueger replied

07-21-2014, 04:52 AM
Bismark v0.12.4 released. This release addresses several issues, most importantly the handling of which read alignments count as ambiguous in Bowtie 2 mode. This change might improve overall mapping efficiency slightly, so Bowtie 2 users are highly encouraged to upgrade to the latest version which is available here: https://www.bioinformatics.babraham....jects/bismark/!

Bismark: Improved the way ambiguous alignments are handled in Bowtie 2 mode. Previously, sequences were classified as ambiguously aligning as soon as a sequence produced several equally good alignments within the same alignment thread. Under certain circumstances however there may exist equally good alignments within the same alignment thread, but the sequence might have a better (unique) alignment in another thread. Such a unique alignment will now trump the ambiguous alignment as it should
Bismark: Got rid of 2 warning messages of MD-tag information for reads containing deletions (Bowtie 2 mode only) which accidentally made it through to the release
Bismark: Added '-x' to the invocation of Bowtie 2 for FastA sequences so that it works again (It used to work previously only because Bowtie 2 did not check it properly and automatically used bowtie2-align-s, but now it does check...)
Methylation Extractor: Line endings are now chomped at an earlier stage so that interfering with the optional fields in the Bismark BAM file doesn't break the methylation extractor (e.g. reordering of optional tags by Picard)
Leave a comment:
fkrueger replied

06-27-2014, 01:27 AM
Hi Andrew, thanks for breaking the methylation extractor in yet another completely new way! I have now implemented the new and earlier chomping, and it will be available in the next release. Cheers, Felix
Leave a comment:
adeirossi replied

06-26-2014, 05:00 PM
Hi Felix,

Surprisingly, replacement of the XX tag with the MD tag seems to have caused an error in my use of Bismark. Here is a bismark_methylation_extractor error that emerged when I upgraded from bismark-0.12.2 to 0.12.3:

$ perl bismark_methylation_extractor --paired-end --no_overlap --output STD --report --samtools_path /mnt/ngs/analysis/software/samtools-0.1.19 --mbias_only 99_26_1_GATCAG_003.bam

*** Bismark methylation extractor version v0.12.3 ***

Summarising Bismark methylation extractor parameters:
===============================================================
Bismark paired-end SAM format specified (default)
Output path specified as: STD/

Now testing Bismark result file meth/results/A73406CXU/99_26_1_GATCAG_003/99_26_1_GATCAG_003.bam for positional sorting (which would be bad...) ...passed!

Now reading in Bismark result file meth/results/A73406CXU/99_26_1_GATCAG_003/99_26_1_GATCAG_003.bam

skipping SAM header line: @HD VN:1.4 GO:none SO:queryname
...
Unexpected combination of read and genome conversion: CT
/ CT

From the error message above, $first_read_conversion (the value of the XR tag) is equal to "CT\n", and the trailing newline renders it unrecognizable. Currently, bismark_methylation_extractor removes the newline only when the line ends with the XG tag; when the line ends with the XR tag, it is not removed. For most people, this is probably a non-issue because BAM files directly output by bismark never have lines ending with the XR tag. However, I use Picard to manipulate BAM files between alignment and extraction, and that is where the trouble comes in.

Picard AddOrReplaceReadGroups (v1.93) has the unfortunate behavior of reordering the tags in the input BAM file. Under bismark-0.12.2, it changed the order from "NM XX XM XR XG" to "RG XG NM XM XR XX". Here, there was no problem because the XX tag was not used by bismark_methylation_extractor. But under bismark-0.12.3 with the replacement of the XX tag with the MD tag, Picard now changes the tag order from "NM MD XM XR XG" to "MD RG XG NM XM XR". With the XR tag occurring last, its value ends in an unchomped newline and causes the error.

Though Picard is not without blame here, I think it might be generally more robust for bismark_methylation_extractor not to depend on a specific order of tags. To this end, I have attached a modification for your review that should ensure consistent chomping of the newline from the SAM input stream.

Thanks,
Andrew
Attached Files

bismark_methylation_extractor-0.12.3-mod.pl (183.7 KB, 25 views)
Leave a comment:
fkrueger replied

06-20-2014, 06:36 AM
Hi Sung,

I am now off for weekend in Wells-next-the-Sea but to quickly answer your questions: to 1) Frank Wessely has come up with this table, so you need to ask him really. To 2) No, the BAM file does not store the AS, but it shouldn't be hard to get it to report it. To 3) The MAPQ value is calculated using the same algorithm that Bowtie 2 is using (or at least the Perl adaptation of the C code). More about this can be found further up here in the thread.
Cheers, Felix
Leave a comment:
sunggong replied

06-20-2014, 05:54 AM
Thanks Felix for bringing this up (again).

1. MAPQ before and after dedup
I don't have a proper explanation either, but again it seems true for low MAPQ value as well. Below shows % of reads filtered at different MAPQ threshold before and after dedup.

Code:

MAPQ<1 MAPQ<3 MAPQ<5 MAPQ<10 before dedup 6.57 6.59 12.83 18.33 after dedup 6.77 6.79 13.16 18.67

Anyway they are quite marginal, so could be not significant.

2. MAPQ & alignment score
I was hoping to plot MAPQ against alignment score from my data, but the score is not given from the bismark bam file (let me know if I'm wrong). Instead, I counted the number of 1) insertion (I), 2) deletion (D) and 3) mismatches (X) per CIGAR line. This is to see how the alignment performed for the reads of 0 MAPQ (~7%), which can be interpreted as near 'non-unique' as per Bowtie2 (http://bowtie-bio.sourceforge.net/bo...er-more-unique). As shown from you table above, I was sort of expecting quite large number of [IDX] for the reads of 0 MAPQ, but it was opposite; for most of 0 MAPQ reads there were no insertion, deletion or mismatches (see attached). The Pearson correlation shows -0.2331348 between MAPQ and sum(IDX). Did I miss something serious here?

Can I ask:
1) how the table above was drawn?
2) is it possible to extract alignment score from the Bismark bam file
3) how the MAPQ was calculated in the Bismark bam file? Is it from bowtie(2)?

Thanks,
Sung
PS. below is a perl code to extract IDX from BAM (via samtool view):

Code:

while (<STDIN>){ @a=split/\s+/; @b=split(/([0-9]+[MIDNSHPX=])/,$a[5]); $bad=0; foreach (@b){ if(/([0-9]+)[IDX]/){ #insertion, deletion, or mismatch $bad=$bad+1; } } print "$a[4]\t$bad\n"; # 'MAPQ' 'NO of I|D|X' }

Attached Files

A003.2014-06-20_12PM.pdf (7.0 KB, 21 views)
Leave a comment:
fkrueger replied

06-20-2014, 04:04 AM
We have just released a new version of Bismark (v0.12.3) which makes the following changes:

o Bismark: Replaced the XX-tag field (base-by-base mismatches to the reference, excluding indels) by an MD:Z: field that now properly reflects mismatches as well as indels
o Bismark: Fixed the hemming distance value (NM:i: field) for reads containing insertions (Bowtie 2 mode only) which was previously offset by the number of insertions in the read
o methylation extractor/bismark2bedGraph: Changed the '--zero_based' option of the methylation extractor and bismark2bedGraph to write out an additional coverage file (ending in .zero.cov) which uses the UCSC zero-based, half-open standard
o bismark2bedGraph: Changed the requirement of CpG context files to now start with CpG... (from CpG_...)

Bismark can be downloaded here: https://www.bioinformatics.babraham....jects/bismark/.
Leave a comment:
fkrueger replied

06-19-2014, 06:05 AM
Hi Sung,

I don’t really have an explanation for the increase in filtered alignments after deduplication, but one might speculate that some poorer quality sequences could possibly get sorted - at least to a certain extent - to the start of FastQ files by the Illumina pipeline (reads from the outer edges of a flowcell?), and thus also end up more towards the start of the resulting BAM file? Since the bismark deduplication uses the first alignment per position this might explain the slight increase you are seeing?

Since you are bringing the MAPQ filtering up here and I had similar discussions with at least five people over the past couple of weeks it might be worth going into a bit more detail here. I’d like to give credit to Frank Wessely here who initiated this discussion.

The MAPQ quality value of an alignment depends on the alignment itself (which is in turn governed by the --score_min function and the maximum penalty of -6 for mismatches which is used throughout (--ignore-quals)), as well as the alignment score of the next best read. Since we assume (and recommend) that reads have been vigorously adapter and quality trimmed we recommend using fairly stringent alignment parameters within Bismark to minimise the rate of mis-alignments, and thus erroneous methylation calls. This is why the default --score_min function is set to L,0,-0.2 in Bismark (roughly 3 mismatches or InDels per 100bp read), and not the relatively lax L,-0.6,-0.6 Bowtie 2 default (roughly equivalent to 10 mismatches or Indels per 100bp read).

Now even without considering any similar next best alignments, the MAPQ scores drop relatively quickly for best alignments with a couple of errors in the read, which is mainly a consequence of the stringent --score_min function (and --ignore-quals). If –score_min is relaxed to L,0,-0.4 or even L,-0.6,-0.6 you can see a much wider range of MAPQ values. Here is an example of this, a more detailed list is attached (courtesy of F. Wessely).

--score_min L,0,-0.2 (Bismark default)

Code:

x Min_sc AS=0 AS=-6 AS=-8 AS=-11 AS=-12 AS=-14 AS=-18 AS=-24 50 -10.00 42 8 0 0 0 0 0 0 100 -20.00 42 40 24 8 8 3 0 0

--score_min L,-0.6,-0.6 (Bowtie 2 default)

Code:

x Min_sc AS=0 AS=-6 AS=-8 AS=-11 AS=-12 AS=-14 AS=-18 AS=-24 50 -30.60 42 42 40 24 24 23 8 0 100 -60.60 42 42 42 42 42 40 40 24

As you can see in this example, a 50bp read would already get assigned MAPQ values of 0 for anything more than 1 mismatch or InDel in the read, while this would only be the case for 4 or more mismatches for the much relaxed Bowtie 2 setting. Very similar next alignments should bring the MAPQ values even further down.

It seems to me that filtering on high MAPQ values on already harshly trimmed as well as stringently aligned data is overly stringent, and I am not sure I would want to use much relaxed alignment settings (which may produce incorrect alignments) only to allow some additional quality filtering afterwards. I would still be interested to hear other opinions on this topic.
Attached Files

Bismark_bt2_MAPQ_tables.txt (4.4 KB, 38 views)
Leave a comment:
sunggong replied

06-19-2014, 04:54 AM
MAPQ and deduplicate

Hi,
I wanted to filter reads with poor MAPQ. It seems, with a same MAPQ value, reads are filtered a bit (1.5%) more if reads are de-duplicated (with deduplicate_bismark) then filtered by MAPQ than the other way (filter by MAPQ then dedup). What does this suggest? I've observed this from two bam files with MAPQ 20.

Also, after bismark alignment (before de-dup), ~7% of reads (out of 320M) show MAPQ<1 and ~13% for MAPQ<5. Not sure but 7% of zero MAPQ is a bit high? I used default parameters other than --maxins 564.

Last edited by sunggong; 06-19-2014, 06:44 AM.
Leave a comment:
reco replied

06-16-2014, 07:22 AM
Thanks a lot! fkrueger, that's a pleasant news for my RRBS data analysis.
Leave a comment:

Previous 1 9 16 17 18 19 20 21 22 29 34 template Next

Latest Developments in Precision Medicine

by seqadmin

Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

Somatic Genomics
“We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
- Channel: Articles
05-24-2024, 01:16 PM
Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM

Topics	Statistics	Last Post
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 12 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 25 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM
Comprehensive Sequencing of Great Ape Sex Chromosomes Yields Insights into Evolution and Genetic Variability by seqadmin Started by seqadmin, 05-29-2024, 01:32 PM	0 responses 29 views 0 likes	Last Post by seqadmin 05-29-2024, 01:32 PM
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 215 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News