M-Bias plot and alignment stats pie chart
We have just released a new version of Bismark (v0.8.0). The main purpose of this release is the introduction of an M-bias plot for the methylation extractor (thanks to K.D. Hansen for the suggestion, see reference below). The M-bias plot allows researchers to see whether there are fundamental technical biases in the methylation calling of their reads.
Attached are two examples:
a) the profile of a PBAT-Seq library (read1) which should in theory start with a 'random' 4-mer oligo that was used to bind to bisulfite converted reads. The M-bias (and also base composition profiles of these libraries as judged by FastQC (not shown here)) do show some serious biases primarily within, but not limited to, the first 4bp of the library.
b) Read 2 of any paired-end BS-Seq library. Due to the end-repair reaction and A-tailing procedure for Illumina libaries, the 3' ends of sonicated fragments are filled in by the Klenow enzyme with unmethylated cytosines. These will then be converted to Ts in the bisulfite treatment, resulting in a very high amount of apparently unmethylated cytosines at the first couple of positions of Read 2. (The problem is somewhat reminiscent of the fill-in problem in RRBS libraries).
In addition to just the methylation profile, the M-bias plot also shows the actual number of events for an entire read file. This should enable investigators to make an informed decision of whether to remove biased positions in the reads, e.g. by using the option --ignore <int> of the methylation extractor or by hard-trimming read 2 by 2bp prior to running the alignments, or whether it is OK to live with what they observe.
Other changes in the latest version include:
• Bismark: Added new option '--prefix' to add a prefix to the output filenames. For example, '--prefix test' with 'file.fq' would result in the output file 'test.file.fq_bismark.sam' etc.
• Bismark: Fixed a warning message that occurred when chromosomal sequences could not be extracted in paired-end Bowtie2 mode
• Bismark: will now generate a pie chart with the alignment statistics once a run has finished; this allows to get a quick overview of how many sequences aligned uniquely or sequences that did not align, either due to producing no alignment at all, multiple mapping or because it was impossible to extract the chromosomal sequence
• Methylation Extractor: upon completion, the methylation extractor will now produce an M-bias (methylation bias) plot, which shows the methylation proportion across each possible position in the reads (described in: Hansen et al., Genome Biology, 2012, 13:R83). The data for the M-bias plot will be written into a text file (to generate graphs by alternative means) and drawn into a .png file. The plot also contains the absolute number of methylation calls per position (methylated + unmethylated)
Bismark is avaialable for download here: https://www.bioinformatics.babraham....jects/bismark/
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Good news. So apparently the warning messages you saw amounted to a total of 215 (out of 49 million read pairs), indeed something you should be able to live with.
Bismark has an option '--qupto' to just use the first so many reads; just type bismark --help to see further explanations on it
So you could run single end alignments against your masked genome for the first 500000 reads with not-so-stringent criteria using the following command:
bismark --bowtie2 --qupto 500000 --score_min L,0,-0.6 /path/to/genome/ file_val_1.fq
Leave a comment:
-
Well, apparently errors started to appear also in the alignment run against the standard genome, but a bit later I left the office (see below). Not really concerning if I get a good enough mapping efficiency.... Still running
Well this is embarrassing…… No idea how "they" masked the genome for the interspersed repeats. I can't even ask till next Monday...Sorry for this also, but I have no idea how to extract the first 500K reads. I am quite newbie in all this
The reason for aligning to both versions of the genome is just to compare results...well, I am just curious about how the interspersed repeats can determine certain patterns of methylation in certain parts of a couple of particular chromosomes. I just thought it was worth the shot.
Processed 47700000 sequences so far
Processed 47800000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 95642710.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 95642710.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:2821:99210#TCTTATAT/1 scaffold_1878
Processed 47900000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 95989234.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 95989234.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:12271:44482#TCTTATAT/1 scaffold_232
Processed 48000000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96116284.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96116284.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2315:3124:23585#TCTTATAT/1 scaffold_297
Processed 48100000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96230734.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96230734.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2314:15481:55756#TCTTATAT/1 scaffold_1105
Processed 48200000 sequences so far
Processed 48300000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 96765254.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 96765254.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:2316:2905:27999#TCTTATAT/1 scaffold_1711
Processed 48400000 sequences so far
Processed 48500000 sequences so far
Processed 48600000 sequences so far
Processed 48700000 sequences so far
Processed 48800000 sequences so far
Processed 48900000 sequences so far
FYI - Updated info: The alignment to standard genome is now finished. It took around 17h and the mapping efficiency was fairly high....thanks again for this worderful tool! I copy/paste the report:
Final Alignment report
======================
Sequence pairs analysed in total: 48900263
Number of paired-end alignments with a unique best hit: 31991336
Mapping efficiency: 65.4%
Sequence pairs with no alignments under any condition: 13461118
Sequence pairs did not map uniquely: 3447809
Sequence pairs which were discarded because genomic sequence could not be extracted: 215
Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 15981506 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 16009615 ((converted) bottom strand)
Number of alignments to (merely theoretical) complementary strands being rejected in total: 0
Final Cytosine Methylation Report
=================================
Total number of C's analysed: 1298097901
Total methylated C's in CpG context: 122944840
Total methylated C's in CHG context: 870088
Total methylated C's in CHH context: 2311989
Total C to T conversions in CpG context: 71226620
Total C to T conversions in CHG context: 301248948
Total C to T conversions in CHH context: 799495416
C methylated in CpG context: 63.3%
C methylated in CHG context: 0.3%
C methylated in CHH context: 0.3%Last edited by oria34; 07-06-2013, 05:19 AM.
Leave a comment:
-
Good to hear that the alignments against the standard genome went well.
The errors from the masked genome are quite likely to be a result of the masking process itself, do you have an idea about how much sequence has been masked into Ns (presumably)?
As far as I am aware Bowtie2 supports indexing of Ns in there reference sequence (whereas Bowtie 1 does not), so in theory it should work in the same way. N's in the reads or reference do receive a mapping penalty as well though (albeit not as much as true mismatches), so if you simply have masked too much of your genome this might just exceed the fairly stringent penalty allowance of Bismark's default setting which is governed by the "--score_min" parameter. The longer your reads and the paired-end nature of your library could make it possible that effectively all reads would fall into a 'masked' region.
To see if it is the degree of masking you could just run a few sample reads (e.g. with the parameter --qupto 500000 for the firt 500K reads), run only Read 1 in single end mode and use --score_min L,0,-0.6. This should finish in a couple of minutes and tell you quickly if the mapping efficiency increases notably.
Just out of interest, why would you want to align reads to both a masked and unmasked genome in parallel (especially since the normal genome alignments were fine)?
Leave a comment:
-
Hi again,
OK, I have some more news. I made a mistake checking online my runs ....the one that is running error-free is the alignment against the standard genome.
All the errors I´ve reported have been produced in the alignment run against the masked genome. These runs are now over (see report below) and they didn´t work at all.
To my understanding, Bowtie doesn´t have any type of restrictions in the use of masked genomes, does it? Do you think I could miss something during the genome preparation step?
Cheers
Bismark report for: M24_FIS17_trim_galore_1.fq and M24_FIS17_trim_galore_2.fq (version: v0.7.12)
Bowtie was run against the bisulfite genome of /Bisulfite/BISMARK_REALIGNMENT/MASKED_Version_Genome/ with the specified options: -q --phred64 --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500
Option '--directional' specified: alignments to complementary strands will be ignored (i.e. not performed)
Final Alignment report
======================
Sequence pairs analysed in total: 62787165
Number of paired-end alignments with a unique best hit: 1587
Mapping efficiency: 0.0%
Sequence pairs with no alignments under any condition: 62785501
Sequence pairs did not map uniquely: 77
Sequence pairs which were discarded because genomic sequence could not be extracted: 0
Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 557 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 1030 ((converted) bottom strand)
Number of alignments to (merely theoretical) complementary strands being rejected in total: 0
Final Cytosine Methylation Report
=================================
Total number of C's analysed: 50806
Total methylated C's in CpG context: 1118
Total methylated C's in CHG context: 99
Total methylated C's in CHH context: 172
Total C to T conversions in CpG context: 2166
Total C to T conversions in CHG context: 16576
Total C to T conversions in CHH context: 30675
C methylated in CpG context: 34.0%
C methylated in CHG context: 0.6%
C methylated in CHH context: 0.6%Last edited by oria34; 07-06-2013, 12:39 AM.
Leave a comment:
-
Thanks for your quick reply!
I am not really worried for the errors, I remember reading in this thread that they are due to the Cs located at the very end and, as you said, they are not relevant...Is the outcome what is killing me!
I have created two different folders for the two different alignments (unmasked and masked), each with copies of bismark´s scripts. Each folder has two subfolders, one per individual (two .fq files each)...I have even duplicated the bowtie2 folder just in case....
It is true that the two genomes are in different folders but each genome is used at the same time for the alignment of the different individuals (2x)...but this doesn´t seem to affect to the masked version of the genome.
I´ll post tomorrow how it is going.
Many thanks!
Leave a comment:
-
Hi oria,
Interestingly I encountered the same error messages yesterday myself; they are caused by trying to print out the starting position of the read which aligns so close to the edge of a chromosome or contig/scaffold that additional 2bp sequence can't be extracted. These warnings are harmless though and can be ignored safely; I have still fixed this bug yesterday (if you want the new version just send me an email).
From the output below it seems that you are getting roughly 1 of these warning per 100000 analysed sequences, so this can hardly be the reason for poor overall mapping results. Are you by any chance running the other alignment to the repeat masked genome from the same files in the same folder? This would certainly have the potential to interfere with the results... In any case, if you continue to have problems can you email me further details of your problems? If necessary I could also create an FTP server for you to upload your files in question so I could take a look myself.
Leave a comment:
-
Hi all
I am trying to re-align my reads again using a new assembly of the reference genome and I am finding many problems in the alignment step.
After running genome_preparation I am trying to run bismark for paired end reads trimmed with trim_galore (--phred64 -t --paired filename_2 filename_2). Right after the script creates the two version os the reads (c-t, a-g) it gives back a un ending number of errors. Many of them are related to the impossibility to determine the genomic context (due to Cs located at the end of the contigs) but others are quite new for me.
The resulting SAM file is fairly small and the number of Cs analyzed (from the report file) are ridiculously low; around 50 000.
I am using the last version of Bismark with the --bowtie2 option (also --botwtie2 option for the genome preparation step)
I am running in parallel an alignment with the same .fq files against a masked version of the genome and no errors so far (4 millions reads...)
Any thoughts?
Cheers
PS. some of the lines with multiple errors
Reading in the sequence files BPM24_FIS17_trim_galore_1.fq and BPM24_FIS17_trim_galore_2.fq
Processed 100000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 203898.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 203898.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:2724:10559#ACCAGACT/1 scaffold_1495
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 223364.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 223364.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:7055:11458#ACCAGACT/1 scaffold_402
Processed 200000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 592170.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 592170.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:8410:27041#ACCAGACT/1 scaffold_1917
Processed 300000 sequences so far
Processed 400000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 957754.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 957754.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:14266:27423#ACCAGACT/1 scaffold_1862
Processed 500000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1095048.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1095048.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:14067:33341#ACCAGACT/1 scaffold_297
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1171846.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1171846.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1102:20587:36545#ACCAGACT/1 scaffold_391
Processed 600000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 1323878.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 1323878.
Chromosomal sequence could not be extracted for FCD1LHLACXX:8:1101:10776:58374#ACCAGACT/1 scaffold_1028
Processed 700000 sequences so far
Processed 800000 sequences so far
Processed 900000 sequences so far
Processed 1000000 sequences so far
Processed 1100000 sequences so far
Use of uninitialized value in length at bismark line 2780, <$__ANONIO__> line 2259346.
Use of uninitialized value in concatenation (.) or string at bismark line 2781, <$__ANONIO__> line 2259346.
Last edited by oria34; 07-05-2013, 10:53 AM.
Leave a comment:
-
Hi Felix - thanks for the prompt reply, and sorry I forgot to include my library type/bowtie version. I was just making sure this was the intended behaviour before I updated my pipeline to ignore the /1 /2 tags in remove PCR duplicates - thanks
Leave a comment:
-
Hi forzenlyse,
It seems that you were running paired-end alignments with Bowtie2, correct? Bismark does indeed append /1/1 to read 1, one of which is removed during the alignment step. The second one is then removed at a later step, but to keep the output in a way that allows immediate identification of what was R1 and R2, /1 and /2 are appended just before the SAM/BAM output is created. If you don't want these trailing fragment numbers you can just locate the following two lines in the subroutine "paired_end_SAM_output":
Code:my $id_1 = $id.'/1'; my $id_2 = $id.'/2';
Code:# my $id_1 = $id.'/1'; # my $id_2 = $id.'/2';
Leave a comment:
-
Originally posted by fkrueger View PostHi pengchy,
To 1) It is true that Bismark appends segment numbers to the end of read. This is because Bowtie or Bowtie2 tend to delete these tags internally while aligning, and to make it more difficult they don't do it in the same way. To properly keep track of which read is doing what I had to do this change (btw also white spaces or tab characters are being replaced by _ in the read ID.
My input fastq does not have these tags
Code:@HWI-D00119:25:D1WYEACXX:3:1101:1169:2017 1:N:0: NGGTGTATTGGGTTAGTGTATTTTTTTGTTGATATTTTTTTTTTGTATTGTGGTTGATTTTTGATTAAGGACGTTTGTTTTTAAGGGTTTTTGGTGGG + #0<<BFFFFFFFFFIIFFIFFFIIIIIIIIIFIFFIIIIIIIIFFB<BFBBBFBBB<BBFFFF<BFBBBB<BBBFBFFBFFFBBBFF<BFFFBF<BBF
Code:@HWI-D00119:25:D1WYEACXX:3:1101:1169:2017_1:N:0:/1/1 NGGTGTATTGGGTTAGTGTATTTTTTTGTTGATATTTTTTTTTTGTATTGTGGTTGATTTTTGATTAAGGATGTTTGTTTTTAAGGGTTTTTGGTGGG + #0<<BFFFFFFFFFIIFFIFFFIIIIIIIIIFIFFIIIIIIIIFFB<BFBBBFBBB<BBFFFF<BFBBBB<BBBFBFFBFFFBBBFF<BFFFBF<BBF
Code:HWI-D00119:25:D1WYEACXX:3:1101:1169:2017_1:N:0:/1 67 chr6 151954983 255 98M = 151955045 162 NGGTGTATTGGGTTAGTGTATTTTTTTGTTGATATTTTTTTTTTGTATTGTGGTTGATTTTTGATTAAGGACGTTTGTTTTTAAGGGTTTTTGGTGGG #0<<BFFFFFFFFFIIFFIFFFIIIIIIIIIFIFFIIIIIIIIFFB<BFBBBFBBB<BBFFFF<BFBBBB<BBBFBFFBFFFBBBFF<BFFFBF<BBF NM:i:21 XX:Z:A4C1C4C8C3C2C7C3C1C4C5C3CC5CC7C3C1C9CC7 XM:Z:.....h.x....h........h...x..x.......h...h.x....x.....x...hh.....hh.....Z.h...h.h.........hx....... XR:Z:CT XG:Z:CT
Thanks
Leave a comment:
-
Originally posted by Ahra View PostBismark methylation extractor version v0.7.11
HWI-D00111:9926ADACXX:7:1101:1567:2165_1:N:0:CTTGTA/1 - Vr08 7498192 x
HWI-D00111:9926ADACXX:7:1101:1567:2165_1:N:0:CTTGTA/1 - Vr08 7498173 x
HWI-D00111:9926ADACXX:7:1101:1567:2165_1:N:0:CTTGTA/2 - Vr08 7498096 x
To see if there is something going wrong on the way could you please send the onscreen dialog via email?
Leave a comment:
-
bedGraph usage
hello. I am a newbie using bismark. I'm not a bioinformation and also i don't know much about NGS alignment, especially how to running alignment programs. But, i could use bismark thanks to kindly well made bismark user guide book.
i ran bismark and got the report file and sam file. so to get the position of methylC, i ran bismark_methylation_extractor especially used --bedGraph option to see methlylation %.
According to user guide book,
A typical command including the optional bedGraph --counts output could look like this:
bismark_methylation_extractor -s --bedGraph --counts --buffer_size 10G s_1_sequence.txt_bismark.sam
my data is paired-end so i edited little that one. like this:
bismark_methylation_extractor -p --no_overlap --comprehensive --CX
--bedGraph --counts --buffer_size 10G s_1_sequence.txt_bismark.sam
but, i couldn't get additional information about <chromosome> <start position> <end position> <methylation percentage>
my result was like thins
Bismark methylation extractor version v0.7.11
HWI-D00111:9926ADACXX:7:1101:1567:2165_1:N:0:CTTGTA/1 - Vr08 7498192 x
HWI-D00111:9926ADACXX:7:1101:1567:2165_1:N:0:CTTGTA/1 - Vr08 7498173 x
HWI-D00111:9926ADACXX:7:1101:1567:2165_1:N:0:CTTGTA/2 - Vr08 7498096 x
HWI-D00111:9926ADACXX:7:1101:1567:2165_1:N:0:CTTGTA/2 -
how can i get the methylation percentage information and run further bedGraph2Cytosines process!
i am looking forward anyone's reply
Leave a comment:
-
Solved: alignment on one strand only
I am now facing another problem: when I run bismark of some of my samples, reads end up mapped to only the top or the bottom strand...
Thanks Felix for your help
Julien
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...-
Channel: Articles
05-06-2024, 07:48 AM -
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 05-10-2024, 06:35 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
05-10-2024, 06:35 AM
|
||
Started by seqadmin, 05-09-2024, 02:46 PM
|
0 responses
26 views
0 likes
|
Last Post
by seqadmin
05-09-2024, 02:46 PM
|
||
Started by seqadmin, 05-07-2024, 06:57 AM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
05-07-2024, 06:57 AM
|
||
Started by seqadmin, 05-06-2024, 07:17 AM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
05-06-2024, 07:17 AM
|
Leave a comment: