Seqanswers Leaderboard Ad

**sgrimm** · 02-08-2011, 11:50 AM

strand bias?

I have 3 lanes of of PE bisulf-seq data that I've aligned via bismark. I've merged the data and converted it to a UCSC track to visualize coverage. We've noticed what appears to be regions of significant strand bias randomly throughout the genome, with 1kb+ regions that have (for example) 5x coverage of reads mapped to the + strand and 40x coverage of reads mapped to the - strand.
Has anyone else noticed behavior like this? Any ideas on the source?
Thanks!

**fkrueger** · 02-08-2011, 12:23 PM

Hi Sara,

I can offer you a possible solution for this behavior, but I am not quite sure if it is really true for all cases you are describing. Bismark is as you know performing 4 alignments in parallel, and then deciding whether there is a uniquely best alignment from one of the four alignment instances.

There is however also the possibility that a sequence (or sequence pair) produces the same uniquely best alignment to the same position in the genome in two different alignment threads:

- Either a sequence (or pair) is completely devoid of any Cs or Gs. I once ran a test to find out how frequent this is, and it turned out to be not very frequent at all.

- Alternatively, if the sequence at a position you are looking at is fully methylated, i.e. all Cs are protected, you will also find 2 alignments to the same location in different threads. If a sequence containing only lets say 1 or 2 methylated CpGs (and no other Cs) is a 100% methylated sequence, it will thus produce 2 alignments.

In these cases both alignments will yield the same methylation information, however it is impossible to determine exactly whether the sequence originated from the top strand or from the bottom strand (as they look identical). Thus, if Bismark encounters two alignments to the same position, it will score the alignment which is reported first (which is normally on the OT or OB strand). The second case, namely a sequence which is fully methylated, seems to be the major factor which can produce such a "bias" in strand alignments. Please note that the underlying methylation information coming from two different read/genome conversions would be exactly the same whichever alignment would have been analysed, it thus only looks like there was a bias in the strands the reads came from.

Could you try to verify this hypothesis by looking at such a region and send me an email about the outcome?

Best wishes
Felix

**fkrueger** · 02-10-2011, 07:54 AM

We have just made two amendments to Bismark which were prompted by user requests.

The bismark genome preparation will now write the two bisulfite converted genomes out into multi-FastA files by default instead of generating a single file per chromosome (this is still possible though using --single_fasta). Concatenating single-entry FastA files for the bowtie-build indexing process could previously exceed the OS size limit a list can have (for newly assembled genomes with up to 50000 thousand scaffold and contig files...) and thus crash the indexing process.

The other one affects the way paired-end alignments are handled internally. In particular, this affected the reporting of two special kind of sequences when --directional was specified:

There are 2 special cases under which a sequence will align to the exact same location in two different alignment conditions:

a) If a read pair consists of purely AT reads
b) If a read pair has a methylation percentage of 100%, i.e. none of the Cs (or Gs) is converted.

These reads are now correctly preferentially assigned to the original top strand (OT) or the original bottom strand (OB) and not to their complementary strands. Using --directional will thus not remove any reads described in a) or b) and hence not introduce bias anymore (previously, CTOB strand alignments were preferred over OB strand alignments which would have caused a bias against fully methylated reads originating from the bottom strand).

The latest version (Bismark v0.4.1) can be downloaded at https://www.bioinformatics.bbsrc.ac....jects/bismark/ (If you can't see the latest files the BBSRC cache might need a force update (Ctrl+Refresh)).

Please let me know if you encounter any problems with the latest version.

**zeam** · 03-15-2011, 08:11 PM

Somebody might met a problem when use the scripts.

Originally posted by olivertam View Post

To follow up on the BEDGraph/BigWig idea, we have developed a workaround for this. Again, this is based on single-end analysis, so I haven't tested paired-ends

1) Use methylation-extractor on the Bismark output, with '--comprehensive' option ('--merge_non_CpG' is optional)
2) Run the following script (genome_methylation_bismark2bedGraph.pl). This script sorts the methylation extractor output, then parses the results to generate an "overall methylation level" as a BEDGraph file, with one sampled cytosine site per line.
3) Use the bedGraphToBigWig program (available online, I believe) to convert the BEDGraph to BigWig.

Here's the usage for the genome_methylation_bismark2bedGraph.pl

Usage: genome_methylation_count.pl (--cutoff [threshold] ) [Bismark methylation caller output] > [output]

--cutoff [threshold] - The minimum number of times a methylation state was
seen for that nucleotide before its methylation
percentage is reported.
Default is no threshold

The output file is a tab-delimited BedGraph file with the following information:

<Chromosome> <Start Position> <End Position> <Methylation Percentage>

Bismark methylation caller (v0.2.0 or later) should produce three output files
(CpG, CHG and CHH) when using the "--comprehensive" option
(Two files if using the "--merge_non_CpG" option).
To count both CpG and Non-CpG, combine the output files.

Bismark methylation caller (v0.1.5 or earlier) should produce two output files
(CpG and Non-CpG) when using the "--comprehensive" option.
To count both CpG and Non-CpG, combine the two output files.

Let me know if you have any questions, issues or bugs (since this is a workaround)

Cheers,
Oliver

First,thanks for your perl scripts!
When I use this to processing my own data,it works pefect;but when I use it to proccessing the data downloaded from NCBI SRA,It doesn't work.

My data was downloaded from the NCBI SRA,and after bismark methylation extractor,the output was as below:
######################################################
Bismark methylation extractor version v0.4.1
SRRXXXXXX.4.1 HWUSI-EAS585:2:1:2:1917.1 length=90 - chr05 6490339 x
SRRXXXXXX.4.1 HWUSI-EAS585:2:1:2:1917.1 length=90 - chr05 6490291 x
######################################################
The colums of chromosomes and coordinates are 5 and 6,respectively.So in your scripts,"sort -k3,3 -k4,4n" actually should be "sort -k5,5 -k6,6n",so I think you shoud call users' attention to this.Or you can do something to fix this.

**olivertam** · 03-15-2011, 08:24 PM

Hi zeam,

Thanks for finding the issue. I have never dealt with SRA data, so I was unaware of their "altered" format. Did you run the methylation extractor on the SRA data itself, or was it stored as methylation extractor output?
If you could give me the dataset from SRA that gave you the issue, as well as the cmd line parameters that you used to analyze the data, I'll be happy to try and fix the issue.

Again, thanks for bringing this bug up.

Cheers,
Oliver

**olivertam** · 03-15-2011, 08:30 PM

Hi zeam,

It seems like the SRA has a really bad naming system, where they add spaces in the name. This throws out all the column separation completely.
If you look at the data, you'll probably find that the sequence name is actually "SRRXXXXXX.4.1 HWUSI-EAS585:2:1:2:1917.1 length=90". Unfortunately, the name (which should be in one column) is now treated as three columns thanks to the spaces (hence chr column is column 5 instead of column 3). I'm not sure what would be the best way to fix it, other than to make methylation extractor to remove all spaces from the name (or replace them with underscores).
Really not sure what the best approach is.

Cheers,
Oliver

**zeam** · 03-15-2011, 11:29 PM

Originally posted by olivertam View Post

Hi zeam,

Thanks for finding the issue. I have never dealt with SRA data, so I was unaware of their "altered" format. Did you run the methylation extractor on the SRA data itself, or was it stored as methylation extractor output?
If you could give me the dataset from SRA that gave you the issue, as well as the cmd line parameters that you used to analyze the data, I'll be happy to try and fix the issue.

Again, thanks for bringing this bug up.

Cheers,
Oliver

Hi,Oliver,
You are right about the SRA name,thanks for your replying,and I just think you should let the users know how to modify your script when dealing with SRA data.

Best wishes,
Zeam

**fkrueger** · 03-16-2011, 01:38 AM

Hi Zeam and Oliver,

I took the freedom to modify the genome_methylation_bismark2bedGraph script so that you can now specify an option --remove_spaces which will replace all whitespaces with underscores before running the actual bedGraph conversion. Let me know if you spot any bug.

I will attach the file here, and it is also available for download at https://www.bioinformatics.bbsrc.ac....jects/bismark/

Best wishes,
Felix

Attached Files

genome_methylation_bismark2bedGraph.pl (5.4 KB, 26 views)

**olivertam** · 03-16-2011, 07:29 AM

Hi Felix,

Thanks for doing this. I guess I shouldn't be so lazy next time

The code looks good to me, but I'm sure zeam will be able to test it with his SRA data.

Again, thanks heaps for your help.

Cheers,
Oliver

**cwhelan** · 03-16-2011, 11:03 AM

bismark_to_SAM script for paired end data

Hi Oliver, Felix, and others,

I think there may be a small problem in the bismark_to_SAM_v2.pl script when converting paired end bismark output to SAM format. Specifically, I don't think it's calculating the insert size field in the SAM file correctly. For example, for the Bismark line:

GA2_0015_FC:1:1:2010:1104#0 + 1 208840227 208840334 GGTAGAAGAAGTTTTGATGAGGTGGAGAAGTTATTTATTTGTAGGTTTAT GGCAGAAGAAGCCCTGATGAGGTGGAGAAGTCATCCATTTGCAGGCCTATGA ..x........hhx.................h..hh.....x...hh... NNTTCTCTAAATTAAAACTATTAAAACAAACCTTCTTAAAATATTTAAAA TACTTTCTCTGAGTTAGAACTATTAAAACAGACCTTCTTAGGATATTTAAAA ........x.h...h.............x.........hh.......... CT CT

It's generating the SAM output:

GA2_0015_FC:1:1:2010:1104#0 99 1 208840227 255 50M 1 208840284 57 GGTAGAAGAAGTTTTGATGAGGTGGAGAAGTTATTTATTTGTAGGTTTAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:10 MD:Z:2T8TTT17T2TT5T3TT3

The insert size of 57 in that record doesn't seem to be taking into account the length of one of the reads - in this case 50bp, meaning that the insert size should really be 107bp, I believe.

You can see a visual example of what I'm talking about in the attached SeqMonk snapshot. In the first data set I imported the SAM output as single end data. In the second, I imported it as paired end data, and SeqMonk is displaying the location of the insert. (This was done with the latest version of SeqMonk which fixed a bug with where the inserts for paired end data were displayed.) The last data set is a paired-end import of a SAM file generated by a version of the script that I modified to try to get the right insert size - it's actually a simplification of the current calculations. I've attached my modified version, but I'd appreciate it if someone else could take a look at it and verify that my changes are correct - I find SAM format pretty confusing sometimes so please correct me if I've done something wrong!

Thanks,

Chris

Attached Files

**olivertam** · 03-16-2011, 11:42 AM

Hi cwhelan,

Thank you for picking up this error. I must admit that the SAM format for paired ends was very confusing when they first introduced it. I obtained a much better manual explaining the SAM format, and I completely agree with your changes. I do apologize for the issue, but when they previously used the term "insert size" (though it has been renamed to template length in the newest documentation), I thought it meant the size of the fragment inserted between the two sequenced ends.
Again, I apologize for the mistake, and thank you heaps for fixing it. I would totally recommend Felix to put up your version and give you credit for the change.

Cheers,
Oliver

**zeam** · 03-16-2011, 06:12 PM

Queries on Bismark

Hi,felix,

Recently,I have been using the Bismark,Thanks for your wonderful program.

There is something confused me:
1)In the user'guide,I only saw option to handle mismatch of the seed,but what about the left?
2)Whar criteria has been used to define a unique map read in Bismark?

Best wishes,
Zeam

**fkrueger** · 03-17-2011, 12:03 AM

Hi Zeam,

Thanks for your kind words.

To answer your question about seed length:
Bismark uses Bowtie and thus read alignments and seed length are handled exactly as in Bowtie, i.e. reads can potentially have more mismatches after the seed, however all mismatches together must not exceed the mismatch ceiling (which can be set with the -e parameter). Even though it is possible we wouldn't recommend to tolerate a large amount fo mismatches as this can have dramatic consequences on the inferred methylation levels. What we tend to do is set the seed length to the entire read length (with -l) and allow only 1 or 2 mismatches with the -n parameter, as this minimises the amount of mismappings and thus false mthylation calls.

Bismark considers alignments unique if a read has an alignment to the genome which has a lowest amount of mismatches. So if a read aligns to one position with 0 mimatches, and with 2 mismatches to another position, the perfect match will be picked, whereas a read with 2 different matches to the genome with 0 mismatches each will be removed as an ambiguous match. For paired-end alignments the lowest sum of mismatches of both reads determines whether a read pair can be placed uniquely.

Hope this helps,
Best wishes,
Felix

**fkrueger** · 03-17-2011, 01:16 AM

Dear Chris,

Thanks very much for providing the new bismark2SAM_v3 script, it is now also available for download from the bismark homepage.

Best wishes,
Felix

**zeam** · 03-24-2011, 06:41 PM

A query on bismark read trimming

Hi Felix,

I have noticed that the bismark didn't have a parameter for read trimming like bwa.If I need read trimming,am I use another tool or bismark have this function that I didn't see?

Best wishes,
Zeam

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News