Seqanswers Leaderboard Ad

**fkrueger** · 08-21-2012, 05:08 AM

Originally posted by shawpa View Post

I am using the bismarktobedgraph script from the website. Because it doesn't have a temp directory option I am running it from the directory that has the space. It seems that it is running out of memory still and my standard error file says "sort: write failed: /tmp/sortMpgIOA: No space left on device". I am not sure why it is still writing to this directory. Is there something in the script that needs to be adjusted?

Hi Shawpa,

your input files are probably quite big so that sorting fails. The script is sorting using the operating system's sort command, which defaults to the /tmp/ directory. This could be changed in the script with the sort -T option

Code:

 -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories

I have also written a new version of this script a few days ago, which lets you sort your files chromosome by chromosome if you run out of space. The new script is attached, it'll be up on the homepage shortly.

Here are the changes to the script i have implemented:

Added an option '--split_by_chromosome' to enable sorting of very large files. The methylation extractor output is first written into temporary files chromosome by chromosome. These files are then sorted by position and deleted afterwards.

Added an option '--counts' which adds 2 more lines to the bedGraph output file:

Column 5: count of methylated calls per position, and
Column 6: count of unmethylated calls per position.

Technically, this renders the output to be no longer in bedGraph format, but it might enable additional calculations with the output.

Attached Files

genome_methylation_bismark2bedGraph.pl (9.2 KB, 67 views)

**y_s** · 09-03-2012, 07:29 AM

threads optimization

Dear Felix,
In order to estimate the number of grids I should use in our cluster to run bismark with huge data sets, I am running a test to check processing times and number of processors used and found a problem; more processors took longer processing times.
My environment:
PowerMac OS10.7.4, 2x 2.4GHz 6-core Xeon, 64GB RAM
used bismark-v0.7.6

in-silico generated test data:
100,000 pairs of 100-bp PEreads randomly taken from mouse genome (UCSC mm9)
bisulfite conversion rate:0.999, CpG methylation rate: 0.7

test:
run bismark without option or -p 2, 4, 8, 10, and 12 options

result: (used command and time)
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 424.112706184
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 2 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 304.065144777
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 4 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 233.798403025
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 8 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 251.128180981
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 10 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 284.739043951

I had a similar result with 1 million pairs

bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 4068.50050616
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 2 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 2656.86580992
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 4 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 1997.74887085
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 8 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 2230.58184505
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 10 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 2388.17396808
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 12 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
took 2468.01015115

However, bowtie2 itself showed a good correlation with times and # of processor.
bowtie2 --sensitive-local -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
took 170.282114983
bowtie2 --sensitive-local -p 2 -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
took 87.4923779964
bowtie2 --sensitive-local -p 4 -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
took 45.6259729862
bowtie2 --sensitive-local -p 8 -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
took 26.6778290272
bowtie2 --sensitive-local -p 10 -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
took 23.4780960083
bowtie2 --sensitive-local -p 12 -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
took 21.6073651314

In all conditions, the activity monitor indicated appropriate usage of computational power. (e.g. ~ 100 % CPU power with single processor and ~1000 % with 10 processors)
Is this specific to my environment or a general problem regarding algorithm/program structure?

Thanks,

**fkrueger** · 09-03-2012, 07:43 AM

That's an interesting observation; I have noticed though that you aren't using quite the same paramaters since Bismark is calling Bowtie2 in --reorder mode which is needed for Bismark's internal logic. The Bowtie2 manual states:

--reorder: Guarantees that output SAM records are printed in an order corresponding to the order of the reads in the original input file, even when -p is set greater than 1. Specifying --reorder and setting -p greater than 1 causes Bowtie 2 to run somewhat slower and use somewhat more memory then if --reorder were not specified.

My guess is that it is a combination between Bowtie2 getting itself worked up somehow, and to some extent the way best alignments are worked out and methylation is extracted within Bismark. Since you are seeing an increase in CPU usage to 1000% for 10 cores it seems that it is really Bowtie2 which is thinking so hard, since Bismark itselft is single-threaded and thus uses only 100% max. Another option for parallelizing Bowtie2 alignments might also be to simply split up input files into 2 or more chunks... Does this help?

**y_s** · 09-03-2012, 11:22 AM

Dear Felix,
Thank you for your prompt reply.
As you suggested, the bottleneck seems to be the ability of a thread running bismark. While bismark is running, one perl-5.xx thread and two bowtie2 with p threads are working. The perl thread often gets to 100% CPU power with high p options, which suggest bismark is inundated with tons of incoming results from bowtie2. I got best results with -p 5, but it may be faster to split a job into two and run jobs with -p 3 or 4 independently when I deal with huge data sets, as you suggested.

Many thanks to your suggestion again.

bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
took 426.156659842
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 2 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
took 300.796103001
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 3 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
took 251.952316046
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 4 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
took 231.829093933
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 5 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
took 220.367155075
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 6 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
took 222.934455156
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 7 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
took 242.725275993
bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 8 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
took 257.643329144

**my_bio** · 09-19-2012, 04:23 AM

Dear fkrueger,
genome_methylation_bismark2bedGraph_v4.pl is a useful program to count methylation of cytosine genome-wide. if a cytosine site did not covered one read, information of this cytosine will not appear in output file. So I need a program which can count methylation level of every cytosine site in reference genome, this will be more useful in subsequent analysis. the output file may like this:
chromosome strand coordinate methylated_reads unmethylated_reads context context detaile_context
chr1 10 + 4 6 CG CGA
chr1 11 - 5 6 CG CGT
chr1 15 + 1 11 CHG CAG
chr1 17 - 2 10 CHG CTG
chr1 20 + 0 12 CHH CAA
Could you provide a program like this ?

**drdna** · 09-19-2012, 05:41 AM

Good point my_bio.

**fkrueger** · 09-19-2012, 05:49 AM

Originally posted by my_bio View Post

Dear fkrueger,
genome_methylation_bismark2bedGraph_v4.pl is a useful program to count methylation of cytosine genome-wide. if a cytosine site did not covered one read, information of this cytosine will not appear in output file. So I need a program which can count methylation level of every cytosine site in reference genome, this will be more useful in subsequent analysis. the output file may like this:
chromosome strand coordinate methylated_reads unmethylated_reads context context detaile_context
chr1 10 + 4 6 CG CGA
chr1 11 - 5 6 CG CGT
chr1 15 + 1 11 CHG CAG
chr1 17 - 2 10 CHG CTG
chr1 20 + 0 12 CHH CAA
Could you provide a program like this ?

This should be fairly straight forward to do. For a not covered CpG, would you envisage to have a record such as this:

chr1 10 + 0 0 CG CGA
chr1 11 - 0 0 CG CGT

?

**my_bio** · 09-20-2012, 12:44 AM

Originally posted by fkrueger View Post

This should be fairly straight forward to do. For a not covered CpG, would you envisage to have a record such as this:

chr1 10 + 0 0 CG CGA
chr1 11 - 0 0 CG CGT

?

For a not covered CpG, I would like to have a record. That is to say, for different samples, as long as the reference genome is same, their output files will have the same number of lines. This will be more convenient in subsequent analysis, especially in the comparison of different samples. So I think Bismark need to add these functions. By the way, Bismark will be more popular if it can do methylcytosines calling by binomial test. BiSS(http://www.cibiv.at/software/ngm/BiSS/) can do all these works, but run this software need NVIDIA Accelerated Linux Graphics Driver. I look forward to update of Bismark to do more analysis.

**fkrueger** · 09-20-2012, 07:58 AM

I have now written something that generates a cytosine methylation report for a genome of interest and a sorted methylation input file produced by the script "genome_methylation2bedGraph". By default, the output uses 1-based chromosome coordinates (zero-based cords are optional) and reports CpG context only (all cytosines optional). The output considers all Cs on both forward and reverse strands.
The input file needs to have been generated with the genome_methylation2bedGraph script while using the '--counts' option, or otherwise be exactly in the following format:
<chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>
I didn’t have time for extensive testing but it seems to work alright.

The USAGE is: genome_wide_cytosine_report.pl [options] <input> > <output>, --help will show all options.

The output looks like this:
<chromosome> <position> <strand> <count methylated> <count non-methylated> <C context> <trinucleotide context>

Some additional comments:
- If this script turns out to be working nicely and this is useful for users, I could add to the methylation extractor an option to generate a bedGraph files or this output format straight away. I appreciate that it is somewhat tedious to run methylation_extractor > extractor2bedGraph > genome_wide_cytosine_report in succession, but I am sure this could be streamlined

- You need to be aware that if you are dealing with extremely large datasets that cover almost the entire genome, e.g. 2 billion WGSBS-reads, the whole task becomes very memory intensive (probably up to 20GB or so) even though it is working on a chromosome-by-chromosome basis (large chromosomes may have well over 100 million Cs). This is a limitation of using standard Perl and the overheads used for its data structures, and one would have to look at using different approaches or a different programming language to do the task if this solution fails.

- Running a binomial test on the methylated positions does not necessarily have anything to do with Bismark or any aligner per se (however I am not sure you want to call differential methylation on a per-cytosine basis instead of per feature or region of interest…). You could possibly even run the binomial test with the R script provided by the BiSS authors.

Please let me know what your experiences are.
Best,
Felix

Edit: Script amended to also output sequence context.

Attached Files

genome_wide_cytosine_report.pl (13.3 KB, 66 views)

**my_bio** · 09-20-2012, 07:12 PM

Originally posted by fkrueger View Post

I have now written something that generates a cytosine methylation report for a genome of interest and a sorted methylation input file produced by the script "genome_methylation2bedGraph". By default, the output uses 1-based chromosome coordinates (zero-based cords are optional) and reports CpG context only (all cytosines optional). The output considers all Cs on both forward and reverse strands.
The input file needs to have been generated with the genome_methylation2bedGraph script while using the '--counts' option, or otherwise be exactly in the following format:
<chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>
I didn’t have time for extensive testing but it seems to work alright.

The USAGE is: genome_wide_cytosine_report.pl [options] <input> > <output>, --help will show all options.

The default output is:
<Chromosome> <Position> <Strand> <Count methylated> <Count non-methylated> <Trinucleotide context>
Also, there is an option ‘--dinucleotide_context’ to produce an additional column at the end with the di-nucleotide context. As the dinucleotide context is already contained within the tri-nucleotide context field and because it makes the output file considerably larger, this extra field is only optional.

Some additional comments:
- If this script turns out to be working nicely and this is useful for users, I could add to the methylation extractor an option to generate a bedGraph files or this output format straight away. I appreciate that it is somewhat tedious to run methylation_extractor > extractor2bedGraph > genome_wide_cytosine_report in succession, but I am sure this could be streamlined

- You need to be aware that if you are dealing with extremely large datasets that cover almost the entire genome, e.g. 2 billion WGSBS-reads, the whole task becomes very memory intensive (probably up to 20GB or so) even though it is working on a chromosome-by-chromosome basis (large chromosomes may have well over 100 million Cs). This is a limitation of using standard Perl and the overheads used for its data structures, and one would have to look at using different approaches or a different programming language to do the task if this solution fails.

- Running a binomial test on the methylated positions does not necessarily have anything to do with Bismark or any aligner per se (however I am not sure you want to call differential methylation on a per-cytosine basis instead of per feature or region of interest…). You could possibly even run the binomial test with the R script provided by the BiSS authors.

Please let me know what your experiences are.
Best,
Felix

Edit: I just realised that you were not interested in a separate dinucleotide context but in a CG/CHG/CHH classification which makes a lot more sense. I shall amend this tomorrow.

It is very kind of you to provide this script to generates a cytosine methylation report. But it does not seems to work, something error like this:
perl genome_wide_cytosine_report.pl --help
Type of arg 1 to keys must be hash (not hash element) at genome_wide_cytosine_report.pl line 119, near "},"
Type of arg 1 to keys must be hash (not hash element) at genome_wide_cytosine_report.pl line 206, near "},"
BEGIN not safe after errors--compilation aborted at genome_wide_cytosine_report.pl line 289.

In my opinion, It is completely necessary to add an option to generate a bedGraph files and cytosine methylation report straight away in methylation extractor .
traditional dinucleotide context may not suitable for our samples, I need trinucleotide classification. As you said, I will call differential methylation on a per-cytosine basis instead of a region. It is will be more meaningful.

**fkrueger** · 09-21-2012, 01:42 AM

Hi my_bio,

I have now changed the output to be in the following format:
<chromosome> <position> <strand> <count methylated> <count non-methylated> <C context> <trinucleotide context>

I also fixed the compile errors, strangely enough it ran without any warnings on our system... I hope it'll work nicely now.

Attached Files

genome_wide_cytosine_report.pl (12.5 KB, 84 views)

**my_bio** · 09-21-2012, 06:17 AM

Originally posted by fkrueger View Post

Hi my_bio,

I have now changed the output to be in the following format:
<chromosome> <position> <strand> <count methylated> <count non-methylated> <C context> <trinucleotide context>

I also fixed the compile errors, strangely enough it ran without any warnings on our system... I hope it'll work nicely now.

It seems to work alright by now and I strongly suggest you to add these functions to methylation extractor. By the way, to my opinion, it is needed to splits output into different files for each chromosome. So we can parallel process by chromosome in subsequent analysis.

**fkrueger** · 09-21-2012, 07:52 AM

Originally posted by my_bio View Post

It seems to work alright by now and I strongly suggest you to add these functions to methylation extractor. By the way, to my opinion, it is needed to splits output into different files for each chromosome. So we can parallel process by chromosome in subsequent analysis.

I'll start thinking about implementing it into the methylation extractor if I find some time next week. Splitting the output per chromosomes requires only a couple of extra lines, but we could have it as another option.

**my_bio** · 09-21-2012, 06:09 PM

If the new version of methylation extractor have been updated, please inform us, thanks.

**my_bio** · 09-21-2012, 10:16 PM

To accurately calculate methylation level of cytosine, it's necessary to add another option to filter low sequencing quality reads. that is to say, if a base's sequencing quality is lower than 20, methylation extractor will ignore it.

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News