Header Leaderboard Ad

Collapse

Bismark - A New Tool for Mapping and Analysis of Bisulfite-Seq Data

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by shawpa View Post
    I am using the bismarktobedgraph script from the website. Because it doesn't have a temp directory option I am running it from the directory that has the space. It seems that it is running out of memory still and my standard error file says "sort: write failed: /tmp/sortMpgIOA: No space left on device". I am not sure why it is still writing to this directory. Is there something in the script that needs to be adjusted?
    Hi Shawpa,

    your input files are probably quite big so that sorting fails. The script is sorting using the operating system's sort command, which defaults to the /tmp/ directory. This could be changed in the script with the sort -T option

    Code:
     -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories
    I have also written a new version of this script a few days ago, which lets you sort your files chromosome by chromosome if you run out of space. The new script is attached, it'll be up on the homepage shortly.

    Here are the changes to the script i have implemented:

    Added an option '--split_by_chromosome' to enable sorting of very large files. The methylation extractor output is first written into temporary files chromosome by chromosome. These files are then sorted by position and deleted afterwards.

    Added an option '--counts' which adds 2 more lines to the bedGraph output file:

    Column 5: count of methylated calls per position, and
    Column 6: count of unmethylated calls per position.

    Technically, this renders the output to be no longer in bedGraph format, but it might enable additional calculations with the output.
    Attached Files

    Comment


    • threads optimization

      Dear Felix,
      In order to estimate the number of grids I should use in our cluster to run bismark with huge data sets, I am running a test to check processing times and number of processors used and found a problem; more processors took longer processing times.
      My environment:
      PowerMac OS10.7.4, 2x 2.4GHz 6-core Xeon, 64GB RAM
      used bismark-v0.7.6

      in-silico generated test data:
      100,000 pairs of 100-bp PEreads randomly taken from mouse genome (UCSC mm9)
      bisulfite conversion rate:0.999, CpG methylation rate: 0.7

      test:
      run bismark without option or -p 2, 4, 8, 10, and 12 options

      result: (used command and time)
      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 424.112706184
      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 2 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 304.065144777
      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 4 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 233.798403025
      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 8 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 251.128180981
      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 10 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 284.739043951

      I had a similar result with 1 million pairs

      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 4068.50050616
      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 2 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 2656.86580992
      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 4 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 1997.74887085
      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 8 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 2230.58184505
      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 10 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 2388.17396808
      bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 12 -1 0-genome_1_bis.fastq -2 0-genome_2_bis.fastq
      took 2468.01015115

      However, bowtie2 itself showed a good correlation with times and # of processor.
      bowtie2 --sensitive-local -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
      took 170.282114983
      bowtie2 --sensitive-local -p 2 -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
      took 87.4923779964
      bowtie2 --sensitive-local -p 4 -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
      took 45.6259729862
      bowtie2 --sensitive-local -p 8 -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
      took 26.6778290272
      bowtie2 --sensitive-local -p 10 -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
      took 23.4780960083
      bowtie2 --sensitive-local -p 12 -x genome -1 100k-genome_1.fastq -2 100k-genome_2.fastq -S 100k_genome.sam
      took 21.6073651314

      In all conditions, the activity monitor indicated appropriate usage of computational power. (e.g. ~ 100 % CPU power with single processor and ~1000 % with 10 processors)
      Is this specific to my environment or a general problem regarding algorithm/program structure?

      Thanks,

      Comment


      • That's an interesting observation; I have noticed though that you aren't using quite the same paramaters since Bismark is calling Bowtie2 in --reorder mode which is needed for Bismark's internal logic. The Bowtie2 manual states:

        --reorder: Guarantees that output SAM records are printed in an order corresponding to the order of the reads in the original input file, even when -p is set greater than 1. Specifying --reorder and setting -p greater than 1 causes Bowtie 2 to run somewhat slower and use somewhat more memory then if --reorder were not specified.

        My guess is that it is a combination between Bowtie2 getting itself worked up somehow, and to some extent the way best alignments are worked out and methylation is extracted within Bismark. Since you are seeing an increase in CPU usage to 1000% for 10 cores it seems that it is really Bowtie2 which is thinking so hard, since Bismark itselft is single-threaded and thus uses only 100% max. Another option for parallelizing Bowtie2 alignments might also be to simply split up input files into 2 or more chunks... Does this help?

        Comment


        • Dear Felix,
          Thank you for your prompt reply.
          As you suggested, the bottleneck seems to be the ability of a thread running bismark. While bismark is running, one perl-5.xx thread and two bowtie2 with p threads are working. The perl thread often gets to 100% CPU power with high p options, which suggest bismark is inundated with tons of incoming results from bowtie2. I got best results with -p 5, but it may be faster to split a job into two and run jobs with -p 3 or 4 independently when I deal with huge data sets, as you suggested.

          Many thanks to your suggestion again.

          bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
          took 426.156659842
          bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 2 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
          took 300.796103001
          bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 3 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
          took 251.952316046
          bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 4 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
          took 231.829093933
          bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 5 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
          took 220.367155075
          bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 6 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
          took 222.934455156
          bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 7 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
          took 242.725275993
          bismark --bowtie2 /Users/Shared/NCBIM37genomeDir/ -p 8 -1 100k-genome_1_bis.fastq -2 100k-genome_2_bis.fastq
          took 257.643329144

          Comment


          • Dear fkrueger,
            genome_methylation_bismark2bedGraph_v4.pl is a useful program to count methylation of cytosine genome-wide. if a cytosine site did not covered one read, information of this cytosine will not appear in output file. So I need a program which can count methylation level of every cytosine site in reference genome, this will be more useful in subsequent analysis. the output file may like this:
            chromosome strand coordinate methylated_reads unmethylated_reads context context detaile_context
            chr1 10 + 4 6 CG CGA
            chr1 11 - 5 6 CG CGT
            chr1 15 + 1 11 CHG CAG
            chr1 17 - 2 10 CHG CTG
            chr1 20 + 0 12 CHH CAA
            Could you provide a program like this ?
            Last edited by my_bio; 09-19-2012, 04:30 AM.

            Comment


            • Good point my_bio.

              Comment


              • Originally posted by my_bio View Post
                Dear fkrueger,
                genome_methylation_bismark2bedGraph_v4.pl is a useful program to count methylation of cytosine genome-wide. if a cytosine site did not covered one read, information of this cytosine will not appear in output file. So I need a program which can count methylation level of every cytosine site in reference genome, this will be more useful in subsequent analysis. the output file may like this:
                chromosome strand coordinate methylated_reads unmethylated_reads context context detaile_context
                chr1 10 + 4 6 CG CGA
                chr1 11 - 5 6 CG CGT
                chr1 15 + 1 11 CHG CAG
                chr1 17 - 2 10 CHG CTG
                chr1 20 + 0 12 CHH CAA
                Could you provide a program like this ?
                This should be fairly straight forward to do. For a not covered CpG, would you envisage to have a record such as this:

                chr1 10 + 0 0 CG CGA
                chr1 11 - 0 0 CG CGT

                ?

                Comment


                • Originally posted by fkrueger View Post
                  This should be fairly straight forward to do. For a not covered CpG, would you envisage to have a record such as this:

                  chr1 10 + 0 0 CG CGA
                  chr1 11 - 0 0 CG CGT

                  ?
                  For a not covered CpG, I would like to have a record. That is to say, for different samples, as long as the reference genome is same, their output files will have the same number of lines. This will be more convenient in subsequent analysis, especially in the comparison of different samples. So I think Bismark need to add these functions. By the way, Bismark will be more popular if it can do methylcytosines calling by binomial test. BiSS(http://www.cibiv.at/software/ngm/BiSS/) can do all these works, but run this software need NVIDIA Accelerated Linux Graphics Driver. I look forward to update of Bismark to do more analysis.

                  Comment


                  • I have now written something that generates a cytosine methylation report for a genome of interest and a sorted methylation input file produced by the script "genome_methylation2bedGraph". By default, the output uses 1-based chromosome coordinates (zero-based cords are optional) and reports CpG context only (all cytosines optional). The output considers all Cs on both forward and reverse strands.
                    The input file needs to have been generated with the genome_methylation2bedGraph script while using the '--counts' option, or otherwise be exactly in the following format:
                    <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>
                    I didn’t have time for extensive testing but it seems to work alright.

                    The USAGE is: genome_wide_cytosine_report.pl [options] <input> > <output>, --help will show all options.

                    The output looks like this:
                    <chromosome> <position> <strand> <count methylated> <count non-methylated> <C context> <trinucleotide context>

                    Some additional comments:
                    - If this script turns out to be working nicely and this is useful for users, I could add to the methylation extractor an option to generate a bedGraph files or this output format straight away. I appreciate that it is somewhat tedious to run methylation_extractor > extractor2bedGraph > genome_wide_cytosine_report in succession, but I am sure this could be streamlined

                    - You need to be aware that if you are dealing with extremely large datasets that cover almost the entire genome, e.g. 2 billion WGSBS-reads, the whole task becomes very memory intensive (probably up to 20GB or so) even though it is working on a chromosome-by-chromosome basis (large chromosomes may have well over 100 million Cs). This is a limitation of using standard Perl and the overheads used for its data structures, and one would have to look at using different approaches or a different programming language to do the task if this solution fails.

                    - Running a binomial test on the methylated positions does not necessarily have anything to do with Bismark or any aligner per se (however I am not sure you want to call differential methylation on a per-cytosine basis instead of per feature or region of interest…). You could possibly even run the binomial test with the R script provided by the BiSS authors.

                    Please let me know what your experiences are.
                    Best,
                    Felix


                    Edit: Script amended to also output sequence context.
                    Attached Files
                    Last edited by fkrueger; 09-21-2012, 01:38 AM. Reason: changed the default output

                    Comment


                    • Originally posted by fkrueger View Post
                      I have now written something that generates a cytosine methylation report for a genome of interest and a sorted methylation input file produced by the script "genome_methylation2bedGraph". By default, the output uses 1-based chromosome coordinates (zero-based cords are optional) and reports CpG context only (all cytosines optional). The output considers all Cs on both forward and reverse strands.
                      The input file needs to have been generated with the genome_methylation2bedGraph script while using the '--counts' option, or otherwise be exactly in the following format:
                      <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>
                      I didn’t have time for extensive testing but it seems to work alright.

                      The USAGE is: genome_wide_cytosine_report.pl [options] <input> > <output>, --help will show all options.

                      The default output is:
                      <Chromosome> <Position> <Strand> <Count methylated> <Count non-methylated> <Trinucleotide context>
                      Also, there is an option ‘--dinucleotide_context’ to produce an additional column at the end with the di-nucleotide context. As the dinucleotide context is already contained within the tri-nucleotide context field and because it makes the output file considerably larger, this extra field is only optional.

                      Some additional comments:
                      - If this script turns out to be working nicely and this is useful for users, I could add to the methylation extractor an option to generate a bedGraph files or this output format straight away. I appreciate that it is somewhat tedious to run methylation_extractor > extractor2bedGraph > genome_wide_cytosine_report in succession, but I am sure this could be streamlined

                      - You need to be aware that if you are dealing with extremely large datasets that cover almost the entire genome, e.g. 2 billion WGSBS-reads, the whole task becomes very memory intensive (probably up to 20GB or so) even though it is working on a chromosome-by-chromosome basis (large chromosomes may have well over 100 million Cs). This is a limitation of using standard Perl and the overheads used for its data structures, and one would have to look at using different approaches or a different programming language to do the task if this solution fails.

                      - Running a binomial test on the methylated positions does not necessarily have anything to do with Bismark or any aligner per se (however I am not sure you want to call differential methylation on a per-cytosine basis instead of per feature or region of interest…). You could possibly even run the binomial test with the R script provided by the BiSS authors.

                      Please let me know what your experiences are.
                      Best,
                      Felix


                      Edit: I just realised that you were not interested in a separate dinucleotide context but in a CG/CHG/CHH classification which makes a lot more sense. I shall amend this tomorrow.

                      It is very kind of you to provide this script to generates a cytosine methylation report. But it does not seems to work, something error like this:
                      perl genome_wide_cytosine_report.pl --help
                      Type of arg 1 to keys must be hash (not hash element) at genome_wide_cytosine_report.pl line 119, near "},"
                      Type of arg 1 to keys must be hash (not hash element) at genome_wide_cytosine_report.pl line 206, near "},"
                      BEGIN not safe after errors--compilation aborted at genome_wide_cytosine_report.pl line 289.

                      In my opinion, It is completely necessary to add an option to generate a bedGraph files and cytosine methylation report straight away in methylation extractor .
                      traditional dinucleotide context may not suitable for our samples, I need trinucleotide classification. As you said, I will call differential methylation on a per-cytosine basis instead of a region. It is will be more meaningful.

                      Comment


                      • Hi my_bio,

                        I have now changed the output to be in the following format:
                        <chromosome> <position> <strand> <count methylated> <count non-methylated> <C context> <trinucleotide context>

                        I also fixed the compile errors, strangely enough it ran without any warnings on our system... I hope it'll work nicely now.
                        Attached Files

                        Comment


                        • Originally posted by fkrueger View Post
                          Hi my_bio,

                          I have now changed the output to be in the following format:
                          <chromosome> <position> <strand> <count methylated> <count non-methylated> <C context> <trinucleotide context>

                          I also fixed the compile errors, strangely enough it ran without any warnings on our system... I hope it'll work nicely now.

                          It seems to work alright by now and I strongly suggest you to add these functions to methylation extractor. By the way, to my opinion, it is needed to splits output into different files for each chromosome. So we can parallel process by chromosome in subsequent analysis.

                          Comment


                          • Originally posted by my_bio View Post
                            It seems to work alright by now and I strongly suggest you to add these functions to methylation extractor. By the way, to my opinion, it is needed to splits output into different files for each chromosome. So we can parallel process by chromosome in subsequent analysis.
                            I'll start thinking about implementing it into the methylation extractor if I find some time next week. Splitting the output per chromosomes requires only a couple of extra lines, but we could have it as another option.
                            Last edited by fkrueger; 09-21-2012, 08:14 AM.

                            Comment


                            • If the new version of methylation extractor have been updated, please inform us, thanks.

                              Comment


                              • To accurately calculate methylation level of cytosine, it's necessary to add another option to filter low sequencing quality reads. that is to say, if a base's sequencing quality is lower than 20, methylation extractor will ignore it.

                                Comment

                                Working...
                                X