Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bison: BISlfite alignment On Nodes of a cluster

    Greetings all,

    I'd like to announce the general availability of a program that I've recently written called Bison (BISulfite alignment On Nodes of a cluster), which is intended for those who need to align bisulfite-converted reads and have access to a computer cluster. Bison is quite similar to Bismark (I'm a former Bismark user and wrote Bison to get my alignments sooner), with major differences as follows:
    • Bison is often 5-10x faster, due largely to the fact that it allocates individual cluster nodes to aligning reads to each strand. It combines information from the “aligner nodes” on a separate “master node”.
    • Bison uses the samtools C API to output alignments directly to BAM format, thereby saving space and disk I/O.
    • Bison is written purely in C, results in a bit more of a speed gain.
    • Bison also decides upon the correct alignment in a slightly different way than Bismark, resulting in fewer misaligned reads (0.02-0.03% versus ~0.6% for Bismark).
    • Bison requires only enough RAM for a single instance of bowtie2, as opposed to enough for 2-4 instances.

    Otherwise, Bison will be quite familiar to those of you already accustomed to using Bismark. Both directional and non-directional libraries are supported. Bowtie2 is used for alignment on each of the nodes. Both paired-end and single-end libraries are supported. A methylation extractor is included that outputs into bedGraph format (if people would like a different format or different information, just ask). For those doing RRBS, I should note that the methylation extractor can be told that you are doing RRBS (currently MspI and TaqI digested libraries are supported) and it will then ignore methylation calls of bases added experimentally during fragment end-repair (this avoids needing to trim them off prior to alignment).

    I should note that Bison does not currently support color-space reads, as I've never actually had any. Further, it is generally less flexible than Bismark, so I encourage users interested in Bison to try some test data to see if Bison meets their needs.

    Bison source code and directions for compilation and usage are available via sourceforge. Samtools and Bowtie2 are required for installation. Likewise, MPI is required for compilation and usage, though you can run Bison on a single computer if desired. If people run into installation or usage problems, please feel free to post in this thread or submit a ticket on sourceforge.

    Devon Ryan

  • #2
    version 0.1.1 is now available

    I've finally gotten around to updating the version of bison on sourceforge (now 0.1.1) so it's current with my local version. Changes are as follows:
    • Fixed a number of minor bugs.
    • Added support for uncompressed fastq files, as well as bzipped files (previously, only gzipped fastq files worked properly).
    • --score-min is now parsed by bison prior to being sent to bowtie2, read MAPQ scores are recalculated accordingly by the same algorithm used by bowtie2 (N.B., this only bears a vague correspondence to -10*log10(probability the mapping position is wrong)!).
    • Added a bison_mbias function, to process the aligned BAM file and create a text file containing the percentage of methylated C's as a function of read position. For the utility of this, see: Hansen KD, Langmead B and Irizarry RA, BSmooth: from whole genome bisulfite sequencing to differentially methylated regions. Genome Biol 2012; 13(10):R83.
    • The methylation extractor now accepts the -q options, which sets the MAPQ threshold for a read to be included in the methylation results. The default is a minimum MAPQ of 20, which seems to be a reasonable threshold from a few simulations.
    • In DEBUG mode, the output BAM files used to have fixed names. This was a problem in cases where debugging was being performed on multiple input files. Now, the OT/OB/CTOT/CTOB.bam filename is prepended with an appropriate prefix (extracted from the input file name). In addition, the output directory is now respected in DEBUG mode.
    • Included an "auxiliary" directory, that includes functions for making an RRBS genome and other possibly useful functions.

    Unless some bugs crop up, I expect the next release will support a semi-arbitrary number of nodes. As is, only 3 or 5 nodes can be used at a time. This is fine for me, since I'm usually processing a number of samples in parallel and our cluster is relatively small. I can easily envision others finding more nodes useful. I have a few implementation ideas for this.


    • #3
      Version 0.2.0 now available

      I've just posted version 0.2.0 to sourceforge. This big change in this release is the inclusion of bison_herd, which can use a semi-arbitrary number of nodes (e.g., I'm using 17 nodes to simultaneously align some samples at the moment). bison_herd can also accept a list of input files and will write each of their alignments to separate files. This is useful when you have a number of samples and want to skip the overhead of loading the bowtie2 index and the genome into memory more than once. bison_herd also skips writing in-silico converted reads to a file, further increasing performance. Other changes:
      • Added a note to the methylation summary statistics output at the end of a run that the numbers will include double counting of any site covered by both mates in a pair. These metrics are only meant for general information and not further analysis, so I don't consider that a bug (it's actually a design decision for the sake of performance).
      • --ignore-quals is no longer passed to bowtie2 by default. Specifying this will marginally decrease both correct and incorrect alignments. It will also generally decrease the alignment rate.
      • Fixed --unmapped, which are now written to the directory specified by -o
      • --maxins was already 500 by default, so it is no longer set by default.
      • The methylation extractor now has a -phred option, to exclude methylation calls from low confidence base-calls. The default threshold is 20.
      • Added a script to convert bedGraph files to a format suitable for BSseq.
      • Fixed a bug in bison_merge_CpGs

      The only thing left on my "To Do" list is to add support for filename globbing (e.g. sample_*_1.fastq.gz and sample_*_2.fastq.gz) to make feeding bison_herd (and the auxiliary scripts) with multiple files easier.


      • #4
        Hi Devon, this seems very interesting. I have been having trouble with the amount of plain-text output from Bismark.

        this seems to require exactly mpich, openmpi doesn't seem to work, is that correct? Once I get past that, I get:

        mpicc -c -Wall -O3   main.c -o main.o
        mpicc -Wall -O3  aux.o fastq.o genome.o slurp.o master.o common.o MPI_packing.o worker.o main.o -o bison -L/home/brentp/src/samtools  -lm -lpthread -lmpich -lmpl -lbam -lz
        aux.o: In function `quit':
        aux.c:(.text+0xcf2): undefined reference to `ompi_mpi_comm_world'
        aux.o: In function `effective_nodes':
        aux.c:(.text+0xef5): undefined reference to `ompi_mpi_comm_world'
        slurp.o: In function `slurp':
        slurp.c:(.text+0x1f9): undefined reference to `ompi_mpi_comm_world'
        slurp.c:(.text+0x1fe): undefined reference to `ompi_mpi_int'
        slurp.c:(.text+0x23d): undefined reference to `ompi_mpi_comm_world'
        slurp.c:(.text+0x24d): undefined reference to `ompi_mpi_int'
        slurp.c:(.text+0x277): undefined reference to `ompi_mpi_comm_world'
        slurp.c:(.text+0x287): undefined reference to `ompi_mpi_byte'
        slurp.c:(.text+0x347): undefined reference to `ompi_mpi_byte'
        slurp.c:(.text+0x34d): undefined reference to `ompi_mpi_comm_world'
        slurp.c:(.text+0x3a6): undefined reference to `ompi_mpi_comm_world'
        slurp.c:(.text+0x3c3): undefined reference to `ompi_mpi_byte'
        slurp.c:(.text+0x3ff): undefined reference to `ompi_mpi_byte'
        slurp.c:(.text+0x405): undefined reference to `ompi_mpi_comm_world'
        worker.o: In function `worker_node':
        worker.c:(.text+0x1c2): undefined reference to `ompi_mpi_comm_world'
        worker.c:(.text+0x1cc): undefined reference to `ompi_mpi_int'
        worker.c:(.text+0x390): undefined reference to `ompi_mpi_comm_world'
        worker.c:(.text+0x39b): undefined reference to `ompi_mpi_byte'
        worker.c:(.text+0x3dd): undefined reference to `ompi_mpi_comm_world'
        worker.c:(.text+0x3ea): undefined reference to `ompi_mpi_byte'
        worker.c:(.text+0x453): undefined reference to `ompi_mpi_comm_world'
        worker.c:(.text+0x45e): undefined reference to `ompi_mpi_byte'
        worker.c:(.text+0x5be): undefined reference to `ompi_mpi_comm_world'
        worker.c:(.text+0x5c9): undefined reference to `ompi_mpi_int'
        worker.c:(.text+0x5e3): undefined reference to `ompi_mpi_comm_world'
        worker.c:(.text+0x5ee): undefined reference to `ompi_mpi_byte'
        collect2: ld returned 1 exit status
        make: *** [align] Error 1
        How can I get past that error?


        • #5
          Hi brentp, both should work fine. Try removing "-lmpich -lmpl" and replacing that with "-lmpi". I just replaced mpich2 with open-mpi and that then compiled (i'll add this to my list of things to fix). Of course it now segfaults for me . I'll have to play around with openMPI since I guess I've never used that. Let me know if this works for you, since that'll just mean that openMPI isn't completely installed on my system.

          BTW, I think the most recent version of bismark has an option to output directly to BAM (it just pipes to samtools).


          • #6
            Ah, the ensuing segfault was just due to some mpich2 headers still being found by mpicc (mixing mpich2 and openmpi doesn't work well). Deleting those such that only openmpi headers were being used solved that. So, just changing the -l option should solve the problem for you. Let me know if you run into any other issues. I would like to get as many of the bugs ironed out as possible.


            • #7
              I had to add -lpthread to bias and methylation extractor but it installed fine with openmpi after that.



              • #8
                I've just posted version 0.2.1, which contains a number of bug fixes and a few feature enhancements, to to sourceforge. The changes were as follows:
                • Added support for file globbing in bison_herd. You may now input multiple files using a combination of wild-cards (*, ?, etc.) and commas. Remember to put these in quotes (e.g., "foo/*1.fq.gz","bar/*1.fq.gz") so the shell doesn't perform the expansion!). As before, specifying multiple inputs with the same file name (e.g., sample1/reads.fq,sample2/reads.fq) will cause the output from the first reads.fq alignment to be over-written by the second.
                • Fixed the text output, since "unique alignments" isn't really correct, given that alignments with scores of 0 or 1 can be output but aren't unique.
                • Added information in the Makefile and above about compiling with openmpi.
                • Fixed a bug in bison_herd wherein the -upto option wasn't being handled properly. -upto now accepts an unsigned long in bison_herd.
                • Fixed a bug in bison_herd when paired-end reads were used. This was due to how bowtie2 reads from FIFOs. Changing how things were written to the FIFOs on the worker nodes resolved the problem.
                • The bison_mbias program has been heavily revamped. It still outputs the number of methylated or unmethylated CpG calls per position, but now keeps the metrics for each strand (and read, when paired-end reads are used) separate. If R and the ggplot2 library are installed, the program can also run the bison_mbias2pdf program (see below).
                • Created an bison_mbias2pdf Rscript that will read in the output of bison_mbias and plot the results, indicating the region of each read that should be included in methylation extraction. This script also print these suggestions in the format used by bison_methylation_extractor, for convenience.
                • The methylation extractor can now be told to only include certain regions of each read in the output methylation metrics. This is needed when there is apparent bias in the methylation at one or both ends of a read.
                • Previously, the recalculated MAPQ was incorrect when only 1 read in a pair had a valid secondary alignment. This has been fixed.
                • Fixed another MAPQ recalculation bug, affecting reads with MAPQ 2 that have MAPQ=6.
                • Fixed a bug in writing unmapped reads.
                • Fixed a bug in bison_herd that allowed early termination without warning.

                For those curious, I'm attaching a couple example M-bias plots generated by bison_mbias2pdf. The experiment was RRBS, so you can see the bias in the first and last 2 bases in many of the reads that need to be trimmed (this was a relatively early experiment, so the reads were generally not the best quality).


                • #9
                  I've posted a quick update, version 0.2.2
                  • Properly fixed some wording on the textual output (i.e., removed the word "unique").
                  • Lowered the default MAPQ and Phred thresholds used by the methylation extractor to 10 each. That the MAPQ threshold was originally 20 was an error on my part.


                  • #10
                    I've upload version 0.2.3, which is mainly geared toward getting local alignment working properly (it worked before, but the methylation calls were completely off). My thanks to mvijayen in this thread for providing the impetus and some good example data to get this done.
                    • Fix how hard and soft-clipped bases are dealt with (previously, soft-clipped bases resulted in an error and hard-clipped bases in incorrect position assignments!).
                    • Multiple bug fixes related to local alignment, which previously didn't work correctly. These issues seem to generally now be resolved. May thanks to user mvijayen on seqanswers for providing a perfect usage example for testing (see thread
                    • The maximum length of a single contig is now (2^64)-1 (instead of the previous 2^64). I don't think bowtie2 would even support something that long, but if it did then bison wouldn't (internally, a position of 2^64 means a base is inserted, soft, or hard-clipped).
                    • A previously missing "*" caused Bison to use the entirety of the description line in the fasta file as the chromosome name. This caused errors since bowtie2 only uses every before the first space (the proper method). Bison now does the same.
                    • A note about creating methylation-bias metrics with locally aligned reads is in order. If a read is soft-clipped, that portion is still included in the M-bias metrics. Likewise, if you pass -OT X,X,X,X or similar parameters to the methylation extractor, the soft-clipped area is also included in there.
                    • Another note regarding local alignments is that the XX auxiliary tag (effectively the more verbose version of the MD tag) contains soft-clipped sequences. I could probably have these removed if someone would like.


                    • #11
                      I just posted version 0.2.4, which fixes a silly error on my part and adds a simple markduplicates program:
                      • Fixed an off-by-one error in bison_mbias.
                      • Added bison_markduplicates, which, as the name implies, marks apparent PCR duplicates. The methylation extractor and m-bias calculator have also been updated to ignore marked duplicates.

                      The bison_markduplicates program uses the chromosome and both 5' and 3' bounds of both mates (if there are paired-end reads) as well as the strand to determine PCR duplicates.

                      I just found and fixed a few other bugs in bison_mbias (at some point it started swapping the methylated and unmethylated metrics). The bug wouldn't have caused a big issue previously, since the only purpose was for determining trimming bounds, but it was wrong none-the-less. Another bug was in bison_CpG_coverage, which wasn't handling unmerged bedGraph files properly before (merged files were fine). Sorry about those!
                      Last edited by dpryan; 02-17-2014, 09:33 AM. Reason: More changes


                      • #12
                        Help with bison_index


                        I'm interested in using Bison instead of Bismark for my Bis-seq analysis. I think I have everything installed correctly, but I'm having trouble with indexing the reference genome.

                        bison_index will not accept a directory, but will accept a .fa file.

                        If I try to index multiple .fa files using /directory_of_reference/*.fa it seems to accept the first .fa as input and the second .fa file to create an outputfile.

                        I've looked at bison_index -h which suggests comma separated .fa files, but still no luck.

                        Any suggestions about what I am doing incorrect? I'm using human assembly GRCh37 as my reference.


                        • #13
                          Hi blam,

                          I have to admit that how I have bison_index do this is kind of silly. What would make more sense is, as you suggest, to just tell it what directory the fasta files are in and have it go from there, particularly since that's how all of the other bison tools work! I'll actually try to edit things to work that way tonight.

                          In the interim, using a comma-separated list should work (I just tested that and it at least works on my computer), keeping in mind that that means not including a space after the comma (this is also how bowtie2-build works and is done for purely logistic reasons). So, something like the following should work:

                          bison_index chr1.fa,chr10.fa,chr11.fa,chr12.fa
                          Yes, that's annoying and I will change it. Another possibility is to just:
                          cat *.fa > genome.fa
                          bison_index genome.fa
                          rm genome.fa
                          That's also not ideal, but should suffice while I make a better version. Let me know if you run into any other issues and I'll get them fixed.


                          • #14
                            bison_index THANKS

                            Thanks! The readme file made me think that bison_index took a directory. I am now indexing my reference with your suggestions above.


                            • #15
                              Looks like I'll be fixing the README file as well then :P


                              Latest Articles


                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin

                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin

                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM





                              Topics Statistics Last Post
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 09:45 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 08:54 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 07-02-2024, 03:00 PM
                              0 responses
                              Last Post seqadmin