Bison: BISlfite alignment On Nodes of a cluster

dpryan replied

10-27-2014, 12:56 AM
I've just posted version 0.3.3, which supports discordant and singleton alignments. The tutorial has also been updated to demonstrate how to suppress such alignments, if desired.

Bison how now been published. If you use it in your research, please cite the paper here.

I'll note that the next version will add support for CRAM files.
Leave a comment:
dpryan replied

08-26-2014, 06:53 AM
I've just posted version 0.3.2b, which fixes the Makefile so that bison will use the static htslib file. Otherwise, users would need to keep htslib around (convenient for me, but probably not for you).
Leave a comment:
dpryan replied

08-25-2014, 05:46 AM
It seems that I missed posting when I released version 0.3.1. Anyway, I've just released version 0.3.2. Changes of note are below, though the biggest one is support for HTSlib. I should note that I've also created a tutorial with compilation instructions and a couple example datasets available here.

Added bedGraph2MOABS to convert bedGraph files for use by MOABS.

Added support for HTSlib.

Fixed a small bug wherein --reorder wasn't being invoked when multiple output BAM files were to be used.

Fixed a small bug that only manifested in DEBUG mode.

There is now a tutorial.

The default minimum MAPQ and Phred scores used by bison_mbias have been updated to match bison_methylation_extractor.
Leave a comment:
dpryan replied

02-27-2014, 05:27 AM
v0.3.0

I've just release version 0.3.0, which should address the problem I mentioned in my last post as well as a few other small bugs. I should note that you can now track the development version(s) of bison on github. I have a few branches (some not yet on github), implementing discordant/mixed alignments and using the development version of samtools/htslib.

Note: The indices produced by previous versions are not guaranteed to be compatible unless you used a multi-fasta file. There was a serious implementation problem with how bison_index worked when given multiple files as input and how multiple files were read into memory in previous versions. If you used a multi-fasta file, then everything will continue to work correctly. However, if you used multiple fasta files in a list then I strongly encourage you to delete the previous indices (just remove the bisulfite_genome directory) and reindex. The technical reasons for this issue are that when the bison tools previously read multiple fasta files into memory, they would do so in whatever order they appeared in the directory structure, which can change over time and isn't guaranteed to match the order of files someone specified during indexing. While the alignments wouldn't be affected by this, the methylation calls could have been seriously compromised. In this version, bison_index will only accept a directory, not a list of files, and it will always alphasort() the list of files in that directory prior to processing. This should eliminate this problem. My apologies to anyone affected by this.

Added --genome-size option to a number of the tools. Many of the bison programs need to read the genome into memory. By default, 3 gigabases worth of memory are allocated for that and the size increased as needed. For smaller genomes, this wasted space. For larger genomes, the constant reallocation of space could seriously slow things down. Consequently, this option was added to any tool that reads the genome into memory. It's convenient to overestimate this slightly, so if your genome is 3.8 gigabases, then just use 4000000000 as the genome size.

bison_merge_CpGs can now take multiple input files at once.

A number of small bug fixes, such as when "genome_dir" doesn't end in a /.
Leave a comment:
dpryan replied

02-27-2014, 04:25 AM
@blam: A not so small correction. The only solution that is actually correct was to concatenate files together, but then only keep the resulting multi-fasta file (i.e. "cat *.fa > genome.fa" and then either move genome.fa to its own directory or delete the other .fa files). The other solution is not guaranteed to always work correctly. I'm finishing a new release that will fix this. In the new version, bison_index will in fact accept a directory of fasta files, rather than needing one to specify files individually (in fact, I will explicitly remove the ability for it to handle that since it can have unintended consequences).

The previous implementation could work incorrectly in cases where the input file list didn't match the order in which the files appeared in the directory entry, which can actually change over time. What that would mean is that the files could have been indexed in one order (e.g., chr1, then chr2, then chr3, ...) but then later read into memory in a different order (e.g., chr3, then chr1, then chr2, ...), which could cause all sorts of problems. This could only occur if you passed bison_index a list of files, rather than a single multi-fasta file. While I don't expect people to get bitten by this bug, it's very much possible and I consider it a major issue. I'm testing a fix and will upload a new version within the next couple hours.

For anyone who stores the genome in a single file, this won't be an issue for you. If, however, you store chromosomes/contigs in individual files, then I recommend deleting the current indices (just "rm -rf bisulfite_genome" in the directory with the fasta files) and reindexing. The version I'm testing will always process files in the same order, regardless of their order in the dirent structure on disk, so this problem will be resolved.

Last edited by dpryan; 02-27-2014, 04:28 AM.
Leave a comment:
dpryan replied

02-26-2014, 12:04 PM
Looks like I'll be fixing the README file as well then :P
Leave a comment:
blam replied

02-26-2014, 12:02 PM
bison_index THANKS

Thanks! The readme file made me think that bison_index took a directory. I am now indexing my reference with your suggestions above.
Leave a comment:
dpryan replied

02-26-2014, 11:43 AM
Hi blam,

I have to admit that how I have bison_index do this is kind of silly. What would make more sense is, as you suggest, to just tell it what directory the fasta files are in and have it go from there, particularly since that's how all of the other bison tools work! I'll actually try to edit things to work that way tonight.

In the interim, using a comma-separated list should work (I just tested that and it at least works on my computer), keeping in mind that that means not including a space after the comma (this is also how bowtie2-build works and is done for purely logistic reasons). So, something like the following should work:

Code:

bison_index chr1.fa,chr10.fa,chr11.fa,chr12.fa

Yes, that's annoying and I will change it. Another possibility is to just:

Code:

cat *.fa > genome.fa bison_index genome.fa rm genome.fa

That's also not ideal, but should suffice while I make a better version. Let me know if you run into any other issues and I'll get them fixed.
Leave a comment:
blam replied

02-26-2014, 11:04 AM
Help with bison_index

Hi,

I'm interested in using Bison instead of Bismark for my Bis-seq analysis. I think I have everything installed correctly, but I'm having trouble with indexing the reference genome.

bison_index will not accept a directory, but will accept a .fa file.

If I try to index multiple .fa files using /directory_of_reference/*.fa it seems to accept the first .fa as input and the second .fa file to create an outputfile.

I've looked at bison_index -h which suggests comma separated .fa files, but still no luck.

Any suggestions about what I am doing incorrect? I'm using human assembly GRCh37 as my reference.
Leave a comment:
dpryan replied

02-17-2014, 06:59 AM
I just posted version 0.2.4, which fixes a silly error on my part and adds a simple markduplicates program:

Fixed an off-by-one error in bison_mbias.

Added bison_markduplicates, which, as the name implies, marks apparent PCR duplicates. The methylation extractor and m-bias calculator have also been updated to ignore marked duplicates.

The bison_markduplicates program uses the chromosome and both 5' and 3' bounds of both mates (if there are paired-end reads) as well as the strand to determine PCR duplicates.

I just found and fixed a few other bugs in bison_mbias (at some point it started swapping the methylated and unmethylated metrics). The bug wouldn't have caused a big issue previously, since the only purpose was for determining trimming bounds, but it was wrong none-the-less. Another bug was in bison_CpG_coverage, which wasn't handling unmerged bedGraph files properly before (merged files were fine). Sorry about those!
Last edited by dpryan; 02-17-2014, 09:33 AM. Reason: More changes
Leave a comment:
dpryan replied

01-16-2014, 04:03 AM
I've upload version 0.2.3, which is mainly geared toward getting local alignment working properly (it worked before, but the methylation calls were completely off). My thanks to mvijayen in this thread for providing the impetus and some good example data to get this done.

Fix how hard and soft-clipped bases are dealt with (previously, soft-clipped bases resulted in an error and hard-clipped bases in incorrect position assignments!).

Multiple bug fixes related to local alignment, which previously didn't work correctly. These issues seem to generally now be resolved. May thanks to user mvijayen on seqanswers for providing a perfect usage example for testing (see thread http://seqanswers.com/forums/showthread.php?t=39914).

The maximum length of a single contig is now (2^64)-1 (instead of the previous 2^64). I don't think bowtie2 would even support something that long, but if it did then bison wouldn't (internally, a position of 2^64 means a base is inserted, soft, or hard-clipped).

A previously missing "*" caused Bison to use the entirety of the description line in the fasta file as the chromosome name. This caused errors since bowtie2 only uses every before the first space (the proper method). Bison now does the same.

A note about creating methylation-bias metrics with locally aligned reads is in order. If a read is soft-clipped, that portion is still included in the M-bias metrics. Likewise, if you pass -OT X,X,X,X or similar parameters to the methylation extractor, the soft-clipped area is also included in there.

Another note regarding local alignments is that the XX auxiliary tag (effectively the more verbose version of the MD tag) contains soft-clipped sequences. I could probably have these removed if someone would like.
Leave a comment:
dpryan replied

01-08-2014, 02:36 AM
I've posted a quick update, version 0.2.2

Properly fixed some wording on the textual output (i.e., removed the word "unique").

Lowered the default MAPQ and Phred thresholds used by the methylation extractor to 10 each. That the MAPQ threshold was originally 20 was an error on my part.
Leave a comment:
dpryan replied

01-02-2014, 02:52 AM
I've just posted version 0.2.1, which contains a number of bug fixes and a few feature enhancements, to to sourceforge. The changes were as follows:

Added support for file globbing in bison_herd. You may now input multiple files using a combination of wild-cards (*, ?, etc.) and commas. Remember to put these in quotes (e.g., "foo/*1.fq.gz","bar/*1.fq.gz") so the shell doesn't perform the expansion!). As before, specifying multiple inputs with the same file name (e.g., sample1/reads.fq,sample2/reads.fq) will cause the output from the first reads.fq alignment to be over-written by the second.

Fixed the text output, since "unique alignments" isn't really correct, given that alignments with scores of 0 or 1 can be output but aren't unique.

Added information in the Makefile and above about compiling with openmpi.

Fixed a bug in bison_herd wherein the -upto option wasn't being handled properly. -upto now accepts an unsigned long in bison_herd.

Fixed a bug in bison_herd when paired-end reads were used. This was due to how bowtie2 reads from FIFOs. Changing how things were written to the FIFOs on the worker nodes resolved the problem.

The bison_mbias program has been heavily revamped. It still outputs the number of methylated or unmethylated CpG calls per position, but now keeps the metrics for each strand (and read, when paired-end reads are used) separate. If R and the ggplot2 library are installed, the program can also run the bison_mbias2pdf program (see below).

Created an bison_mbias2pdf Rscript that will read in the output of bison_mbias and plot the results, indicating the region of each read that should be included in methylation extraction. This script also print these suggestions in the format used by bison_methylation_extractor, for convenience.

The methylation extractor can now be told to only include certain regions of each read in the output methylation metrics. This is needed when there is apparent bias in the methylation at one or both ends of a read.

Previously, the recalculated MAPQ was incorrect when only 1 read in a pair had a valid secondary alignment. This has been fixed.

Fixed another MAPQ recalculation bug, affecting reads with MAPQ 2 that have MAPQ=6.

Fixed a bug in writing unmapped reads.

Fixed a bug in bison_herd that allowed early termination without warning.

For those curious, I'm attaching a couple example M-bias plots generated by bison_mbias2pdf. The experiment was RRBS, so you can see the bias in the first and last 2 bases in many of the reads that need to be trimmed (this was a relatively early experiment, so the reads were generally not the best quality).
Leave a comment:
brentp replied

12-05-2013, 01:47 PM
I had to add -lpthread to bias and methylation extractor but it installed fine with openmpi after that.

Thanks.
Leave a comment:
dpryan replied

12-05-2013, 12:43 PM
Ah, the ensuing segfault was just due to some mpich2 headers still being found by mpicc (mixing mpich2 and openmpi doesn't work well). Deleting those such that only openmpi headers were being used solved that. So, just changing the -l option should solve the problem for you. Let me know if you run into any other issues. I would like to get as many of the bugs ironed out as possible.
Leave a comment:

Previous 1 2 template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News