Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    @blam: A not so small correction. The only solution that is actually correct was to concatenate files together, but then only keep the resulting multi-fasta file (i.e. "cat *.fa > genome.fa" and then either move genome.fa to its own directory or delete the other .fa files). The other solution is not guaranteed to always work correctly. I'm finishing a new release that will fix this. In the new version, bison_index will in fact accept a directory of fasta files, rather than needing one to specify files individually (in fact, I will explicitly remove the ability for it to handle that since it can have unintended consequences).

    The previous implementation could work incorrectly in cases where the input file list didn't match the order in which the files appeared in the directory entry, which can actually change over time. What that would mean is that the files could have been indexed in one order (e.g., chr1, then chr2, then chr3, ...) but then later read into memory in a different order (e.g., chr3, then chr1, then chr2, ...), which could cause all sorts of problems. This could only occur if you passed bison_index a list of files, rather than a single multi-fasta file. While I don't expect people to get bitten by this bug, it's very much possible and I consider it a major issue. I'm testing a fix and will upload a new version within the next couple hours.

    For anyone who stores the genome in a single file, this won't be an issue for you. If, however, you store chromosomes/contigs in individual files, then I recommend deleting the current indices (just "rm -rf bisulfite_genome" in the directory with the fasta files) and reindexing. The version I'm testing will always process files in the same order, regardless of their order in the dirent structure on disk, so this problem will be resolved.
    Last edited by dpryan; 02-27-2014, 04:28 AM.

    Comment


    • #17
      v0.3.0

      I've just release version 0.3.0, which should address the problem I mentioned in my last post as well as a few other small bugs. I should note that you can now track the development version(s) of bison on github. I have a few branches (some not yet on github), implementing discordant/mixed alignments and using the development version of samtools/htslib.
      • Note: The indices produced by previous versions are not guaranteed to be compatible unless you used a multi-fasta file. There was a serious implementation problem with how bison_index worked when given multiple files as input and how multiple files were read into memory in previous versions. If you used a multi-fasta file, then everything will continue to work correctly. However, if you used multiple fasta files in a list then I strongly encourage you to delete the previous indices (just remove the bisulfite_genome directory) and reindex. The technical reasons for this issue are that when the bison tools previously read multiple fasta files into memory, they would do so in whatever order they appeared in the directory structure, which can change over time and isn't guaranteed to match the order of files someone specified during indexing. While the alignments wouldn't be affected by this, the methylation calls could have been seriously compromised. In this version, bison_index will only accept a directory, not a list of files, and it will always alphasort() the list of files in that directory prior to processing. This should eliminate this problem. My apologies to anyone affected by this.
      • Added --genome-size option to a number of the tools. Many of the bison programs need to read the genome into memory. By default, 3 gigabases worth of memory are allocated for that and the size increased as needed. For smaller genomes, this wasted space. For larger genomes, the constant reallocation of space could seriously slow things down. Consequently, this option was added to any tool that reads the genome into memory. It's convenient to overestimate this slightly, so if your genome is 3.8 gigabases, then just use 4000000000 as the genome size.
      • bison_merge_CpGs can now take multiple input files at once.
      • A number of small bug fixes, such as when "genome_dir" doesn't end in a /.

      Comment


      • #18
        It seems that I missed posting when I released version 0.3.1. Anyway, I've just released version 0.3.2. Changes of note are below, though the biggest one is support for HTSlib. I should note that I've also created a tutorial with compilation instructions and a couple example datasets available here.
        • Added bedGraph2MOABS to convert bedGraph files for use by MOABS.
        • Added support for HTSlib.
        • Fixed a small bug wherein --reorder wasn't being invoked when multiple output BAM files were to be used.
        • Fixed a small bug that only manifested in DEBUG mode.
        • There is now a tutorial.
        • The default minimum MAPQ and Phred scores used by bison_mbias have been updated to match bison_methylation_extractor.

        Comment


        • #19
          I've just posted version 0.3.2b, which fixes the Makefile so that bison will use the static htslib file. Otherwise, users would need to keep htslib around (convenient for me, but probably not for you).

          Comment


          • #20
            I've just posted version 0.3.3, which supports discordant and singleton alignments. The tutorial has also been updated to demonstrate how to suppress such alignments, if desired.

            Bison how now been published. If you use it in your research, please cite the paper here.

            I'll note that the next version will add support for CRAM files.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            21 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            45 views
            0 likes
            Last Post seqadmin  
            Working...
            X