Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kevlim83
    Junior Member
    • Jan 2010
    • 9

    bowtie reference genome index: help required

    Dear all,

    We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible. Is there a possible solution without modification of the source code?

    Of course, we would like to consider source code modification as a last resort. In any case, we would also appreciate any insights as to how we can modify the source code to handle a 6billion character genome.

    Regards,
    Kevin
  • nilshomer
    Nils Homer
    • Nov 2008
    • 1283

    #2
    Originally posted by kevlim83 View Post
    Dear all,

    We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible. Is there a possible solution without modification of the source code?

    Of course, we would like to consider source code modification as a last resort. In any case, we would also appreciate any insights as to how we can modify the source code to handle a 6billion character genome.

    Regards,
    Kevin
    I am guessing it has something to do with 32-bit integers, and so you would have to change the index source code to store 64-bit integers, which would double the index size instantly.

    Could you split your reference and align to each separately and merge the results? This is not as faithful to the bowtie algorithm but seems like a practical solution.

    Comment

    • kevlim83
      Junior Member
      • Jan 2010
      • 9

      #3
      Hi,

      Thanks for the reply.

      Can anyone guide me as to where the pointers I need to change are located?

      Regards,
      Kevin

      Comment

      • sperry
        Junior Member
        • Feb 2010
        • 7

        #4
        Hi Kevin,

        Trying to update the source code could be more trouble than it is worth. If it was simply a matter of changing a few pointers, the author likely would have done that rather than adding this disclaimer to the manual:

        Because bowtie-build uses 32-bit pointers internally, it can handle up to a theoretical maximum of 2^32-1 (somewhat more than 4 billion) characters in an index, though, with other constraints, the actual ceiling is somewhat less than that. If your reference exceeds 2^32-1 characters, bowtie-build will print an error message and abort. To resolve this, divide your reference sequences into smaller batches and/or chunks and build a separate index for each.

        If your computer has more than 3-4 GB of memory and you would like to exploit that fact to make index building faster, use a 64-bit version of the bowtie-build binary. The 32-bit version of the binary is restricted to using less than 4 GB of memory. If a 64-bit pre-built binary does not yet exist for your platform on the sourceforge download site, you will need to build one from source.
        Have you tried any of the other aligners? I have had good experiences with BWA, although I haven't tried it with a 6 billion base reference sequence.

        If you are committed to Bowtie, splitting your reference sequence into two files will get you up and running, as others have pointed out.

        Comment

        • kevlim83
          Junior Member
          • Jan 2010
          • 9

          #5
          Yes, we also think that messing around with source code is a cumbersome task indeed.

          However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

          Hence, we are left with the last resort which is to modify the source code.

          Any form of help is truly appreciated here. Thanks.

          Regards,
          Kevin

          Originally posted by sperry View Post
          Hi Kevin,

          Trying to update the source code could be more trouble than it is worth. If it was simply a matter of changing a few pointers, the author likely would have done that rather than adding this disclaimer to the manual:



          Have you tried any of the other aligners? I have had good experiences with BWA, although I haven't tried it with a 6 billion base reference sequence.

          If you are committed to Bowtie, splitting your reference sequence into two files will get you up and running, as others have pointed out.

          Comment

          • nilshomer
            Nils Homer
            • Nov 2008
            • 1283

            #6
            Originally posted by kevlim83 View Post
            Yes, we also think that messing around with source code is a cumbersome task indeed.

            However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

            Hence, we are left with the last resort which is to modify the source code.

            Any form of help is truly appreciated here. Thanks.

            Regards,
            Kevin
            What about using a different aligner?

            Comment

            • sperry
              Junior Member
              • Feb 2010
              • 7

              #7
              Hi Kevin,

              Take a look at the ebwt.h file in the bowtie source distribution. This file outlines the ebwt-related classes. Searching for 'int', 'uint32_t', and 'int32_t' should give you an idea of where you can start to modify the code.

              You might also find it useful to compile bowtie using the '-ggdb' flag, and then try invoking bowtie-build with your large reference sequence within gdb to see exactly where things are breaking down.

              -Scott

              Originally posted by kevlim83 View Post
              Yes, we also think that messing around with source code is a cumbersome task indeed.

              However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

              Hence, we are left with the last resort which is to modify the source code.

              Any form of help is truly appreciated here. Thanks.

              Regards,
              Kevin
              Last edited by sperry; 03-01-2010, 08:21 AM.

              Comment

              • chadn737
                Senior Member
                • Jan 2009
                • 392

                #8
                An old thread, but I am currently in a similar situation. I have a polyploid genome of >10 Gbs that I have to work with. Anybody have any recommendations on altering bowtie for this?

                Alternatively, any good strategies at post-processing data aligned to individual chunks to achieve the same result?

                Comment

                • dpryan
                  Devon Ryan
                  • Jul 2011
                  • 3478

                  #9
                  I think BWA can handle larger genomes, that'd be the easiest solution.

                  BTW, you can split a genome, map all the reads to each of the chunks with bowtie2, and then process the results to produce results equivalent to what would have been produced had you aligned to the genome as a whole with bowtie2, but it's not completely trivial. This is effectively how bisulfite-seq aligners work (see the source code for Bison if you really want to see how to do this).

                  Comment

                  • chadn737
                    Senior Member
                    • Jan 2009
                    • 392

                    #10
                    This is for bisulphite-sequencing. The problem being, that my lab uses a specific pipeline for our analysis, we work closely with the developers. Bowtie is a standard part of that protocol and I have already used this pipeline for analyzing A LOT of data, this being the first time I have run into problems. I really would like to avoid using any other aligner, because then the effort put into achieving identical results with Bowtie will be a headache in itself.

                    That being said, I think I have successfully modified bowtie-build...whether or not this works I can't say until its finished and I have had a chance to align some data. But it seems to be working.

                    Comment

                    • Timothy Amos
                      Junior Member
                      • Aug 2014
                      • 4

                      #11
                      Originally posted by kevlim83 View Post
                      We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible.
                      I know this is a four year old question, but bowtie-2 says it can now deal with this (Current version is Bowtie2 2.2.4):

                      Small and large indexes

                      bowtie2-build can index reference genomes of any size. For genomes less than about 4 billion nucleotides in length, bowtie2-build builds a "small" index using 32-bit numbers in various parts of the index. When the genome is longer, bowtie2-build builds a "large" index using 64-bit numbers. Small indexes are stored in files with the .bt2 extension, and large indexes are stored in files with the .bt2l extension. The user need not worry about whether a particular index is small or large; the wrapper scripts will automatically build and use the appropriate index.

                      Comment

                      • zillur
                        Senior Member
                        • Sep 2014
                        • 106

                        #12
                        Hi,
                        I have to map yeast genome using bowtie2. For this from where I can download genome.


                        The Saccharomyces Genome Database (SGD) provides comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae.

                        The Saccharomyces Genome Database (SGD) provides comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae.


                        Where I can reference genome?

                        Best Regards
                        Zillur

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM
                        • SEQadmin2
                          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                          by SEQadmin2


                          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                          Introduction

                          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                          05-22-2026, 06:42 AM
                        • SEQadmin2
                          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                          by SEQadmin2

                          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                          05-06-2026, 09:04 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, Today, 08:59 AM
                        0 responses
                        11 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 12:03 PM
                        0 responses
                        21 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 11:40 AM
                        0 responses
                        17 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-28-2026, 11:40 AM
                        0 responses
                        31 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...