Unconfigured Ad

**nilshomer** · 02-24-2010, 10:00 AM

Originally posted by kevlim83 View Post

Dear all,

We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible. Is there a possible solution without modification of the source code?

Of course, we would like to consider source code modification as a last resort. In any case, we would also appreciate any insights as to how we can modify the source code to handle a 6billion character genome.

Regards,
Kevin

I am guessing it has something to do with 32-bit integers, and so you would have to change the index source code to store 64-bit integers, which would double the index size instantly.

Could you split your reference and align to each separately and merge the results? This is not as faithful to the bowtie algorithm but seems like a practical solution.

**kevlim83** · 02-24-2010, 07:04 PM

Hi,

Thanks for the reply.

Can anyone guide me as to where the pointers I need to change are located?

Regards,
Kevin

**sperry** · 02-26-2010, 10:32 AM

Hi Kevin,

Trying to update the source code could be more trouble than it is worth. If it was simply a matter of changing a few pointers, the author likely would have done that rather than adding this disclaimer to the manual:

Because bowtie-build uses 32-bit pointers internally, it can handle up to a theoretical maximum of 2^32-1 (somewhat more than 4 billion) characters in an index, though, with other constraints, the actual ceiling is somewhat less than that. If your reference exceeds 2^32-1 characters, bowtie-build will print an error message and abort. To resolve this, divide your reference sequences into smaller batches and/or chunks and build a separate index for each.

If your computer has more than 3-4 GB of memory and you would like to exploit that fact to make index building faster, use a 64-bit version of the bowtie-build binary. The 32-bit version of the binary is restricted to using less than 4 GB of memory. If a 64-bit pre-built binary does not yet exist for your platform on the sourceforge download site, you will need to build one from source.

Have you tried any of the other aligners? I have had good experiences with BWA, although I haven't tried it with a 6 billion base reference sequence.

If you are committed to Bowtie, splitting your reference sequence into two files will get you up and running, as others have pointed out.

**kevlim83** · 02-28-2010, 06:48 PM

Yes, we also think that messing around with source code is a cumbersome task indeed.

However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

Hence, we are left with the last resort which is to modify the source code.

Any form of help is truly appreciated here. Thanks.

Regards,
Kevin

Originally posted by sperry View Post

Hi Kevin,

Trying to update the source code could be more trouble than it is worth. If it was simply a matter of changing a few pointers, the author likely would have done that rather than adding this disclaimer to the manual:

Have you tried any of the other aligners? I have had good experiences with BWA, although I haven't tried it with a 6 billion base reference sequence.

If you are committed to Bowtie, splitting your reference sequence into two files will get you up and running, as others have pointed out.

**nilshomer** · 02-28-2010, 06:59 PM

Originally posted by kevlim83 View Post

Yes, we also think that messing around with source code is a cumbersome task indeed.

However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

Hence, we are left with the last resort which is to modify the source code.

Any form of help is truly appreciated here. Thanks.

Regards,
Kevin

What about using a different aligner?

**sperry** · 03-01-2010, 07:58 AM

Hi Kevin,

Take a look at the ebwt.h file in the bowtie source distribution. This file outlines the ebwt-related classes. Searching for 'int', 'uint32_t', and 'int32_t' should give you an idea of where you can start to modify the code.

You might also find it useful to compile bowtie using the '-ggdb' flag, and then try invoking bowtie-build with your large reference sequence within gdb to see exactly where things are breaking down.

-Scott

Originally posted by kevlim83 View Post

Yes, we also think that messing around with source code is a cumbersome task indeed.

However, the reason why we want to do so is because we want bowtie to find reads that align uniquely to a given reference genome using the "-m 1 --best --strata" parameter. As such, if we split up the reference genome into two, then we are essentially running bowtie twice for each reference split. Even if we have a correct way to merge these result sets to obtain the unique alignments, this is not the same as running the same parameters on a combined reference. The reason being is that we are finding unique alignments at the "best strata" level. Splitting up the reference will allow bowtie to get alignments that are "best strata" unique only to a subset.

Hence, we are left with the last resort which is to modify the source code.

Any form of help is truly appreciated here. Thanks.

Regards,
Kevin

**chadn737** · 01-31-2014, 08:53 AM

An old thread, but I am currently in a similar situation. I have a polyploid genome of >10 Gbs that I have to work with. Anybody have any recommendations on altering bowtie for this?

Alternatively, any good strategies at post-processing data aligned to individual chunks to achieve the same result?

**dpryan** · 01-31-2014, 11:47 AM

I think BWA can handle larger genomes, that'd be the easiest solution.

BTW, you can split a genome, map all the reads to each of the chunks with bowtie2, and then process the results to produce results equivalent to what would have been produced had you aligned to the genome as a whole with bowtie2, but it's not completely trivial. This is effectively how bisulfite-seq aligners work (see the source code for Bison if you really want to see how to do this).

**chadn737** · 01-31-2014, 11:52 AM

This is for bisulphite-sequencing. The problem being, that my lab uses a specific pipeline for our analysis, we work closely with the developers. Bowtie is a standard part of that protocol and I have already used this pipeline for analyzing A LOT of data, this being the first time I have run into problems. I really would like to avoid using any other aligner, because then the effort put into achieving identical results with Bowtie will be a headache in itself.

That being said, I think I have successfully modified bowtie-build...whether or not this works I can't say until its finished and I have had a chance to align some data. But it seems to be working.

**Timothy Amos** · 11-27-2014, 02:24 PM

Originally posted by kevlim83 View Post

We are facing some problems indexing our reference genome with bowtie-index, as our reference size is greater than 4billion characters. According to the manual, this is not possible.

I know this is a four year old question, but bowtie-2 says it can now deal with this (Current version is Bowtie2 2.2.4):

Small and large indexes

bowtie2-build can index reference genomes of any size. For genomes less than about 4 billion nucleotides in length, bowtie2-build builds a "small" index using 32-bit numbers in various parts of the index. When the genome is longer, bowtie2-build builds a "large" index using 64-bit numbers. Small indexes are stored in files with the .bt2 extension, and large indexes are stored in files with the .bt2l extension. The user need not worry about whether a particular index is small or large; the wrapper scripts will automatically build and use the appropriate index.

Bowtie 2: Manual

http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

**zillur** · 12-09-2014, 07:36 PM

Hi,
I have to map yeast genome using bowtie2. For this from where I can download genome.

ENA Browser

http://www.ebi.ac.uk/ena/data/search?query=yeast

ENA Browser

JS Bin

http://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/

Page Not Found | SGD

http://www.yeastgenome.org/download-data/sequence

The Saccharomyces Genome Database (SGD) provides comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae.

S288C | SGD

http://www.yeastgenome.org/strain/S288C/overview

The Saccharomyces Genome Database (SGD) provides comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae.

Where I can reference genome?

Best Regards
Zillur

Topics	Statistics	Last Post
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, Today, 08:59 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 Today, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM

Unconfigured Ad

bowtie reference genome index: help required

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News