Seqanswers Leaderboard Ad

**GenoMax** · 10-07-2014, 07:48 AM

Though it has not been said explicitly on TopHat web page (last time this was mentioned was for v. 2.0.11) it is still likely that TopHat does not support 64-bit bowtie2 indexes. I think that is what you have generated.

According to the manual bowtie2-build should generate normal indexes (if the reference is < 4 gigabases). Not sure why you are getting large indexes.

**blancha** · 10-07-2014, 07:57 AM

How do I generate the smaller (32-bit) index files?

There is no option in bowtie2-build.

I used the following simple command to generate the index files.

Code:

bowtie2-build Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
&> bowtie2_build.sh.log

Code:

Bowtie 2 version 2.2.3 by Ben Langmead ([email protected], www.cs.jhu.edu/~langmea)
Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
    reference_in            comma-separated list of files with ref sequences
    bt2_index_base          write bt2 data to files with this dir/basename
*** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
Options:
    -f                      reference files are Fasta (default)
    -c                      reference sequences given on cmd line (as
                            <reference_in>)
    --large-index           force generated index to be 'large', even if ref
                            has fewer than 4 billion nucleotides
    -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
    -p/--packed             use packed strings internally; slower, less memory
    --bmax <int>            max bucket sz for blockwise suffix-array builder
    --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
    --dcv <int>             diff-cover period for blockwise (default: 1024)
    --nodc                  disable diff-cover (algorithm becomes quadratic)
    -r/--noref              don't build .3/.4 index files
    -3/--justref            just build .3/.4 index files
    -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)
    -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
    --seed <int>            seed for random number generator
    -q/--quiet              verbose output (for debugging)
    -h/--help               print detailed description of tool and its options
    --usage                 print this usage message
    --version               print version information and quit

**kmcarr** · 10-07-2014, 08:18 AM

Originally posted by blancha View Post

How do I generate the smaller (32-bit) index files?

There is no option in bowtie2-build.

I used the following simple command to generate the index files.

Code:

bowtie2-build Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
&> bowtie2_build.sh.log

Code:

Bowtie 2 version 2.2.3 by Ben Langmead ([email protected], www.cs.jhu.edu/~langmea)
Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
    reference_in            comma-separated list of files with ref sequences
    bt2_index_base          write bt2 data to files with this dir/basename
*** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
Options:
    -f                      reference files are Fasta (default)
    -c                      reference sequences given on cmd line (as
                            <reference_in>)
    --large-index           force generated index to be 'large', even if ref
                            has fewer than 4 billion nucleotides
    -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
    -p/--packed             use packed strings internally; slower, less memory
    --bmax <int>            max bucket sz for blockwise suffix-array builder
    --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
    --dcv <int>             diff-cover period for blockwise (default: 1024)
    --nodc                  disable diff-cover (algorithm becomes quadratic)
    -r/--noref              don't build .3/.4 index files
    -3/--justref            just build .3/.4 index files
    -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)
    -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
    --seed <int>            seed for random number generator
    -q/--quiet              verbose output (for debugging)
    -h/--help               print detailed description of tool and its options
    --usage                 print this usage message
    --version               print version information and quit

bowtie2-build is really just a small wrapper script which then calls either bowtie2-build-s ('small' genomes) or bowtie2-build-l ('large'). While not recommended you could try directly using bowtie2-build-s, e.g.

Code:

bowtie2-build-s Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
&> bowtie2_build.sh.log

I do not know if this will work for GRCh38.

**GenoMax** · 10-07-2014, 08:53 AM

Homo_sapiens.GRCh38.dna.toplevel.fa from ensembl is 36G in size. It appears to contain alternate haplotypes for a number of locations/scaffolds in addition to the chromosomes. No wonder bowtie2 is building long indexes.

I am going to see if I can find a link for just the chromosomes.

**blancha** · 10-07-2014, 09:20 AM

@GenoMax,@kmcarr

Thank you both for your help.

I've downloaded Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, which excludes haplotypes and patches.
bowtie2-build built the smaller bt2 index files on this file.

Since I was interested in novel transcript discovery in addition to gene expression quantification, I wanted to use the most complete genome version available, so I was using Homo_sapiens.GRCh38.dna.toplevel.fa.gz. In hindsight, Homo_sapiens.GRCh37.dna.primary_assembly.fa was probably more appropriate.

The following description of the files says GRCh37, but it was downloaded from the GRCh38 directory on the Ensembl FTP site.

Code:

---------
TOPLEVEL
---------
These files contains all sequence regions flagged as toplevel in an Ensembl
schema. This includes chromsomes, regions not assembled into chromosomes and
N padded haplotype/patch regions.

EXAMPLES

  Toplevel sequences unmasked:
    Homo_sapiens.GRCh37.dna.toplevel.fa.gz
  
  Toplevel soft/hard masked sequences:
    Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz
    Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz

-----------------
PRIMARY ASSEMBLY
-----------------
Primary assembly contains all toplevel sequence regions excluding haplotypes
and patches. This file is best used for performing sequence similarity searches
where patch and haplotype sequences would confuse analysis.   

EXAMPLES

  Primary assembly sequences unmasked:
    Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
  
  Primary assembly soft/hard masked sequences:
    Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa.gz
    Homo_sapiens.GRCh37.dna_rm.primary_assembly.fa.gz

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to run Tophat2 with GRCh38?

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News