Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to run Tophat2 with GRCh38?

    Hi,

    This is a very simple question that I'm hopeful someone has resolved already.

    How does one run Tophat2 with GRCh38?

    I've downloaded the reference genome from Ensembl.
    I've indexed the reference genome with bowtie2-build.

    The problem is that bowtie2-build generates large index files with the extension bt2l that are not recognized by TopHat.

    What should I do?
    Would an older version of Bowtie2 allow me to generate bt2 files?

    Someone must have resolved this problem.
    iGenomes does not yet provide indexes for GRCh38.
    I'm happy with Tophat, and don't want to switch to STAR, although I find this issue annoying and perplexing.

    The problem has been reported in the Tuxedo user group, but no solution has been provided.


    TopHat v2.0.12
    Bowtie2 version 2.2.3

    Error: Could not find Bowtie 2 index files (/stockage/genomes/Homo_sapiens/Ensembl/GRCh38/Sequence/Bowtie2Index/Homo_sapiens.GRCh38.dna.toplevel.*.bt2)

    Thank you for your help.

  • #2
    Though it has not been said explicitly on TopHat web page (last time this was mentioned was for v. 2.0.11) it is still likely that TopHat does not support 64-bit bowtie2 indexes. I think that is what you have generated.

    According to the manual bowtie2-build should generate normal indexes (if the reference is < 4 gigabases). Not sure why you are getting large indexes.
    Last edited by GenoMax; 10-07-2014, 07:58 AM.

    Comment


    • #3
      How do I generate the smaller (32-bit) index files?

      There is no option in bowtie2-build.

      I used the following simple command to generate the index files.

      Code:
      bowtie2-build Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
      &> bowtie2_build.sh.log
      Code:
      Bowtie 2 version 2.2.3 by Ben Langmead ([email protected], www.cs.jhu.edu/~langmea)
      Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
          reference_in            comma-separated list of files with ref sequences
          bt2_index_base          write bt2 data to files with this dir/basename
      *** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
      Options:
          -f                      reference files are Fasta (default)
          -c                      reference sequences given on cmd line (as
                                  <reference_in>)
          --large-index           force generated index to be 'large', even if ref
                                  has fewer than 4 billion nucleotides
          -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
          -p/--packed             use packed strings internally; slower, less memory
          --bmax <int>            max bucket sz for blockwise suffix-array builder
          --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
          --dcv <int>             diff-cover period for blockwise (default: 1024)
          --nodc                  disable diff-cover (algorithm becomes quadratic)
          -r/--noref              don't build .3/.4 index files
          -3/--justref            just build .3/.4 index files
          -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)
          -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
          --seed <int>            seed for random number generator
          -q/--quiet              verbose output (for debugging)
          -h/--help               print detailed description of tool and its options
          --usage                 print this usage message
          --version               print version information and quit
      Last edited by blancha; 10-07-2014, 08:07 AM.

      Comment


      • #4
        Originally posted by blancha View Post
        How do I generate the smaller (32-bit) index files?

        There is no option in bowtie2-build.

        I used the following simple command to generate the index files.

        Code:
        bowtie2-build Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
        &> bowtie2_build.sh.log
        Code:
        Bowtie 2 version 2.2.3 by Ben Langmead ([email protected], www.cs.jhu.edu/~langmea)
        Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
            reference_in            comma-separated list of files with ref sequences
            bt2_index_base          write bt2 data to files with this dir/basename
        *** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
        Options:
            -f                      reference files are Fasta (default)
            -c                      reference sequences given on cmd line (as
                                    <reference_in>)
            --large-index           force generated index to be 'large', even if ref
                                    has fewer than 4 billion nucleotides
            -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
            -p/--packed             use packed strings internally; slower, less memory
            --bmax <int>            max bucket sz for blockwise suffix-array builder
            --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
            --dcv <int>             diff-cover period for blockwise (default: 1024)
            --nodc                  disable diff-cover (algorithm becomes quadratic)
            -r/--noref              don't build .3/.4 index files
            -3/--justref            just build .3/.4 index files
            -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)
            -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
            --seed <int>            seed for random number generator
            -q/--quiet              verbose output (for debugging)
            -h/--help               print detailed description of tool and its options
            --usage                 print this usage message
            --version               print version information and quit
        bowtie2-build is really just a small wrapper script which then calls either bowtie2-build-s ('small' genomes) or bowtie2-build-l ('large'). While not recommended you could try directly using bowtie2-build-s, e.g.
        Code:
        bowtie2-build-s Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
        &> bowtie2_build.sh.log
        I do not know if this will work for GRCh38.

        Comment


        • #5
          Homo_sapiens.GRCh38.dna.toplevel.fa from ensembl is 36G in size. It appears to contain alternate haplotypes for a number of locations/scaffolds in addition to the chromosomes. No wonder bowtie2 is building long indexes.

          I am going to see if I can find a link for just the chromosomes.

          Comment


          • #6
            @GenoMax,@kmcarr

            Thank you both for your help.

            I've downloaded Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, which excludes haplotypes and patches.
            bowtie2-build built the smaller bt2 index files on this file.

            Since I was interested in novel transcript discovery in addition to gene expression quantification, I wanted to use the most complete genome version available, so I was using Homo_sapiens.GRCh38.dna.toplevel.fa.gz. In hindsight, Homo_sapiens.GRCh37.dna.primary_assembly.fa was probably more appropriate.

            The following description of the files says GRCh37, but it was downloaded from the GRCh38 directory on the Ensembl FTP site.
            Code:
            ---------
            TOPLEVEL
            ---------
            These files contains all sequence regions flagged as toplevel in an Ensembl
            schema. This includes chromsomes, regions not assembled into chromosomes and
            N padded haplotype/patch regions.
            
            EXAMPLES
            
              Toplevel sequences unmasked:
                Homo_sapiens.GRCh37.dna.toplevel.fa.gz
              
              Toplevel soft/hard masked sequences:
                Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz
                Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz
            
            -----------------
            PRIMARY ASSEMBLY
            -----------------
            Primary assembly contains all toplevel sequence regions excluding haplotypes
            and patches. This file is best used for performing sequence similarity searches
            where patch and haplotype sequences would confuse analysis.   
            
            EXAMPLES
            
              Primary assembly sequences unmasked:
                Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
              
              Primary assembly soft/hard masked sequences:
                Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa.gz
                Homo_sapiens.GRCh37.dna_rm.primary_assembly.fa.gz
            Last edited by blancha; 10-07-2014, 09:35 AM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Recent Developments in Metagenomics
              by seqadmin





              Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
              09-23-2024, 06:35 AM
            • seqadmin
              Understanding Genetic Influence on Infectious Disease
              by seqadmin




              During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

              Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
              09-09-2024, 10:59 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 10-02-2024, 04:51 AM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 10-01-2024, 07:10 AM
            0 responses
            19 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 09-30-2024, 08:33 AM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 09-26-2024, 12:57 PM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Working...
            X