Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to run Tophat2 with GRCh38?

    Hi,

    This is a very simple question that I'm hopeful someone has resolved already.

    How does one run Tophat2 with GRCh38?

    I've downloaded the reference genome from Ensembl.
    I've indexed the reference genome with bowtie2-build.

    The problem is that bowtie2-build generates large index files with the extension bt2l that are not recognized by TopHat.

    What should I do?
    Would an older version of Bowtie2 allow me to generate bt2 files?

    Someone must have resolved this problem.
    iGenomes does not yet provide indexes for GRCh38.
    I'm happy with Tophat, and don't want to switch to STAR, although I find this issue annoying and perplexing.

    The problem has been reported in the Tuxedo user group, but no solution has been provided.


    TopHat v2.0.12
    Bowtie2 version 2.2.3

    Error: Could not find Bowtie 2 index files (/stockage/genomes/Homo_sapiens/Ensembl/GRCh38/Sequence/Bowtie2Index/Homo_sapiens.GRCh38.dna.toplevel.*.bt2)

    Thank you for your help.

  • #2
    Though it has not been said explicitly on TopHat web page (last time this was mentioned was for v. 2.0.11) it is still likely that TopHat does not support 64-bit bowtie2 indexes. I think that is what you have generated.

    According to the manual bowtie2-build should generate normal indexes (if the reference is < 4 gigabases). Not sure why you are getting large indexes.
    Last edited by GenoMax; 10-07-2014, 07:58 AM.

    Comment


    • #3
      How do I generate the smaller (32-bit) index files?

      There is no option in bowtie2-build.

      I used the following simple command to generate the index files.

      Code:
      bowtie2-build Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
      &> bowtie2_build.sh.log
      Code:
      Bowtie 2 version 2.2.3 by Ben Langmead ([email protected], www.cs.jhu.edu/~langmea)
      Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
          reference_in            comma-separated list of files with ref sequences
          bt2_index_base          write bt2 data to files with this dir/basename
      *** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
      Options:
          -f                      reference files are Fasta (default)
          -c                      reference sequences given on cmd line (as
                                  <reference_in>)
          --large-index           force generated index to be 'large', even if ref
                                  has fewer than 4 billion nucleotides
          -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
          -p/--packed             use packed strings internally; slower, less memory
          --bmax <int>            max bucket sz for blockwise suffix-array builder
          --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
          --dcv <int>             diff-cover period for blockwise (default: 1024)
          --nodc                  disable diff-cover (algorithm becomes quadratic)
          -r/--noref              don't build .3/.4 index files
          -3/--justref            just build .3/.4 index files
          -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)
          -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
          --seed <int>            seed for random number generator
          -q/--quiet              verbose output (for debugging)
          -h/--help               print detailed description of tool and its options
          --usage                 print this usage message
          --version               print version information and quit
      Last edited by blancha; 10-07-2014, 08:07 AM.

      Comment


      • #4
        Originally posted by blancha View Post
        How do I generate the smaller (32-bit) index files?

        There is no option in bowtie2-build.

        I used the following simple command to generate the index files.

        Code:
        bowtie2-build Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
        &> bowtie2_build.sh.log
        Code:
        Bowtie 2 version 2.2.3 by Ben Langmead ([email protected], www.cs.jhu.edu/~langmea)
        Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
            reference_in            comma-separated list of files with ref sequences
            bt2_index_base          write bt2 data to files with this dir/basename
        *** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
        Options:
            -f                      reference files are Fasta (default)
            -c                      reference sequences given on cmd line (as
                                    <reference_in>)
            --large-index           force generated index to be 'large', even if ref
                                    has fewer than 4 billion nucleotides
            -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
            -p/--packed             use packed strings internally; slower, less memory
            --bmax <int>            max bucket sz for blockwise suffix-array builder
            --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
            --dcv <int>             diff-cover period for blockwise (default: 1024)
            --nodc                  disable diff-cover (algorithm becomes quadratic)
            -r/--noref              don't build .3/.4 index files
            -3/--justref            just build .3/.4 index files
            -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)
            -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
            --seed <int>            seed for random number generator
            -q/--quiet              verbose output (for debugging)
            -h/--help               print detailed description of tool and its options
            --usage                 print this usage message
            --version               print version information and quit
        bowtie2-build is really just a small wrapper script which then calls either bowtie2-build-s ('small' genomes) or bowtie2-build-l ('large'). While not recommended you could try directly using bowtie2-build-s, e.g.
        Code:
        bowtie2-build-s Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
        &> bowtie2_build.sh.log
        I do not know if this will work for GRCh38.

        Comment


        • #5
          Homo_sapiens.GRCh38.dna.toplevel.fa from ensembl is 36G in size. It appears to contain alternate haplotypes for a number of locations/scaffolds in addition to the chromosomes. No wonder bowtie2 is building long indexes.

          I am going to see if I can find a link for just the chromosomes.

          Comment


          • #6
            @GenoMax,@kmcarr

            Thank you both for your help.

            I've downloaded Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, which excludes haplotypes and patches.
            bowtie2-build built the smaller bt2 index files on this file.

            Since I was interested in novel transcript discovery in addition to gene expression quantification, I wanted to use the most complete genome version available, so I was using Homo_sapiens.GRCh38.dna.toplevel.fa.gz. In hindsight, Homo_sapiens.GRCh37.dna.primary_assembly.fa was probably more appropriate.

            The following description of the files says GRCh37, but it was downloaded from the GRCh38 directory on the Ensembl FTP site.
            Code:
            ---------
            TOPLEVEL
            ---------
            These files contains all sequence regions flagged as toplevel in an Ensembl
            schema. This includes chromsomes, regions not assembled into chromosomes and
            N padded haplotype/patch regions.
            
            EXAMPLES
            
              Toplevel sequences unmasked:
                Homo_sapiens.GRCh37.dna.toplevel.fa.gz
              
              Toplevel soft/hard masked sequences:
                Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz
                Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz
            
            -----------------
            PRIMARY ASSEMBLY
            -----------------
            Primary assembly contains all toplevel sequence regions excluding haplotypes
            and patches. This file is best used for performing sequence similarity searches
            where patch and haplotype sequences would confuse analysis.   
            
            EXAMPLES
            
              Primary assembly sequences unmasked:
                Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
              
              Primary assembly soft/hard masked sequences:
                Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa.gz
                Homo_sapiens.GRCh37.dna_rm.primary_assembly.fa.gz
            Last edited by blancha; 10-07-2014, 09:35 AM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advanced Methods for the Detection of Infectious Disease
              by seqadmin




              The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
              ...
              11-27-2023, 01:15 PM
            • seqadmin
              Strategies for Investigating the Microbiome
              by seqadmin




              Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
              11-09-2023, 07:02 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-01-2023, 09:55 AM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 11-30-2023, 10:48 AM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 11-29-2023, 08:26 AM
            0 responses
            14 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 11-29-2023, 08:12 AM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Working...
            X