No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat2 with GFF3 annotation fails to produce Bowtie index.

    Hi all.

    I'm trying to map paired end illumina reads to a reference genome with a GFF3 file for annotation info. I compiled the genome sequence from separate files of each linkage groups plus the scaffolds which couldn't be assigned to linkage groups.

    But running tophat2 like so:

    tophat -p 8 -G ~/path/to/annotation.gff3 index_name CAA_l1_1.fq.gz CAA_l1_2.fq.gz
    ends up giving me this error:

    [2013-11-21 12:19:21] Building transcriptome data files..
    [2013-11-21 12:19:24] Building Bowtie index from annotation.fa
    Error: Couldn't build bowtie index with err = 1
    I thought that maybe the names were off but it all looks like it matches.

    bowtie2-inspect -n 
    gi|339751252|ref|NC_015762.1| Bombus terrestris linkage group LG B01, Bter_1.0 chromosome, whole genome shotgun sequence
    bowtie2-inspect -s 
    Flags   1
    Reverse flags   5
    Colorspace      0
    2.0-compatible  1
    SA-Sample       1 in 16
    FTab-Chars      10
    Sequence-1      gi|339751252|ref|NC_015762.1| Bombus terrestris linkage group LG B01, Bter_1.0 chromosome, whole genome shotgun sequence        17153651
    and my GFF3 file looks like so:

    #!gff-spec-version 1.20
    #!processor NCBI annotwriter
    ##sequence-region NC_015762.1 1 17153651
    NC_015762.1     RefSeq  region  1       17153651        .       +       .       ID=id0;Dbxref=taxon:30195;gbkey=Src;genome=chromosome;linkage-group=LG B01;mol_type=genomic DNA;note=haploid drones;sex=male
    NC_015762.1     RefSeq  gene    2279    19877   .       -       .       ID=gene0;Name=LOC100649911;Dbxref=GeneID:100649911;gbkey=Gene;gene=LOC100649911
    Any suggestions here as to what I'm doing wrong would be most appreciated.

  • #2
    Have you pre-build the genome index for the genome you are searching against?

    This guide is helpful:
    Last edited by GenoMax; 11-21-2013, 08:45 AM.


    • #3
      Thanks Geno, I'll take a look at that paper. I did build the genome index ahead of running tophat.

      I have a hunch that it might be because sequence names are all filled with extra info. I'll try cleaning those out and rebuilding the genome index again.


      • #4
        Did you call your genome index "index_name" because that is what your command line is suggesting?

        Can you post the command line you used to build the index?


        • #5
          I called it that in the post because it might be unclear otherwise.

          index created by first catenating all the .fa files then running:
          #real name of the index
          bowtie2-build Bter_gDNA.fa Bter_gDNA

          BTW, it seems to do the mapping step if you exclude the GFF part. Also, running bowtie alone works fine. It seems to really be an issue with the GFF matching.


          • #6
            Got it.

            Can you validate your GFF file to make sure it is ok:


            • #7
              Hadn't seen those gff-tools. Will take a good look for future reference. Turns out it was an issue with the long names in the fasta file and short names in the GFF3 file. A silly issue really!


              • #8
                I recently ran into the same error:
                Couldn't build bowtie index with err = 1
                I ran bowtie2-inspect -n but the name exactly matched the name in the 1st column of the GTF. I then ran bowtie2-inspect genome_index > new.fa (without -n) to regenerate the fasta. This fixed it.

                Ps. A diff of the original genome.fa vs new.fa indicated that the white-space was different (I checked previously that there were no spaces after the sequence name, but apparently the new-line character was different). Also, the regenerated fasta had a different number of bases per line and no extra blank line at the end of the file. I'm not sure which of these differences was causing the error.


                • #9
                  Good to know bw, thanks.


                  • #10
                    hello ev'one,
                    I also run into the same problem
                    Error: Couldn't build bowtie index with err = 1
                    * I created my index from my refernce genome "ref_maize.fa" and created those files :
                    maize_ebtw.1.bt2 maize_ebtw.3.bt2 maize_ebtw.rev.1.bt2
                    maize_ebtw.2.bt2 maize_ebtw.4.bt2 maize_ebtw.rev.2.bt2
                    which are all in one directory "bowtie_build2"
                    And then I run tophat with the following command:
                    tophat -p 7 -o /nfshome/fhg2a/nature_maize/tophat_results_base -G ZmB73_5a.59_WGS.gff --no-novel-juncs bowtie_build2/maize_ebtw reads/SRR039501.fastq >tophat_base.log

                    now I got the error with the following:
                    [2014-05-06 12:15:54] Building Bowtie index from ZmB73_5a.59_WGS.fa
                    Error: Couldn't build bowtie index with err = 1
                    MY QUESTION: I DONT' EVEN HAVE FILE CALLED "ZmB73_5a.59_WGS.fa"? YOUR help is appreciated.

                    my file looks like this :
                    head ref_maize.fa

                    ANNOTATION FILE
                    head ZmB73_5a.59_WGS.gff
                    9 ensembl chromosome 1 156750706 . . . ID=9;Name=chromosome:AGPv2:9:1:156750706:1
                    9 ensembl gene 19970 20093 . + . ID=GRMZM2G581216;Name=GRMZM2G581216;biotype=transposable_element
                    9 ensembl mRNA 19970 20093 . + . ID=GRMZM2G581216_T01;Parent=GRMZM2G581216;Name=GRMZM2G581216_T01;biotype=protein_coding
                    9 ensembl exon 19970 20093 . + . Parent=GRMZM2G581216_T01;Name=GRMZM2G581216_E01
                    9 ensembl CDS 19970 20092 . + 0 Parent=GRMZM2G581216_T01;Name=CDS.2
                    9 ensembl gene 23314 26371 . + . ID=GRMZM2G163722;Name=GRMZM2G163722;biotype=transposable_element
                    9 ensembl mRNA 23314 26371 . + . ID=GRMZM2G163722_T01;Parent=GRMZM2G163722;Name=GRMZM2G163722_T01;biotype=protein_coding
                    9 ensembl intron 23496 23939 . + . Parent=GRMZM2G163722_T01;Name=intron.3
                    9 ensembl intron 24061 24283 . + . Parent=GRMZM2G163722_T01;Name=intron.4
                    9 ensembl intron 24472 24540 . + . Parent=GRMZM2G163722_T01;Name=intron.5


                    • #11
                      @filmonhg: Do yourself a favor and grab a copy of maize data from iGenomes (will work unless your genome is non-standard) This way you would have sequence, annotation, indexes that are all coordinated (chromosome names etc) and will work together.

                      You are going to run into other issues even if you were to try renaming your GTF file to read maize_ebtw.gtf.

                      See the following quote from TopHat site.

                      Please note that the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat. You can get a list of the sequence names in a Bowtie index by typing:

                      bowtie-inspect --names your_index

                      So before using a known annotation file with this option please make sure that the 1st column in the annotation file uses the exact same chromosome/contig names (case sensitive) as shown by the bowtie-inspect command above.


                      • #12
                        Although this is very obvious, it took me a couple of days to realise that I couldn't simply create my own reference file from the genomic region of my interest (I was humbly downloading the sequence from UCSC browser by simply selecting the region of interest and clicking TOOLS - GET DNA haha :P ).

                        So, just to clarify for newbies - as myself -, the fasta file to create your bowtie index needs to inform which base in which chromosome that sequence refers to), and this information needs to match in terms of format and in coordinates with the information in the GTF file.

                        P.S.: I gave up of trying to minimise the genomic region to which my reads would map. First, it's not straightforward and I think it requires some programming, and second, you'll introduce biases in your analysis (reads that shouldn't map there may end up mapping there).
                        Last edited by rodrigo.duarte88; 05-19-2015, 07:20 AM.


                        Latest Articles


                        • seqadmin
                          Advanced Methods for the Detection of Infectious Disease
                          by seqadmin

                          The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
                          11-27-2023, 01:15 PM
                        • seqadmin
                          Strategies for Investigating the Microbiome
                          by seqadmin

                          Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
                          11-09-2023, 07:02 AM





                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 10:48 AM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 11-29-2023, 08:26 AM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 11-29-2023, 08:12 AM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 11-27-2023, 08:12 AM
                        0 responses
                        Last Post seqadmin