Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat problem: failing reads alignment

    Hi all,
    i'm dealing for the first time with Rna-seq data and i'm training performing some exercises running tophat on "wgEncodeCshlLongRnaSeqK562CellTotalFastqRep1.fastq" reads data from ENCODE project.
    I'v issued the cocmmand "tophat -p 24 -G genes.gtf -o K562_1 hg19 reads.fastq" where genes.gtf is the transcript annotation file (hg19) from Illumina, hg19 is the reference genome (bowtie index) and reads.fastq is the single end reads file mentioned above (152 in length)
    After 15 hrs i have in the output directory a very short .bam file (around 300 Kbyte).
    Looking at the log files i find:

    bowtie.left_kept_reads.fixmap.log:
    ::::::::::::::
    # reads processed: 77046522
    # reads with at least one reported alignment: 82 (0.00%)
    # reads that failed to align: 77046440 (100.00%)
    Reported 82 alignments to 1 output stream(s)

    ::::::::::::::
    bowtie.left_kept_reads_seg1.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 17276886 (22.42%)
    # reads that failed to align: 59283777 (76.95%)
    # reads with alignments suppressed due to -m: 485777 (0.63%)
    Reported 83222144 alignments to 1 output stream(s)
    ::::::::::::::
    bowtie.left_kept_reads_seg2.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 22628110 (29.37%)
    # reads that failed to align: 53761233 (69.78%)
    # reads with alignments suppressed due to -m: 657097 (0.85%)
    Reported 119440312 alignments to 1 output stream(s)
    ::::::::::::::
    bowtie.left_kept_reads_seg3.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 18924892 (24.56%)
    # reads that failed to align: 57402732 (74.50%)
    # reads with alignments suppressed due to -m: 718816 (0.93%)
    Reported 122913616 alignments to 1 output stream(s)
    ::::::::::::::
    bowtie.left_kept_reads_seg4.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 12507630 (16.23%)
    # reads that failed to align: 64207195 (83.34%)
    # reads with alignments suppressed due to -m: 331615 (0.43%)
    Reported 56082256 alignments to 1 output stream(s)
    ::::::::::::::
    bowtie.left_kept_reads_seg5.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 19159264 (24.87%)
    # reads that failed to align: 57337091 (74.42%)
    # reads with alignments suppressed due to -m: 550085 (0.71%)
    Reported 100864131 alignments to 1 output stream(s)
    ::::::::::::::
    bowtie.left_kept_reads_seg6.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 7086758 (9.20%)
    # reads that failed to align: 69837946 (90.64%)
    # reads with alignments suppressed due to -m: 121736 (0.16%)
    Reported 26553396 alignments to 1 output stream(s)

    Moreover i find a lot (around 200) of "malformed closure" warnings in long_spanning_reads.log

    Thanx a lot for any suggestions/advice.

  • #2
    I would ask you to include the genome in your tophat run

    so you new command will be

    "tophat -p 24 -G genes.gtf /path/to/genome -o K562_1 hg19 reads.fastq"

    The genome file name would be the common prefix for files you generate using the genome fasta file and bowtie

    "bowtie-build genome.fa genome"

    Comment


    • #3
      I've included it.
      As mentioned above it is "hg19 ". This are the .ebwt index and bowtie correctly build the hg19.fa reference file.

      I've tried the same procedure but using paired end fastq reads (2x76) from different Rna-seq (wgEncodeCshlLongRnaSeqK562CellLongnonpolyaFastqRd1Rep1.fastq.gz and wgEncodeCshlLongRnaSeqK562CellLongnonpolyaFastqRd2Rep1.fastq.gz ) and it worked.
      Maybe the problem is the format of the single end reads data of 152 nt?

      Thanx

      Comment


      • #4
        Originally posted by Annibal View Post
        Maybe the problem is the format of the single end reads data of 152 nt?
        Probably this is part of the problem since by default Tophat only allows 2 mismatches on the whole read. I had a similar problem when analyzing reads of 107bp. Switching to --bowtie-n mode might help since the mismatches are counted only in the seed region (28 first bp). But still, I found no way to increase the parameter "-e" of Bowtie from Tophat command line, and I suspect it might be too restrictive for long reads.
        If you find a way to improve the alignment, please please keep us informed!

        Comment


        • #5
          can anybody help with this error
          I tried to map Illumina paired-end RNA seq reads of Rice to reference genome.
          I ran the tophat with the following command:

          /opt/tophat-1.4.1.Linux_x86_64/tophat -p 4 -o output -G osa.gtf /home/anurag.gautam/03_Genomes/Oryza_sativa_Indica/bowtie/osa SRR037735_1.fastq SRR037735_2.fastq
          [Mon Apr 9 18:11:39 2012] Beginning TopHat run (v1.4.1)
          -----------------------------------------------
          [Mon Apr 9 18:11:39 2012] Preparing output location output/
          [Mon Apr 9 18:11:39 2012] Checking for Bowtie index files
          [Mon Apr 9 18:11:39 2012] Checking for reference FASTA file
          [Mon Apr 9 18:11:39 2012] Checking for Bowtie
          Bowtie version: 0.12.7.0
          [Mon Apr 9 18:11:39 2012] Checking for Samtools
          Samtools Version: 0.1.16
          [Mon Apr 9 18:11:39 2012] Generating SAM header for /home/anurag.gautam/03_Genomes/Oryza_sativa_Indica/bowtie/osa
          format: fastq
          quality scale: phred33 (default)
          [Mon Apr 9 18:11:39 2012] Reading known junctions from GTF file
          Warning: TopHat did not find any junctions in GTF file
          [Mon Apr 9 18:11:39 2012] Preparing reads
          left reads: min. length=75, count=9884891
          right reads: min. length=75, count=9873028
          [Mon Apr 9 18:18:36 2012] Creating transcriptome data files..
          [FAILED]
          Error: gtf_to_fasta returned an error.

          Please help with this error.

          Comment


          • #6
            Originally posted by Julien Roux View Post
            Probably this is part of the problem since by default Tophat only allows 2 mismatches on the whole read. I had a similar problem when analyzing reads of 107bp. Switching to --bowtie-n mode might help since the mismatches are counted only in the seed region (28 first bp). But still, I found no way to increase the parameter "-e" of Bowtie from Tophat command line, and I suspect it might be too restrictive for long reads.
            If you find a way to improve the alignment, please please keep us informed!
            Don't know if it works, i haven't reviewed the code but i suppose you can trick the program editing the tophat file since it only calls the bowtie exec...
            At line 717:
            if option == "--bowtie-n":
            self.bowtie_alignment_option = "-n"

            Just replace the "-n" with "-e your_value -n" and when you run tophat with --bowtie-n it will invoke bowtie with -e yourvalue -n

            Comment


            • #7
              I will first check whether the gtf file is in the right format. Then I will check whether I have allocated enough memory for TopHat. My job once got killed at the exact step because it used much more memory than I specified.

              Originally posted by anurag.gautam View Post
              can anybody help with this error
              I tried to map Illumina paired-end RNA seq reads of Rice to reference genome.
              I ran the tophat with the following command:

              /opt/tophat-1.4.1.Linux_x86_64/tophat -p 4 -o output -G osa.gtf /home/anurag.gautam/03_Genomes/Oryza_sativa_Indica/bowtie/osa SRR037735_1.fastq SRR037735_2.fastq
              [Mon Apr 9 18:11:39 2012] Beginning TopHat run (v1.4.1)
              -----------------------------------------------
              [Mon Apr 9 18:11:39 2012] Preparing output location output/
              [Mon Apr 9 18:11:39 2012] Checking for Bowtie index files
              [Mon Apr 9 18:11:39 2012] Checking for reference FASTA file
              [Mon Apr 9 18:11:39 2012] Checking for Bowtie
              Bowtie version: 0.12.7.0
              [Mon Apr 9 18:11:39 2012] Checking for Samtools
              Samtools Version: 0.1.16
              [Mon Apr 9 18:11:39 2012] Generating SAM header for /home/anurag.gautam/03_Genomes/Oryza_sativa_Indica/bowtie/osa
              format: fastq
              quality scale: phred33 (default)
              [Mon Apr 9 18:11:39 2012] Reading known junctions from GTF file
              Warning: TopHat did not find any junctions in GTF file
              [Mon Apr 9 18:11:39 2012] Preparing reads
              left reads: min. length=75, count=9884891
              right reads: min. length=75, count=9873028
              [Mon Apr 9 18:18:36 2012] Creating transcriptome data files..
              [FAILED]
              Error: gtf_to_fasta returned an error.

              Please help with this error.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              50 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X