Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat problem: failing reads alignment

    Hi all,
    i'm dealing for the first time with Rna-seq data and i'm training performing some exercises running tophat on "wgEncodeCshlLongRnaSeqK562CellTotalFastqRep1.fastq" reads data from ENCODE project.
    I'v issued the cocmmand "tophat -p 24 -G genes.gtf -o K562_1 hg19 reads.fastq" where genes.gtf is the transcript annotation file (hg19) from Illumina, hg19 is the reference genome (bowtie index) and reads.fastq is the single end reads file mentioned above (152 in length)
    After 15 hrs i have in the output directory a very short .bam file (around 300 Kbyte).
    Looking at the log files i find:

    bowtie.left_kept_reads.fixmap.log:
    ::::::::::::::
    # reads processed: 77046522
    # reads with at least one reported alignment: 82 (0.00%)
    # reads that failed to align: 77046440 (100.00%)
    Reported 82 alignments to 1 output stream(s)

    ::::::::::::::
    bowtie.left_kept_reads_seg1.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 17276886 (22.42%)
    # reads that failed to align: 59283777 (76.95%)
    # reads with alignments suppressed due to -m: 485777 (0.63%)
    Reported 83222144 alignments to 1 output stream(s)
    ::::::::::::::
    bowtie.left_kept_reads_seg2.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 22628110 (29.37%)
    # reads that failed to align: 53761233 (69.78%)
    # reads with alignments suppressed due to -m: 657097 (0.85%)
    Reported 119440312 alignments to 1 output stream(s)
    ::::::::::::::
    bowtie.left_kept_reads_seg3.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 18924892 (24.56%)
    # reads that failed to align: 57402732 (74.50%)
    # reads with alignments suppressed due to -m: 718816 (0.93%)
    Reported 122913616 alignments to 1 output stream(s)
    ::::::::::::::
    bowtie.left_kept_reads_seg4.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 12507630 (16.23%)
    # reads that failed to align: 64207195 (83.34%)
    # reads with alignments suppressed due to -m: 331615 (0.43%)
    Reported 56082256 alignments to 1 output stream(s)
    ::::::::::::::
    bowtie.left_kept_reads_seg5.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 19159264 (24.87%)
    # reads that failed to align: 57337091 (74.42%)
    # reads with alignments suppressed due to -m: 550085 (0.71%)
    Reported 100864131 alignments to 1 output stream(s)
    ::::::::::::::
    bowtie.left_kept_reads_seg6.fixmap.log
    ::::::::::::::
    # reads processed: 77046440
    # reads with at least one reported alignment: 7086758 (9.20%)
    # reads that failed to align: 69837946 (90.64%)
    # reads with alignments suppressed due to -m: 121736 (0.16%)
    Reported 26553396 alignments to 1 output stream(s)

    Moreover i find a lot (around 200) of "malformed closure" warnings in long_spanning_reads.log

    Thanx a lot for any suggestions/advice.

  • #2
    I would ask you to include the genome in your tophat run

    so you new command will be

    "tophat -p 24 -G genes.gtf /path/to/genome -o K562_1 hg19 reads.fastq"

    The genome file name would be the common prefix for files you generate using the genome fasta file and bowtie

    "bowtie-build genome.fa genome"

    Comment


    • #3
      I've included it.
      As mentioned above it is "hg19 ". This are the .ebwt index and bowtie correctly build the hg19.fa reference file.

      I've tried the same procedure but using paired end fastq reads (2x76) from different Rna-seq (wgEncodeCshlLongRnaSeqK562CellLongnonpolyaFastqRd1Rep1.fastq.gz and wgEncodeCshlLongRnaSeqK562CellLongnonpolyaFastqRd2Rep1.fastq.gz ) and it worked.
      Maybe the problem is the format of the single end reads data of 152 nt?

      Thanx

      Comment


      • #4
        Originally posted by Annibal View Post
        Maybe the problem is the format of the single end reads data of 152 nt?
        Probably this is part of the problem since by default Tophat only allows 2 mismatches on the whole read. I had a similar problem when analyzing reads of 107bp. Switching to --bowtie-n mode might help since the mismatches are counted only in the seed region (28 first bp). But still, I found no way to increase the parameter "-e" of Bowtie from Tophat command line, and I suspect it might be too restrictive for long reads.
        If you find a way to improve the alignment, please please keep us informed!

        Comment


        • #5
          can anybody help with this error
          I tried to map Illumina paired-end RNA seq reads of Rice to reference genome.
          I ran the tophat with the following command:

          /opt/tophat-1.4.1.Linux_x86_64/tophat -p 4 -o output -G osa.gtf /home/anurag.gautam/03_Genomes/Oryza_sativa_Indica/bowtie/osa SRR037735_1.fastq SRR037735_2.fastq
          [Mon Apr 9 18:11:39 2012] Beginning TopHat run (v1.4.1)
          -----------------------------------------------
          [Mon Apr 9 18:11:39 2012] Preparing output location output/
          [Mon Apr 9 18:11:39 2012] Checking for Bowtie index files
          [Mon Apr 9 18:11:39 2012] Checking for reference FASTA file
          [Mon Apr 9 18:11:39 2012] Checking for Bowtie
          Bowtie version: 0.12.7.0
          [Mon Apr 9 18:11:39 2012] Checking for Samtools
          Samtools Version: 0.1.16
          [Mon Apr 9 18:11:39 2012] Generating SAM header for /home/anurag.gautam/03_Genomes/Oryza_sativa_Indica/bowtie/osa
          format: fastq
          quality scale: phred33 (default)
          [Mon Apr 9 18:11:39 2012] Reading known junctions from GTF file
          Warning: TopHat did not find any junctions in GTF file
          [Mon Apr 9 18:11:39 2012] Preparing reads
          left reads: min. length=75, count=9884891
          right reads: min. length=75, count=9873028
          [Mon Apr 9 18:18:36 2012] Creating transcriptome data files..
          [FAILED]
          Error: gtf_to_fasta returned an error.

          Please help with this error.

          Comment


          • #6
            Originally posted by Julien Roux View Post
            Probably this is part of the problem since by default Tophat only allows 2 mismatches on the whole read. I had a similar problem when analyzing reads of 107bp. Switching to --bowtie-n mode might help since the mismatches are counted only in the seed region (28 first bp). But still, I found no way to increase the parameter "-e" of Bowtie from Tophat command line, and I suspect it might be too restrictive for long reads.
            If you find a way to improve the alignment, please please keep us informed!
            Don't know if it works, i haven't reviewed the code but i suppose you can trick the program editing the tophat file since it only calls the bowtie exec...
            At line 717:
            if option == "--bowtie-n":
            self.bowtie_alignment_option = "-n"

            Just replace the "-n" with "-e your_value -n" and when you run tophat with --bowtie-n it will invoke bowtie with -e yourvalue -n

            Comment


            • #7
              I will first check whether the gtf file is in the right format. Then I will check whether I have allocated enough memory for TopHat. My job once got killed at the exact step because it used much more memory than I specified.

              Originally posted by anurag.gautam View Post
              can anybody help with this error
              I tried to map Illumina paired-end RNA seq reads of Rice to reference genome.
              I ran the tophat with the following command:

              /opt/tophat-1.4.1.Linux_x86_64/tophat -p 4 -o output -G osa.gtf /home/anurag.gautam/03_Genomes/Oryza_sativa_Indica/bowtie/osa SRR037735_1.fastq SRR037735_2.fastq
              [Mon Apr 9 18:11:39 2012] Beginning TopHat run (v1.4.1)
              -----------------------------------------------
              [Mon Apr 9 18:11:39 2012] Preparing output location output/
              [Mon Apr 9 18:11:39 2012] Checking for Bowtie index files
              [Mon Apr 9 18:11:39 2012] Checking for reference FASTA file
              [Mon Apr 9 18:11:39 2012] Checking for Bowtie
              Bowtie version: 0.12.7.0
              [Mon Apr 9 18:11:39 2012] Checking for Samtools
              Samtools Version: 0.1.16
              [Mon Apr 9 18:11:39 2012] Generating SAM header for /home/anurag.gautam/03_Genomes/Oryza_sativa_Indica/bowtie/osa
              format: fastq
              quality scale: phred33 (default)
              [Mon Apr 9 18:11:39 2012] Reading known junctions from GTF file
              Warning: TopHat did not find any junctions in GTF file
              [Mon Apr 9 18:11:39 2012] Preparing reads
              left reads: min. length=75, count=9884891
              right reads: min. length=75, count=9873028
              [Mon Apr 9 18:18:36 2012] Creating transcriptome data files..
              [FAILED]
              Error: gtf_to_fasta returned an error.

              Please help with this error.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                Today, 07:48 AM
              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 07:17 AM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-02-2024, 08:06 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-30-2024, 12:17 PM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-29-2024, 10:49 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Working...
              X