Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • thurisaz
    Member
    • Jun 2011
    • 24

    Trouble getting TopHat to work -- empty junctions.bed

    Hi all,

    To prepare for analyzing some RNAseq data we are expecting to get soon, I'm trying to use TopHat to analyze a similar published dataset from the SRA (http://www.ncbi.nlm.nih.gov/geo/quer...i?acc=GSE21323). We are also using SOLiD and are expecting 36-bp reads, though our data will be paired-end.

    After downloading the .sra file and converting it (with fastq-dump and a perl script from another thread on the forum), I have a prmt5.csfasta file that looks like:

    >prmt5.1
    T12333110000213211212111200120201030
    >prmt5.2
    T22222230320112002111202002111030220
    >prmt5.3
    T00112120103200010122310030111101112
    >prmt5.4
    T33113031103310232122213233232311110
    >prmt5.5
    T32211100022011031031033221131332201
    [...]
    and a prmt5.qual file that looks like:

    >prmt5.1
    4 5 5 2 2 4 4 3 7 5 3 5 4 2 4 3 5 5 7 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2
    >prmt5.2
    5 5 7 9 5 5 4 6 3 5 4 5 5 5 5 5 5 12 10 5 5 8 5 6 3 5 11 5 9 5 7 8 5 5 4
    >prmt5.3
    3 3 2 4 4 3 3 4 5 5 2 5 4 5 4 5 5 4 3 3 5 5 5 5 5 5 5 4 5 5 3 4 4 4 5
    >prmt5.4
    2 10 5 4 5 2 4 2 5 5 2 2 2 2 3 2 5 2 2 4 2 4 4 5 5 5 4 4 4 5 3 4 2 4 4
    >prmt5.5
    3 5 8 14 4 2 5 7 5 5 2 4 10 11 5 2 7 11 11 10 2 2 3 10 4 5 5 13 14 3 2 5 5 10 2
    [...]
    My tophat command is:

    tophat -o prmt5_tophat-out --segment-length 17 --segment-mismatches 1 -CQ -a 4 a_thaliana-color prmt5.csfasta prmt5.qual 1> tophat.prmt5.out 2> tophat.prmt5.err
    TopHat seems to run fine:

    [Fri Jul 1 10:58:01 2011] Beginning TopHat run (v1.3.0)
    -----------------------------------------------
    [Fri Jul 1 10:58:01 2011] Preparing output location Col_tophat-out/
    [Fri Jul 1 10:58:01 2011] Checking for Bowtie index files
    [Fri Jul 1 10:58:01 2011] Checking for reference FASTA file
    [Fri Jul 1 10:58:01 2011] Checking for Bowtie
    Bowtie version: 0.12.7.0
    [Fri Jul 1 10:58:01 2011] Checking for Samtools
    Samtools Version: 0.1.16
    [Fri Jul 1 10:58:01 2011] Generating SAM header for a_thaliana-color
    [Fri Jul 1 10:58:05 2011] Preparing reads
    format: fasta
    Left reads: min. length=36, count=41286371
    [Fri Jul 1 11:15:33 2011] Mapping left_kept_reads against a_thaliana-color with Bowtie
    [Fri Jul 1 11:39:45 2011] Processing bowtie hits
    [Fri Jul 1 12:36:35 2011] Mapping left_kept_reads_seg1 against a_thaliana-color with Bowtie (1/2)
    [Fri Jul 1 13:50:13 2011] Mapping left_kept_reads_seg2 against a_thaliana-color with Bowtie (2/2)
    [Fri Jul 1 14:35:15 2011] Searching for junctions via segment mapping
    [Fri Jul 1 16:12:24 2011] Retrieving sequences for splices
    [Fri Jul 1 16:12:37 2011] Indexing splices
    [Fri Jul 1 16:12:38 2011] Mapping left_kept_reads_seg1 against segment_juncs with Bowtie (1/2)
    [Fri Jul 1 16:19:32 2011] Mapping left_kept_reads_seg2 against segment_juncs with Bowtie (2/2)
    [Fri Jul 1 16:25:30 2011] Joining segment hits
    [Fri Jul 1 16:37:26 2011] Reporting output tracks
    -----------------------------------------------
    Run complete [05:49:39 elapsed]
    but junctions.bed, deletions.bed and insertions.bed are empty (aside from the header). accepted_hits.bam is 539MB, so I guess it's OK. I had a look through the logs and found the following in long_spanning_reads.log:

    long_spanning_reads v1.3.0 (2398:2399)
    --------------------------------------------
    Opening Col_tophat-out/tmp/left_kept_reads_seg1.bwtout.z for reading
    Opening Col_tophat-out/tmp/left_kept_reads_seg2.bwtout.z for reading
    Opening Col_tophat-out/tmp/left_kept_reads_seg1.to_spliced.bwtout.z for reading
    Opening Col_tophat-out/tmp/left_kept_reads_seg2.to_spliced.bwtout.z for reading
    Loading reference sequences...
    reference sequences loaded.
    Loading spliced hits...done
    Loading junctions...done
    Loading deletions...done
    Warning: malformed closure
    Warning: malformed closure
    Warning: malformed closure
    Warning: malformed closure
    [...]
    I don't know what "malformed closure" means and didn't find anything when I searched the forum or google. There are no other obvious errors in the log files, though this is all new to me, so I might be missing something. segment_juncs.log says "Found 19801 potential split-segment junctions", so I'm confused that junctions.bed is empty...or am I misunderstanding something?

    Can anyone provide some pointers about what's going wrong? Am I making a simple, silly mistake or is there something fundamental I'm misunderstanding?

    I thought it would be fine to do these test runs on my desktop (Core2 Duo 3GHz with 4GB RAM); could that be causing trouble (eg, not enough memory)?

    Thanks for any help, guidance or suggestions!
  • GiladZil
    Junior Member
    • Jun 2011
    • 5

    #2
    Same problem - empty .bed files

    Hey,
    I have the exact same problem and data type as thurisaz's, except I got no warnings or error messages whatsoever.
    According to 'bowtie.left_kept_reads.fixmap.log' only 10% of the reads were aligned!!
    I should note that when I ran Tophat 1.0.12 on a sample dataset I did get meaningful .bed files, but there I had another problem when I ran it on the full dataset, as the temporary file: 'left_kept_reads.fq.candidate_hits.sam' became huge (>360GB) and the program was terminated by the server before it finished running.
    The Bowtie version was the same for both runs ( 0.12.3).
    I couldn't find significant changes between v1.0.12 and v1.3.0 regarding short reads handling. Were ther any?

    Please help us solve this, it's quite depressing coming home to an empty .bed...

    Comment

    • kathi
      Junior Member
      • Apr 2011
      • 1

      #3
      Tophat 1.3.0 - Warning: malformed closure - hardly any junctions found

      Hi all,

      I encounter a similar problem when running Tophat 1.3.0 on Illumina sequencing data (75nt PE).

      The dataset has ~30 mio reads and Tophat 1.1.4 (--num-threads 8 --mate-inner-dist 200 --solexa-quals --min-isoform-fraction 0 --coverage-search --segment-mismatches 1) predicts almost 9 mio junction-spanning reads among them.

      The same input file and the same Tophat parameters used with Tophat 1.3.0 finds only ~1000 junction-spanning reads, changing the parameters makes the number drop down to 0.

      Like reported by thurisaz, Tophat seems to run fine, but long_spanning_reads.log contains many many lines of Warning: malformed closure.

      I would be very happy for any explanations why the junction reads are lost or suggestions how to solve the problem.

      Thanks a lot,
      Kathi

      Comment

      • thurisaz
        Member
        • Jun 2011
        • 24

        #4
        Following kathi's post, I decided to try running the analysis with an old version of tophat and found the same thing. I ran TopHat 1.3.1 and 1.1.4 on paired end SOLid (50+25bp, about 36m reads) with a GTF file and exactly the same options (--coverage-search -r 131 --library-type fr-secondstrand -F 0.01 -p 16 -i 5 -I 6000 --segment-length 17 --segment-mismatches 1 -m 1 -CQ -a 4).

        TopHat 1.3.1:
        - empty junctions.bed
        - "Found 0 junctions from happy spliced reads" in reports.log
        - Reports 41% of F3 reads "with more than one alignment" in bowtie.left_kept_reads.fixmap.log and then 4%,4% & 19% in the _seg* logs
        - F5 reads are 40% aligned, followed by 7 & 1.6% for the segments

        TopHat 1.1.4:
        - junctions.bed is not empty
        - reports.log says "Found 26222 junctions from happy spliced reads"
        - F3 reads map 61% initially and F5 reads 40%. I'm not sure what's going on the with segments because the filenames look random and there seem to be two log files per segment with different percentages. My impression, though, is that the values look better (getting as high as 80-90%).

        These runs were done back-to-back on the same machine using the same data and GTF file; the only thing that changed is the version of Tophat. I'm a bit surprised to see such drastic differences, particularly since I thought v.1.3 is supposed to be better at handling SOLiD. Is there some good reason for this which I'm missing? Should I be doing something different with v. 1.3.1?

        I'm going to try 1.2.0 now and see how it behaves; hopefully I'll report back about that later today. In the meantime, any suggestions or insight would be greatly appreciated.

        Comment

        • thurisaz
          Member
          • Jun 2011
          • 24

          #5
          1.2.0 also didn't work well for me; it only found 20 junctions when run with precisely the same options. Based on the .bam file, it also seems to have aligned far fewer reads overall (9m vs. 15m & 19m for 1.3.1 & 1.1.4, respectively), which I find a bit strange. I guess it's possible that something odd happened during the run, but I'm not planning to troubleshoot it; I think I'll just use 1.1.4 for now.

          Comment

          • thurisaz
            Member
            • Jun 2011
            • 24

            #6
            One more thing -- in order to get tophat 1.1.4 to run without errors on my colorspace data, I had to replace line 1511 of the tophat script:

            Code:
            decode_dic = { 'A0':'A', 'A1':'C', 'A2':'G', 'A3':'T', 'A4':'N', 'A.':'N',                                                    
                              'C0':'C', 'C1':'A', 'C2':'T', 'C3':'G', 'C4':'N', 'C.':'N',                                                
                              'G0':'G', 'G1':'T', 'G2':'A', 'G3':'C', 'G4':'N', 'G.':'N',                                                
                              'T0':'T', 'T1':'G', 'T2':'C', 'T3':'A', 'T4':'N', 'T.':'N',                                                
                              'N0':'N', 'N1':'N', 'N2':'N', 'N3':'N', 'N4':'N', 'N.':'N' }
            with the corresponding code from v. 1.3.1:

            Code:
            decode_dic = { 'A0':'A', 'A1':'C', 'A2':'G', 'A3':'T', 'A4':'N', 'A.':'N', 'AN':'N',                                           
                     'C0':'C', 'C1':'A', 'C2':'T', 'C3':'G', 'C4':'N', 'C.':'N', 'CN':'N',                                           
                     'G0':'G', 'G1':'T', 'G2':'A', 'G3':'C', 'G4':'N', 'G.':'N', 'GN':'N',                                           
                     'T0':'T', 'T1':'G', 'T2':'C', 'T3':'A', 'T4':'N', 'T.':'N', 'TN':'N',                                           
                     'N0':'N', 'N1':'N', 'N2':'N', 'N3':'N', 'N4':'N', 'N.':'N', 'NN':'N',                                           
                    '.0':'N', '.1':'N', '.2':'N', '.3':'N', '.4':'N', '..':'N', '.N':'N' }

            Comment

            • biznatch
              Senior Member
              • Nov 2010
              • 124

              #7
              I am also getting lots of "Warning: malformed closure" warnings, does anyone know what this means?

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Pathogen Surveillance with Advanced Genomic Tools
                by seqadmin




                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                Yesterday, 11:48 AM
              • seqadmin
                New Genomics Tools and Methods Shared at AGBT 2025
                by seqadmin


                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                The Headliner
                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                03-03-2025, 01:39 PM
              • seqadmin
                Investigating the Gut Microbiome Through Diet and Spatial Biology
                by seqadmin




                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                02-24-2025, 06:31 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-20-2025, 05:03 AM
              0 responses
              34 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-19-2025, 07:27 AM
              0 responses
              43 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-18-2025, 12:50 PM
              0 responses
              35 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-03-2025, 01:15 PM
              0 responses
              190 views
              0 reactions
              Last Post seqadmin  
              Working...