Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to run large inputs using Tophat quicker?

    Hi,

    I'm analysing some pair-end RNA-seq data from 20 healthy individuals using Tophat and Cufflinks. However, the novel isoforms detected from each individual are quite different from each other when I compared them across individuals. So now I'm thinking merging these 20 samples together into one mega file, and then use Tophat and Cufflinks to detect isoforms. This brings with a new problem. The mega file is very big, about 74G for read1 and another 74G for read2. When I ran Tophat, it took me about three weeks, and this is very risky because if for any reason my computer was shut down, my job was terminated and I need to tun it again. Does anyone know how to make tophat run quicker for this kind of large input files?

    Many thanks

  • #2
    Hello
    You could use the iplant collaborative infrastructure... Setup an account, upload your data to the data store and then use their cloud infrastructure to run your alignment on their super computers.

    Comment


    • #3
      If I might ask, how large was the fabric/infrastructure you were using for the analysis of your data using Tophat/Cufflinks? I'm curious to know this.

      Thanks.

      Comment


      • #4
        Originally posted by xy6699 View Post
        Hi,

        I'm analysing some pair-end RNA-seq data from 20 healthy individuals using Tophat and Cufflinks. However, the novel isoforms detected from each individual are quite different from each other when I compared them across individuals. So now I'm thinking merging these 20 samples together into one mega file, and then use Tophat and Cufflinks to detect isoforms. This brings with a new problem. The mega file is very big, about 74G for read1 and another 74G for read2. When I ran Tophat, it took me about three weeks, and this is very risky because if for any reason my computer was shut down, my job was terminated and I need to tun it again. Does anyone know how to make tophat run quicker for this kind of large input files?

        Many thanks
        TopHat just aligns the reads to a reference. The alignment of any one read to that reference sequence is independent of all other reads so alignments will not be impacted by submitting the reads as 20 independent files or one single file. In other words if you already have the output of 20 TopHat runs there is no point in re-running TopHat on these reads.

        Novel isoforms are identified by Cufflinks. There are cufflinks parameters which define a coverage minimum to call a novel isoform. Have you run cuffmerge to combine the results of the individual cufflinks output into one unified set of isoform calls?

        Comment


        • #5
          Hi kmcarr,

          I have run cuffmerge, but the result is not very satisfactory. For example, for individual assembly, I found the reference transcript in each of my samples, but after using cuffmerge, the reference transcript was lost and the merged isoforms tend to be longer.

          Tophat identifies junctions using reads, so I'm thinking if I merge all the reads together at the first step, will it improve the accuracy in junction detection?

          Comment


          • #6
            Originally posted by xy6699 View Post
            Tophat identifies junctions using reads, so I'm thinking if I merge all the reads together at the first step, will it improve the accuracy in junction detection?
            No, TopHat does not identify junctions. This is the point I was trying to get across in my first post. TopHat aligns each read independant of all other reads, it does not matter if you have more reads in your input. TopHat WILL NOT change how it aligns reads based on how many reads you give it.

            Cufflinks identifies junctions. There are some parameters for Cifflinks which set a minimum number of reads supporting a junction. If you want to see what might happen you could merge the BAM output from the 20 individual TopHat runs and run Cufflinks on that. But you really have to ask yourself, if the junction is sooooo rare you need to go to these lengths to detect it, are you sure its real?

            Comment


            • #7
              Ah, I see. Thanks a lot for your information.

              So the problem I have now is that after cuffmerge I lost the reference annotated transcripts for some genes as I mentioned above. I attached an example here. Do you have any suggestions about how to assemble the isoforms more correctly?

              Many thanks
              Attached Files

              Comment


              • #8
                TopHat 2nd iteration with merged-junctions file as input to -j option?

                Originally posted by kmcarr View Post
                TopHat just aligns the reads to a reference. The alignment of any one read to that reference sequence is independent of all other reads so alignments will not be impacted by submitting the reads as 20 independent files or one single file. In other words if you already have the output of 20 TopHat runs there is no point in re-running TopHat on these reads.

                Novel isoforms are identified by Cufflinks. There are cufflinks parameters which define a coverage minimum to call a novel isoform. Have you run cuffmerge to combine the results of the individual cufflinks output into one unified set of isoform calls?
                Hi kmcarr, Does your comment still apply if I an running TopHat 2nd iteration with merged-junctions.bed file as input to -j option? I ran 1st iteration of tophat with just the -G option and 2nd iteration with -j and -G.

                Does this help detect junctions that would otherwise go missing with a single tophat iteration?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  The Impact of AI in Genomic Medicine
                  by seqadmin



                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                  02-26-2024, 02:07 PM
                • seqadmin
                  Multiomics Techniques Advancing Disease Research
                  by seqadmin


                  New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                  A major leap in the field has
                  ...
                  02-08-2024, 06:33 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:12 AM
                0 responses
                17 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 02-23-2024, 04:11 PM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 02-21-2024, 08:52 AM
                0 responses
                73 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 02-20-2024, 08:57 AM
                0 responses
                62 views
                0 likes
                Last Post seqadmin  
                Working...
                X