Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat: report *only* novel splice junctions?

    Okay, so I've finally got my bearings with TopHat (I think).

    First, I was able to shoehorn my Helicos RNA-Seq data so that it can be used with TopHat (thank you for the help Cole). Then I was finally able to work around the issues that I was having with getting a good gff3 file (thanks to the forum members who helped).

    Now I have an additional questions, is there a way to *only* report junctions that are found, but aren't in the supplied gff3 file? I generated a list of junctions with and without the --no-novel-juncs parameter so I know that I have about a thousand novel junctions by comparing the two output junction files.

    It wouldn't be too hard to write a script to compare the outputs and only keep junctions that are novel, but I'm lazy and would prefer not to do if there is another way.

    Thanks,
    Sam

  • #2
    Never mind, I figured it out.
    Last edited by sdarko; 01-19-2010, 06:16 AM.

    Comment


    • #3
      I think you do it in the right way. Presently, there is no way for conveniently output the novel junction.

      Comment


      • #4
        Hi,

        From what you said, it sounds like to use the junctions output from tophat to compare with a ref gff3 file, those that do not appear in gff3 are potential novel splice forms.

        Can you explain in more details:
        1. what's the gff3 file and how can I get one for human genome?
        2. Do I need to convert junctions.bed to gff3?
        3. Which command from TopHat or which software can do the comparison between my sample junctions and the ref gff3 file?

        Thanks a lot

        Comment


        • #5
          Originally posted by jiwu2573 View Post
          Hi,

          From what you said, it sounds like to use the junctions output from tophat to compare with a ref gff3 file, those that do not appear in gff3 are potential novel splice forms.

          Can you explain in more details:
          1. what's the gff3 file and how can I get one for human genome?
          2. Do I need to convert junctions.bed to gff3?
          3. Which command from TopHat or which software can do the comparison between my sample junctions and the ref gff3 file?

          Thanks a lot
          RE:
          1. for more information of gff3, you can search by google. It is complex to get gff3 for human genome. First, you get gtf from ensembl, then convert them to gff3 by some perl script (gtf2gff3.pl)
          2. no
          3. write program by yourself

          you are welcome.

          Comment


          • #6
            Originally posted by sdarko View Post
            Okay, so I've finally got my bearings with TopHat (I think).

            Then I was finally able to work around the issues that I was having with getting a good gff3 file (thanks to the forum members who helped).
            Will you kindly share the good gff3 file for human genome?
            How big is the file?

            After I get tophat running with this gff3 file, I can write a program to compare outputs with and without the --no-novel-juncs parameter and share with you.

            In addition, have you ever thought about differential splice events between groups (2 conditions)? Maybe the program can count the percentage of a particular novel splice form and then get some statistics between 2 groups? Any other way you can think of? Let me know so I may implant this function in the program too.

            Looking forward to your reply!
            Last edited by jiwu2573; 01-25-2010, 01:01 PM.

            Comment


            • #7
              Originally posted by sdarko View Post
              I generated a list of junctions with and without the --no-novel-juncs parameter so I know that I have about a thousand novel junctions by comparing the two output junction files.
              May I just confirm with you:
              First, use command: tophat -G/<GFF3 file> --no-novel-juncs
              Second, for the same dataset,use command: tophat -G/<GFF3 file>
              Finally, compare the 2 files of junctions.bed, pick up the differences

              By the way, how are you going to deal with those 1000 novel junctions?

              Thanks!

              Comment


              • #8
                Originally posted by jiwu2573 View Post
                May I just confirm with you:
                First, use command: tophat -G/<GFF3 file> --no-novel-juncs
                Second, for the same dataset,use command: tophat -G/<GFF3 file>
                Finally, compare the 2 files of junctions.bed, pick up the differences

                By the way, how are you going to deal with those 1000 novel junctions?

                Thanks!
                Essentially, yes, I was running with those attributes (plus a couple of other changes, fewer than the default multi-matches etc).

                I was hoping to try to confirm some of those novel junctions by PCR. This is for my thesis project for my Masters degree in bioinformatics.

                I think that I may take a slightly different approach now for novel junctions. I just built a bowtie index for mRNA, ESTs, and refmRNA from UCSC and I'm going to align my RNA-Seq tags to those with bowtie. Then I'm going to take what *doesn't* align to those and use those to search for novel junctions.

                My thinking is that by aligning to mRNA, ESTs, and refmRNA (essentially known splice junctions) and then taking what doesn't align and running that with tophat, then I'll be enriching for novel splice junctions in the unaligned file.

                Comment


                • #9
                  Also, maybe the next version of cufflinks (same developer who made bowtie and tophat) will be able to do what we want it to do.

                  See this post: http://seqanswers.com/forums/showthread.php?t=3754

                  Comment


                  • #10
                    Originally posted by sdarko View Post
                    Also, maybe the next version of cufflinks (same developer who made bowtie and tophat) will be able to do what we want it to do.

                    See this post: http://seqanswers.com/forums/showthread.php?t=3754
                    *EDIT* Whoops, that seems to be a reply in a thread you started. So, I'm sure you've seen it

                    Comment


                    • #11
                      Hi --

                      In the post that started this thread, Sam says,
                      > I was able to shoehorn my Helicos RNA-Seq data so that it can be used with TopHat (thank you for the help Cole)
                      I'd like to know a bit more about how you solved this, because I'm trying to do something similar. Helicos data has several different characteristics. I'm specifically concerned about its 5% error rate: over half deletions from missing the light output of the un-amplified single DNA molecule, and many of the rest insertions (presumably stray light from nearby molecules or electrical noise; since electro-optical sensitivity will be maxed out for the same reason). Helicos claims its alignment algorithms are designed to handle these, but Bowtie isn't; since it doesn't handle indels. Did you do something really impressive, like hack Tophat to call the Helicos alignment algorithms instead of Bowtie? Or did you just make the formats compatible as you said in your very first SeqAnswers post
                      and decide to put up with the errors? I'd much appreciate any suggestions, whether they are file formats, parameter choices, or black-belt Python programing gems ;-)

                      Thanks much!
                      Howie

                      Comment


                      • #12
                        Originally posted by Howie Goodell View Post
                        Did you do something really impressive, like hack Tophat to call the Helicos alignment algorithms instead of Bowtie? Or did you just make the formats compatible as you said in your very first SeqAnswers post
                        and decide to put up with the errors? I'd much appreciate any suggestions, whether they are file formats, parameter choices, or black-belt Python programing gems ;-)
                        Hey there Howie. This is going to really disappoint you, but what I ended up doing probably isn't too impressive.

                        My first problem was that tophat requires reads of identical length (which was not documented when I started this project). As you know, Helicos reads are of variable length. So, after converting the sms file to a FASTA file, I just wrote a quick program that trims the reads down to a certain length. I know I'm losing information, but have decided to live with it. So far I've been trimming them to 25bp and splitting reads of 50+bp into two.

                        As far as the indel situation goes, I just decided to be very conservative in my parameters using tophat. Using the 25bp reads, I require the anchor lengths to be 10bp with zero mismatches. I also don't allow any segment mismatches. In addition, multihits are set to zero. When I get my BED files, I also ignore any reported splice junctions that have less than 3 reads aligning to that particular junction. I figure that indels may be happening, but reported splice junctions reported using those criteria are probably not coincidence.

                        So, far I've had good luck. I probably (okay, certainly) don't have as many aligned reads as people who use Illumina data, but I'm okay with that. In my tests using the sample data from Helicos, I've found some very good evidence for novel spliceforms and novel transcripts.

                        I hope that helps and if you have any questions or need any further help, please feel free to PM me at any time.

                        Sam

                        Comment


                        • #13
                          Sdarko, I'm new to the next generation sequencing community and also use the Helicos platform. Question: is there a reason why you're using Tophat instead of the Helicos pipelines?

                          Comment


                          • #14
                            Originally posted by andrewj View Post
                            Sdarko, I'm new to the next generation sequencing community and also use the Helicos platform. Question: is there a reason why you're using Tophat instead of the Helicos pipelines?
                            As far as I know, the Helicos software (Helisphere --> http://open.helicosbio.com/mwiki/index.php/Main_Page) won't align RNA-Seq reads across exon-intron junctions. And I'm looking for novel transcripts and alternative splice junctions in known genes.

                            For transcript quantification, yes I use Helisphere and align reads to mRNA. For novel splice junctions, I use TopHat and align to the whole genome.

                            Comment


                            • #15
                              Hi!

                              This is a somewhat old thread, but I would like to know more about the biological constraints TopHat uses to call a splice junction... and if there is anyway to override this...

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X