Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks and Cuffcompare extended CDS?

    Hello,

    I have just finished running cufflinks on my dataset and I would like to know how to calculate how much the coding regions of the cufflinks transcript models extended past the known annotated reference model. This will shed light on the 5' and 3' UTR's as well as refine the previous annotation. I performed a cuffcompare analysis on my dataset, but as of now I do not think the class codes from cuffcompare cover this.

    Does anyone has any ideas, comments, or suggestions on how to quantify this?

    Thanks,

  • #2
    Hello, apadr007
    In classcodes of cuffcompare's tracking files, personally it seems that both code "o - Generic exonic overlap with a reference transcript" and code "x - Exonic overlap with reference on the opposite strand" can help you pick up transcrips of interest.
    My question is why don't current classcodes of cuffcompares meet your needs. Thank you and hope for you reply .

    Comment


    • #3
      I had thought the generic overlap with a reference transcript only referred to single exon overlaps. Thanks, it seems I can use these class codes for my analysis.

      Comment


      • #4
        Originally posted by jiyan View Post
        Hello, apadr007
        In classcodes of cuffcompare's tracking files, personally it seems that both code "o - Generic exonic overlap with a reference transcript" and code "x - Exonic overlap with reference on the opposite strand" can help you pick up transcrips of interest.
        My question is why don't current classcodes of cuffcompares meet your needs. Thank you and hope for you reply .

        I have check my result. It seems that the items with code "o" contained in some reference transcripts. So it doesn't work to extend CDS when predicting UTRs.

        Comment


        • #5
          I want to revive this thread again as i am basically doing what apadr007 has plan to do sometime before. I did observed that class code "o" contained in some reference transcripts as well (same observation as lixiangru). But why those transcripts were classified as "o" rather than "c".

          I also want to see how the reference transcriptome annotation compares to cufflinks transcript models. For that is it right if i use class code "o" only?

          Also apadr007 have you finished with your analysis. If so could you share with me of how did you calculate the coding regions of the cufflinks transcript models extended past the known annotated reference model? Any help would be appreciated. Thanks
          Last edited by upendra_35; 10-15-2012, 10:47 PM.

          Comment


          • #6
            I want to revive this thread again as i am basically doing what apadr007 has plan to do sometime before. I did observed that class code "o" contained in some reference transcripts as well (same observation as lixiangru). But why those transcripts were classified as "o" rather than "c".

            I also want to see how the reference transcriptome annotation compares to cufflinks transcript models. For that is it right if i use class code "o" only?

            Also apadr007 have you finished with your analysis. If so could you share with me of how did you calculate the coding regions of the cufflinks transcript models extended past the known annotated reference model? Any help would be appreciated. Thanks

            Hi upendra_35,

            These transcripts have these general classifications based on the criteria that was set by the developers - therefore, take it with a grain of salt. Although their logic behind the generation of these classcodes is sound, it is not all encompassing. I have experimentally validated several "polymerase run on fragments" and several other classifications considered to be false by cuffcompare and I have picked them up with RT-PCR. I would say to use these more as a guide for your analysis. They are useful to determine whether you detect a variant of a previously known gene or if what you pick it totally novel with respect to the reference genome.

            What you can do to detect coding regions that extend past the known reference annotation is to first extract from your cuffcompare data what reference genes (by accession number you detect). Place these accession numbers in a file - call it input.txt for example. Also, get the .gff3 file from your reference annotation and do this in a unix shell.

            Code:
            while read A B; do grep $A reference-genome.gff3; done < input.txt > output.gff3
            This will generate a .gff3 with only the accession numbers you detected with cuffcompare (You would do this if you only detected a portion of your reference genes and not all of them. For example, if you are doing a tissue specific transcriptome analysis or if you do not believe in the coverage that is mapping to the reference annotation. Thus, allowing you to select only what you place into the input.txt file). From here remove all of the extra things in the .gff3 like coding DNA sequences like this

            Code:
             cat output.gff3 | grep "gene" > output.gff3.gene
            then, reformat this .gff3 file into a .bed file. You can take a look at the bed format here (http://genome.ucsc.edu/FAQ/FAQformat.html#format1). You can do this in excel if you are not familiar with awk or perl, its just a matter of changing the columns around and adding the orientation of your transcripts with either a + or a -.

            From here you can use BEDtools (http://code.google.com/p/bedtools/#BEDTools_Summary) And analyze the amount of coverage you are getting from your bam files that are within a certain distance upstream (i.e. 5') or downstream (i.e. 3') from your reference gene. Alternatively you can do this at the transcript level as well, but people normally use coverage from their reads from my experience.

            And as far as comparing your models to the reference genome, I would use all of the ones that have an associated accession number, except for "p" which can be up to 2kb in front of a reference gene.


            cheers,
            AP

            Comment


            • #7
              Thanks a lot AP. By accession number you mean class codes? I have the below classcodes from my analysis and according what i understand i take all the classcodes except "p" for validating the reference genome annotation (I have a gff file from reference genome). Right?

              9982 =
              4230 c
              1 class_code
              2522 e
              122 i
              11275 j
              4591 o
              304 p
              39 s
              1882 u
              317 x

              The main objective of doing this analysis is not to make a new annotation altogether but use the existing annotation and improve it further using RNAseq. So we though of only using class "u" (novel transcripts) and class "o" (novel exons/correct annotation transcripts compared to reference?) and after validating some of these, we plan to combine these transcripts to the existing annotation. What do you think of overall strategy. Also for validating "u" and "o" i was planning to use QPCR and RT-PCR respectively. Does that make sense?

              Comment


              • #8
                When I say accession number I mean the name of the reference genes in your organism. This is found in the file you downloaded from whatever database you used.

                No, you just get the accession number based on what your transcripts detected. So if you have a transcript with a "o" or "j" than they will have an associated accession number from the reference annotation - you use that accession number. These accession numbers are reported in the cuffcompare.tracking output file. Obviously the "u" will have no association with a accession number, therefore, for the analysis that you want to do they would not be used.

                If you design primers to test whether your transcripts are real, based on read alignment, will not tell you very much in my opinion since they has been confirmed already and you can just reference a paper that has done this before (check out http://www.biomedcentral.com/1471-2164/12/587/). You can try to test them to determine a FPKM cutoff for you data to consider a minimum of what you call "real", however.

                If you want to test your models, you can design primers that span across splice junctions, run a PCR and then send them for sequencing (traditional sequencing). This will tell you if your novel junctions are real and will access the overall accuracy of your isoforms.

                Comment


                • #9
                  Thanks so much AP. Very useful comments and suggestions. I have been struggling for last few days of how do i deal with cuffcompare class code stuff but after your comments/suggestions i was relieved. The problem is that there is not much information regarding this anywhere (not even on their website).

                  So to sum up:

                  Classcode "u" transcripts doesn't need to be validated because some where it is been validated unless someone wants to know the FPKM cut-off (By the way i have been using FPKM cut-off of 1 to filter out the novel transcripts will that be good or too relaxed?)

                  Classcode "o" transcripts does need to be validated by designing primes across the splice junctions and sequencing.

                  Thanks again AP.
                  Last edited by upendra_35; 10-16-2012, 09:33 AM.

                  Comment


                  • #10
                    Hey all,

                    I am looking for the transcripts which are extended or clipped in either 5' or 3' end. Could anyone help me with this..?? What classcode would be helpful in understanding this transcripts..??

                    Comment


                    • #11
                      My best guess would be ....Class code 'O' transcripts you would be looking at.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X