Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks annotation handling business

    Help me understand if you know, or comment if you find similar

    Cufflinks run with -g, using GTF (from illumina igenomes).



    ORIGINAL ANNOTATION
    Code:
    chr2	unknown	exon	152163161	152164423	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	stop_codon	152163943	152163945	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	CDS	152163946	152164423	.	-	1	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	CDS	152165451	152165743	.	-	0	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	exon	152165451	152165743	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	CDS	152168773	152169063	.	-	0	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	exon	152168773	152169063	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	start_codon	152169061	152169063	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    chr2	unknown	exon	152169593	152169796	.	-	.	gene_id "Trib3"; gene_name "Trib3"; p_id "P4503"; transcript_id "NM_175093"; tss_id "TSS19853";
    ;




    CUFFLINKS OUTPUT:
    What happened to gene ID?? where did it go
    Code:
    chr2	Cufflinks	transcript	152163161	152169796	1000	-	.	gene_id "CUFF.9451"; transcript_id "NM_175093"; FPKM "35.0692808303"; frac "0.811271"; conf_lo "33.828739"; conf_hi "36.309823"; cov "154.402273"; full_read_support "yes";
    chr2	Cufflinks	exon	152163161	152164423	1000	-	.	gene_id "CUFF.9451"; transcript_id "NM_175093"; exon_number "1"; FPKM "35.0692808303"; frac "0.811271"; conf_lo "33.828739"; conf_hi "36.309823"; cov "154.402273";
    chr2	Cufflinks	exon	152165451	152165743	1000	-	.	gene_id "CUFF.9451"; transcript_id "NM_175093"; exon_number "2"; FPKM "35.0692808303"; frac "0.811271"; conf_lo "33.828739"; conf_hi "36.309823"; cov "154.402273";
    chr2	Cufflinks	exon	152168773	152169063	1000	-	.	gene_id "CUFF.9451"; transcript_id "NM_175093"; exon_number "3"; FPKM "35.0692808303"; frac "0.811271"; conf_lo "33.828739"; conf_hi "36.309823"; cov "154.402273";
    chr2	Cufflinks	exon	152169593	152169796	1000	-	.	gene_id "CUFF.9451"; transcript_id "NM_175093"; exon_number "4"; FPKM "35.0692808303"; frac "0.811271"; conf_lo "33.828739"; conf_hi "36.309823"; cov "154.402273";



    CUFFMERGE OUTPUT:
    ok, got my gene name back!
    What about gene_id? What happened to transcript ID? Where did CUFF.7092.2 come from? Where happened to p_ID?
    Code:
    chr2	Cufflinks	exon	152163157	152164423	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023065"; exon_number "1"; gene_name "Trib3"; oId "CUFF.7092.2"; nearest_ref "NM_175093"; class_code "j"; tss_id "TSS16814";
    chr2	Cufflinks	exon	152165420	152165427	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023065"; exon_number "2"; gene_name "Trib3"; oId "CUFF.7092.2"; nearest_ref "NM_175093"; class_code "j"; tss_id "TSS16814";
    chr2	Cufflinks	exon	152163157	152164423	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023066"; exon_number "1"; gene_name "Trib3"; oId "CUFF.7092.1"; nearest_ref "NM_175093"; class_code "j"; tss_id "TSS16815";
    chr2	Cufflinks	exon	152166010	152166019	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023066"; exon_number "2"; gene_name "Trib3"; oId "CUFF.7092.1"; nearest_ref "NM_175093"; class_code "j"; tss_id "TSS16815";
    chr2	Cufflinks	exon	152163161	152164423	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023067"; exon_number "1"; gene_name "Trib3"; oId "NM_175093"; nearest_ref "NM_175093"; class_code "="; tss_id "TSS16816"; p_id "P12953";
    chr2	Cufflinks	exon	152165451	152165743	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023067"; exon_number "2"; gene_name "Trib3"; oId "NM_175093"; nearest_ref "NM_175093"; class_code "="; tss_id "TSS16816"; p_id "P12953";
    chr2	Cufflinks	exon	152168773	152169063	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023067"; exon_number "3"; gene_name "Trib3"; oId "NM_175093"; nearest_ref "NM_175093"; class_code "="; tss_id "TSS16816"; p_id "P12953";
    chr2	Cufflinks	exon	152169593	152169796	.	-	.	gene_id "XLOC_013512"; transcript_id "TCONS_00023067"; exon_number "4"; gene_name "Trib3"; oId "NM_175093"; nearest_ref "NM_175093"; class_code "="; tss_id "TSS16816"; p_id "P12953";




    CUFFDIFF OUTPUT:
    I will be damned!
    Whats with the new TSS ID. IN my original GTF, this ID belongs to an entirely different gene (LOC100040911) on chrY. For which gene am i looking at expression difference -- Trib3 or LOC100040911?
    Code:
    tracking_id	class_code	nearest_ref_id	gene_id	gene_short_name	tss_id	locus	length	coverage	q1_FPKM	q1_conf_lo	q1_conf_hi	q1_status	q2_FPKM	q2_conf_lo	q2_conf_hi	q2_status
    TSS16815	XLOC_013512	Trib3	chr2:152163156-152169796	q1	q2	OK	0.00575202	14.359	11.2856	-2.70945	0.0067394	0.0480599	yes
    Last edited by epi; 03-01-2012, 08:00 AM.

  • #2
    To all Cufflinks users and author, can anyone please comment if I have stated my problem clearly.

    Unfortunately it may be a serious bug in this software, which is causing it to mix FPKMs between different genes. Not only it obviously makes cufflinks highly unreliable, but a puts a question mark over the published findings.

    Thanks for your attention and replies.

    Comment


    • #3
      Hi,

      The option "-g" that you are using is usually used to discover new transcripts. So the CUFF.* names that you got there are for those newly found transcripts.
      If you only want to do it on known transcripts, use the "-G" option.

      That is, I don't know why CuffMerge gives us certain oId with CUFF.* names.

      Comment


      • #4
        For the tss_id, there is probably a bug, because it's numerated in ascending order but the TSS number as nothing to do with the gene itself.


        ***Update***
        Cuffmerge merge together multiple transcripts.gtf files coming from different analyses. The TSS information is not present in that file. Maybe it doesn't use the information contained in the given reference GTF file and just rename the tss_id...
        Last edited by jcgrenier; 03-09-2012, 12:24 PM.

        Comment


        • #5
          As far as I can tell, Cufflinks is working correctly and as described in the manual. As noted above, you want -G, not -g.

          Comment


          • #6
            There is a nature protocol published several days before, which may help you:
            "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks"

            Comment


            • #7
              Thanks you all for replies, and I apologize if I was panicking more than needed. Perhaps I should explain my question a bit further.

              As I understand, -g adds new transcripts "in addition to" existing ones, which is my goal under this analysis.

              From the manual:
              Output will include all reference transcripts as well as any novel genes and isoforms that are assembled
              While for many other genes, it is using the original gene name, for Trib3, it is converting it to CUFF.9451, and after cuffmerge to CUFF.7092.2. This is the source of all confusion, from manual and published protocol it is not clear why it should do that. After all, the annotation of Trib3 transcript and its exons is identical to CUFF.9451. They why this behaves like a novel gene now.

              From jcgrenier comments, I went back and found that it is indeed correct that transcript_id are incremented in ascending order, which has made a chance identity of IDs between chr2 and chrY, could very well be other cases which I don't know of.

              Actually, apart form a specific case, the real question here is how come cufflinks is not using the transcript_ID, p_ID and in some cases gene_ID from the original GFF as it is supposed to.
              Thanks again for comments.
              Last edited by epi; 03-12-2012, 08:02 AM.

              Comment


              • #8
                Originally posted by epi View Post
                Actually, apart form a specific case, the real question here is how come cufflinks is not using the transcript_ID, p_ID and in some cases gene_ID from the original GFF as it is supposed to.
                Hmm, that's annoying... can you show us how you're running cuffmerge? One limitation of the current version of Cuffmerge is that it doesn't do any of its own ORF prediction. So new isoforms of coding genes won't get assigned a new p_id or attached to an existing one. New genes will not be declared as coding or non-coding (this is a hard problem in general). However, they will get grouped under a tss_id. The thing is, Cuffmerge will reassign tss_ids just like it reassigns transcript_ids, because those also have to be unique. The only thing it really tries to propogate from the reference in terms of metadata is the gene short name.

                Comment


                • #9
                  Originally posted by Cole Trapnell View Post
                  Hmm, that's annoying... can you show us how you're running cuffmerge? One limitation of the current version of Cuffmerge is that it doesn't do any of its own ORF prediction. So new isoforms of coding genes won't get assigned a new p_id or attached to an existing one. New genes will not be declared as coding or non-coding (this is a hard problem in general). However, they will get grouped under a tss_id. The thing is, Cuffmerge will reassign tss_ids just like it reassigns transcript_ids, because those also have to be unique. The only thing it really tries to propogate from the reference in terms of metadata is the gene short name.
                  Hi Cole, Thanks again for replying. While your comment clarifies some things, it also raises some more question. First here is my command

                  Code:
                  cuffmerge -o test1 -g genes.gtf -p 10 -s ~/path/mm9 assemblies1.txt
                  Assemblies1.txt contains absolute path to cufflinks gtf
                  Code:
                  path-to-dir/transcripts_sample1.gtf
                  path-to-dir/transcripts_sample2.gtf
                  .
                  .
                  .
                  The behavior you described is what I am observing as well. As I mentioned earlier, I am using -g and not -G to allow known as well as novel genes. Should I not expect it to preserve transcript, TSS and pids for the known genes.

                  Also I am still unclear why there are new predictions (with CUFF. IDs) that are identical to existing annotations. I have manually looked at one of the test runs and all new predictions (with CUFF. IDs) were original genes/transcripts in the source GTF annotation (I must have looked only about 20-25 ). So far I have not encountered any novel predictions which do not correspond to existing annotation. In some cases, original annotation still exists and in some cases not. Similar to the example about Trib3 I posted above.

                  Comment


                  • #10
                    Has anyone figured this issue out? I'm running a small study with cancer samples and I don't even see TP53 annotated in the genes.fpkm_tracking using either the Illumina-provided UCSC or Ensembl files (with chromosome names edited for compatibility). I'm running using the "-g" option as well and I was expecting that existing genes observed by cufflinks would have entries with the annotated gene names.
                    Last edited by Lee Sam; 04-05-2012, 10:05 AM.

                    Comment


                    • #11
                      Hi, I also encountered the same issue. The ./cufflinks also gave me some *.CUFF id's.
                      I used -g option, as I am interested in novel transcripts. How can you get all novel transcripts or genes?

                      Comment


                      • #12
                        Thanks for commenting here, it is good to know I am not the only one in the this boat. I have pinged Cole again, hope he writes something here.

                        Comment


                        • #13
                          Hi all,
                          Epi, did you find a workaround ? I ran into the same problem. Running cufflinks with -g options, some original gene_ids were converted to new identifiers sharing the CUFF prefix.
                          Il2ra (NM_008367) became CUFF.265243...
                          Regards

                          Comment


                          • #14
                            Hi, does anyone have an answer to this question?

                            I am fairly new to this game, but I am running a small RNA-seq study on cancer samples.
                            I have mapped the reads using tophat, and quantified transcript/gene expression using cufflinks and the -g option with ENSEMBL annotation .gtf file.

                            However, I also don't find many of the expected expressed gene IDs in my genes.fpkm_tracking file, like TP53 etc. But when i do a manual inspection of the assembly using IGV, I clearly see that these genes are highly covered by reads....

                            Also, in my isoforms.fpkm_tracking file I can find some of the transcript Ids for these genes, but as i said not their gene IDs..

                            Can anyone explain this/give a solution to my problem?

                            Comment


                            • #15
                              Unfortunately, cuffdiff does not seem to carry over the original transcript ids from the merged.gtf file in to the output files. However, you can either use Excel (vlookup) or a perlscript or awk to populate an additional field in your 'diff' output files by searching merged.gtf. The original transcript ids are listed as 'oId' in the merged.gtf created by cuffmerge.

                              I will post a script if I find time to get one done.

                              hope this helps!

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X