Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks outputs duplicate GTF entries?

    Hi all-

    First off, thank you Cole for Cufflinks. It will be very useful for the type of RPKM analysis I'd like to.

    I'm having a bit of trouble with the GTF output files when I plug them into cuffcompare. I run cufflinks using a UCSC GTF file for the mouse genome as reference:

    $./cufflinks -G mm9.KnownGene.GTF accepted_hits.sam

    This works fine and outputs the appropriate files. However, when I plug 2 output GTFs into CuffCompare

    $./cuffcompare -r mm9_KnownGene.GTF -V -o stats.txt Sample1.gtf Sample2.GTF

    I get the following error:
    Loading reference transcripts..
    64 duplicate reference transcripts discarded.
    ..ref data loaded
    Processing file: Sample1.gtf
    Loading transcripts from Sample1.gtf..
    Error: duplicate GFF ID 'uc007aji.1' encountered!

    Taking a look at the Sample1.gtf, it does in fact look like there are duplicate entries. Example:

    chr1 Cufflinks transcript 15795739 15833510 1000 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; RPKM "1.2249777386"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.781371";
    chr1 Cufflinks exon 15795739 15796038 1000 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "1"; RPKM "1.2249777386"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.781371";
    chr1 Cufflinks exon 15797722 15797817 1000 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "2"; RPKM "1.2249777386"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.781371";
    chr1 Cufflinks exon 15803122 15803243 1000 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "3"; RPKM "1.2249777386"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.781371";
    chr1 Cufflinks exon 15809015 15809164 1000 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "4"; RPKM "1.2249777386"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.781371";
    chr1 Cufflinks exon 15809732 15809844 1000 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "5"; RPKM "1.2249777386"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.781371";
    chr1 Cufflinks exon 15821588 15821647 1000 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "6"; RPKM "1.2249777386"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.781371";
    chr1 Cufflinks exon 15823482 15823570 1000 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "7"; RPKM "1.2249777386"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.781371";
    chr1 Cufflinks exon 15828263 15828369 1000 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "8"; RPKM "1.2249777386"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.781371";
    chr1 Cufflinks exon 15833002 15833510 1000 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "9"; RPKM "1.2249777386"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "1.781371";
    chr1 Cufflinks transcript 15795739 15833510 154 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; RPKM "0.1240106434"; frac "0.116576"; conf_lo "0.999925"; conf_hi "1.369358"; cov "0.180337";
    chr1 Cufflinks exon 15795739 15796038 154 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "1"; RPKM "0.1240106434"; frac "0.116576"; conf_lo "0.999925"; conf_hi "1.369358"; cov "0.180337";
    chr1 Cufflinks exon 15797722 15797817 154 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "2"; RPKM "0.1240106434"; frac "0.116576"; conf_lo "0.999925"; conf_hi "1.369358"; cov "0.180337";
    chr1 Cufflinks exon 15803122 15803243 154 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "3"; RPKM "0.1240106434"; frac "0.116576"; conf_lo "0.999925"; conf_hi "1.369358"; cov "0.180337";
    chr1 Cufflinks exon 15809015 15809164 154 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "4"; RPKM "0.1240106434"; frac "0.116576"; conf_lo "0.999925"; conf_hi "1.369358"; cov "0.180337";
    chr1 Cufflinks exon 15809732 15809844 154 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "5"; RPKM "0.1240106434"; frac "0.116576"; conf_lo "0.999925"; conf_hi "1.369358"; cov "0.180337";
    chr1 Cufflinks exon 15821588 15821647 154 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "6"; RPKM "0.1240106434"; frac "0.116576"; conf_lo "0.999925"; conf_hi "1.369358"; cov "0.180337";
    chr1 Cufflinks exon 15823482 15823570 154 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "7"; RPKM "0.1240106434"; frac "0.116576"; conf_lo "0.999925"; conf_hi "1.369358"; cov "0.180337";
    chr1 Cufflinks exon 15828263 15828369 154 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "8"; RPKM "0.1240106434"; frac "0.116576"; conf_lo "0.999925"; conf_hi "1.369358"; cov "0.180337";
    chr1 Cufflinks exon 15833002 15833510 154 + . gene_id "uc007aji.1"; transcript_id "uc007aji.1"; exon_number "9"; RPKM "0.1240106434"; frac "0.116576"; conf_lo "0.999925"; conf_hi "1.369358"; cov "0.180337";

    This is not a unique occurrence and seems to appear throughout the GTF file.

    It seems as though these 2 transcripts are identical in every way...why is Cufflinks outputting them twice?

    Any assistance would be much appreciated.

    Thanks!

  • #2
    me too

    Just wanted to say that the same thing happened to me.

    Comment


    • #3
      ditto

      Same issue. Did anyone figure out a fix?

      any reason I couldn't just write a script to cull the redundant entries?

      Thanks-

      Comment


      • #4
        That would probably work, but which RPKM is right and which would you delete?

        Comment


        • #5
          workable...

          I'm working with the Drosophila, and it turns out there's only 2 redundant entires... didn't even need to write a script, I just deleted them by hand.

          Not sure which RPKM is right, but I can live with 2 genes being a little wonky, especially because I can always double check by eye if they're something important. Sorry it doesn't help the mammalian folks, but at least CuffCompare can be made to work, if necessary.

          Comment


          • #6
            Oh, in that case, no biggie...

            The mouse situation is a bit more complicated. I think I'm just going to not use cuffcompare for now.

            Comment


            • #7
              Dear all,

              i just encountering a very similar problem. I ran cufflinks 0.8.3 on several samples, e.g.:

              Code:
              cufflinks --num-threads 8 --max-mle-iterations 10000 --GTF ./Homo_sapiens.GRCh37.58.name=id.gtf --output-dir ./2_cufflinks/PE_NN3 ./1_tophat/PE_NN3/accepted_hits.sam
              Please note that the GTF-file used originates from
              Code:
              ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh37.58.gtf.gz
              . I, however, changed the original GTF-file scriptually, such that all gene_name's match the according gene_id's (same for transcripts) to make cufflinks/cuffcompare/cuffdiff stick with id's throughout all three analysis stages (i don't like gene_names here). I finally double checked the resulting GTF file for duplicate entries, there are none.

              Now, when i run cuffcompare i receive the following error message:


              Code:
              cuffcompare -T -R -o ./3_cuffcompare/REPL -r ./Homo_sapiens.GRCh37.58.name=id.gtf ./2_cufflinks/PE_NN3/transcripts.gtf $USSC/2_cufflinks/PE_NN5/transcripts.gtf
              Code:
              Error: duplicate GFF ID 'ENST00000415889' enountered!
              (The actual duplicate id reported changes with the sample-composition a try to compare.)

              Taking a look into the "transcripts.expr" files produced by cufflinks, it appears that there are indeed duplicate lines (really duplicate!). They match both in transcript_id's as well as in expression values. The according "transcripts.gtf" also produced by cufflinks contains the whole block for duplicate transcripts twice, too. Again, all letters and digits do match, so we're talking about real duplicates reported.

              I'm not quite sure, but i feel like, i didn't had this problem in cufflinks 0.8.2.

              Any ideas?

              Best,
              Uwe

              Comment


              • #8
                Dear Winfred,

                I think I got a smiliar to yours.
                Please a look to this threads and tell me if this is relevant to your problem.
                Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


                Olivier

                Comment


                • #9
                  Originally posted by oliviera View Post
                  Dear Winfred,

                  I think I got a smiliar to yours.
                  Please a look to this threads and tell me if this is relevant to your problem.
                  Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


                  Olivier
                  Thanks for your quick reply, oliviera.

                  It is, however, the case that the transcript boundaries appear to not differ in their locations. Admittedly the RPKM values do actually differ with regard to the duplicate lines (i claimed them to be equal in my previous post). Please find the two excerpts from "transcripts.expr" and transcripts.gft, respectively. Maybe this tells anyone, what could be wrong with my setup...

                  transcripts.expr:
                  Code:
                  ENST00000513107	408319	5	44388586	44389808	0.02483	1	1	0	0.0744899	0.0442087	1131
                  ENST00000513107	408321	5	44388586	44389808	0.0192059	1	0.670969	0	0.0592762	0.0341952	1131
                  transcripts.gtf
                  Code:
                  5	Cufflinks	transcript	44388587	44389808	1000	-	.	gene_id "ENSG00000070193"; transcript_id "ENST00000513107"; FPKM "0.0248299624"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.074490"; cov "0.044209";
                  5	Cufflinks	exon	44388587	44389158	1000	-	.	gene_id "ENSG00000070193"; transcript_id "ENST00000513107"; exon_number "1"; FPKM "0.0248299624"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.074490"; cov "0.044209";
                  5	Cufflinks	exon	44389250	44389808	1000	-	.	gene_id "ENSG00000070193"; transcript_id "ENST00000513107"; exon_number "2"; FPKM "0.0248299624"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.074490"; cov "0.044209";
                  5	Cufflinks	transcript	44388587	44389808	1000	-	.	gene_id "ENSG00000070193"; transcript_id "ENST00000513107"; FPKM "0.0192058821"; frac "0.670969"; conf_lo "0.000000"; conf_hi "0.059276"; cov "0.034195";
                  5	Cufflinks	exon	44388587	44389158	1000	-	.	gene_id "ENSG00000070193"; transcript_id "ENST00000513107"; exon_number "1"; FPKM "0.0192058821"; frac "0.670969"; conf_lo "0.000000"; conf_hi "0.059276"; cov "0.034195";
                  5	Cufflinks	exon	44389250	44389808	1000	-	.	gene_id "ENSG00000070193"; transcript_id "ENST00000513107"; exon_number "2"; FPKM "0.0192058821"; frac "0.670969"; conf_lo "0.000000"; conf_hi "0.059276"; cov "0.034195";
                  There is -i triple checked that, after all- definitely only a single entry regarding transcript "ENST00000513107" in the source-GTF file. Below, i attached all GTF-lines that contain the transcript_id "ENST00000513107":
                  Code:
                  5	protein_coding	exon	44389250	44389808	.	-	.	 gene_id "ENSG00000070193"; transcript_id "ENST00000513107"; exon_number "1"; gene_name "ENSG00000070193"; transcript_name "ENST00000513107";
                  5	protein_coding	exon	44388587	44389158	.	-	.	 gene_id "ENSG00000070193"; transcript_id "ENST00000513107"; exon_number "2"; gene_name "ENSG00000070193"; transcript_name "ENST00000513107";
                  5	protein_coding	CDS	44388587	44388784	.	-	0	 gene_id "ENSG00000070193"; transcript_id "ENST00000513107"; exon_number "2"; gene_name "ENSG00000070193"; transcript_name "ENST00000513107"; protein_id "ENSP00000426406";
                  5	protein_coding	start_codon	44388782	44388784	.	-	0	 gene_id "ENSG00000070193"; transcript_id "ENST00000513107"; exon_number "2"; gene_name "ENSG00000070193"; transcript_name "ENST00000513107";
                  So i don't think it to be a transcript boundary problem, even more -if i remember correctly- that this particular problem didn't appear in v0.8.2.

                  Any other ideas? Thanks in advance!
                  Uwe

                  Comment


                  • #10
                    Same problem with "duplicate GFF ID".

                    For the moment I have switched back to Cufflinks 0.8.2.

                    Comment


                    • #11
                      I also found the same issue with version 0.8.3 but not 0.8.2. It looks like it's introduced by cufflinks when running with the "-G" option.

                      Comment


                      • #12
                        This looks like a regression in the alignment+GTF record bundling code in Cufflinks. If anybody here can produce a relatively small test set that consistently reproduces this issue, I'll fix it pronto. Sorry for the annoying bug...

                        Comment


                        • #13
                          Originally posted by Cole Trapnell View Post
                          This looks like a regression in the alignment+GTF record bundling code in Cufflinks. If anybody here can produce a relatively small test set that consistently reproduces this issue, I'll fix it pronto. Sorry for the annoying bug...
                          Hi Cole,

                          first of all, thanks for engagement! I by the way don't think it necessary to apologize for being in BETA stage. IMHO the Trapnell-toolsuite is one of the most complete and consistent ones out there, so thanks for that!

                          I've prepared a dataset that could hold as a test scenario for the issue discussed above. Hopefully it is somewhat useful, even though its not that "relatively small" ,-) Please find the accepted_hits.sam and the GTF file i used in the following download (1.3GB).
                          Code:
                          http://www.gtsg.org/_/COLE.tar.gz
                          Referring to the cmdline (cufflinks v0.8.3):
                          Code:
                          cufflinks --num-threads 8 --max-mle-iterations 10000 --GTF ./Homo_sapiens.GRCh37.58.name=id.gtf --output-dir ./ ./accepted_hits.sam
                          there should appear two duplicate transcript entries in transcripts.gtf as well as in transcripts.expr:
                          Code:
                          ENST00000320936
                          ENST00000423038
                          Thanks again and best!
                          Uwe

                          Comment


                          • #14
                            I'll second that. A great set of RNA-Seq tools even in their beta form.

                            Thanks Cole.

                            Chris

                            Comment


                            • #15
                              I just posted a pre-release build of Cufflinks 0.8.4 (svn r1370) that addresses this issue. I was able to locally reproduce the bug with Uwe's test set (thanks!), and the build I've posted corrects it (for me, anyways). Please let me know if this crops up again (ideally with another test download to reproduce it).

                              You should know before trying this build that it is an SVN snapshot and hasn't gone through the normal round of release testing, so you may run into other issues.

                              This build also includes additional assembler fixes and direct BAM file support. You can now supply BAM files as input to Cufflinks (SAM is still supported).

                              Please download one of the tarballs with "0.8.4" in the version tag from:

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:09 AM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Today, 06:13 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X