Announcement

Collapse
No announcement yet.

Cuffdiff error - duplicate GFF ID encountered?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cuffdiff error - duplicate GFF ID encountered?

    Hi all,

    I am very new to all things RNA-seq, so please bear with me if the questions are really basic
    I am trying to compare two things for differential expression.

    The pipeline I am using is: Tophat -> Cuffdiff
    (with newest versions of each, Tophat 2.04 and Cufflinks 2.02)

    I am skipping running Cufflinks separately before Cuffdiff, because I'm not really interested in new gene/transcript discovery.

    The problem is, when I try to run Cuffdiff, it quits with an error saying the reference annotation contains duplicate GFF IDs:

    Code:
    You are using Cufflinks v2.0.2, which is the most recent release.
    [16:22:49] Loading reference annotation.
    Error: duplicate GFF ID 'FBtr0100868' encountered!
    The reference annotation I am using is the gff downloaded from Flybase: dmel-all-r5.46.gff.

    However, when I searched this gff file, I didn't see duplicate lines containing this id, FBtr0100868.
    Just to experiment though, I tried removing the lines containing the offending GFF id from the gff file, and running Cuffdiff again to see if it would fix the problem, but then it just had the same error with a different GFF id.
    I tried doing this more times with each duplicate GFF id, but every time it just comes back with the same error and a different GFF id.

    Has anyone else encountered this error using the gff file from Flybase, or anywhere else for that matter? I don't know if I'm doing the right thing by removing the "bad" IDs from the reference annotation either, especially since there seem to be an endless number of them. Is there any other way I should fix the reference annotation? Or would it be easier to just run Cufflinks and use its output gtf, instead of trying to fix the Flybase gff?

    Any help would be very much appreciated!

  • #2
    Same problem!

    Hi all!

    I have the same problem, does anyone found a solution? please, help us

    Comment


    • #3
      Update: I wasn't able to exactly fix the issue, but I was able to get around it:

      So, I realized I couldn't use the Flybase gff in Cuffmerge either (same error), so my idea of possibly using the Cufflinks->Cuffmerge gtf didn't work out.

      However, I was reading through this thread (mostly on the 2nd page) where people were having similar problems:

      http://seqanswers.com/forums/showthread.php?t=3493

      Someone was able to fix theirs by removing the duplicates, sort of like what I was trying to do, since their gff only seemed to have a few problem lines.

      Others mentioned just trying a different gff file, if one was available. Since removing the duplicates by hand wasn't an option, I just tried the gtf from Ensembl instead, and it worked without a problem!

      The Flybase gff I had for some reason worked with an older version of Cufflinks, so I guess that's why I didn't think of trying a different gff/gtf before.
      Last edited by drosoform; 08-24-2012, 07:20 PM. Reason: typo

      Comment


      • #4
        I had the same problem running Cufflinks v2.0 on dmel release 5.49.
        To get around the problem, I wrote a PERL script to pull out the lines from the GFF file containing the word "gene" in the third column. That seemed to fix the problem.

        There is no GTF file from Ensembl that corresponds to dmel release 5.49 from flybase.

        Comment


        • #5
          One possible solution

          Dear All,

          I am using Cufflinks v2.1.1 and encountered the same error upon running CuffDiff with a mask file and the gencode v15 annotation ("Error: duplicate GFF ID 'ENST00000389680.2' encountered!"). (Might be interesting to note that when NOT using a mask file I did not get an error at this point at all, but CuffDiff got stuck at a locus for > 12 hours). However, there was no duplicated ID:

          Code:
          $ grep ENST00000389680.2 gencode.v15.annotation.gtf
          chrM	ENSEMBL	transcript	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; level 3; tag "basic";
          chrM	ENSEMBL	exon	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; exon_number 1;  level 3; tag "basic";
          As you can see these entries are for a transcript and for an exon, not duplicates at all. Yet it did get me thinking: Cufflinks only really uses CDS and exon entries, so the existence of all the other identifiers ('gene', 'transcript' etc in the third column of my GTF might be the cause of the hassle. So I removed all entries in the GTF except the 'exon' and 'CDS' lines and Voila! Now it is working.

          If there is not in fact duplicated entries in your annotation file I would suggest trying this approach.

          Best,
          Boel

          Comment


          • #6
            Dear all,

            I made the same problem with cufflink 2.1.1 and I do find duplicates in the gtf file produced by cufflinks:
            Error: duplicate GFF ID 'SL1sc04444 | LOCATED IN chloroplast chloroplast inner encountered!

            I trace back to the transcripts file and find the duplicates below:

            scaffold3562 Cufflinks transcript 123676 142383 1 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner
            scaffold3562 Cufflinks exon 123676 138285 1 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner
            scaffold3562 Cufflinks exon 139443 142383 1 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner

            scaffold3562 Cufflinks transcript 139443 142383 1000 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner
            scaffold3562 Cufflinks exon 139443 142383 1000 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner

            The coordinate of the second transcript is exactly the same as annotation file, but the first one is expanded from the 5' head, although the 3' end is the same, it should belong to another transcript of the same gene right? Why cufflink still group it into the same transcript and cause this duplicate ID error?

            Anyone have any suggestion how to fix it?

            Best,
            Zhihao

            Comment


            • #7
              it works

              Originally posted by Boel View Post
              Dear All,

              I am using Cufflinks v2.1.1 and encountered the same error upon running CuffDiff with a mask file and the gencode v15 annotation ("Error: duplicate GFF ID 'ENST00000389680.2' encountered!"). (Might be interesting to note that when NOT using a mask file I did not get an error at this point at all, but CuffDiff got stuck at a locus for > 12 hours). However, there was no duplicated ID:

              Code:
              $ grep ENST00000389680.2 gencode.v15.annotation.gtf
              chrM	ENSEMBL	transcript	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; level 3; tag "basic";
              chrM	ENSEMBL	exon	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; exon_number 1;  level 3; tag "basic";
              As you can see these entries are for a transcript and for an exon, not duplicates at all. Yet it did get me thinking: Cufflinks only really uses CDS and exon entries, so the existence of all the other identifiers ('gene', 'transcript' etc in the third column of my GTF might be the cause of the hassle. So I removed all entries in the GTF except the 'exon' and 'CDS' lines and Voila! Now it is working.

              If there is not in fact duplicated entries in your annotation file I would suggest trying this approach.

              Best,
              Boel

              Thanks for posting the solution. I had same issue of duplicated gff id when I ran cuffmerge. It started to work after I ran " awk '($3 == "exon" || $3 == "CDS")' " for all the input gtf files ( both the gtf files from cufflinks and the reference ).

              Comment


              • #8
                Originally posted by kwatts59 View Post
                I had the same problem running Cufflinks v2.0 on dmel release 5.49.
                To get around the problem, I wrote a PERL script to pull out the lines from the GFF file containing the word "gene" in the third column. That seemed to fix the problem.
                Hi, I am having this problem as well, but sadly have no idea how to write PERL scripts. I was wondering if you'd be willing to share your PERL script with the rest of us who may be having this problem?

                Thanks

                Comment


                • #9
                  Cuffdiff error-- duplicate GFF IDs & using VIM to edit transcript files

                  You can use the vi or vim editor (default on most unix systems) to edit the gtf files.
                  >vim transcript.gtf
                  :g/dup/d
                  Is a global command to find any line that contains "dup" and delete the entire line.
                  Similarly, you can do the same to remove the gene
                  :g/gene/d

                  To save the changes:
                  :wq

                  Search online for vim commands to help you if you get stuck.

                  I edited the reference gene transcript file to remove lines with "gene".
                  After running cufflinks, I edited the transcript.gtf file to remove any lines with "dup" in them.
                  Cuffmerge ran quite happily after that.

                  Comment


                  • #10
                    Duplicate GFF ID: a possible solution

                    Hi,

                    I am using Cufflinks v2.1.1 with ensembl human genome annotation GTF and was getting the same error of duplicate GFF IDs. Following the previously posted solutions I tried the following

                    (1) awk '($3 == "exon" || $3 == "CDS")
                    Although it worked but the resulting GTF loses 1/4 of the annotation lines(mostly UTR) and that might affect the transcript assembly in an unknown(minor or major??) way.

                    (2) tried igenome GTF and still got the same error

                    When I tried to locate the problem in the GTF itself, it seems that duplicate entries were associated with transcripts having "Selenocysteine" annotation lines(114 lines in recent annotations GTF both Ensembl and igenomes). Once I get rid of these 114 lines from GTF files using
                    "awk '!/Selenocysteine/' Homo_sapiens.GRCh38.76.gtf >Homo_sapiens.GRCh38.76.gtf_seleno_filtered". It worked without any error and without losing too much information from the annotation GTF.

                    Best
                    Dhirendra
                    Last edited by dhir_kumar; 11-07-2016, 07:55 AM.

                    Comment


                    • #11
                      Dhirendra's solution works

                      so +1 to Dhirendra.

                      Kindes Regards

                      Comment


                      • #12
                        Dear All,

                        I was having this issue, while I was running "cuffmerge" on the assemblies built using cufflinks 2.1.1.

                        I checked in my reference gtf if the duplicated entry exists but there wasn't any. Later I found out that the problem with duplicated entries was not with the gencode gtf file which I was using as reference, but with the "transcripts.gtf" file created during cufflinks step.

                        After, updating cufflinks to a newer version 2.2.1 and re-running cufflinks step has resolved this issue.

                        Hope that helps.
                        Good luck

                        Comment

                        Working...
                        X