Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • amolkolte
    replied
    Dear All,

    I was having this issue, while I was running "cuffmerge" on the assemblies built using cufflinks 2.1.1.

    I checked in my reference gtf if the duplicated entry exists but there wasn't any. Later I found out that the problem with duplicated entries was not with the gencode gtf file which I was using as reference, but with the "transcripts.gtf" file created during cufflinks step.

    After, updating cufflinks to a newer version 2.2.1 and re-running cufflinks step has resolved this issue.

    Hope that helps.
    Good luck

    Leave a comment:


  • Kristoffer Vitting-Seerup
    replied
    Dhirendra's solution works

    so +1 to Dhirendra.

    Kindes Regards

    Leave a comment:


  • dhir_kumar
    replied
    Duplicate GFF ID: a possible solution

    Hi,

    I am using Cufflinks v2.1.1 with ensembl human genome annotation GTF and was getting the same error of duplicate GFF IDs. Following the previously posted solutions I tried the following

    (1) awk '($3 == "exon" || $3 == "CDS")
    Although it worked but the resulting GTF loses 1/4 of the annotation lines(mostly UTR) and that might affect the transcript assembly in an unknown(minor or major??) way.

    (2) tried igenome GTF and still got the same error

    When I tried to locate the problem in the GTF itself, it seems that duplicate entries were associated with transcripts having "Selenocysteine" annotation lines(114 lines in recent annotations GTF both Ensembl and igenomes). Once I get rid of these 114 lines from GTF files using
    "awk '!/Selenocysteine/' Homo_sapiens.GRCh38.76.gtf >Homo_sapiens.GRCh38.76.gtf_seleno_filtered". It worked without any error and without losing too much information from the annotation GTF.

    Best
    Dhirendra
    Last edited by dhir_kumar; 11-07-2016, 07:55 AM.

    Leave a comment:


  • MDonlin
    replied
    Cuffdiff error-- duplicate GFF IDs & using VIM to edit transcript files

    You can use the vi or vim editor (default on most unix systems) to edit the gtf files.
    >vim transcript.gtf
    :g/dup/d
    Is a global command to find any line that contains "dup" and delete the entire line.
    Similarly, you can do the same to remove the gene
    :g/gene/d

    To save the changes:
    :wq

    Search online for vim commands to help you if you get stuck.

    I edited the reference gene transcript file to remove lines with "gene".
    After running cufflinks, I edited the transcript.gtf file to remove any lines with "dup" in them.
    Cuffmerge ran quite happily after that.

    Leave a comment:


  • sugo
    replied
    Originally posted by kwatts59 View Post
    I had the same problem running Cufflinks v2.0 on dmel release 5.49.
    To get around the problem, I wrote a PERL script to pull out the lines from the GFF file containing the word "gene" in the third column. That seemed to fix the problem.
    Hi, I am having this problem as well, but sadly have no idea how to write PERL scripts. I was wondering if you'd be willing to share your PERL script with the rest of us who may be having this problem?

    Thanks

    Leave a comment:


  • cylsae
    replied
    it works

    Originally posted by Boel View Post
    Dear All,

    I am using Cufflinks v2.1.1 and encountered the same error upon running CuffDiff with a mask file and the gencode v15 annotation ("Error: duplicate GFF ID 'ENST00000389680.2' encountered!"). (Might be interesting to note that when NOT using a mask file I did not get an error at this point at all, but CuffDiff got stuck at a locus for > 12 hours). However, there was no duplicated ID:

    Code:
    $ grep ENST00000389680.2 gencode.v15.annotation.gtf
    chrM	ENSEMBL	transcript	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; level 3; tag "basic";
    chrM	ENSEMBL	exon	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; exon_number 1;  level 3; tag "basic";
    As you can see these entries are for a transcript and for an exon, not duplicates at all. Yet it did get me thinking: Cufflinks only really uses CDS and exon entries, so the existence of all the other identifiers ('gene', 'transcript' etc in the third column of my GTF might be the cause of the hassle. So I removed all entries in the GTF except the 'exon' and 'CDS' lines and Voila! Now it is working.

    If there is not in fact duplicated entries in your annotation file I would suggest trying this approach.

    Best,
    Boel

    Thanks for posting the solution. I had same issue of duplicated gff id when I ran cuffmerge. It started to work after I ran " awk '($3 == "exon" || $3 == "CDS")' " for all the input gtf files ( both the gtf files from cufflinks and the reference ).

    Leave a comment:


  • lzhdennisdn
    replied
    Dear all,

    I made the same problem with cufflink 2.1.1 and I do find duplicates in the gtf file produced by cufflinks:
    Error: duplicate GFF ID 'SL1sc04444 | LOCATED IN chloroplast chloroplast inner encountered!

    I trace back to the transcripts file and find the duplicates below:

    scaffold3562 Cufflinks transcript 123676 142383 1 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner
    scaffold3562 Cufflinks exon 123676 138285 1 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner
    scaffold3562 Cufflinks exon 139443 142383 1 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner

    scaffold3562 Cufflinks transcript 139443 142383 1000 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner
    scaffold3562 Cufflinks exon 139443 142383 1000 - . gene_id "CUFF.12474"; transcript_id "SL1sc04444 | LOCATED IN chloroplast chloroplast inner

    The coordinate of the second transcript is exactly the same as annotation file, but the first one is expanded from the 5' head, although the 3' end is the same, it should belong to another transcript of the same gene right? Why cufflink still group it into the same transcript and cause this duplicate ID error?

    Anyone have any suggestion how to fix it?

    Best,
    Zhihao

    Leave a comment:


  • Boel
    replied
    One possible solution

    Dear All,

    I am using Cufflinks v2.1.1 and encountered the same error upon running CuffDiff with a mask file and the gencode v15 annotation ("Error: duplicate GFF ID 'ENST00000389680.2' encountered!"). (Might be interesting to note that when NOT using a mask file I did not get an error at this point at all, but CuffDiff got stuck at a locus for > 12 hours). However, there was no duplicated ID:

    Code:
    $ grep ENST00000389680.2 gencode.v15.annotation.gtf
    chrM	ENSEMBL	transcript	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; level 3; tag "basic";
    chrM	ENSEMBL	exon	648	1601	.	+	.	gene_id "ENSG00000211459.2"; transcript_id "ENST00000389680.2"; gene_type "Mt_rRNA"; gene_status "KNOWN"; gene_name "J01415.23"; transcript_type "Mt_rRNA"; transcript_status "KNOWN"; transcript_name "J01415.23-201"; exon_number 1;  level 3; tag "basic";
    As you can see these entries are for a transcript and for an exon, not duplicates at all. Yet it did get me thinking: Cufflinks only really uses CDS and exon entries, so the existence of all the other identifiers ('gene', 'transcript' etc in the third column of my GTF might be the cause of the hassle. So I removed all entries in the GTF except the 'exon' and 'CDS' lines and Voila! Now it is working.

    If there is not in fact duplicated entries in your annotation file I would suggest trying this approach.

    Best,
    Boel

    Leave a comment:


  • kwatts59
    replied
    I had the same problem running Cufflinks v2.0 on dmel release 5.49.
    To get around the problem, I wrote a PERL script to pull out the lines from the GFF file containing the word "gene" in the third column. That seemed to fix the problem.

    There is no GTF file from Ensembl that corresponds to dmel release 5.49 from flybase.

    Leave a comment:


  • drosoform
    replied
    Update: I wasn't able to exactly fix the issue, but I was able to get around it:

    So, I realized I couldn't use the Flybase gff in Cuffmerge either (same error), so my idea of possibly using the Cufflinks->Cuffmerge gtf didn't work out.

    However, I was reading through this thread (mostly on the 2nd page) where people were having similar problems:

    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    Someone was able to fix theirs by removing the duplicates, sort of like what I was trying to do, since their gff only seemed to have a few problem lines.

    Others mentioned just trying a different gff file, if one was available. Since removing the duplicates by hand wasn't an option, I just tried the gtf from Ensembl instead, and it worked without a problem!

    The Flybase gff I had for some reason worked with an older version of Cufflinks, so I guess that's why I didn't think of trying a different gff/gtf before.
    Last edited by drosoform; 08-24-2012, 07:20 PM. Reason: typo

    Leave a comment:


  • mticlla
    replied
    Same problem!

    Hi all!

    I have the same problem, does anyone found a solution? please, help us

    Leave a comment:


  • drosoform
    started a topic Cuffdiff error - duplicate GFF ID encountered?

    Cuffdiff error - duplicate GFF ID encountered?

    Hi all,

    I am very new to all things RNA-seq, so please bear with me if the questions are really basic
    I am trying to compare two things for differential expression.

    The pipeline I am using is: Tophat -> Cuffdiff
    (with newest versions of each, Tophat 2.04 and Cufflinks 2.02)

    I am skipping running Cufflinks separately before Cuffdiff, because I'm not really interested in new gene/transcript discovery.

    The problem is, when I try to run Cuffdiff, it quits with an error saying the reference annotation contains duplicate GFF IDs:

    Code:
    You are using Cufflinks v2.0.2, which is the most recent release.
    [16:22:49] Loading reference annotation.
    Error: duplicate GFF ID 'FBtr0100868' encountered!
    The reference annotation I am using is the gff downloaded from Flybase: dmel-all-r5.46.gff.

    However, when I searched this gff file, I didn't see duplicate lines containing this id, FBtr0100868.
    Just to experiment though, I tried removing the lines containing the offending GFF id from the gff file, and running Cuffdiff again to see if it would fix the problem, but then it just had the same error with a different GFF id.
    I tried doing this more times with each duplicate GFF id, but every time it just comes back with the same error and a different GFF id.

    Has anyone else encountered this error using the gff file from Flybase, or anywhere else for that matter? I don't know if I'm doing the right thing by removing the "bad" IDs from the reference annotation either, especially since there seem to be an endless number of them. Is there any other way I should fix the reference annotation? Or would it be easier to just run Cufflinks and use its output gtf, instead of trying to fix the Flybase gff?

    Any help would be very much appreciated!

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 06:37 PM
0 responses
10 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 06:07 PM
0 responses
9 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
49 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
67 views
0 likes
Last Post seqadmin  
Working...
X