Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • middlemale
    Member
    • Feb 2010
    • 16

    cufflinks errors of duplicates

    Hi,

    I searched for a while for my problem running cufflinks, sounds no answer yet.

    I run tophat + bowtie for RNA-seq data (single end read), and got the widetype .sam file plus treated .sam file. The -G GFF option was supplied for tophat, which file was converted from Danio rerio GTF file and downloaded from http://www.ensembl.org/info/data/ftp/index.html.

    Then I try to run cufflinks with the following command:

    [mMi@devaP Felipa]$ cufflinks -G /home/RNASeq/FishGenome/Danio_rerio_Zv8_57.gtf ./WT_accepted_hits.sam

    Counting hits in map
    Error: duplicate GFF ID 'ENSDART00000099599' (or exons too far apart)!


    #####################

    I cannot find strings of 'ENSDART00000099599' in the WT.accepted_hits.sam file but write a pl script looking in Danio_rerio_Zv8_57.gtf file

    mMi@mMi-Ubuntu:/A01 RNA-seq$ perl FindTargetRecord.pl
    18 protein_coding exon 16261480 16262025 .- . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding CDS 16261480 16262025 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
    **
    18 protein_coding start_codon 16262023 16262025 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding exon 14234408 14234520 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "2"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding CDS 14234408 14234520 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "2"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
    **
    18 protein_coding exon 14234169 14234325 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "3"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding CDS 14234169 14234325 . - 1 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "3"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
    **
    18 protein_coding exon 14231851 14232003 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "4"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding CDS 14231851 14232003 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "4"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
    **
    18 protein_coding exon 14223590 14224135 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding CDS 14223593 14224135 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
    **
    18 protein_coding stop_codon 14223590 14223592 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **

    Does this mean to delete some lines in reference gtf file?

    #################

    Then I delete the -G option

    [mMi@devaP Felipa]$ cufflinks WT_accepted_hits.sam

    now it sounds fine and produces .gtf gene.expr and trasncripts.expr files, but all ID are annotated with cuffID, not gene or transcript ID.

    #####################

    any suggestion of sorting it out?

    cheers
  • gpertea
    Member
    • Jan 2010
    • 21

    #2
    It looks like there is an abnormally large intron there, over 2Mb long, between the 1st an 2nd exon of that transcript.
    Removing that transcript from your reference annotation file (yes, deleting all lines mentioning ENSDART00000099599) should solve the problem.

    Comment

    • middlemale
      Member
      • Feb 2010
      • 16

      #3
      thanks gpertea and others. I have tried modifying genome gtf file from ensembl (like deleting all lines mentioning ENSDART00000099599), but other duplicated IDs are found and there are too many to be deleted manually. additionally the raw SAM file was generated by Tophat, and I sorted it again. cufflinks still reports

      "Processing bundle [ chr1:1203-1254 ] with 1 non-redundant alignments".

      Can anyone doing human genome RNA-seq data suggest which reference gtf file should be used here?

      cheers

      Comment

      • makost
        Junior Member
        • Jun 2010
        • 5

        #4
        What I did was to change the names of the duplicated genes to ENSGxxxxxxxxxxx_dup1 in the GFF file I downloaded from Ensembl for the human genome.

        Once you have no records with the same name but in different positions you should be able to run Cufflinks without any problems.

        Cheers

        Comment

        Latest Articles

        Collapse

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        17 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        27 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        38 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-02-2026, 12:03 PM
        0 responses
        61 views
        0 reactions
        Last Post SEQadmin2  
        Working...