Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • krapulaxdoctor
    Member
    • May 2015
    • 22

    Problem with UCSC GTF files?

    Hi,

    I would like to ask for some opinion and advice related to the different available GTF-file sources for annotated genes.(mm10, but others as well)
    I did some search to avoid duplicate entries, (sorry if It is still one).
    The topic I would like to discuss is briefly mentioned at other forums, but was never discussed thoroughly that gave a satisfactory explanation.

    I wanted to download GTF files (mm10) from UCSC genome browser to have reference genes and transcript variants for differential transcript variant expression and splicing analyses.

    However, it looks like no matter how I was setting up the table browser (UCSC genes, NCBI refseq, etc) the obtained GTF files from UCSC browser were not suitable for such analyses.
    I noticed that these GTF files (from UCSC) treat each transcript variants as a separate gene, since the "transcript ID" is identical to "gene ID" in these files. (did I do something wrong?)
    For these analyses I need a GTF file where each gene ID is linked ( aka repeated ) to multiple transcript variants (if there are variants of course). The only source I found such GTF file is Gencode and Ensembl.
    However, these files contain approx 50000 genes and 150000 transcript variants which I found too much due to predictions. While the UCSC has approx 38000 entries which might be less redundant and speculative? (no idea)

    I would like to ask for some advice about where to find / how to make an optimal GTF file that would be suitable for differential splicing/ transc. variant expression analyses?

    Would you recommend to avoid using UCSC GTF files for expression analyses in general?

    Thank you for your help.

    Best.
  • doraemon
    Junior Member
    • Nov 2013
    • 2

    #2
    Hi,

    I'm not an expert and my knowledge is limited to human genes ... Although I'd like to think that the principles outlined extend to mouse genes as well.

    1) Refseq - transcripts are well supported by evidence and heavily used (NM_ .. for known protein coding)
    2) Ensembl / Gencode Comprehensive - Contains both annotated and manually curated transcripts
    3) Ensembl / Gencode Basic - Contains manually curate transcripts only

    I'm not terribly familiar with UCSC. In the literature I have come across so far, the authors have almost always leaned towards using RefSeq or Ensembl.

    So the choice of which transcripts annotation to go with depends on what you're trying to do.

    If you're interested in performing variant analysis of transcripts and ensure that they're supported by evidence, Refseq or Gencode basic is your friend.

    If you're concerned that limiting yourself to annotations that are supported by evidence - might result in missing out other possibly novel transcripts, then Gencode Comprehensive is the way to go.

    These two papers go into a significant more detail as to the pros and cons of using one annotation construct vs another.

    RNA-Seq has become increasingly popular in transcriptome profiling. One aspect of transcriptome research is to quantify the expression levels of genomic elements, such as genes, their transcripts and exons. Acquiring a transcriptome expression ...

    A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al ...

    Comment

    • krapulaxdoctor
      Member
      • May 2015
      • 22

      #3
      dear doraemon,

      Thank you for the response. I ended up with similar conclusion. It is a bit confusing for a non-bioinformatician like me.

      Comment

      • qiongyi
        Member
        • Nov 2010
        • 10

        #4
        Hi krapulaxdoctor,

        I hit the same problem as you mentioned in the thread. I think it is a bug in UCSC Table Browser. To solve this problem, I downloaded both the GTF file and the refFlat file using Table Browser, and then applied a custom PERL script "gtf_addGeneName_from_refFlat.pl" to add the gene name into the GTF file.

        For you and other people's convenience, I have put my custom PERL script in https://github.com/Qiongyi/custom_PERL_scripts
        Feel free to use if you meet similar problem.

        Usage: gtf_addGeneName_from_refFlat.pl mm10.refGene.gtf mm10.refGene.refFlat.txt output(the updated GTF file with gene ID)

        Cheers,

        Qiongyi

        Comment

        Latest Articles

        Collapse

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        17 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        27 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        38 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-02-2026, 12:03 PM
        0 responses
        61 views
        0 reactions
        Last Post SEQadmin2  
        Working...