Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ftian
    Junior Member
    • Jan 2014
    • 8

    How to calculate transcript length?

    Dear all,

    Do you know how to calculate transcript length for calculating RPKM?
    For each UCSC gene (transcript), is its length equal to the sum of the length of its extrons?

    Thanks,

    Feng
  • andylemire
    Member
    • Jan 2014
    • 14

    #2
    That is correct, based on my understanding. Someone please correct me if I'm wrong.

    Tophat has the gffread utility which can take a .gtf and a genome reference and extract all transcripts into a new fasta, which you could then use to calculate the length for each. Using samtools faidx you could quickly get the lengths for each (columns 1 and 2).

    Comment

    • dpryan
      Devon Ryan
      • Jul 2011
      • 3478

      #3
      andylemire is correct. Another method for calculating lengths would be to just feed your annotation file to R:

      Code:
      #!/usr/bin/env Rscript
      library(GenomicRanges)
      library(rtracklayer)
      GTFfile = "some_annotation_file.GTF"
      GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
      grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
      reducedGTF <- unlist(grl, use.names=T)
      elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
      elementMetadata(reducedGTF)$widths <- width(reducedGTF)
      calc_length <- function(x) {
          sum(elementMetadata(x)$widths)
      }
      output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
      colnames(output) <- c("Length")
      #You can now do whatever you want with "output"
      The original version of the script above also calculated GC content (sometimes useful for RNAseq normalization).

      Comment

      • ftian
        Junior Member
        • Jan 2014
        • 8

        #4
        Dear all,
        Than you for your reply very much.

        By the way, how do you map UCSC gene ID to entrez ID? Which UCSC table should I use?

        Can all the UCSC genes (transcripts) be mapped to entrez gene?

        Comment

        • dpryan
          Devon Ryan
          • Jul 2011
          • 3478

          #5
          FYI, your life will be easier if you stick to Ensembl annotations, they're better in pretty much all respects.

          Regarding your actual question, have a look at the knownToLocusLink table. LocusLink is (at least normally) the same as the entrez ID, since locus link was the predecessor to entrez. You might also find the various kgXX (known gene, aka UCSC ID based) tables and the knownToXXX (known gene to something else conversion) useful.

          Comment

          • ftian
            Junior Member
            • Jan 2014
            • 8

            #6
            Hi dpryan,

            Thanks for your reply. The knownToLocusLink has the mapping from UCSC genes to Entrez ID. However, about 20% UCSC genes have no related Entrez ID. For these un-mapped UCSC, should I just ignore them when estimating gene expression from isoform expression?

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              Why are you trying to estimate gene expression from isoform expression instead of just counting over genes? That latter would seem much easier. For the 20% with no Entrez ID, have a look at what a few of them are. I recall most of them being Riken genes or pseudogenes (i.e., not interesting for you), but it's been quite a while...

              Comment

              • ftian
                Junior Member
                • Jan 2014
                • 8

                #8
                Do you mean sum the read counts which are aligned to gene?
                For my data, I only have the count number of isoforms.

                Comment

                • dpryan
                  Devon Ryan
                  • Jul 2011
                  • 3478

                  #9
                  Yes, summing the read count per-gene is the normal method. You should have the raw alignments, so just use featureCounts or htseq-count to get the per-gene metrics.

                  Comment

                  • ftian
                    Junior Member
                    • Jan 2014
                    • 8

                    #10
                    Get it. Thanks!

                    Comment

                    • velt
                      Member
                      • Jun 2013
                      • 10

                      #11
                      Dear dpryan,

                      I wrote a script similar to your script, to calculate the size of ensembl genes. However, I have a question for you. How you handle transcripts that overlap ?
                      Indeed, by not taking into account 'transcripts that overlap' in my script, I wondered if I did not overestimated the length of genes.
                      In your opinion, is it more just to find these transcripts and count the region once rather than twice (or more)?

                      Thank you in advance for your answers.

                      (PS: Sorry if my english is bad, it is not my native language)

                      Comment

                      • dpryan
                        Devon Ryan
                        • Jul 2011
                        • 3478

                        #12
                        Hi velt,

                        The relevant portion of my script is the "reduce()" function. In GenomicRanges in R, this merges together any overlapping exons. So, if you have two overlapping transcripts, the bases in the overlap region are only counted once (as opposed to being double counted). As you mentioned, not doing this will lead to heavily spliced transcripts appearing much much longer than they actually are.

                        Hope that helps (and no worries about your English, it's quite good and vastly better than my French)!

                        Comment

                        • velt
                          Member
                          • Jun 2013
                          • 10

                          #13
                          Thank you very much for your reply ! I will proceed in this manner to treat transcripts that overlap.

                          Have a nice day.

                          Comment

                          Latest Articles

                          Collapse

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, 06-09-2026, 11:58 AM
                          0 responses
                          25 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-05-2026, 10:09 AM
                          0 responses
                          30 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-04-2026, 08:59 AM
                          0 responses
                          39 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 12:03 PM
                          0 responses
                          62 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...