Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to calculate transcript length?

    Dear all,

    Do you know how to calculate transcript length for calculating RPKM?
    For each UCSC gene (transcript), is its length equal to the sum of the length of its extrons?

    Thanks,

    Feng

  • #2
    That is correct, based on my understanding. Someone please correct me if I'm wrong.

    Tophat has the gffread utility which can take a .gtf and a genome reference and extract all transcripts into a new fasta, which you could then use to calculate the length for each. Using samtools faidx you could quickly get the lengths for each (columns 1 and 2).

    Comment


    • #3
      andylemire is correct. Another method for calculating lengths would be to just feed your annotation file to R:

      Code:
      #!/usr/bin/env Rscript
      library(GenomicRanges)
      library(rtracklayer)
      GTFfile = "some_annotation_file.GTF"
      GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
      grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
      reducedGTF <- unlist(grl, use.names=T)
      elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
      elementMetadata(reducedGTF)$widths <- width(reducedGTF)
      calc_length <- function(x) {
          sum(elementMetadata(x)$widths)
      }
      output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
      colnames(output) <- c("Length")
      #You can now do whatever you want with "output"
      The original version of the script above also calculated GC content (sometimes useful for RNAseq normalization).

      Comment


      • #4
        Dear all,
        Than you for your reply very much.

        By the way, how do you map UCSC gene ID to entrez ID? Which UCSC table should I use?

        Can all the UCSC genes (transcripts) be mapped to entrez gene?

        Comment


        • #5
          FYI, your life will be easier if you stick to Ensembl annotations, they're better in pretty much all respects.

          Regarding your actual question, have a look at the knownToLocusLink table. LocusLink is (at least normally) the same as the entrez ID, since locus link was the predecessor to entrez. You might also find the various kgXX (known gene, aka UCSC ID based) tables and the knownToXXX (known gene to something else conversion) useful.

          Comment


          • #6
            Hi dpryan,

            Thanks for your reply. The knownToLocusLink has the mapping from UCSC genes to Entrez ID. However, about 20% UCSC genes have no related Entrez ID. For these un-mapped UCSC, should I just ignore them when estimating gene expression from isoform expression?

            Comment


            • #7
              Why are you trying to estimate gene expression from isoform expression instead of just counting over genes? That latter would seem much easier. For the 20% with no Entrez ID, have a look at what a few of them are. I recall most of them being Riken genes or pseudogenes (i.e., not interesting for you), but it's been quite a while...

              Comment


              • #8
                Do you mean sum the read counts which are aligned to gene?
                For my data, I only have the count number of isoforms.

                Comment


                • #9
                  Yes, summing the read count per-gene is the normal method. You should have the raw alignments, so just use featureCounts or htseq-count to get the per-gene metrics.

                  Comment


                  • #10
                    Get it. Thanks!

                    Comment


                    • #11
                      Dear dpryan,

                      I wrote a script similar to your script, to calculate the size of ensembl genes. However, I have a question for you. How you handle transcripts that overlap ?
                      Indeed, by not taking into account 'transcripts that overlap' in my script, I wondered if I did not overestimated the length of genes.
                      In your opinion, is it more just to find these transcripts and count the region once rather than twice (or more)?

                      Thank you in advance for your answers.

                      (PS: Sorry if my english is bad, it is not my native language)

                      Comment


                      • #12
                        Hi velt,

                        The relevant portion of my script is the "reduce()" function. In GenomicRanges in R, this merges together any overlapping exons. So, if you have two overlapping transcripts, the bases in the overlap region are only counted once (as opposed to being double counted). As you mentioned, not doing this will lead to heavily spliced transcripts appearing much much longer than they actually are.

                        Hope that helps (and no worries about your English, it's quite good and vastly better than my French)!

                        Comment


                        • #13
                          Thank you very much for your reply ! I will proceed in this manner to treat transcripts that overlap.

                          Have a nice day.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Best Practices for Single-Cell Sequencing Analysis
                            by seqadmin



                            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                            06-06-2024, 07:15 AM
                          • seqadmin
                            Latest Developments in Precision Medicine
                            by seqadmin



                            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                            Somatic Genomics
                            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                            05-24-2024, 01:16 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 06-17-2024, 06:54 AM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-14-2024, 07:24 AM
                          0 responses
                          20 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-13-2024, 08:58 AM
                          0 responses
                          17 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 06-12-2024, 02:20 PM
                          0 responses
                          20 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X