Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to calculate transcript length?

    Dear all,

    Do you know how to calculate transcript length for calculating RPKM?
    For each UCSC gene (transcript), is its length equal to the sum of the length of its extrons?

    Thanks,

    Feng

  • #2
    That is correct, based on my understanding. Someone please correct me if I'm wrong.

    Tophat has the gffread utility which can take a .gtf and a genome reference and extract all transcripts into a new fasta, which you could then use to calculate the length for each. Using samtools faidx you could quickly get the lengths for each (columns 1 and 2).

    Comment


    • #3
      andylemire is correct. Another method for calculating lengths would be to just feed your annotation file to R:

      Code:
      #!/usr/bin/env Rscript
      library(GenomicRanges)
      library(rtracklayer)
      GTFfile = "some_annotation_file.GTF"
      GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
      grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
      reducedGTF <- unlist(grl, use.names=T)
      elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
      elementMetadata(reducedGTF)$widths <- width(reducedGTF)
      calc_length <- function(x) {
          sum(elementMetadata(x)$widths)
      }
      output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
      colnames(output) <- c("Length")
      #You can now do whatever you want with "output"
      The original version of the script above also calculated GC content (sometimes useful for RNAseq normalization).

      Comment


      • #4
        Dear all,
        Than you for your reply very much.

        By the way, how do you map UCSC gene ID to entrez ID? Which UCSC table should I use?

        Can all the UCSC genes (transcripts) be mapped to entrez gene?

        Comment


        • #5
          FYI, your life will be easier if you stick to Ensembl annotations, they're better in pretty much all respects.

          Regarding your actual question, have a look at the knownToLocusLink table. LocusLink is (at least normally) the same as the entrez ID, since locus link was the predecessor to entrez. You might also find the various kgXX (known gene, aka UCSC ID based) tables and the knownToXXX (known gene to something else conversion) useful.

          Comment


          • #6
            Hi dpryan,

            Thanks for your reply. The knownToLocusLink has the mapping from UCSC genes to Entrez ID. However, about 20% UCSC genes have no related Entrez ID. For these un-mapped UCSC, should I just ignore them when estimating gene expression from isoform expression?

            Comment


            • #7
              Why are you trying to estimate gene expression from isoform expression instead of just counting over genes? That latter would seem much easier. For the 20% with no Entrez ID, have a look at what a few of them are. I recall most of them being Riken genes or pseudogenes (i.e., not interesting for you), but it's been quite a while...

              Comment


              • #8
                Do you mean sum the read counts which are aligned to gene?
                For my data, I only have the count number of isoforms.

                Comment


                • #9
                  Yes, summing the read count per-gene is the normal method. You should have the raw alignments, so just use featureCounts or htseq-count to get the per-gene metrics.

                  Comment


                  • #10
                    Get it. Thanks!

                    Comment


                    • #11
                      Dear dpryan,

                      I wrote a script similar to your script, to calculate the size of ensembl genes. However, I have a question for you. How you handle transcripts that overlap ?
                      Indeed, by not taking into account 'transcripts that overlap' in my script, I wondered if I did not overestimated the length of genes.
                      In your opinion, is it more just to find these transcripts and count the region once rather than twice (or more)?

                      Thank you in advance for your answers.

                      (PS: Sorry if my english is bad, it is not my native language)

                      Comment


                      • #12
                        Hi velt,

                        The relevant portion of my script is the "reduce()" function. In GenomicRanges in R, this merges together any overlapping exons. So, if you have two overlapping transcripts, the bases in the overlap region are only counted once (as opposed to being double counted). As you mentioned, not doing this will lead to heavily spliced transcripts appearing much much longer than they actually are.

                        Hope that helps (and no worries about your English, it's quite good and vastly better than my French)!

                        Comment


                        • #13
                          Thank you very much for your reply ! I will proceed in this manner to treat transcripts that overlap.

                          Have a nice day.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Genetic Variation in Immunogenetics and Antibody Diversity
                            by seqadmin



                            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                            11-06-2024, 07:24 PM
                          • seqadmin
                            Choosing Between NGS and qPCR
                            by seqadmin



                            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                            10-18-2024, 07:11 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 11:09 AM
                          0 responses
                          22 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Today, 06:13 AM
                          0 responses
                          20 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 11-01-2024, 06:09 AM
                          0 responses
                          30 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 10-30-2024, 05:31 AM
                          0 responses
                          21 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X