Seqanswers Leaderboard Ad

**andylemire** · 01-09-2014, 06:31 PM

That is correct, based on my understanding. Someone please correct me if I'm wrong.

Tophat has the gffread utility which can take a .gtf and a genome reference and extract all transcripts into a new fasta, which you could then use to calculate the length for each. Using samtools faidx you could quickly get the lengths for each (columns 1 and 2).

**dpryan** · 01-10-2014, 01:41 AM

andylemire is correct. Another method for calculating lengths would be to just feed your annotation file to R:

Code:

#!/usr/bin/env Rscript
library(GenomicRanges)
library(rtracklayer)
GTFfile = "some_annotation_file.GTF"
GTF <- import.gff(GTFfile, format="gtf", genome="GRCm38.71", asRangedData=F, feature.type="exon")
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF)
calc_length <- function(x) {
    sum(elementMetadata(x)$widths)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
colnames(output) <- c("Length")
#You can now do whatever you want with "output"

The original version of the script above also calculated GC content (sometimes useful for RNAseq normalization).

**ftian** · 01-10-2014, 06:51 AM

Dear all,
Than you for your reply very much.

By the way, how do you map UCSC gene ID to entrez ID? Which UCSC table should I use?

Can all the UCSC genes (transcripts) be mapped to entrez gene?

**dpryan** · 01-10-2014, 07:15 AM

FYI, your life will be easier if you stick to Ensembl annotations, they're better in pretty much all respects.

Regarding your actual question, have a look at the knownToLocusLink table. LocusLink is (at least normally) the same as the entrez ID, since locus link was the predecessor to entrez. You might also find the various kgXX (known gene, aka UCSC ID based) tables and the knownToXXX (known gene to something else conversion) useful.

**ftian** · 01-10-2014, 08:28 AM

Hi dpryan,

Thanks for your reply. The knownToLocusLink has the mapping from UCSC genes to Entrez ID. However, about 20% UCSC genes have no related Entrez ID. For these un-mapped UCSC, should I just ignore them when estimating gene expression from isoform expression?

**dpryan** · 01-10-2014, 08:31 AM

Why are you trying to estimate gene expression from isoform expression instead of just counting over genes? That latter would seem much easier. For the 20% with no Entrez ID, have a look at what a few of them are. I recall most of them being Riken genes or pseudogenes (i.e., not interesting for you), but it's been quite a while...

**ftian** · 01-10-2014, 08:40 AM

Do you mean sum the read counts which are aligned to gene?
For my data, I only have the count number of isoforms.

**dpryan** · 01-10-2014, 08:41 AM

Yes, summing the read count per-gene is the normal method. You should have the raw alignments, so just use featureCounts or htseq-count to get the per-gene metrics.

**ftian** · 01-10-2014, 08:49 AM

Get it. Thanks!

**velt** · 01-16-2014, 02:03 AM

Dear dpryan,

I wrote a script similar to your script, to calculate the size of ensembl genes. However, I have a question for you. How you handle transcripts that overlap ?
Indeed, by not taking into account 'transcripts that overlap' in my script, I wondered if I did not overestimated the length of genes.
In your opinion, is it more just to find these transcripts and count the region once rather than twice (or more)?

Thank you in advance for your answers.

(PS: Sorry if my english is bad, it is not my native language)

**dpryan** · 01-16-2014, 02:12 AM

Hi velt,

The relevant portion of my script is the "reduce()" function. In GenomicRanges in R, this merges together any overlapping exons. So, if you have two overlapping transcripts, the bases in the overlap region are only counted once (as opposed to being double counted). As you mentioned, not doing this will lead to heavily spliced transcripts appearing much much longer than they actually are.

Hope that helps (and no worries about your English, it's quite good and vastly better than my French)!

**velt** · 01-16-2014, 04:42 AM

Thank you very much for your reply ! I will proceed in this manner to treat transcripts that overlap.

Have a nice day.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 22 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

How to calculate transcript length?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News