Seqanswers Leaderboard Ad

**dpryan** · 10-12-2015, 03:06 AM

There isn't necessarily a "most important" transcript. The most relevant one will vary by tissue, developmental stage and treatment condition.

**litali** · 10-12-2015, 03:23 AM

transcripts ensembl

but where can i found this information? in the ensembl I only see the list of the transcripts. However, in articles in the literature usually only one of the transcripts is adressed

**dpryan** · 10-12-2015, 03:25 AM

Depends on the organism you're working on. Some of them have expression databases, others don't you'll have to find that out and check them (if they exist).

**GenoMax** · 10-12-2015, 03:38 AM

You could refer back to the transcript that codes for the protein in RefSeq or CCDS (if you want just "one" transcript). If this is a non-human/mouse gene then CCDS won't work.

**litali** · 10-12-2015, 03:42 AM

I am working on human, in the gene i am working on there are several transcripts, and there is more than one which is protein coding, but all the work in the litareature relates only to one of them. The difference between the transcripts is one exon.Both transcripts have ccds

**GenoMax** · 10-12-2015, 04:00 AM

Can you clarify what exactly you are looking to do with this?

If only one protein is referred to in literature then perhaps that is the dominant isoform. As Devon mentioned there could be tissue/cell/development specific need for other versions.

You could look in Illumina Bodymap data (or a more specific place like the TCGA data) to see if there is evidence for presence of specific versions in different tissues/conditions.

**litali** · 10-12-2015, 04:27 AM

transcript

exactly, I think one of the transcripts is the dominant form, I just wonder how one can know which one is the dominant?

**turnersd** · 10-12-2015, 04:27 AM

This may help.

{APPRIS} - Annotating principal splice isoforms

http://appris.bioinfo.cnio.es/

Explore and download data on alternative splicing annotations and principal isoforms with the APPRIS Database, WebServer and WebServices.

{APPRIS}
Annotating principal splice isoforms

**blancha** · 10-12-2015, 04:28 AM

The exact term is the "canonical transcript", which is generally the longest transcript.
You'll find many posts on this somewhat controversial topic, if you google "canonical transcript".
Different databases may also not have the same canonical transcript for a given gene.

Here is the definition of the canonical transcript from Ensembl.
"For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript."

404 Not Found

http://uswest.ensembl.org/Help/Glossary?id=346;redirect=no

Strangely enough, you can't get the canonical transcript ID through Ensembl's biomaRt.
You can get it using the Ensembl Perl (ugh!) API though.

Redirecting to Google Groups

https://groups.google.com/forum/#!topic/biomart-users/skO4zgqzGBA

UCSC, on the other hand, has a table called knownCanonical, which you can download with the UCSC Table Browser.
I generally prefer using Ensembl, but in this case, UCSC is the one that provides the simplest method to get the canonical transcript.
Their method for defining the canonical transcript is murky though, since it is not always the longest transcript. There is some human curation involved.

**GenoMax** · 10-12-2015, 04:33 AM

@turnersd: Thanks for sharing that site.

Looking at example they list on the site there are still 2 principal isoforms listed for this gene (http://appris.bioinfo.cnio.es/#/data...099899?db=hg38) so @litali may be left with the same conundrum

**blancha** · 10-12-2015, 04:41 AM

@GenoMax

The UCSC Genome Group suggests just picking one at random.

"Thank you for your question about the knownCanonical table. Unfortunately, the issue of a gene being assigned multiple transcripts is still present in our most recent versions of the knownCanonical table. We are looking at different solutions to this complex problem, and hope to have this resolved in a future version of the UCSC Genes track. For the transcript you mentioned in your email, one of our engineers suggests arbitrarily choosing which of the two transcripts to keep and which to discard.

[...]

Matthew Speir
UCSC Genome Bioinformatics Group"

Redirecting to Google Groups

https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/Ayqf3SRSTDk

Edit: Just clicked on the link you provided. The two longest transcripts have exactly the same length, so they're obviously both reported as being the canonical, or principal, transcript. So, the simplest method computationally to determine the canonical transcript, is simply to report the longest transcript. If there is a tie in length, simply report the first transcript in numeric order. Biologically, it doesn't make much sense, but it is computationally simple. Given that different databases report different transcripts, even this algorithm with not always return the same canonical transcript for different databases.

**blancha** · 10-12-2015, 05:22 AM

I've posted the APRIS flags for principal isoforms below.
Really, I think the algorithm I posted above is the most computationally straightforward manner of identifying the canonical transcript.

{APPRIS} - Annotating principal splice isoforms

http://appris.bioinfo.cnio.es/#/downloads

Explore and download data on alternative splicing annotations and principal isoforms with the APPRIS Database, WebServer and WebServices.

Principal Isoform flags

APPRIS selects a single CDS variant for each gene as the 'PRINCIPAL' isoform based on the range of protein features. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. The definition of the flags are as follows:

PRINCIPAL:1

Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS database. The APPRIS core modules map protein structural and functional information and cross-species conservation to the annotated variants.
PRINCIPAL:2

Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant.

If one (but no more than one) of these candidates has a distinct CCDS identifier it is selected as the principal variant for that gene. A CCDS identifier shows that there is consensus between RefSeq and GENCODE/Ensembl for that variant, guaranteeing that the variant has cDNA support.
PRINCIPAL:3

Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated.

Consensus CDS annotated earlier are likely to have more cDNA evidence. Consecutive CCDS identifiers are not included in this flag, since they will have been annotated in the same release of CCDS. These are distinguished with the next flag.
PRINCIPAL:4

Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
PRINCIPAL:5

Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.

**GenoMax** · 10-12-2015, 07:36 AM

We are leaving @litali more or less where (s)he was when this thread was started.

Or maybe not. Dare we say that the most "important" transcript is the one most abundant/prevalent.

The possibility remains that the longest canonical variant may not be the most prevalent.

Topics	Statistics	Last Post
Study Highlights Challenges in Cellular Reprogramming for Regenerative Medicine by seqadmin Started by seqadmin, Today, 06:25 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:25 AM
New DNA Modification Discovered as Key to Gene Activation in Early Development by seqadmin Started by seqadmin, Yesterday, 01:02 PM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 01:02 PM
Wastewater Analysis Unlocks New Method for Identifying Public Health Threats by seqadmin Started by seqadmin, 09-18-2024, 06:39 AM	0 responses 14 views 0 likes	Last Post by seqadmin 09-18-2024, 06:39 AM
Molecular Markers Shared Across Dementias by seqadmin Started by seqadmin, 09-11-2024, 02:44 PM	0 responses 14 views 0 likes	Last Post by seqadmin 09-11-2024, 02:44 PM

Seqanswers Leaderboard Ad

Announcement

which is the main transcript

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News