For a specific gene, how can I know which is the main transcript in the ensembl? there is a list of few transcripts and not always the first one or the longest one is the one which is considered the main one, so which is the definition for the most important transcript?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
I am working on human, in the gene i am working on there are several transcripts, and there is more than one which is protein coding, but all the work in the litareature relates only to one of them. The difference between the transcripts is one exon.Both transcripts have ccds
Comment
-
Can you clarify what exactly you are looking to do with this?
If only one protein is referred to in literature then perhaps that is the dominant isoform. As Devon mentioned there could be tissue/cell/development specific need for other versions.
You could look in Illumina Bodymap data (or a more specific place like the TCGA data) to see if there is evidence for presence of specific versions in different tissues/conditions.
Comment
-
The exact term is the "canonical transcript", which is generally the longest transcript.
You'll find many posts on this somewhat controversial topic, if you google "canonical transcript".
Different databases may also not have the same canonical transcript for a given gene.
Here is the definition of the canonical transcript from Ensembl.
"For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript."
Strangely enough, you can't get the canonical transcript ID through Ensembl's biomaRt.
You can get it using the Ensembl Perl (ugh!) API though.
UCSC, on the other hand, has a table called knownCanonical, which you can download with the UCSC Table Browser.
I generally prefer using Ensembl, but in this case, UCSC is the one that provides the simplest method to get the canonical transcript.
Their method for defining the canonical transcript is murky though, since it is not always the longest transcript. There is some human curation involved.
Comment
-
@turnersd: Thanks for sharing that site.
Looking at example they list on the site there are still 2 principal isoforms listed for this gene (http://appris.bioinfo.cnio.es/#/data...099899?db=hg38) so @litali may be left with the same conundrum
Comment
-
@GenoMax
The UCSC Genome Group suggests just picking one at random.
"Thank you for your question about the knownCanonical table. Unfortunately, the issue of a gene being assigned multiple transcripts is still present in our most recent versions of the knownCanonical table. We are looking at different solutions to this complex problem, and hope to have this resolved in a future version of the UCSC Genes track. For the transcript you mentioned in your email, one of our engineers suggests arbitrarily choosing which of the two transcripts to keep and which to discard.
[...]
Matthew Speir
UCSC Genome Bioinformatics Group"
Edit: Just clicked on the link you provided. The two longest transcripts have exactly the same length, so they're obviously both reported as being the canonical, or principal, transcript. So, the simplest method computationally to determine the canonical transcript, is simply to report the longest transcript. If there is a tie in length, simply report the first transcript in numeric order. Biologically, it doesn't make much sense, but it is computationally simple. Given that different databases report different transcripts, even this algorithm with not always return the same canonical transcript for different databases.
Comment
-
I've posted the APRIS flags for principal isoforms below.
Really, I think the algorithm I posted above is the most computationally straightforward manner of identifying the canonical transcript.
Explore and download data on alternative splicing annotations and principal isoforms with the APPRIS Database, WebServer and WebServices.
Principal Isoform flags
APPRIS selects a single CDS variant for each gene as the 'PRINCIPAL' isoform based on the range of protein features. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable. The definition of the flags are as follows:
PRINCIPAL:1
Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS database. The APPRIS core modules map protein structural and functional information and cross-species conservation to the annotated variants.
PRINCIPAL:2
Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant.
If one (but no more than one) of these candidates has a distinct CCDS identifier it is selected as the principal variant for that gene. A CCDS identifier shows that there is consensus between RefSeq and GENCODE/Ensembl for that variant, guaranteeing that the variant has cDNA support.
PRINCIPAL:3
Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated.
Consensus CDS annotated earlier are likely to have more cDNA evidence. Consecutive CCDS identifiers are not included in this flag, since they will have been annotated in the same release of CCDS. These are distinguished with the next flag.
PRINCIPAL:4
Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
PRINCIPAL:5
Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.
Comment
-
We are leaving @litali more or less where (s)he was when this thread was started.
Or maybe not. Dare we say that the most "important" transcript is the one most abundant/prevalent.
The possibility remains that the longest canonical variant may not be the most prevalent.
Comment
Latest Articles
Collapse
-
by seqadmin
During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.
Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...-
Channel: Articles
09-09-2024, 10:59 AM -
-
by seqadmin
The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...-
Channel: Articles
08-27-2024, 04:44 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 06:25 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
Today, 06:25 AM
|
||
Started by seqadmin, Yesterday, 01:02 PM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
Yesterday, 01:02 PM
|
||
Started by seqadmin, 09-18-2024, 06:39 AM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
09-18-2024, 06:39 AM
|
||
Started by seqadmin, 09-11-2024, 02:44 PM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
09-11-2024, 02:44 PM
|
Comment