Dear all,
Before I dive into details, my question is whether it is a sound idea to envisage:
I describe here my situation using the Kallisto pseudo-aligner and Ensembl Homo_sapiens.GRCh38.cdna.all.fa, but this may well apply for all the tools out there using transcriptome fasta files.
After using the above cDNA Fasta file with Kallisto, I have realised that it includes transcripts encoded in haplotypes (e.g. CHR_HSCHR8_4_CTG7), and that those transcripts have distinct gene identifiers.
The presence of haplotypes clearly seems to confuse Kallisto for genes that have transcripts encoded in those haplotype regions. I illustrate my point with the TNF gene, that has one transcript on chr 6, while seven haplotypes exist that contain the TNF gene sequence:
GENENAME GENEID SEQNAME GENESEQSTART GENESEQEND
1 TNF ENSG00000204490 CHR_HSCHR6_MHC_COX_CTG1 31562973 31565742
2 TNF ENSG00000206439 CHR_HSCHR6_MHC_QBL_CTG1 31565793 31568564
3 TNF ENSG00000223952 CHR_HSCHR6_MHC_MCF_CTG1 31651872 31654641
4 TNF ENSG00000228321 CHR_HSCHR6_MHC_MANN_CTG1 31615015 31617784
5 TNF ENSG00000228849 CHR_HSCHR6_MHC_DBB_CTG1 31557707 31560476
6 TNF ENSG00000228978 CHR_HSCHR6_MHC_APD_CTG1 31643520 31645322
7 TNF ENSG00000230108 CHR_HSCHR6_MHC_SSTO_CTG1 31566312 31569081
8 TNF ENSG00000232810 6 31575567 31578336
Now, I show below Kallisto-estimated counts for 4 single cells (a subset for example here) from a single human donor (max. two haplotypes expected) using Kallisto.
It reveals identical gene expression levels for seven of those gene identifiers (which therefore most likely share identical pseudo-mappable sequence). Notably, when the remaining haplotype is detected, lower expression is estimated in the other haplotypes (I just checked and validated this observation in many other samples not shown here).
201202 201204 201206 201207
TNF_ENSG00000204490 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000206439 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000223952 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228321 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228849 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228978 0.00000 0.000 69.053237 0.00000
TNF_ENSG00000230108 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000232810 93.46909 2305.451 2.403247 91.27554
Now, my main worry is that the presence of TNF in multiple haplotypes forces Kallisto to assign 1/7th of the "true" expression level to each of the gene identifiers, minus the expression level assigned to the last haplotype.
Out of curiosity, I have plotted the colSum of the log10-transformed Kallisto-estimated counts above for all samples in the data set. In other words, I manually collapsed all haplotypes counts for TNF in a single value. There are two infection conditions (D and L), and one uninfected (Mock) condition. I also highlighted the samples were the "lone haplotype" was detected:
.
And to come back to my initial question:
I am aware of the gffread command available here that could be used to generate a subsetted cDNA Fasta file based on a GFF file, which itself could be restricted to exclude haplotypes and patches. But if we'd be going that way, rather than leaving the filtering open to interpretation by users, I thought that having a reference file provided by Ensembl would be preferable.
Thanks in advance for feedback.
Kévin
Before I dive into details, my question is whether it is a sound idea to envisage:
- an Ensembl "cdna" Fasta file that, similarly to the Ensembl FTP files "...dna...primary_assembly.fa", includes only transcripts excluding haplotypes and patches.
- a method to collapse expression levels of genes present in multiple haplotypes in the currently available Ensembl FTP full cDNA Fasta files "...cdna.all.fa" into a single gene expression level
I describe here my situation using the Kallisto pseudo-aligner and Ensembl Homo_sapiens.GRCh38.cdna.all.fa, but this may well apply for all the tools out there using transcriptome fasta files.
After using the above cDNA Fasta file with Kallisto, I have realised that it includes transcripts encoded in haplotypes (e.g. CHR_HSCHR8_4_CTG7), and that those transcripts have distinct gene identifiers.
The presence of haplotypes clearly seems to confuse Kallisto for genes that have transcripts encoded in those haplotype regions. I illustrate my point with the TNF gene, that has one transcript on chr 6, while seven haplotypes exist that contain the TNF gene sequence:
GENENAME GENEID SEQNAME GENESEQSTART GENESEQEND
1 TNF ENSG00000204490 CHR_HSCHR6_MHC_COX_CTG1 31562973 31565742
2 TNF ENSG00000206439 CHR_HSCHR6_MHC_QBL_CTG1 31565793 31568564
3 TNF ENSG00000223952 CHR_HSCHR6_MHC_MCF_CTG1 31651872 31654641
4 TNF ENSG00000228321 CHR_HSCHR6_MHC_MANN_CTG1 31615015 31617784
5 TNF ENSG00000228849 CHR_HSCHR6_MHC_DBB_CTG1 31557707 31560476
6 TNF ENSG00000228978 CHR_HSCHR6_MHC_APD_CTG1 31643520 31645322
7 TNF ENSG00000230108 CHR_HSCHR6_MHC_SSTO_CTG1 31566312 31569081
8 TNF ENSG00000232810 6 31575567 31578336
Now, I show below Kallisto-estimated counts for 4 single cells (a subset for example here) from a single human donor (max. two haplotypes expected) using Kallisto.
It reveals identical gene expression levels for seven of those gene identifiers (which therefore most likely share identical pseudo-mappable sequence). Notably, when the remaining haplotype is detected, lower expression is estimated in the other haplotypes (I just checked and validated this observation in many other samples not shown here).
201202 201204 201206 201207
TNF_ENSG00000204490 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000206439 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000223952 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228321 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228849 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000228978 0.00000 0.000 69.053237 0.00000
TNF_ENSG00000230108 93.46909 2305.451 2.403247 91.27554
TNF_ENSG00000232810 93.46909 2305.451 2.403247 91.27554
Now, my main worry is that the presence of TNF in multiple haplotypes forces Kallisto to assign 1/7th of the "true" expression level to each of the gene identifiers, minus the expression level assigned to the last haplotype.
- First there is the question of underestimating the expression level of the "true" gene, considering that the TPM for TNF are split into seven haplotypes.
- Then there is the issue of testing multiple times the same hypothesis, considering that seven of the haplotypes seem to carry the same gene information here.
- I also wonder about gene-set enrichment analyses, whether all those haplotypes are annotated to Gene Ontology categories for example, and how they may artificially enrich the categories by giving TNF an implicit weight of 7.
Out of curiosity, I have plotted the colSum of the log10-transformed Kallisto-estimated counts above for all samples in the data set. In other words, I manually collapsed all haplotypes counts for TNF in a single value. There are two infection conditions (D and L), and one uninfected (Mock) condition. I also highlighted the samples were the "lone haplotype" was detected:
.
And to come back to my initial question:
- Should a cdna Fasta file be made available, that contains only transcripts present in the "primary assembly" ? In this case, I I suppose that "true" expression would be assigned to the "main" gene identifier, at the expense of losing haplotype-specific expression, as illustrated by the lone TNF haplotype above.
- Given that after isoform quantitation is usually collapsed to gene-level after quantitation, should another level of collapsing consider haplotypes? If so, how? Although they share gene symbol, they have distinct gene identifiers. However, one cannot simply use the gene symbol as gene feature identifiers, as some gene symbols are annotated to true paralogs.
I am aware of the gffread command available here that could be used to generate a subsetted cDNA Fasta file based on a GFF file, which itself could be restricted to exclude haplotypes and patches. But if we'd be going that way, rather than leaving the filtering open to interpretation by users, I thought that having a reference file provided by Ensembl would be preferable.
Thanks in advance for feedback.
Kévin
Comment