I'm trying to find out which gene_biotypes are described in the Human gtf file from Ensembl.
The head of the file looks like that:
I tried
...but since the gene_biotype is not always in $7 I also get entries like:
gene_source "ensembl"
gene_source "ensembl_havana"
gene_source "havana"
...which makes me suspect that I might miss some biotypes
Can somebody please help me to improve the code to retrieve the full list of gene_biotype's
Thank you in advance....
The head of the file looks like that:
Code:
1 havana gene 19072110 19075511 . - . gene_id "ENSG00000272084"; gene_version "1"; gene_name "RP5-1126H10.2"; gene_source "havana"; gene_biotype "3prime_overlapping_ncRNA"; havana_gene "OTTHUMG00000185684"; havana_gene_version "1"; 1 havana transcript 19072110 19075511 . - . gene_id "ENSG00000272084"; gene_version "1"; transcript_id "ENST00000606379"; transcript_version "1"; gene_name "RP5-1126H10.2"; gene_source "havana"; gene_biotype "3prime_overlapping_ncRNA"; havana_gene "OTTHUMG00000185684"; havana_gene_version "1"; transcript_name "RP5-1126H10.2-001"; transcript_source "havana"; transcript_biotype "3prime_overlapping_ncRNA"; havana_transcript "OTTHUMT00000470990"; havana_transcript_version "1"; tag "basic"; transcript_support_level "NA"; 1 havana exon 19072110 19075511 . - . gene_id "ENSG00000272084"; gene_version "1"; transcript_id "ENST00000606379"; transcript_version "1"; exon_number "1"; gene_name "RP5-1126H10.2"; gene_source "havana"; gene_biotype "3prime_overlapping_ncRNA"; havana_gene "OTTHUMG00000185684"; havana_gene_version "1"; transcript_name "RP5-1126H10.2-001"; transcript_source "havana"; transcript_biotype "3prime_overlapping_ncRNA"; havana_transcript "OTTHUMT00000470990"; havana_transcript_version "1"; exon_id "ENSE00003701142"; exon_version "1"; tag "basic"; transcript_support_level "NA"; 1 havana gene 201464383 201465146 . - . gene_id "ENSG00000224818"; gene_version "1"; gene_name "RP11-134G8.10"; gene_source "havana"; gene_biotype "3prime_overlapping_ncRNA"; havana_gene "OTTHUMG00000189503"; havana_gene_version "1"; 1 havana transcript 201464383 201465146 . - . gene_id "ENSG00000224818"; gene_version "1"; transcript_id "ENST00000430471"; transcript_version "1"; gene_name "RP11-134G8.10"; gene_source "havana"; gene_biotype "3prime_overlapping_ncRNA"; havana_gene "OTTHUMG00000189503"; havana_gene_version "1"; transcript_name "RP11-134G8.10-001"; transcript_source "havana"; transcript_biotype "3prime_overlapping_ncRNA"; havana_transcript "OTTHUMT00000479807"; havana_transcript_version "1"; tag "basic"; transcript_support_level "5"; 1 havana exon 201465061 201465146 . - . gene_id "ENSG00000224818"; gene_version "1"; transcript_id "ENST00000430471"; transcript_version "1"; exon_number "1"; gene_name "RP11-134G8.10"; gene_source "havana"; gene_biotype "3prime_overlapping_ncRNA"; havana_gene "OTTHUMG00000189503"; havana_gene_version "1"; transcript_name "RP11-134G8.10-001"; transcript_source "havana"; transcript_biotype "3prime_overlapping_ncRNA"; havana_transcript "OTTHUMT00000479807"; havana_transcript_version "1"; exon_id "ENSE00001739739"; exon_version "1"; tag "basic"; transcript_support_level "5";
Code:
cat Homo_sapiens.GRCh38.85.gtf| cut -f 9| awk -F ';' '{print $7}'| sort | uniq
gene_source "ensembl"
gene_source "ensembl_havana"
gene_source "havana"
...which makes me suspect that I might miss some biotypes
Can somebody please help me to improve the code to retrieve the full list of gene_biotype's
Thank you in advance....
Comment