Hey folks,
I am working the RNAseq data from HNSC cancer (downloaded from TCGA). I'm analysing the so called short open reading frames (sORF) which are not part of the annotated human transcriptome.
I start by mapping the reads to the known transcript reference and discard all reads that get successfully mapped. Then I take the resulting unmapped reads and map to the genome, but from there on I have a hard time because they get mapped all around and I don't know how to proceed. If I could, I would rather want to map the reads to specific regions that are the short ORFs.
So the question is, do you know if I can get my hands on some data that identifies those regions ? like a BED file or something.
From the literature I have seen researchers performing proteogenomics, where they map MS spectra to a custom database containing those regions only and not the annotated transcriptome, but they don't provide the data.
Also, I am not only talking about regions (sORFs) that have been confirmed to be protein coding, but just if they have the theoretical potential (e.g. minimum bases of 6 - a start and a stop codon, alternative start codons etc).
Thanks in advance!
I am working the RNAseq data from HNSC cancer (downloaded from TCGA). I'm analysing the so called short open reading frames (sORF) which are not part of the annotated human transcriptome.
I start by mapping the reads to the known transcript reference and discard all reads that get successfully mapped. Then I take the resulting unmapped reads and map to the genome, but from there on I have a hard time because they get mapped all around and I don't know how to proceed. If I could, I would rather want to map the reads to specific regions that are the short ORFs.
So the question is, do you know if I can get my hands on some data that identifies those regions ? like a BED file or something.
From the literature I have seen researchers performing proteogenomics, where they map MS spectra to a custom database containing those regions only and not the annotated transcriptome, but they don't provide the data.
Also, I am not only talking about regions (sORFs) that have been confirmed to be protein coding, but just if they have the theoretical potential (e.g. minimum bases of 6 - a start and a stop codon, alternative start codons etc).
Thanks in advance!