Hi all,
I've been working on this for a few days and don't seem to be getting anywhere.
I have a list of gene symbols (example list below) that I need to use to retrieve promoter sequences for a promoter analysis. Basically I want to use the gene symbol to identify a promoter (for which I expect there may be several promoters) and then use that location to retrieve 1000 nucleotides upstream and maybe 200-500 nt downstream of the promoter.
The two main strategies I have tried were:
1. Download extracted promoter sequences from UCSC download site. Convert gene symbol to refseq ID. Match my gene list to promoters in the pre-compiled fasta of promoter sequences.
PROBLEM: At least some of my refseq IDs don't seem to be found in this precompiled promoter sequence dataset.
2. Use my GTF annotation file to select promoter coordinates from my gene symbols.
PROBLEM: My UCSC GTF files don't appear to contain 5'UTR or whole transcript intervals (only exon and intron intervals). My Annotation file does have the refseq NM_00xxxxx ID though, so I could retrieve those, but where do I find transcript intervals from that? And I only want the primary promoter for each transcript.
If it is helpful, I can program in python - I just need specific help with the direction.
Thanks for the help guys. I really appreciate it.
Paul
Appologies if this is a repost - I've seen MANY similar posts, but nothing that I've found particularly helpful (that didn't lead to a dead end).
Example list of gene symbols for which I need promoter sequences:
Snrpd2
Snrpe
Snrpg
Snrpn
Snx11
Socs1
Sod1
Sox11
Sox12
Sox4
Sphk1
Spin2c
I've been working on this for a few days and don't seem to be getting anywhere.
I have a list of gene symbols (example list below) that I need to use to retrieve promoter sequences for a promoter analysis. Basically I want to use the gene symbol to identify a promoter (for which I expect there may be several promoters) and then use that location to retrieve 1000 nucleotides upstream and maybe 200-500 nt downstream of the promoter.
The two main strategies I have tried were:
1. Download extracted promoter sequences from UCSC download site. Convert gene symbol to refseq ID. Match my gene list to promoters in the pre-compiled fasta of promoter sequences.
PROBLEM: At least some of my refseq IDs don't seem to be found in this precompiled promoter sequence dataset.
2. Use my GTF annotation file to select promoter coordinates from my gene symbols.
PROBLEM: My UCSC GTF files don't appear to contain 5'UTR or whole transcript intervals (only exon and intron intervals). My Annotation file does have the refseq NM_00xxxxx ID though, so I could retrieve those, but where do I find transcript intervals from that? And I only want the primary promoter for each transcript.
If it is helpful, I can program in python - I just need specific help with the direction.
Thanks for the help guys. I really appreciate it.
Paul
Appologies if this is a repost - I've seen MANY similar posts, but nothing that I've found particularly helpful (that didn't lead to a dead end).
Example list of gene symbols for which I need promoter sequences:
Snrpd2
Snrpe
Snrpg
Snrpn
Snx11
Socs1
Sod1
Sox11
Sox12
Sox4
Sphk1
Spin2c
Comment