Seqanswers Leaderboard Ad

**mikesh** · 11-15-2013, 01:33 PM

Originally posted by Fernas View Post

Dear all,

I have list of gene names, for example, the first five of them are:
ABCB1
ABCG2
ACHE
ACVR2A
ACVR2B

I want to get the genomic coordinates of these genes on human genome (hg19). I want one record per gene, so, I do not want information about exons,utrs...etc. for example I want the output to be as follows (or in any format e.g. bed gtf gff..etc):
ABCB1 gene chr1 120434 134324 +
ABCG2 gene chr1 324312 393431 -
...etc

I tried to use UCSC to query the list but, it provided me with information about all exons,utrs, but no information about gene features.

Go to UCSC GB, browse the tables and download RefGene track. It has genomic coordinates and "name2" field that is a HGNC gene symbol. Of course each gene could have several transcripts (NM_* identifiers), so you either use all of them, or the longest one (aka canonical).

**Fernas** · 11-15-2013, 02:01 PM

Thanks @mikesh.
I think you mean: "download RefSeq track" instead of "download RefGene track". correct?

I followed the steps that you mentioned above and got what I want. However, I am wondering if there is any tool on galaxy or others that give the longest transcript (canonical) from the outputs, so, i have one record per gene.

many thanks!

**jameslz** · 11-15-2013, 10:08 PM

@Fernas,

For hg19, export the "RefSeq Gene track", and merge the overlapped transcripts of the same gene, you will get what you want.

For GRCh37.p13, download "ref_GRCh37.p13_top_level.gff3.gz" from NCBI FTP ( ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/GFF/ )

**GenoMax** · 11-16-2013, 05:11 AM

Originally posted by Fernas View Post

I followed the steps that you mentioned above and got what I want. However, I am wondering if there is any tool on galaxy or others that give the longest transcript (canonical) from the outputs, so, i have one record per gene.

many thanks!

You are going to have to do some parsing if you need only one record per gene. As people have pointed out you can get the GFF or GTF (ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens). Then you could do the following (assuming that your file of gene names is "genes")

Code:

$ grep -f genes ensembl_gff/gtf file > gtf_records_you _need (use a logical file name here)

Then you will have to parse the resulting file to get the longest entry (if that is what you need) in the exact format you need.

**Fernas** · 11-16-2013, 05:59 AM

Thanks a lot @jameslz and @GenoMax.

If I want to get one record per gene, what is the best strategy to query the output file? shall I find the longest entry? or I merge overlapped entries (using bedtools mergebed tool)?

**jameslz** · 11-17-2013, 04:18 AM

I write perl script to parse the information from "RefSeq Gene track".

And the result looks like:

#gene chromosome chromosome_length locus transcript_number transcripts transcript_location
FIBCD1 chr9 141213431 133777824-133814455 2 NM_032843;NM_001145106 133777824-133814239|133777824-133814455

**Fernas** · 11-17-2013, 04:28 AM

Thanks @jameslz for your reply.

Is this script available on the web?
One more question: how did you define the (locus) start/end position? is it: the gene start position is the starting position of closest transcript to the chromsome start, and the locus end position is the end position of furthest transcript?

**jameslz** · 11-17-2013, 04:48 PM

@Fernas
I use the following procedure:

1. sort all transcripts of the same gene by location.
2. overlap and merge
3. use the leftmost position and rightmost position.

I can give the perl script and the final result. [email protected]

**GenoMax** · 11-17-2013, 05:51 PM

Originally posted by jameslz View Post

@Fernas
I use the following procedure:

1. sort all transcripts of the same gene by location.
2. overlap and merge
3. use the leftmost position and rightmost position.

I can give the perl script and the final result. [email protected]

If you are willing please post the script here. (Use Edit --> Go Advanced --> Then use the "paper clip" icon to attach the file to a post).

This way you would be helping others who may have a need for something similar.

**jameslz** · 11-17-2013, 07:16 PM

@GenoMax, OK!
The perl script uses ucsc refseq track (export as "all fields from selected table" format)

such as:

#bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
0 NM_032291 chr1 + 66999824 67210768 67000041 67208778 25 66999824,67091529,67098752,67101626,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755, 67000051,67091593,67098777,67101698,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67210768, 0 SGIP1 cmpl cmpl 0,1,2,0,0,0,1,0,0,0,1,2,1,1,1,1,0,1,1,2,2,0,2,1,1,
1 NM_032785 chr1 - 48998526 50489626 48999844 50489468 14 48998526,49000561,49005313,49052675,49056504,49100164,49119008,49128823,49332862,49511255,49711441,50162984,50317067,50489434, 48999965,49000588,49005410,49052838,49056657,49100276,49119123,49128913,49332902,49511472,49711536,50163109,50317190,50489626, 0 AGBL4 cmpl cmpl 2,2,1,0,0,2,1,1,0,2,0,1,1,0,

usage: perl track_trans.pl hg19_refGene.tbl hg19_refGene_trans.tbl

Attached Files

track_trans.pl (1.9 KB, 72 views)

**gringer** · 11-17-2013, 07:24 PM

Originally posted by GenoMax View Post

If you are willing please post the script here. (Use Edit --> Go Advanced --> Then use the "paper clip" icon to attach the file to a post).

This way you would be helping others who may have a need for something similar.

Here's my attempt [attached], which parses GTF output as linked by GenoMax. It might work with GFF files as well, but I haven't tested that (changes may be needed for the regular expression on line 112). It's a little bit over-engineered due to being derived from something else, for hackability purposes, and because I'm trying to get used to this pod documentation stuff. The usual "there will be bugs" disclaimer applies. Here's the command line syntax:

Code:

$ ./gtf2genePos.pl -help
Usage:
    ./gtf2genePos.pl <lookup GTF file>\n";

    output:
      a CSV file containing gene names, and locations

  Basic Options:
    -summarise
      Produce gene summaries, rather than individual region information

    -list *file*
      Filter gene names by including only genes from this list file

    -help
      Show this help message

    -v
      increase verbosity of output

Attached Files

gtf2genePos.pl (3.1 KB, 58 views)

**KE8** · 05-27-2015, 09:22 AM

Hey,

I am looking to get all the gene coordinates for Pseudomonas aeruginosa genes. I had a look at UCSC GB but I can't seem to use it as a reference genome.

Does anyone know of any software via PubMed that would allow me to upload an excel file containing the gene names, or paste the list of those names?

I would really appreciate any help you can provide me. I am a bit of a newbie with this bioinformatics technique.

**GenoMax** · 05-27-2015, 09:53 AM

Originally posted by KE8 View Post

Hey,

I am looking to get all the gene coordinates for Pseudomonas aeruginosa genes. I had a look at UCSC GB but I can't seem to use it as a reference genome.

Does anyone know of any software via PubMed that would allow me to upload an excel file containing the gene names, or paste the list of those names?

I would really appreciate any help you can provide me. I am a bit of a newbie with this bioinformatics technique.

Go here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Pick out the particular Pseudomonas strain you are interested in. Then go into that folder and get the ".gff" file (right click on the name and then save as, open in excel if you want). That will give you the gene coordinates (e.g. P. aeruginosa PAO1 ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_002516.gff)

**KE8** · 05-27-2015, 10:02 AM

Originally posted by GenoMax View Post

Go here: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Pick out the particular Pseudomonas strain you are interested in. Then go into that folder and get the ".gff" file (right click on the name and then save as, open in excel if you want). That will give you the gene coordinates (e.g. P. aeruginosa PAO1 ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_002516.gff)

My genes are all listed in PA# format. Is there a way to convert these names into the NP format that PubMed uses?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

get Gene Coordinates of human genes names

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News