Seqanswers Leaderboard Ad

**steven** · 04-30-2010, 11:22 AM

First quick and dirty try:
awk '$3=="mRNA" && $9~/^ID/{sub(/_T.*/,"",$9);L[substr($9,4)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' your_input_file
I guess a much better solution will come soon..

**Simon Anders** · 05-01-2010, 02:49 AM

Hi Siva

Originally posted by Siva View Post

How can I output transcript/feature length from the given genome coordinates? Is there a simple perl script out there? Or is there some script available in ENSembL API? Any help or suggestions appreciated?

This is a typical use case for my HTSeq Python framework.

The following script does the job. Save it and run it with Python (after having installed the newest version, 0.4.3, of HTSeq):

Code:

import HTSeq

gff_file = HTSeq.GFF_Reader( "ZmB73_4a.53_WGS.gff.gz", end_included=True )

transcripts= {}

for feature in gff_file:
   if feature.type == "exon":
      transcript_id = feature.attr['Parent']
      if transcript_id not in transcripts:
         transcripts[ transcript_id ] = list()
      transcripts[ transcript_id ].append( feature )
      
for transcript_id in sorted( transcripts ):      
   transcript_length = 0
   for exon in transcripts[ transcript_id ]:
      transcript_length += exon.iv.length + 1
   print transcript_id, transcript_length, len( transcripts[ transcript_id ] )

After a minute or two, it outputs a list of transcripts, each line giving the transcript ID, the length (summed over all its exons), and the number of exons.

In order to advertise HTSeq a bit and demonstrate that it is easy to use I add here a commented version of the same script. Even if you've obtained only a minimal knowledge of Python (e.g., by reading only the first few chapters of Mark Lutz's very good book "Learning Python"), you will find that it iss easy to write such stuff yourself.

Code:

import HTSeq

# A GFF_Reader parses a GFF file and presents its content as
# a sequence of 'GenomicFeature' objects.
gff_file = HTSeq.GFF_Reader( "ZmB73_4a.53_WGS.gff.gz", end_included=True )
# In this GFF file, the 'end' coordinate is part of the feature,
# hence 'end_included=True'.

# The following dict will store, for each transcript, a list
# of its exons
transcripts= {}

# Go through the GFF file and look for all exons:
for feature in gff_file:
   # 'feature' is an HTSeq.GenomicFeature object, which describes a
   # line in the GFF file. It has several slots, for example, 'iv'
   # for the interval (where it is on the genome), 'type' for its
   # type (3rd column in the GFF file) and 'attr' for the
   # attributes (last column)
   # We are only interested in lines describing exons:
   if feature.type == "exon":
      # In this GFF file, the ID of the transcript to which the exons
      # belongs is stored in an attribute (i.e., something in the 9th
      # column) called "Parent"
      transcript_id = feature.attr['Parent']
      # Is this bthe first time we see this transcript?
      if transcript_id not in transcripts:
         # If so, add to the 'transcripts' dict an empty list 
         # store store the exons in
         transcripts[ transcript_id ] = list()
      # Add the exon to the list
      transcripts[ transcript_id ].append( feature )
      
# Now, go through the transcripts and calculate the length of each
# by adding up the lengths of its exons
for transcript_id in sorted( transcripts ):      
   transcript_length = 0
   # Go through the GenomicFeature
   for exon in transcripts[ transcript_id ]:
      # Each GenomicFeature object has a slot called 'iv', which is a
      # GenomicInterval object, which in turn contains slots such as
      # 'chrom', 'start', 'end', 'strand', 'length', etc., to describe
      # where the feature sits in the genome. We use 'length' here.
      transcript_length += exon.iv.length + 1
      # We add the 1 because this GFF file does no
   # Now, 'transcript_length' is the sum of the lengths of all the
   # transcripts exon.
   # Print this result, and also add the number of exons as (which is just the 
   # length of the list of exons for the transcript):
   print transcript_id, transcript_length, len( transcripts[ transcript_id ] )

All the classes mentioned here are, of course, explained in detail in HTSeq's reference documentation on the web site.

Simon

**JohnK** · 05-02-2010, 10:06 AM

Originally posted by Siva View Post

Hello
I have the maize reference annotation downloaded from http://ftp.maizesequence.org/current/working-set/

the first 10 lines are as under:

Code:

UNKNOWN	ensembl	chromosome	1	14680007	.	.	.	ID=UNKNOWN;Name=chromosome:AGPv1:UNKNOWN:1:14680007:1
UNKNOWN	ensembl	CDS	17354	17446	.	+	0	Parent=GRMZM2G524142_T01;Name=CDS.1254885
UNKNOWN	ensembl	exon	17354	17460	.	+	.	Parent=GRMZM2G524142_T01;Name=GRMZM2G524142_E01
UNKNOWN	ensembl	gene	17354	17460	.	+	.	ID=GRMZM2G524142;Name=GRMZM2G524142;biotype=protein_coding
UNKNOWN	ensembl	mRNA	17354	17460	.	+	.	ID=GRMZM2G524142_T01;Parent=GRMZM2G524142;Name=GRMZM2G524142_T01;biotype=protein_coding
UNKNOWN	ensembl	exon	17537	17976	.	-	.	Parent=GRMZM2G369735_T01;Name=GRMZM2G369735_E03
UNKNOWN	ensembl	gene	17537	18259	.	-	.	ID=GRMZM2G369735;Name=GRMZM2G369735;biotype=protein_coding
UNKNOWN	ensembl	mRNA	17537	18259	.	-	.	ID=GRMZM2G369735_T01;Parent=GRMZM2G369735;Name=GRMZM2G369735_T01;biotype=protein_coding
UNKNOWN	ensembl	CDS	17716	17952	.	-	.	Parent=GRMZM2G369735_T01;Name=CDS.1254890
UNKNOWN	ensembl	exon	17972	18045	.	-	.	Parent=GRMZM2G369735_T02;Name=GRMZM2G369735_E02

How can I output transcript/feature length from the given genome coordinates? Is there a simple perl script out there? Or is there some script available in ENSembL API? Any help or suggestions appreciated?

thanks
Siva

If I misunderstood your question, I apologize. Here's a perl-way:

< <input file> perl -e 'while(<>){ @splitted=split('\t', $_); $length= $splitted[4]-$splitted[3] + 1; $_=~s/\n//g; print $_, "\t", $length, "\n"; }' > <out file name> &

This will add the length on the end of each line and print to <out file name>. Use & to run in the background so you can carry on with life.

, And if you just want 'mRNA' "transcripts":

< <input file> perl -e 'while(<>){ @splitted=split('\t', $_); $length= $splitted[4]-$splitted[3] + 1; $_=~s/\n//g; if($_=~/^.*mRNA.*\n/){print $_, "\t", $length, "\n";} }' > <out file name> &

Only prints on lines with mRNA feature name included. Oh, and just so you can manipulate the above cml script to handle any various needs in the future, here's an explanation so you can expand:

^:= "from the beginning"
.*:= "zero or more of anything
\n:="newline char"
$_:= "perl special variable that refers to the current line of a file being read"
<>:="anonymous filehandle reading from stdin
\t:="tab char"
split(/pattern/, object or var):= "perl function that splits"
$_=~s/\n//g; #this code statement says to substitute any newline characters within $_ (the current line) globally (g), and sub them for nothing (notice empty /\n// forward slashes).

**Simon Anders** · 05-02-2010, 10:12 AM

Ok, if it's only about getting the lengths of the features, no need for more sophistication than the one-liners of #2 or #4. I thought Siva wants to get the transcript lengths, and that requires summing up the exon lengths.

**Siva** · 05-02-2010, 02:47 PM

Originally posted by Simon Anders View Post

Ok, if it's only about getting the lengths of the features, no need for more sophistication than the one-liners of #2 or #4. I thought Siva wants to get the transcript lengths, and that requires summing up the exon lengths.

Thanks John and Steven for your suggestions and explanations. Simon, I posted my query in a hurry. However, you are right I will need transcript lengths summed over exons. Thanks for sharing your script. I will post my feedback after running your script soon, I have updated HTSeq to the latest version. I will see if I can tweak John's command script. btw I have also just installed Bioperl 1.6.1 so I can use some its features too.

more later
many thanks
Siva

**Siva** · 05-02-2010, 10:25 PM

Hi Simon:
I keep getting an unexpected keyword argument error while trying to compile your script. I have the latest HTSeq 0.4.3 installed. Here is the error:

Code:

"../simon_anders_gff.py", line 3, in <module>
    gff_file = HTSeq.GFF_Reader("ZmB73_4a.53_WGS.gff",end_included=True)
TypeError: __init__() got an unexpected keyword argument 'end_included'

Siva

**steven** · 05-03-2010, 01:21 AM

Originally posted by Simon Anders View Post

Ok, if it's only about getting the lengths of the features, no need for more sophistication than the one-liners of #2 or #4. I thought Siva wants to get the transcript lengths, and that requires summing up the exon lengths.

Well, that's precisely what my one-liner (#2) is actually supposed to do

awk '$3=="mRNA" && $9~/^ID/{sub(/_T.*/,"",$9);L[substr($9,4)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' your_input_file

For each line with a third column "mRNA" and a 9th column starting with "ID", it extracts the ID (more precisely, what is between the "ID=" and the first "_T" in the column #9 -I assume this format is conserved throughout all the file). It adds to the total length of this ID (initially 0) the length of the corresponding exon. "L" stores the total length of the IDs. Each time an exon is found with a known ID, the total length is updated.
Once the whole file has been read ("END" loop), all IDs are printed with their corresponding length. The output should be a text file with two columns: one with the mRNA IDs, the other one with their sizes.

Please let me know if there is something wrong with this line and/or if something else than the mRNA lengths is required (in which case, a more elegant and clean package like HTSeq is probably the best option indeed).

**Siva** · 05-05-2010, 08:57 AM

Originally posted by steven View Post

Well, that's precisely what my one-liner (#2) is actually supposed to do

awk '$3=="mRNA" && $9~/^ID/{sub(/_T.*/,"",$9);L[substr($9,4)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' your_input_file

Steven your awk one liner works great. For example, I have 21 lines in the truncated GFF file depicting only a single gene (ID) with two transcripts on 'chr1'.

Code:

 Input file: ZmB73_4a.53_WGS_chr1_21.gff
chr1    ensembl chromosome      1       300239041       .       .       .       ID=1;Name=chromosome:AGPv1:1:1:300239041:1
chr1    ensembl exon    3       104     .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E07
chr1    ensembl gene    3       3807    .       +       .       ID=GRMZM2G060082;Name=GRMZM2G060082;biotype=protein_coding
chr1    ensembl mRNA    3       3807    .       +       .       ID=GRMZM2G060082_T01;Parent=GRMZM2G060082;Name=GRMZM2G060082_T01;biotype=protein_coding
chr1    ensembl intron  105     199     .       +       .       Parent=GRMZM2G060082_T01
chr1    ensembl exon    200     313     .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E06
chr1    ensembl CDS     230     313     .       +       .       Parent=GRMZM2G060082_T01;Name=CDS.98360
chr1    ensembl intron  314     421     .       +       .       Parent=GRMZM2G060082_T01
chr1    ensembl CDS     422     604     .       +       0       Parent=GRMZM2G060082_T01;Name=CDS.98361
chr1    ensembl exon    422     604     .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E05
chr1    ensembl intron  605     1015    .       +       .       Parent=GRMZM2G060082_T01
chr1    ensembl CDS     1016    1233    .       +       0       Parent=GRMZM2G060082_T01;Name=CDS.98362
chr1    ensembl exon    1016    1233    .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E04
chr1    ensembl intron  1234    1620    .       +       .       Parent=GRMZM2G060082_T01
chr1    ensembl CDS     1621    1966    .       +       2       Parent=GRMZM2G060082_T01;Name=CDS.98363
chr1    ensembl exon    1621    3807    .       +       .       Parent=GRMZM2G060082_T01;Name=GRMZM2G060082_E01
chr1    ensembl exon    2607    3218    .       +       .       Parent=GRMZM2G060082_T02;Name=GRMZM2G060082_E03
chr1    ensembl mRNA    2607    3754    .       +       .       ID=GRMZM2G060082_T02;Parent=GRMZM2G060082;Name=GRMZM2G060082_T02;biotype=protein_coding
chr1    ensembl CDS     2765    2947    .       +       .       Parent=GRMZM2G060082_T02;Name=CDS.98367
chr1    ensembl intron  3219    3300    .       +       .       Parent=GRMZM2G060082_T02
chr1    ensembl exon    3301    3754    .       +       .       Parent=GRMZM2G060082_T02;Name=GRMZM2G060082_E02

A friend and I made minor modifications to your one-liner to give transcript lengths summed over exons like this:

Code:

command:
awk '$3=="exon" && $9~/^Parent/{sub(/;.*/,"",$9);L[substr($9,8)]+=$5-$4+1}END{for(i in L){print i,L[i]}}' ZmB73_4a.53_WGS_chr1_21.gff

This is the output:

Code:

GRMZM2G060082_T01 2804
GRMZM2G060082_T02 1066

Awk is truly powerful!! Can you modify it further to add a column that gives the number of exons per transcript?

thanks
Siva

**Siva** · 05-05-2010, 09:13 AM

Originally posted by Siva View Post

Hi Simon:
I keep getting an unexpected keyword argument error while trying to compile your script. I have the latest HTSeq 0.4.3 installed. Here is the error:

Code:

"../simon_anders_gff.py", line 3, in <module>
    gff_file = HTSeq.GFF_Reader("ZmB73_4a.53_WGS.gff",end_included=True)
TypeError: __init__() got an unexpected keyword argument 'end_included'

Siva

Simon:
Do I need to install another package for your script to calculate length to work?
I built HTSeq from source and installed it like a binary as per your instructions. htseq-count works fine.

Siva

**Simon Anders** · 05-05-2010, 12:04 PM

Originally posted by Siva View Post

Simon:
Do I need to install another package for your script to calculate length to work?
I built HTSeq from source and installed it like a binary as per your instructions. htseq-count works fine.

No, you need to upgrade to the newest version of HTSeq.

The reason is an annoying lack of clarity in the GFF specs, which I noticed when I wrote my post above. If, in a GFF file, a feature is said to range from, say, 1000 to 1015, is the base at position 1015 part of the feature? In other words: Is the length of the feature 1015-1000=15, or is it 1015-1000+1=16?

If you look at GTF files from Ensembl, you will notice that the end is not included. You can see that from looking at CDS entries, because their length has to be a mutiple of three. Your GFF file however, by the same criterion, includes the ende.

So, who is right? The original GFF specification is silent on the issue. Hence, I added a new option to HTSeq.GFF_Reader to allow you to specify which convention should be used.

Simon

**Siva** · 05-05-2010, 02:54 PM

simon:
I see!! Although I installed the latest version of HTSeq over the old version. Every time I type >>import HTSeq.....the old version gets imported. I will see if I can upgrade it.

Siva

**Siva** · 05-05-2010, 08:27 PM

deleted this message as I posted a new one...!!

**Siva** · 05-05-2010, 08:50 PM

Hi Simon
I am not sure what is happening. I built and installed your May 4 release HTSeq 0.4.3-p1, but still I get the same error "unexpected keyword end_included" while calling HTSeq.GFF_Reader. I thought that when installing a new version of a Python package the older version is automatically upgraded and the old dependencies etc., are erased. That does not seem to be the case here. Should I upgrade PyPI on the lines of CPAN::upgrade? Is there something else I am overlooking?

Siva
ps: this is the error message

Code:

>>> gff_file = HTSeq.GFF_Reader ("ZmB73_4a.53_WGS.gff.gz", end_included=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() got an unexpected keyword argument 'end_included'

**Simon Anders** · 05-05-2010, 11:41 PM

Originally posted by Siva View Post

I thought that when installing a new version of a Python package the older version is automatically upgraded and the old dependencies etc., are erased.

It seems that it's still there. Try the following. Start a Python session and type:

Code:

import HTSeq
print HTSeq.__file__

This will print out where Python found HTSeq and will indicate to you where the directory with the outdated version of HTSeq is hanging out on your hard disk. Just delete it and try again.

This here, by the way, gives you a list of all directories oin which Python searches (in the given order) for packages when doing an 'import':

Code:

import sys
print sys.path

Hope that helps
Simon

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 25 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Getting transcript lengths from GFF file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News