Unconfigured Ad

**maubp** · 09-22-2013, 03:54 AM

You should be able to convert the UniProt XML to FASTA using Biopython,

Code:

from Bio import SeqIO
count = SeqIO.convert("uniref90.xml", "uniprot-xml", "converted.fasta", "fasta")
print("Converted %i records" % count)

**emanlee** · 09-22-2013, 05:43 AM

Thank you for your quick reply. I'll try it out.

**emanlee** · 09-22-2013, 04:38 PM

Code:

>>> from Bio import SeqIO
>>> count = SeqIO.convert("uniref90.xml", "uniprot-xml", "uniref90converted.fasta", "fasta")
>>> print("Converted %i records" % count)

Converted 0 records

We checked uniref90.xml:
more uniref90.xml

Code:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<UniRef90 xmlns="http://uniprot.org/uniref" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://uniprot.org/uniref http://www.uniprot.org/support/docs/uniref.xsd" 
 releaseDate="2007-03-06" version="10.0"> 
<entry id="UniRef90_Q3ASY8" updated="2007-03-06">
<name>Cluster: Parallel beta-helix repeat</name>
<property type="member count" value="1"/>
<property type="common taxon" value="Chlorobium chlorochromatii CaD3"/>
<property type="common taxon ID" value="340177"/>
<representativeMember>
<dbReference type="UniProtKB ID" id="Q3ASY8_CHLCH">
<property type="UniProtKB accession" value="Q3ASY8"/>
<property type="UniParc ID" value="UPI00005D5563"/>
<property type="UniRef100 ID" value="UniRef100_Q3ASY8"/>
<property type="UniRef50 ID" value="UniRef50_Q3ASY8"/>
<property type="protein name" value="Parallel beta-helix repeat"/>
<property type="source organism" value="Chlorobium chlorochromatii (strain CaD3)"/>
<property type="NCBI taxonomy" value="340177"/>
<property type="length" value="36805"/>
<property type="isSeed" value="true"/>
</dbReference>
<sequence length="36805" checksum="A7A8EA21B9345FF9">
MKPRFYIEQLEPRILLSGDILSELVPLLSSREASQMQSDYLLEHPEARRVAPLSAVEAAR
....

Could you help us, thanks.

**kmcarr** · 09-23-2013, 11:18 AM

Originally posted by emanlee View Post

Could you help us, thanks.

Wouldn't it just be much easier to download the UniRef90 FASTA file directly?

**maubp** · 09-26-2013, 12:50 PM

Originally posted by kmcarr View Post

Wouldn't it just be much easier to download the UniRef90 FASTA file directly?

Indeed, I should have doubled checked that really didn't exist.

As to the Biopython conversion failing, that is probably a bug - I'd have replied earlier but missed the thread reply alert - sorry.

**GenoMax** · 09-26-2013, 05:07 PM

The file linked by kmcarr does not refer to a "version 10.0" that emanlee was asking for. Perhaps that is not important.

**maubp** · 09-27-2013, 07:44 AM

OK then... first this is how I just extracted the uniref90.xml file from the FTP site (multiple levels of bundling!):

Code:

$ wget ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release10.0/uniref/uniref10.0.tar.gz
...
$ tar -zxvf uniref10.0.tar.gz 
uniref100.tar
uniref50.tar
$ tar -xvf uniref90.tar 
README
uniref90.dtd
uniref90.xml.gz
$ gunzip uniref90.xml.gz

And here is what the start of the file looks like for me too (same as emanlee reported):

Code:

$ head -n 25 uniref90.xml 
<?xml version="1.0" encoding="ISO-8859-1" ?>
<UniRef90 xmlns="http://uniprot.org/uniref" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://uniprot.org/uniref http://www.uniprot.org/support/docs/uniref.xsd" 
 releaseDate="2007-03-06" version="10.0"> 
<entry id="UniRef90_Q3ASY8" updated="2007-03-06">
<name>Cluster: Parallel beta-helix repeat</name>
<property type="member count" value="1"/>
<property type="common taxon" value="Chlorobium chlorochromatii CaD3"/>
<property type="common taxon ID" value="340177"/>
<representativeMember>
<dbReference type="UniProtKB ID" id="Q3ASY8_CHLCH">
<property type="UniProtKB accession" value="Q3ASY8"/>
<property type="UniParc ID" value="UPI00005D5563"/>
<property type="UniRef100 ID" value="UniRef100_Q3ASY8"/>
<property type="UniRef50 ID" value="UniRef50_Q3ASY8"/>
<property type="protein name" value="Parallel beta-helix repeat"/>
<property type="source organism" value="Chlorobium chlorochromatii (strain CaD3)"/>
<property type="NCBI taxonomy" value="340177"/>
<property type="length" value="36805"/>
<property type="isSeed" value="true"/>
</dbReference>
<sequence length="36805" checksum="A7A8EA21B9345FF9">
MKPRFYIEQLEPRILLSGDILSELVPLLSSREASQMQSDYLLEHPEARRVAPLSAVEAAR
ACMVVVQSEAPSLLTEDGLMYPFEVGVGEERSSEANAEPTLAADFSADYTFSKSEWDALE

And here's how many records there seem to be according to grep:

Code:

$ grep -c "^<entry id" uniref90.xml 
2781437

Biopython 1.61 and 1.62 do appear to have a problem parsing this - I suspect the XML is different in some way to what we expect.

Update: Raised here: http://lists.open-bio.org/pipermail/...er/010909.html

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

format uniref90.xml to database for BLAST

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News