Seqanswers Leaderboard Ad

**maubp** · 09-06-2009, 02:05 PM

I have done this as a one off in python for some test SFF files - would that be of interest?

**jvhaarst** · 09-06-2009, 11:48 PM

Yes, that would be very helpful.
Would it be OK if I published it on a website somewhere if it works satisfactory ?

**maubp** · 09-07-2009, 01:56 AM

Do you care about the Roche XML manifest and/or the record index? If not, this makes life simpler (and if you want the index added later on, just put the SFF file though the Roche tool sfffile and it will generate the index).

Do you know any Python? You would need a line or two of Python to do the renaming. Can you give a few examples of the old names and the desired new names?

On re-reading your original question, I would guess the renaming could be based on the barcodes (i.e. you'll need to look at the called sequence). This would complicate things a little. If so, what do you do if there the barcode isn't sequenced perfectly and does not match any of your expected barcodes?

**jvhaarst** · 09-08-2009, 01:43 AM

I don't care about the index, one could reconstruct that easily with sfffile.
I just had a look at iolib and sff_extract, and they both don't have information about the manifest.
I can see that it is placed just before the index, so I guess that one could just grab the bytes between the last read and the index location, and reuse the info as is.

Did you find a description of the manifest block ?

My idea on renaming would be to let the user decide what they want.
My idea now would be add something to the 454 identifier, so it stays unique and indentifiable form which run it came.

I'll have a look at the manifest block to see whether i can guess what the leading bytes mean.

**maubp** · 09-08-2009, 02:11 AM

Originally posted by jvhaarst View Post

I don't care about the index, one could reconstruct that easily with sfffile.

Yes - that works fine I've found.

Originally posted by jvhaarst View Post

I just had a look at iolib and sff_extract, and they both don't have information about the manifest.

It is undocumented as far as I know.

Originally posted by jvhaarst View Post

I can see that it is placed just before the index, so I guess that one could just grab the bytes between the last read and the index location, and reuse the info as is.

Yes you can do that. Roche SFF files with a manifest use the "SFF index block" to hold both an XML manifest, and an actual index block.

Originally posted by jvhaarst View Post

Did you find a description of the manifest block ?

No - but a little reverse engineering shows the length of the XML string is given (so you know where it is, and where the following index data is), and the length of the index data.

Originally posted by jvhaarst View Post

My idea on renaming would be to let the user decide what they want. My idea now would be add something to the 454 identifier, so it stays unique and indentifiable form which run it came.

Just adding the same text to every read identifier? Should be easy...

Originally posted by jvhaarst View Post

I'll have a look at the manifest block to see whether i can guess what the leading bytes mean.

I've told you what I think it means above - very simple, just two lengths

We are sorry, but the page you requested is no longer available.`

http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats

The above documentation (and the Roche 454 manual which has similar content) don't actually cover the index. All the specification lays down is the index starts with a four byte "magic number" (a format name) and a four byte version (typically a string). Thus different SFF index types can be distinguished by their first eight characters.

I have only seen Roche SFF files with indexes starting ".srt1.00" (with no XML manifest) and more commonly ".mft1.00" (short for Manifest v1.00 is my guess). These both use the same index internally, working in base 255 so that 0xFF can be used as a separator character. As far as I know, neither of these index block formats is documented (although I have reverse engineered enough to understand most of the layout).

Looking at the Staden IO lib, their code knows about ".srt1.00" (454 sorted v1.00) and also supports ".hsh1.00" (hash table v1.00). They provide documentation of these hash tables too. I have no idea if these hash indexes are actually in widespread use or not.

I'm working on support for SFF files in Biopython, including the indexes. This code is currently on github and is not yet in the main trunk:

File not found · peterjc/biopython

http://github.com/peterjc/biopython/tree/index

My fork of the official Biopython repository, used for experimental branches etc - File not found · peterjc/biopython

Once it is (or if you are happy using my branch for a one off conversion), then this should work if you don't care about the Roche XML manifest:

Code:

from Bio import SeqIO

def rename(record) :
    """Function to alter the record's identifier."""
    record.id += "_and_a_suffix"
    return record

#Python generator expression, only one record in memory at a time:
records = (rename(rec) for rec in SeqIO.parse(open("input.sff","rb"),"sff"))

#This will not write the Roche XML manifest!
handle = open("output.sff", "wb")
SeqIO.write(records, handle, "sff")
handle.close()

I can do something similar preserving the XML, but it requires going a little low level - not just using Biopython's SeqRecord based SeqIO system:

Introduction to SeqIO · Biopython

http://biopython.org/wiki/SeqIO

**jvhaarst** · 09-08-2009, 03:12 AM

Great !
This saves me (and probably others) a lot of time.
Adding the index and the manifest shouldn't be that hard, would it ?
The index is probably just a sorted list with IDs, and an adress ?

**maubp** · 09-08-2009, 03:29 AM

Originally posted by jvhaarst View Post

Great !
This saves me (and probably others) a lot of time.
Adding the index and the manifest shouldn't be that hard, would it ?
The index is probably just a sorted list with IDs, and an adress ?

It's not hard, but I haven't settled on my API yet, and I'm still hoping for more details about the XML manifest format, and the index. The Roche index is an alphabetically sorted list of the names, storing the offset using base 255 (not 256), followed by a marker character (byte 0xFF, decimal 256).

The short script above (using the current version of the Biopython branch referred to) will write a Roche style index with a dummy manifest. I would expect this to work as is when SFF support is merged into the main Biopython trunk.

I could share an example which first extracts the original XML manifest, and saves that to the output file (along with the selected records and their new names and offsets). However, right now that requires calling "private" methods in my code, and such a script will probably go out of date shortly. If you are doing this as a one off, this might be fine, but I don't want to circulate an example which I expect to break soon (as I work on the Biopython SFF support).

Note that one of the things recorded in the XML manifest is the <accession_prefix>, i.e. what all the reads are expected to start with. If you edit the SFF read names, but not this bit of the manifest, it may confuse the Roche tools. As the XML manifest is (to my knowledge) undocumented, the only safe option is to not to write it, or make the user/programmer calling Biopython decide this themselves.

**jvhaarst** · 09-08-2009, 06:03 AM

For now, I think I will first have a test with the changed reads.
I myself wouldn't change the start of the reads, because that would make it harder to see which run produced a read. This means that the accession_prefix can stay as it is.

For the future a version which can reuse the old manifest would be great.

**ketil** · 11-17-2010, 06:26 AM

The read names in 454 enode among other things the date and time, so collisions should be very unlikely. That said, I wrote a small utility to just add a serial number to all reads in a set of SFF files (also ignoring index and manifest etc) - it's available as part of the flower package (http://blog.malde.org/index.php/flower).

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 48 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Renaming reads within SFF files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News