Seqanswers Leaderboard Ad

**drio** · 10-22-2009, 07:06 PM

Originally posted by Eugeni View Post

Hi, kmcarr
Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
Dou you know what happens, if it is important?
Thanks a lot

Some of the quality values have extra spaces depending the number of digits. We just have to make sure there is exactly 1 space between
them:

--- fastaQual2fastaq.pl.orig 2009-10-22 22:05:24.000000000 -0500
+++ fastaQual2fastaq.pl 2009-10-22 22:04:54.000000000 -0500
@@ -33,6 +33,7 @@
chomp $qrecord;
my ($qdef, @qualLines) = split /\n/, $qrecord;
my $qualString = join ' ', @qualLines;
+ $qualString =~ s/\s+/ /g;
my @quals = split / /, $qualString;
print FASTQ "@","$qdef\n";
print FASTQ "$seqs{$qdef}\n";

**kmcarr** · 10-22-2009, 07:18 PM

Nice catch drio, thanks. One of those really subtle things you don't catch until you work with a different set of files.

Eugeni, sorry I didn't get back to you on this; got really crushed at work. I have uploaded a modified version of the script incorporating drio's fix.

Attached Files

fastaQual2fastq.pl (926 Bytes, 1303 views)

**maubp** · 10-23-2009, 01:13 AM

Seeing as the thread has shifted from SFF to FASTQ, to the easier task of FASTA+QUAL to FASTQ, here is a Biopython solution which will work on Biopython 1.51 or later:

Code:

from Bio import SeqIO
from Bio.SeqIO.QualityIO import PairedFastaQualIterator
handle = open("temp.fastq", "w") #w=write
records = PairedFastaQualIterator(open("example.fasta"), open("example.qual"))
count = SeqIO.write(records, handle, "fastq")
handle.close()
print "Converted %i records" % count

This example will be included in the next edition of the Biopython Tutorial. Adding simple command line parsing using sys.argv is left as an exercise for the reader

A future version of Biopython should also let you go directly from SFF to FASTQ (or FASTA, or QUAL, or ...) which will be much simpler. This code is already written and can be tested by the adventurous

Peter

**idas** · 03-26-2010, 02:22 PM

sff2fastq

To Whomever That Maybe Interested:

I have recently release a program called 'sff2fastq' onto github that does a direct SFF to FASTQ format conversion. 'sff2fastq' is implemented in the C language and should compile on *NIX type operating systems (Linux, BSD-type, & Mac OS X).

The FASTQ output produced is of the Sanger FASTQ format.

The source code & compilation instructions are available via the following github url:

GitHub - indraniel/sff2fastq: extract 454 Genome Sequencer reads from a SFF file and convert them into a FASTQ formatted output

http://github.com/indraniel/sff2fastq

extract 454 Genome Sequencer reads from a SFF file and convert them into a FASTQ formatted output - indraniel/sff2fastq

If the git version control software is not available on your system please visit the following link for installation instructions:

Sign in for Software Support and Product Help - GitHub Support

http://help.github.com/git-installation-redirect

Access your support options and sign in to your account for GitHub software support and product assistance. Get the help you need from our dedicated support team.

Any feedback about the program would be appreciated. Bug reports are very much welcomed, although I can't guarantee when they will be addressed.

Sincerely,
Indraniel Das

The Genome Center at Washington University

**maubp** · 03-27-2010, 06:33 AM

Originally posted by maubp View Post

A future version of Biopython should also let you go directly from SFF to FASTQ (or FASTA, or QUAL, or ...) which will be much simpler. This code is already written and can be tested by the adventurous

This will be in Biopython 1.54 due out shortly (probably April 2010), and can be tested no if you install the latest Biopython from the repository. A simple Biopython script for SFF to FASTQ would be just:

Code:

from Bio import SeqIO
SeqIO.convert("example.sff", "sff", "untrimmed.fastq", "fastq")

Or:

Code:

from Bio import SeqIO
SeqIO.convert("example.sff", "sff-trim", "trimmed.fastq", "fastq")

Note this does not handle paired end SFF files which requires the reads be analysed to look for the linker sequence. You can use sff_extract for that.

**maubp** · 03-27-2010, 06:37 AM

Originally posted by idas View Post

I have recently release a program called 'sff2fastq' ... Any feedback about the program would be appreciated. Bug reports are very much welcomed, although I can't guarantee when they will be addressed.

It might be useful to omit the optional repetition of the read names on the plus lines in the FASTQ output. Most tools should cope with this, and it does significantly reduce the file size.

**nt2010** · 10-22-2010, 10:15 AM

I need to convert bunch of sffs to fastq. I did a quick experiment to compare sff2fastq and sff_extract
∘ picked a random sff file from my data set: size 2.2G, 662933 reads (after conversion)
∘ sff_extract took > 270sec, output fasta and qual in separate files, quals in number not ASCII
∘ sff2fastq took 50 sec
∘ sff2fastq output trimmed reads by default. There is option to output untrimmed reads. Trimmed reads about half of untrimmed reads in length.
∘ sff_extract output untrimmed reads by default, which match exactly the output of sff2fastq.

I think i'm going to use sff2fastq. A question to its author: what are the criteria to trim reads? Thanks.
Question to sff2

**BaCh** · 10-22-2010, 11:24 AM

Originally posted by nt2010 View Post

∘ sff_extract took > 270sec, output fasta and qual in separate files, quals in number not ASCII
∘ sff2fastq took 50 sec

sff_extract defaults to FASTA + QUAL. To get FASTQ just add "-Q" to the command line.

sff2fastq is in C, so a 5 to 1 ratio in runtime is not too bad. Also, be careful with paired-end reads if you have them: sff_extract has a pipeline to get them out for you as one would expect them, sequences from sff2fastq you will need to post-process (i.e. split at the right place) yourself.

Originally posted by nt2010 View Post

A question to its author: what are the criteria to trim reads? Thanks.

I would expect sff2fastq to work exactly like sff_extract: by using the trim information in the reads within the SFF. But then again I might be totally wrong.

B.

**idas** · 10-25-2010, 09:02 PM

Apologies about the delayed response.

Originally posted by BaCh View Post

I would expect sff2fastq to work exactly like sff_extract: by using the trim information in the reads within the SFF. But then again I might be totally wrong.

B.

Yes the above is correct. sff2fastq is using the trim information embedded within the sff file itself to display the reads.

sff2fastq is designed to have similar functionality as the 454 tools (like sffinfo) that is produced by 454/Roche. sffinfo outputs trimmed reads by default.

The '-n' option of sff2fastq (similar to sffinfo) bypasses the trim information encoded in the within sff file and just displays the full raw read data directly.

To view more information about the original trimming information encoded within the sff file please look at the Data Analysis Software Manual produced by 454. One version of it is available by the following link:

404 Not Found

http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf

Some trimming occurs in the signal processing step of the GS Run Processor application that performs the original base calling from the raw images acquired from the 454 instrument. It trims read ends for low quality and primer sequence (see sections 3.2 and 3.2.2 in the above manual for the details about this process).

The format of the trim information that is encoded within the sff file is described in section 13.3.8.2 of the above manual as well

Does this clarify your question about sff2fastq?

**SES** · 10-26-2010, 09:10 AM

Originally posted by kmcarr View Post

Yes, tis true that the output from sffinfo or sff_extract will have the FASTA and QUAL file entries in the same order. If you can always count on that then by all means design your script around that.

The sequences were run through the SeqClean cleaning & trimming pipeline first (http://compbio.dfci.harvard.edu/tgi/software/). The final, cleaned FASTA and QUAL files are not matched in terms of order.

Sorry for just seeing this but the cln2qual script that comes with SeqClean should trim the qual file using the report and take care of that problem.

**nt2010** · 10-28-2010, 12:00 PM

Thanks BaCh and idas for your answers. All clear.

I'm not sure if i should continue here or start another thread. My questions would be that some of trimmed reads output by the converter(s) can still be very long with low quality at the end (Phred ~ 10). Should i trim then further, or it's acceptable to keep them as 454 works differently from illumina?

**ketil** · 11-17-2010, 06:07 AM

Thanks for the benchmarks! What machine was used for this? I've written a program (flower - http://blog.malde.org/index.php/flower) to extract various information from SFF files, including Fasta and (Illumina or Sanger style) FastQ. It takes about 20 seconds to convert at 2.1G SFF to FastQ, but this is on a beefy server (Xeon 3.4GHz), so it's probably not directly comparable. Nice to see that we're in the same league, at least.

**prisnirath** · 05-11-2011, 07:36 AM

thanks...the script worked for me with a little alterations (minor ones).

**maasha** · 01-18-2012, 03:12 AM

Using Biopieces you can do:

Code:

read_sff -i data.sff | write_fastq -o data.fq -x

or

Code:

read_sff -i data.sff | write_454 -o data.fna -q data.fna.qual -x

or both in one go:

Code:

read_sff -i data.sff | write_fastq -o data.fq | write_454 -o data.fna -q data.fna.qual -x

**lplough81** · 02-02-2012, 11:11 AM

Error on Fastq convert

HI,
I tried the fastq convert module in Biopython;

from Bio import SeqIO
SeqIO.convert("example.sff", "sff", "untrimmed.fastq", "fastq")

(I used my sff file though)

and I recieved this error:

File "/usr/lib/pymodules/python2.7/Bio/SeqIO/SffIO.py", line 258, in _sff_file_header
raise ValueError("Empty file.")
ValueError: Empty file.

Does this mean that there is an open line in the sff file? Any thoughts?

Thanks,
Louis

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News