FASTQ sequence converter

SES replied

10-26-2010, 09:10 AM
Originally posted by kmcarr View Post

Yes, tis true that the output from sffinfo or sff_extract will have the FASTA and QUAL file entries in the same order. If you can always count on that then by all means design your script around that.

The sequences were run through the SeqClean cleaning & trimming pipeline first (http://compbio.dfci.harvard.edu/tgi/software/). The final, cleaned FASTA and QUAL files are not matched in terms of order.

Sorry for just seeing this but the cln2qual script that comes with SeqClean should trim the qual file using the report and take care of that problem.
Leave a comment:
idas replied

10-25-2010, 09:02 PM
Apologies about the delayed response.

Originally posted by BaCh View Post

I would expect sff2fastq to work exactly like sff_extract: by using the trim information in the reads within the SFF. But then again I might be totally wrong.

B.

Yes the above is correct. sff2fastq is using the trim information embedded within the sff file itself to display the reads.

sff2fastq is designed to have similar functionality as the 454 tools (like sffinfo) that is produced by 454/Roche. sffinfo outputs trimmed reads by default.

The '-n' option of sff2fastq (similar to sffinfo) bypasses the trim information encoded in the within sff file and just displays the full raw read data directly.

To view more information about the original trimming information encoded within the sff file please look at the Data Analysis Software Manual produced by 454. One version of it is available by the following link:

404 Not Found

http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf

Some trimming occurs in the signal processing step of the GS Run Processor application that performs the original base calling from the raw images acquired from the 454 instrument. It trims read ends for low quality and primer sequence (see sections 3.2 and 3.2.2 in the above manual for the details about this process).

The format of the trim information that is encoded within the sff file is described in section 13.3.8.2 of the above manual as well

Does this clarify your question about sff2fastq?
Leave a comment:
BaCh replied

10-22-2010, 11:24 AM
Originally posted by nt2010 View Post

∘ sff_extract took > 270sec, output fasta and qual in separate files, quals in number not ASCII
∘ sff2fastq took 50 sec

sff_extract defaults to FASTA + QUAL. To get FASTQ just add "-Q" to the command line.

sff2fastq is in C, so a 5 to 1 ratio in runtime is not too bad. Also, be careful with paired-end reads if you have them: sff_extract has a pipeline to get them out for you as one would expect them, sequences from sff2fastq you will need to post-process (i.e. split at the right place) yourself.

Originally posted by nt2010 View Post

A question to its author: what are the criteria to trim reads? Thanks.

I would expect sff2fastq to work exactly like sff_extract: by using the trim information in the reads within the SFF. But then again I might be totally wrong.

B.
Leave a comment:
nt2010 replied

10-22-2010, 10:15 AM
I need to convert bunch of sffs to fastq. I did a quick experiment to compare sff2fastq and sff_extract
∘ picked a random sff file from my data set: size 2.2G, 662933 reads (after conversion)
∘ sff_extract took > 270sec, output fasta and qual in separate files, quals in number not ASCII
∘ sff2fastq took 50 sec
∘ sff2fastq output trimmed reads by default. There is option to output untrimmed reads. Trimmed reads about half of untrimmed reads in length.
∘ sff_extract output untrimmed reads by default, which match exactly the output of sff2fastq.

I think i'm going to use sff2fastq. A question to its author: what are the criteria to trim reads? Thanks.
Question to sff2
Leave a comment:
maubp replied

03-27-2010, 06:37 AM
Originally posted by idas View Post

I have recently release a program called 'sff2fastq' ... Any feedback about the program would be appreciated. Bug reports are very much welcomed, although I can't guarantee when they will be addressed.

It might be useful to omit the optional repetition of the read names on the plus lines in the FASTQ output. Most tools should cope with this, and it does significantly reduce the file size.
Leave a comment:
maubp replied

03-27-2010, 06:33 AM
Originally posted by maubp View Post

A future version of Biopython should also let you go directly from SFF to FASTQ (or FASTA, or QUAL, or ...) which will be much simpler. This code is already written and can be tested by the adventurous

This will be in Biopython 1.54 due out shortly (probably April 2010), and can be tested no if you install the latest Biopython from the repository. A simple Biopython script for SFF to FASTQ would be just:

Code:

from Bio import SeqIO SeqIO.convert("example.sff", "sff", "untrimmed.fastq", "fastq")

Or:

Code:

from Bio import SeqIO SeqIO.convert("example.sff", "sff-trim", "trimmed.fastq", "fastq")

Note this does not handle paired end SFF files which requires the reads be analysed to look for the linker sequence. You can use sff_extract for that.
Leave a comment:
idas replied

03-26-2010, 02:22 PM
sff2fastq

To Whomever That Maybe Interested:

I have recently release a program called 'sff2fastq' onto github that does a direct SFF to FASTQ format conversion. 'sff2fastq' is implemented in the C language and should compile on *NIX type operating systems (Linux, BSD-type, & Mac OS X).

The FASTQ output produced is of the Sanger FASTQ format.

The source code & compilation instructions are available via the following github url:

GitHub - indraniel/sff2fastq: extract 454 Genome Sequencer reads from a SFF file and convert them into a FASTQ formatted output

http://github.com/indraniel/sff2fastq

extract 454 Genome Sequencer reads from a SFF file and convert them into a FASTQ formatted output - indraniel/sff2fastq

If the git version control software is not available on your system please visit the following link for installation instructions:

Sign in for Software Support and Product Help - GitHub Support

http://help.github.com/git-installation-redirect

Access your support options and sign in to your account for GitHub software support and product assistance. Get the help you need from our dedicated support team.

Any feedback about the program would be appreciated. Bug reports are very much welcomed, although I can't guarantee when they will be addressed.

Sincerely,
Indraniel Das

The Genome Center at Washington University
Leave a comment:
maubp replied

10-23-2009, 01:13 AM
Seeing as the thread has shifted from SFF to FASTQ, to the easier task of FASTA+QUAL to FASTQ, here is a Biopython solution which will work on Biopython 1.51 or later:

Code:

from Bio import SeqIO from Bio.SeqIO.QualityIO import PairedFastaQualIterator handle = open("temp.fastq", "w") #w=write records = PairedFastaQualIterator(open("example.fasta"), open("example.qual")) count = SeqIO.write(records, handle, "fastq") handle.close() print "Converted %i records" % count

This example will be included in the next edition of the Biopython Tutorial. Adding simple command line parsing using sys.argv is left as an exercise for the reader

A future version of Biopython should also let you go directly from SFF to FASTQ (or FASTA, or QUAL, or ...) which will be much simpler. This code is already written and can be tested by the adventurous

Peter
Leave a comment:
kmcarr replied

10-22-2009, 07:18 PM
Nice catch drio, thanks. One of those really subtle things you don't catch until you work with a different set of files.

Eugeni, sorry I didn't get back to you on this; got really crushed at work. I have uploaded a modified version of the script incorporating drio's fix.
Attached Files

fastaQual2fastq.pl (926 Bytes, 1308 views)
Last edited by kmcarr; 10-22-2009, 07:22 PM.
Leave a comment:
drio replied

10-22-2009, 07:06 PM
Originally posted by Eugeni View Post

Hi, kmcarr
Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
Dou you know what happens, if it is important?
Thanks a lot

Some of the quality values have extra spaces depending the number of digits. We just have to make sure there is exactly 1 space between
them:

--- fastaQual2fastaq.pl.orig 2009-10-22 22:05:24.000000000 -0500
+++ fastaQual2fastaq.pl 2009-10-22 22:04:54.000000000 -0500
@@ -33,6 +33,7 @@
chomp $qrecord;
my ($qdef, @qualLines) = split /\n/, $qrecord;
my $qualString = join ' ', @qualLines;
+ $qualString =~ s/\s+/ /g;
my @quals = split / /, $qualString;
print FASTQ "@","$qdef\n";
print FASTQ "$seqs{$qdef}\n";
Leave a comment:
maubp replied

10-08-2009, 06:31 AM
Just a guess, but you could check your line endings (DOS/Windows versus Unix).
Leave a comment:
Eugeni replied

10-08-2009, 06:11 AM
Originally posted by kmcarr View Post

Does the warning only appear once? How many entries are in your FASTA/QUAL files?

The warning appears associated to all sequences; i have 380185 fasta/qual entries
Leave a comment:
kmcarr replied

10-08-2009, 04:17 AM
Originally posted by Eugeni View Post

Hi, kmcarr
Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
Dou you know what happens, if it is important?
Thanks a lot

Does the warning only appear once? How many entries are in your FASTA/QUAL files?
Leave a comment:
Eugeni replied

10-08-2009, 12:37 AM
Originally posted by kmcarr View Post

Here is a perl script to convert FASTA + QUAL files to FASTQ. You would need to first generate the FASTA and QUAL files from the SFF file using a tool like sffinfo from Roche or sff_extract.

Code:

#!/usr/bin/perl use warnings; use strict; use File::Basename; my $inFasta = $ARGV[0]; my $baseName = basename($inFasta, qw/.fasta .fna/); my $inQual = $baseName . ".qual"; my $outFastq = $baseName . ".fastq"; my %seqs; $/ = ">"; open (FASTA, "<$inFasta"); my $junk = (<FASTA>); while (my $frecord = <FASTA>) { chomp $frecord; my ($fdef, @seqLines) = split /\n/, $frecord; my $seq = join '', @seqLines; $seqs{$fdef} = $seq; } close FASTA; open (QUAL, "<$inQual"); $junk = <QUAL>; open (FASTQ, ">$outFastq"); while (my $qrecord = <QUAL>) { chomp $qrecord; my ($qdef, @qualLines) = split /\n/, $qrecord; my $qualString = join ' ', @qualLines; my @quals = split / /, $qualString; print FASTQ "@","$qdef\n"; print FASTQ "$seqs{$qdef}\n"; print FASTQ "+\n"; foreach my $qual (@quals) { print FASTQ chr($qual + 33); } print FASTQ "\n"; } close QUAL; close FASTQ;

Usage notes:

- Run the program just pass it the name of the fasta sequence file, e.g.

Code:

%> fastaQual2fastq.pl foo.fasta

(assuming you saved the above code with the name 'fastaQual2fastq.pl')

- The fasta filename must end in either .fasta or .fna

- The quality filename must have the same basename as the fasta file and end with .qual. For example, if your sequence file is "foo.fna" then the quality file must be named "foo.qual".

Hi, kmcarr
Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
Dou you know what happens, if it is important?
Thanks a lot
Leave a comment:
kmcarr replied

10-07-2009, 09:03 AM
Originally posted by maubp View Post

Interesting - I wonder why they do that, and if it would be easy to fix their pipeline...

The pipeline script (seqclean) is written in Perl so you could download it from the link above and check it out.

Last edited by kmcarr; 10-07-2009, 09:27 AM. Reason: Removed message text after discovering the cln2qual is perl, not binary.
Leave a comment:

Previous 1 2 3 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News