Header Leaderboard Ad

Collapse

FASTQ sequence converter

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SES
    replied
    Originally posted by kmcarr View Post
    Yes, tis true that the output from sffinfo or sff_extract will have the FASTA and QUAL file entries in the same order. If you can always count on that then by all means design your script around that.

    The sequences were run through the SeqClean cleaning & trimming pipeline first (http://compbio.dfci.harvard.edu/tgi/software/). The final, cleaned FASTA and QUAL files are not matched in terms of order.
    Sorry for just seeing this but the cln2qual script that comes with SeqClean should trim the qual file using the report and take care of that problem.

    Leave a comment:


  • idas
    replied
    Apologies about the delayed response.

    Originally posted by BaCh View Post
    I would expect sff2fastq to work exactly like sff_extract: by using the trim information in the reads within the SFF. But then again I might be totally wrong.

    B.
    Yes the above is correct. sff2fastq is using the trim information embedded within the sff file itself to display the reads.

    sff2fastq is designed to have similar functionality as the 454 tools (like sffinfo) that is produced by 454/Roche. sffinfo outputs trimmed reads by default.

    The '-n' option of sff2fastq (similar to sffinfo) bypasses the trim information encoded in the within sff file and just displays the full raw read data directly.

    To view more information about the original trimming information encoded within the sff file please look at the Data Analysis Software Manual produced by 454. One version of it is available by the following link:

    http://sequence.otago.ac.nz/download...are_Manual.pdf

    Some trimming occurs in the signal processing step of the GS Run Processor application that performs the original base calling from the raw images acquired from the 454 instrument. It trims read ends for low quality and primer sequence (see sections 3.2 and 3.2.2 in the above manual for the details about this process).

    The format of the trim information that is encoded within the sff file is described in section 13.3.8.2 of the above manual as well

    Does this clarify your question about sff2fastq?

    Leave a comment:


  • BaCh
    replied
    Originally posted by nt2010 View Post
    ∘ sff_extract took > 270sec, output fasta and qual in separate files, quals in number not ASCII
    ∘ sff2fastq took 50 sec
    sff_extract defaults to FASTA + QUAL. To get FASTQ just add "-Q" to the command line.

    sff2fastq is in C, so a 5 to 1 ratio in runtime is not too bad. Also, be careful with paired-end reads if you have them: sff_extract has a pipeline to get them out for you as one would expect them, sequences from sff2fastq you will need to post-process (i.e. split at the right place) yourself.

    Originally posted by nt2010 View Post
    A question to its author: what are the criteria to trim reads? Thanks.
    I would expect sff2fastq to work exactly like sff_extract: by using the trim information in the reads within the SFF. But then again I might be totally wrong.

    B.

    Leave a comment:


  • nt2010
    replied
    I need to convert bunch of sffs to fastq. I did a quick experiment to compare sff2fastq and sff_extract
    ∘ picked a random sff file from my data set: size 2.2G, 662933 reads (after conversion)
    ∘ sff_extract took > 270sec, output fasta and qual in separate files, quals in number not ASCII
    ∘ sff2fastq took 50 sec
    ∘ sff2fastq output trimmed reads by default. There is option to output untrimmed reads. Trimmed reads about half of untrimmed reads in length.
    ∘ sff_extract output untrimmed reads by default, which match exactly the output of sff2fastq.

    I think i'm going to use sff2fastq. A question to its author: what are the criteria to trim reads? Thanks.
    Question to sff2

    Leave a comment:


  • maubp
    replied
    Originally posted by idas View Post
    I have recently release a program called 'sff2fastq' ... Any feedback about the program would be appreciated. Bug reports are very much welcomed, although I can't guarantee when they will be addressed.
    It might be useful to omit the optional repetition of the read names on the plus lines in the FASTQ output. Most tools should cope with this, and it does significantly reduce the file size.

    Leave a comment:


  • maubp
    replied
    Originally posted by maubp View Post
    A future version of Biopython should also let you go directly from SFF to FASTQ (or FASTA, or QUAL, or ...) which will be much simpler. This code is already written and can be tested by the adventurous
    This will be in Biopython 1.54 due out shortly (probably April 2010), and can be tested no if you install the latest Biopython from the repository. A simple Biopython script for SFF to FASTQ would be just:
    Code:
    from Bio import SeqIO
    SeqIO.convert("example.sff", "sff", "untrimmed.fastq", "fastq")
    Or:
    Code:
    from Bio import SeqIO
    SeqIO.convert("example.sff", "sff-trim", "trimmed.fastq", "fastq")
    Note this does not handle paired end SFF files which requires the reads be analysed to look for the linker sequence. You can use sff_extract for that.

    Leave a comment:


  • idas
    replied
    sff2fastq

    To Whomever That Maybe Interested:

    I have recently release a program called 'sff2fastq' onto github that does a direct SFF to FASTQ format conversion. 'sff2fastq' is implemented in the C language and should compile on *NIX type operating systems (Linux, BSD-type, & Mac OS X).

    The FASTQ output produced is of the Sanger FASTQ format.

    The source code & compilation instructions are available via the following github url:

    http://github.com/indraniel/sff2fastq

    If the git version control software is not available on your system please visit the following link for installation instructions:

    http://help.github.com/git-installation-redirect

    Any feedback about the program would be appreciated. Bug reports are very much welcomed, although I can't guarantee when they will be addressed.

    Sincerely,
    Indraniel Das

    The Genome Center at Washington University

    Leave a comment:


  • maubp
    replied
    Seeing as the thread has shifted from SFF to FASTQ, to the easier task of FASTA+QUAL to FASTQ, here is a Biopython solution which will work on Biopython 1.51 or later:

    Code:
    from Bio import SeqIO
    from Bio.SeqIO.QualityIO import PairedFastaQualIterator
    handle = open("temp.fastq", "w") #w=write
    records = PairedFastaQualIterator(open("example.fasta"), open("example.qual"))
    count = SeqIO.write(records, handle, "fastq")
    handle.close()
    print "Converted %i records" % count
    This example will be included in the next edition of the Biopython Tutorial. Adding simple command line parsing using sys.argv is left as an exercise for the reader

    A future version of Biopython should also let you go directly from SFF to FASTQ (or FASTA, or QUAL, or ...) which will be much simpler. This code is already written and can be tested by the adventurous

    Peter

    Leave a comment:


  • kmcarr
    replied
    Nice catch drio, thanks. One of those really subtle things you don't catch until you work with a different set of files.

    Eugeni, sorry I didn't get back to you on this; got really crushed at work. I have uploaded a modified version of the script incorporating drio's fix.
    Attached Files
    Last edited by kmcarr; 10-22-2009, 07:22 PM.

    Leave a comment:


  • drio
    replied
    Originally posted by Eugeni View Post
    Hi, kmcarr
    Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
    Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
    Dou you know what happens, if it is important?
    Thanks a lot
    Some of the quality values have extra spaces depending the number of digits. We just have to make sure there is exactly 1 space between
    them:

    --- fastaQual2fastaq.pl.orig 2009-10-22 22:05:24.000000000 -0500
    +++ fastaQual2fastaq.pl 2009-10-22 22:04:54.000000000 -0500
    @@ -33,6 +33,7 @@
    chomp $qrecord;
    my ($qdef, @qualLines) = split /\n/, $qrecord;
    my $qualString = join ' ', @qualLines;
    + $qualString =~ s/\s+/ /g;
    my @quals = split / /, $qualString;
    print FASTQ "@","$qdef\n";
    print FASTQ "$seqs{$qdef}\n";

    Leave a comment:


  • maubp
    replied
    Just a guess, but you could check your line endings (DOS/Windows versus Unix).

    Leave a comment:


  • Eugeni
    replied
    Originally posted by kmcarr View Post
    Does the warning only appear once? How many entries are in your FASTA/QUAL files?
    The warning appears associated to all sequences; i have 380185 fasta/qual entries

    Leave a comment:


  • kmcarr
    replied
    Originally posted by Eugeni View Post
    Hi, kmcarr
    Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
    Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
    Dou you know what happens, if it is important?
    Thanks a lot
    Does the warning only appear once? How many entries are in your FASTA/QUAL files?

    Leave a comment:


  • Eugeni
    replied
    Originally posted by kmcarr View Post
    Here is a perl script to convert FASTA + QUAL files to FASTQ. You would need to first generate the FASTA and QUAL files from the SFF file using a tool like sffinfo from Roche or sff_extract.

    Code:
    #!/usr/bin/perl
    
    use warnings;
    use strict;
    use File::Basename;
    
    my $inFasta = $ARGV[0];
    my $baseName = basename($inFasta, qw/.fasta .fna/);
    my $inQual = $baseName . ".qual";
    my $outFastq = $baseName . ".fastq";
    
    my %seqs;
    
    $/ = ">";
    
    open (FASTA, "<$inFasta");
    my $junk = (<FASTA>);
    
    while (my $frecord = <FASTA>) {
    	chomp $frecord;
    	my ($fdef, @seqLines) = split /\n/, $frecord;
    	my $seq = join '', @seqLines;
    	$seqs{$fdef} = $seq;
    }
    
    close FASTA;
    
    open (QUAL, "<$inQual");
    $junk = <QUAL>;
    open (FASTQ, ">$outFastq");
    
    while (my $qrecord = <QUAL>) {
    	chomp $qrecord;
    	my ($qdef, @qualLines) = split /\n/, $qrecord;
    	my $qualString = join ' ', @qualLines;
    	my @quals = split / /, $qualString;
    	print FASTQ "@","$qdef\n";
    	print FASTQ "$seqs{$qdef}\n";
    	print FASTQ "+\n";
    	foreach my $qual (@quals) {
    		print FASTQ chr($qual + 33);
    	}
    	print FASTQ "\n";
    }
    
    close QUAL;
    close FASTQ;
    Usage notes:

    - Run the program just pass it the name of the fasta sequence file, e.g.

    Code:
    %> fastaQual2fastq.pl foo.fasta
    (assuming you saved the above code with the name 'fastaQual2fastq.pl')

    - The fasta filename must end in either .fasta or .fna

    - The quality filename must have the same basename as the fasta file and end with .qual. For example, if your sequence file is "foo.fna" then the quality file must be named "foo.qual".
    Hi, kmcarr
    Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
    Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
    Dou you know what happens, if it is important?
    Thanks a lot

    Leave a comment:


  • kmcarr
    replied
    Originally posted by maubp View Post
    Interesting - I wonder why they do that, and if it would be easy to fix their pipeline...
    The pipeline script (seqclean) is written in Perl so you could download it from the link above and check it out.
    Last edited by kmcarr; 10-07-2009, 09:27 AM. Reason: Removed message text after discovering the cln2qual is perl, not binary.

    Leave a comment:

Working...
X