Header Leaderboard Ad

Collapse

FASTQ sequence converter

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by Eugeni View Post
    Hi, kmcarr
    Thanks for you help, the script has been worked wery well, has generated the fastq file in the sanger format, although in the stdout of the script gives this message:
    Argument "" isn't numeric in addition (+) at fastaQual2fastq.pl line 41, <QUAL> chunk 380185.
    Dou you know what happens, if it is important?
    Thanks a lot
    Some of the quality values have extra spaces depending the number of digits. We just have to make sure there is exactly 1 space between
    them:

    --- fastaQual2fastaq.pl.orig 2009-10-22 22:05:24.000000000 -0500
    +++ fastaQual2fastaq.pl 2009-10-22 22:04:54.000000000 -0500
    @@ -33,6 +33,7 @@
    chomp $qrecord;
    my ($qdef, @qualLines) = split /\n/, $qrecord;
    my $qualString = join ' ', @qualLines;
    + $qualString =~ s/\s+/ /g;
    my @quals = split / /, $qualString;
    print FASTQ "@","$qdef\n";
    print FASTQ "$seqs{$qdef}\n";
    -drd

    Comment


    • #17
      Nice catch drio, thanks. One of those really subtle things you don't catch until you work with a different set of files.

      Eugeni, sorry I didn't get back to you on this; got really crushed at work. I have uploaded a modified version of the script incorporating drio's fix.
      Attached Files
      Last edited by kmcarr; 10-22-2009, 07:22 PM.

      Comment


      • #18
        Seeing as the thread has shifted from SFF to FASTQ, to the easier task of FASTA+QUAL to FASTQ, here is a Biopython solution which will work on Biopython 1.51 or later:

        Code:
        from Bio import SeqIO
        from Bio.SeqIO.QualityIO import PairedFastaQualIterator
        handle = open("temp.fastq", "w") #w=write
        records = PairedFastaQualIterator(open("example.fasta"), open("example.qual"))
        count = SeqIO.write(records, handle, "fastq")
        handle.close()
        print "Converted %i records" % count
        This example will be included in the next edition of the Biopython Tutorial. Adding simple command line parsing using sys.argv is left as an exercise for the reader

        A future version of Biopython should also let you go directly from SFF to FASTQ (or FASTA, or QUAL, or ...) which will be much simpler. This code is already written and can be tested by the adventurous

        Peter

        Comment


        • #19
          sff2fastq

          To Whomever That Maybe Interested:

          I have recently release a program called 'sff2fastq' onto github that does a direct SFF to FASTQ format conversion. 'sff2fastq' is implemented in the C language and should compile on *NIX type operating systems (Linux, BSD-type, & Mac OS X).

          The FASTQ output produced is of the Sanger FASTQ format.

          The source code & compilation instructions are available via the following github url:

          http://github.com/indraniel/sff2fastq

          If the git version control software is not available on your system please visit the following link for installation instructions:

          http://help.github.com/git-installation-redirect

          Any feedback about the program would be appreciated. Bug reports are very much welcomed, although I can't guarantee when they will be addressed.

          Sincerely,
          Indraniel Das

          The Genome Center at Washington University

          Comment


          • #20
            Originally posted by maubp View Post
            A future version of Biopython should also let you go directly from SFF to FASTQ (or FASTA, or QUAL, or ...) which will be much simpler. This code is already written and can be tested by the adventurous
            This will be in Biopython 1.54 due out shortly (probably April 2010), and can be tested no if you install the latest Biopython from the repository. A simple Biopython script for SFF to FASTQ would be just:
            Code:
            from Bio import SeqIO
            SeqIO.convert("example.sff", "sff", "untrimmed.fastq", "fastq")
            Or:
            Code:
            from Bio import SeqIO
            SeqIO.convert("example.sff", "sff-trim", "trimmed.fastq", "fastq")
            Note this does not handle paired end SFF files which requires the reads be analysed to look for the linker sequence. You can use sff_extract for that.

            Comment


            • #21
              Originally posted by idas View Post
              I have recently release a program called 'sff2fastq' ... Any feedback about the program would be appreciated. Bug reports are very much welcomed, although I can't guarantee when they will be addressed.
              It might be useful to omit the optional repetition of the read names on the plus lines in the FASTQ output. Most tools should cope with this, and it does significantly reduce the file size.

              Comment


              • #22
                I need to convert bunch of sffs to fastq. I did a quick experiment to compare sff2fastq and sff_extract
                ∘ picked a random sff file from my data set: size 2.2G, 662933 reads (after conversion)
                ∘ sff_extract took > 270sec, output fasta and qual in separate files, quals in number not ASCII
                ∘ sff2fastq took 50 sec
                ∘ sff2fastq output trimmed reads by default. There is option to output untrimmed reads. Trimmed reads about half of untrimmed reads in length.
                ∘ sff_extract output untrimmed reads by default, which match exactly the output of sff2fastq.

                I think i'm going to use sff2fastq. A question to its author: what are the criteria to trim reads? Thanks.
                Question to sff2

                Comment


                • #23
                  Originally posted by nt2010 View Post
                  ∘ sff_extract took > 270sec, output fasta and qual in separate files, quals in number not ASCII
                  ∘ sff2fastq took 50 sec
                  sff_extract defaults to FASTA + QUAL. To get FASTQ just add "-Q" to the command line.

                  sff2fastq is in C, so a 5 to 1 ratio in runtime is not too bad. Also, be careful with paired-end reads if you have them: sff_extract has a pipeline to get them out for you as one would expect them, sequences from sff2fastq you will need to post-process (i.e. split at the right place) yourself.

                  Originally posted by nt2010 View Post
                  A question to its author: what are the criteria to trim reads? Thanks.
                  I would expect sff2fastq to work exactly like sff_extract: by using the trim information in the reads within the SFF. But then again I might be totally wrong.

                  B.

                  Comment


                  • #24
                    Apologies about the delayed response.

                    Originally posted by BaCh View Post
                    I would expect sff2fastq to work exactly like sff_extract: by using the trim information in the reads within the SFF. But then again I might be totally wrong.

                    B.
                    Yes the above is correct. sff2fastq is using the trim information embedded within the sff file itself to display the reads.

                    sff2fastq is designed to have similar functionality as the 454 tools (like sffinfo) that is produced by 454/Roche. sffinfo outputs trimmed reads by default.

                    The '-n' option of sff2fastq (similar to sffinfo) bypasses the trim information encoded in the within sff file and just displays the full raw read data directly.

                    To view more information about the original trimming information encoded within the sff file please look at the Data Analysis Software Manual produced by 454. One version of it is available by the following link:

                    http://sequence.otago.ac.nz/download...are_Manual.pdf

                    Some trimming occurs in the signal processing step of the GS Run Processor application that performs the original base calling from the raw images acquired from the 454 instrument. It trims read ends for low quality and primer sequence (see sections 3.2 and 3.2.2 in the above manual for the details about this process).

                    The format of the trim information that is encoded within the sff file is described in section 13.3.8.2 of the above manual as well

                    Does this clarify your question about sff2fastq?

                    Comment


                    • #25
                      Originally posted by kmcarr View Post
                      Yes, tis true that the output from sffinfo or sff_extract will have the FASTA and QUAL file entries in the same order. If you can always count on that then by all means design your script around that.

                      The sequences were run through the SeqClean cleaning & trimming pipeline first (http://compbio.dfci.harvard.edu/tgi/software/). The final, cleaned FASTA and QUAL files are not matched in terms of order.
                      Sorry for just seeing this but the cln2qual script that comes with SeqClean should trim the qual file using the report and take care of that problem.

                      Comment


                      • #26
                        Thanks BaCh and idas for your answers. All clear.

                        I'm not sure if i should continue here or start another thread. My questions would be that some of trimmed reads output by the converter(s) can still be very long with low quality at the end (Phred ~ 10). Should i trim then further, or it's acceptable to keep them as 454 works differently from illumina?

                        Comment


                        • #27
                          Thanks for the benchmarks! What machine was used for this? I've written a program (flower - http://blog.malde.org/index.php/flower) to extract various information from SFF files, including Fasta and (Illumina or Sanger style) FastQ. It takes about 20 seconds to convert at 2.1G SFF to FastQ, but this is on a beefy server (Xeon 3.4GHz), so it's probably not directly comparable. Nice to see that we're in the same league, at least.

                          Comment


                          • #28
                            thanks...the script worked for me with a little alterations (minor ones).

                            Comment


                            • #29
                              Using Biopieces you can do:

                              Code:
                              read_sff -i data.sff | write_fastq -o data.fq -x
                              or

                              Code:
                              read_sff -i data.sff | write_454 -o data.fna -q data.fna.qual -x
                              or both in one go:

                              Code:
                              read_sff -i data.sff | write_fastq -o data.fq | write_454 -o data.fna -q data.fna.qual -x

                              Comment


                              • #30
                                Error on Fastq convert

                                HI,
                                I tried the fastq convert module in Biopython;

                                from Bio import SeqIO
                                SeqIO.convert("example.sff", "sff", "untrimmed.fastq", "fastq")

                                (I used my sff file though)

                                and I recieved this error:

                                File "/usr/lib/pymodules/python2.7/Bio/SeqIO/SffIO.py", line 258, in _sff_file_header
                                raise ValueError("Empty file.")
                                ValueError: Empty file.

                                Does this mean that there is an open line in the sff file? Any thoughts?

                                Thanks,
                                Louis

                                Comment

                                Working...
                                X