Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

extracting reads from a large fastq file

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • extracting reads from a large fastq file

    Hi all,

    So, I'd like to extract the reads that came out of a blast result that met a certain threshold. It appears the NCBI blast output make it rather easy to get the sequence identifiers I need, but the out put files do not include the actual sequences in a very easy to use format.

    Any suggestions? I've been searching around for a tool to do this but haven't found anything yet.

  • #2
    The mira assembler:

    http://sourceforge.net/apps/mediawik...itle=Main_Page

    comes with a script called fastqselect.tcl. The script takes a name file and pulls out your sequences of interest. I have used biopython scripts to do the same thing, but this fastqselect tool is much faster.

    J

    Comment


    • #3
      Thanks, I'll give that a try.

      Comment


      • #4
        That worked great. And covered some 1.4 billion reads in just a half hour or so. Thanks again.

        Comment


        • #5
          i'm sure Simon Anders will get around to answering with this same answer, but HTSeq has a fastq reader that simplifies this process (assuming you're comfortable with python).

          check out:
          Code:
          http://www-huber.embl.de/users/anders/HTSeq/doc/tour.html

          Comment


          • #6
            Originally posted by themerlin View Post
            The mira assembler:

            http://sourceforge.net/apps/mediawik...itle=Main_Page

            comes with a script called fastqselect.tcl. The script takes a name file and pulls out your sequences of interest. I have used biopython scripts to do the same thing, but this fastqselect tool is much faster.

            J
            For those of you using Galaxy, I have two python scripts to do available on the Galaxy Tool Shed, http://toolshed.g2.bx.psu.edu/

            seq_filter_by_id - Filter sequences by ID
            seq_select_by_id - Select sequences by ID

            They both work with FASTA, FASTQ and SFF files. The first makes a single pass though the sequence files, and extracts reads with a requested ID in the order found in the sequence file. The second is slower because it indexes the file first then extracts the requested IDs in the order of the ID file.

            Comment


            • #7
              Originally posted by Wallysb01 View Post
              That worked great. And covered some 1.4 billion reads in just a half hour or so. Thanks again.
              You have 1.4 billion reads in one file? And you used this with BLAST?

              (I don't know what you are trying to do but splitting the data, if possible, will speed up any procedure)

              Anyway, these all sound like great solutions, but I would like to point out that cdbfasta has the -Q option to index fastq files and cdbyank can be used to pull the requested ID or IDs from a list. I have not used these other tools but I have tried BioPerl's Fastq indexing method and SeqIO module for pulling Fastq entries and it became clear to me that these were just not practical solutions for the size of modern NGS sequence files. cdbfasta will probably be the fastest solution for pulling reads, but like any indexing method, you have to create the index. I don't know what is best for your application but it looks like you have some options.

              Comment


              • #8
                Originally posted by SES View Post
                You have 1.4 billion reads in one file? And you used this with BLAST?
                Well, it was actually the fastest thing to do. Assembly is going to take a while.


                Thanks for the other suggestions though.

                Comment


                • #9
                  Originally posted by SES View Post
                  You have 1.4 billion reads in one file? And you used this with BLAST?

                  (I don't know what you are trying to do but splitting the data, if possible, will speed up any procedure)

                  Anyway, these all sound like great solutions, but I would like to point out that cdbfasta has the -Q option to index fastq files and cdbyank can be used to pull the requested ID or IDs from a list. I have not used these other tools but I have tried BioPerl's Fastq indexing method and SeqIO module for pulling Fastq entries and it became clear to me that these were just not practical solutions for the size of modern NGS sequence files. cdbfasta will probably be the fastest solution for pulling reads, but like any indexing method, you have to create the index. I don't know what is best for your application but it looks like you have some options.
                  I think BioPerl offers more than one back end for indexing sequences... not sure which is fastest for this size of data.

                  You could also try the Bio.SeqIO.index_db(...) function in Biopython 1.57 or later which uses an SQLite database to index sequence files and give random access to the sequences by ID.

                  Biopython also offers similar functionality with Bio.SeqIO.index(...) but with an in memory index, which would need a fairly big RAM machine with 1.4 billion reads!

                  Comment


                  • #10
                    If the -o T parameter was used during the creation of the blast database using formatdb then the old stalwart, fastacmd can be used to pull the desired read from the database.

                    That said, cdbfasta works great with both fastq and fasta files.

                    --
                    Phillip

                    Comment


                    • #11
                      Originally posted by pmiguel View Post
                      If the -o T parameter was used during the creation of the blast database using formatdb then the old stalwart, fastacmd can be used to pull the desired read from the database.
                      Yes, if he wants FASTA files.

                      But the original question did say FASTQ not FASTA (but it is easy to type the wrong name...)

                      Comment


                      • #12
                        Originally posted by maubp View Post
                        I think BioPerl offers more than one back end for indexing sequences... not sure which is fastest for this size of data.

                        You could also try the Bio.SeqIO.index_db(...) function in Biopython 1.57 or later which uses an SQLite database to index sequence files and give random access to the sequences by ID.
                        The module I used is Bio::Index::Fastq and I don't think it is designed for NGS data based on my own experience and some conversations I have read on the BioPerl listserv. The Biopython module you describe sounds like it is functionally similar to Bio::[D]B::Fasta (ignore brackets) in BioPerl but I don't think there is such a module with Fastq support. Am I wrong? That would be great if such a thing existed in BioPerl because I'm not a Python coder unfortunately.

                        Comment


                        • #13
                          Originally posted by SES View Post
                          The module I used is Bio::Index::Fastq and I don't think it is designed for NGS data based on my own experience and some conversations I have read on the BioPerl listserv.
                          I have read that Bio::Index::Fastq uses DB_File, a Perl interface to the Berkeley DB (BDB), and that this does not scale well with NGS data:
                          http://lists.open-bio.org/pipermail/...ry/034530.html

                          The Biopython module you describe sounds like it is functionally similar to Bio::[D]B::Fasta (ignore brackets) in BioPerl but I don't think there is such a module with Fastq support. Am I wrong? That would be great if such a thing existed in BioPerl because I'm not a Python coder unfortunately
                          According to the email I linked to above from Chris Fields, BioPerl folk have discussed an SQLite back end using the Perl interface AnyDBM_File - but I don't know that anything has been done yet.

                          Comment


                          • #14
                            Originally posted by maubp View Post
                            I have read that Bio::Index::Fastq uses DB_File, a Perl interface to the Berkeley DB (BDB), and that this does not scale well with NGS data:
                            http://lists.open-bio.org/pipermail/...ry/034530.html


                            According to the email I linked to above from Chris Fields, BioPerl folk have discussed an SQLite back end using the Perl interface AnyDBM_File - but I don't know that anything has been done yet.
                            Thanks, that is informative and consistent with my experience. Though it would be nice to have this functionality in BioPerl, there are numerous good solutions out there. I don't know how difficult this would be to implement but it does seem difficult to justify spending time on given all the solutions listed here.

                            Comment


                            • #15
                              Originally posted by SES View Post
                              Thanks, that is informative and consistent with my experience. Though it would be nice to have this functionality in BioPerl, there are numerous good solutions out there. I don't know how difficult this would be to implement but it does seem difficult to justify spending time on given all the solutions listed here.
                              Sign up to the BioPerl mailing list and talk to them - it may encourage someone to work on it, especially if you offer to help test it

                              Comment

                              Working...
                              X