Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Wallysb01
    Senior Member
    • Feb 2011
    • 286

    extracting reads from a large fastq file

    Hi all,

    So, I'd like to extract the reads that came out of a blast result that met a certain threshold. It appears the NCBI blast output make it rather easy to get the sequence identifiers I need, but the out put files do not include the actual sequences in a very easy to use format.

    Any suggestions? I've been searching around for a tool to do this but haven't found anything yet.
  • themerlin
    Member
    • Feb 2010
    • 51

    #2
    The mira assembler:

    Download MIRA for free. MIRA V5 is available only on GitHub! The V4 version released here on SourceForge stay up as some automated release fetching packages rely on V4.


    comes with a script called fastqselect.tcl. The script takes a name file and pulls out your sequences of interest. I have used biopython scripts to do the same thing, but this fastqselect tool is much faster.

    J

    Comment

    • Wallysb01
      Senior Member
      • Feb 2011
      • 286

      #3
      Thanks, I'll give that a try.

      Comment

      • Wallysb01
        Senior Member
        • Feb 2011
        • 286

        #4
        That worked great. And covered some 1.4 billion reads in just a half hour or so. Thanks again.

        Comment

        • jbrwn
          Member
          • Mar 2011
          • 37

          #5
          i'm sure Simon Anders will get around to answering with this same answer, but HTSeq has a fastq reader that simplifies this process (assuming you're comfortable with python).

          check out:
          Code:
          http://www-huber.embl.de/users/anders/HTSeq/doc/tour.html

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Originally posted by themerlin View Post
            The mira assembler:

            Download MIRA for free. MIRA V5 is available only on GitHub! The V4 version released here on SourceForge stay up as some automated release fetching packages rely on V4.


            comes with a script called fastqselect.tcl. The script takes a name file and pulls out your sequences of interest. I have used biopython scripts to do the same thing, but this fastqselect tool is much faster.

            J
            For those of you using Galaxy, I have two python scripts to do available on the Galaxy Tool Shed, http://toolshed.g2.bx.psu.edu/

            seq_filter_by_id - Filter sequences by ID
            seq_select_by_id - Select sequences by ID

            They both work with FASTA, FASTQ and SFF files. The first makes a single pass though the sequence files, and extracts reads with a requested ID in the order found in the sequence file. The second is slower because it indexes the file first then extracts the requested IDs in the order of the ID file.

            Comment

            • SES
              Senior Member
              • Mar 2010
              • 275

              #7
              Originally posted by Wallysb01 View Post
              That worked great. And covered some 1.4 billion reads in just a half hour or so. Thanks again.
              You have 1.4 billion reads in one file? And you used this with BLAST?

              (I don't know what you are trying to do but splitting the data, if possible, will speed up any procedure)

              Anyway, these all sound like great solutions, but I would like to point out that cdbfasta has the -Q option to index fastq files and cdbyank can be used to pull the requested ID or IDs from a list. I have not used these other tools but I have tried BioPerl's Fastq indexing method and SeqIO module for pulling Fastq entries and it became clear to me that these were just not practical solutions for the size of modern NGS sequence files. cdbfasta will probably be the fastest solution for pulling reads, but like any indexing method, you have to create the index. I don't know what is best for your application but it looks like you have some options.

              Comment

              • Wallysb01
                Senior Member
                • Feb 2011
                • 286

                #8
                Originally posted by SES View Post
                You have 1.4 billion reads in one file? And you used this with BLAST?
                Well, it was actually the fastest thing to do. Assembly is going to take a while.


                Thanks for the other suggestions though.

                Comment

                • maubp
                  Peter (Biopython etc)
                  • Jul 2009
                  • 1544

                  #9
                  Originally posted by SES View Post
                  You have 1.4 billion reads in one file? And you used this with BLAST?

                  (I don't know what you are trying to do but splitting the data, if possible, will speed up any procedure)

                  Anyway, these all sound like great solutions, but I would like to point out that cdbfasta has the -Q option to index fastq files and cdbyank can be used to pull the requested ID or IDs from a list. I have not used these other tools but I have tried BioPerl's Fastq indexing method and SeqIO module for pulling Fastq entries and it became clear to me that these were just not practical solutions for the size of modern NGS sequence files. cdbfasta will probably be the fastest solution for pulling reads, but like any indexing method, you have to create the index. I don't know what is best for your application but it looks like you have some options.
                  I think BioPerl offers more than one back end for indexing sequences... not sure which is fastest for this size of data.

                  You could also try the Bio.SeqIO.index_db(...) function in Biopython 1.57 or later which uses an SQLite database to index sequence files and give random access to the sequences by ID.

                  Biopython also offers similar functionality with Bio.SeqIO.index(...) but with an in memory index, which would need a fairly big RAM machine with 1.4 billion reads!

                  Comment

                  • pmiguel
                    Senior Member
                    • Aug 2008
                    • 2328

                    #10
                    If the -o T parameter was used during the creation of the blast database using formatdb then the old stalwart, fastacmd can be used to pull the desired read from the database.

                    That said, cdbfasta works great with both fastq and fasta files.

                    --
                    Phillip

                    Comment

                    • maubp
                      Peter (Biopython etc)
                      • Jul 2009
                      • 1544

                      #11
                      Originally posted by pmiguel View Post
                      If the -o T parameter was used during the creation of the blast database using formatdb then the old stalwart, fastacmd can be used to pull the desired read from the database.
                      Yes, if he wants FASTA files.

                      But the original question did say FASTQ not FASTA (but it is easy to type the wrong name...)

                      Comment

                      • SES
                        Senior Member
                        • Mar 2010
                        • 275

                        #12
                        Originally posted by maubp View Post
                        I think BioPerl offers more than one back end for indexing sequences... not sure which is fastest for this size of data.

                        You could also try the Bio.SeqIO.index_db(...) function in Biopython 1.57 or later which uses an SQLite database to index sequence files and give random access to the sequences by ID.
                        The module I used is Bio::Index::Fastq and I don't think it is designed for NGS data based on my own experience and some conversations I have read on the BioPerl listserv. The Biopython module you describe sounds like it is functionally similar to Bio::[D]B::Fasta (ignore brackets) in BioPerl but I don't think there is such a module with Fastq support. Am I wrong? That would be great if such a thing existed in BioPerl because I'm not a Python coder unfortunately.

                        Comment

                        • maubp
                          Peter (Biopython etc)
                          • Jul 2009
                          • 1544

                          #13
                          Originally posted by SES View Post
                          The module I used is Bio::Index::Fastq and I don't think it is designed for NGS data based on my own experience and some conversations I have read on the BioPerl listserv.
                          I have read that Bio::Index::Fastq uses DB_File, a Perl interface to the Berkeley DB (BDB), and that this does not scale well with NGS data:


                          The Biopython module you describe sounds like it is functionally similar to Bio::[D]B::Fasta (ignore brackets) in BioPerl but I don't think there is such a module with Fastq support. Am I wrong? That would be great if such a thing existed in BioPerl because I'm not a Python coder unfortunately
                          According to the email I linked to above from Chris Fields, BioPerl folk have discussed an SQLite back end using the Perl interface AnyDBM_File - but I don't know that anything has been done yet.

                          Comment

                          • SES
                            Senior Member
                            • Mar 2010
                            • 275

                            #14
                            Originally posted by maubp View Post
                            I have read that Bio::Index::Fastq uses DB_File, a Perl interface to the Berkeley DB (BDB), and that this does not scale well with NGS data:



                            According to the email I linked to above from Chris Fields, BioPerl folk have discussed an SQLite back end using the Perl interface AnyDBM_File - but I don't know that anything has been done yet.
                            Thanks, that is informative and consistent with my experience. Though it would be nice to have this functionality in BioPerl, there are numerous good solutions out there. I don't know how difficult this would be to implement but it does seem difficult to justify spending time on given all the solutions listed here.

                            Comment

                            • maubp
                              Peter (Biopython etc)
                              • Jul 2009
                              • 1544

                              #15
                              Originally posted by SES View Post
                              Thanks, that is informative and consistent with my experience. Though it would be nice to have this functionality in BioPerl, there are numerous good solutions out there. I don't know how difficult this would be to implement but it does seem difficult to justify spending time on given all the solutions listed here.
                              Sign up to the BioPerl mailing list and talk to them - it may encourage someone to work on it, especially if you offer to help test it

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Pathogen Surveillance with Advanced Genomic Tools
                                by seqadmin




                                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                Today, 11:48 AM
                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM
                              • seqadmin
                                Investigating the Gut Microbiome Through Diet and Spatial Biology
                                by seqadmin




                                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                                02-24-2025, 06:31 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              26 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              33 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              25 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              190 views
                              0 reactions
                              Last Post seqadmin  
                              Working...