Seqanswers Leaderboard Ad

**themerlin** · 06-24-2011, 07:01 AM

The mira assembler:

http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page

Download MIRA for free. MIRA - Sequence assembler and sequence mapping for whole genome shotgun and EST / RNASeq sequencing data. Can use Sanger, 454, Illumina and IonTorrent data.

comes with a script called fastqselect.tcl. The script takes a name file and pulls out your sequences of interest. I have used biopython scripts to do the same thing, but this fastqselect tool is much faster.

J

**Wallysb01** · 06-24-2011, 07:42 AM

Thanks, I'll give that a try.

**Wallysb01** · 06-24-2011, 01:01 PM

That worked great. And covered some 1.4 billion reads in just a half hour or so. Thanks again.

**jbrwn** · 06-24-2011, 02:25 PM

i'm sure Simon Anders will get around to answering with this same answer, but HTSeq has a fastq reader that simplifies this process (assuming you're comfortable with python).

check out:

Code:

http://www-huber.embl.de/users/anders/HTSeq/doc/tour.html

**maubp** · 07-04-2011, 07:04 AM

Originally posted by themerlin View Post

The mira assembler:

MIRA

http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page

Download MIRA for free. MIRA - Sequence assembler and sequence mapping for whole genome shotgun and EST / RNASeq sequencing data. Can use Sanger, 454, Illumina and IonTorrent data.

comes with a script called fastqselect.tcl. The script takes a name file and pulls out your sequences of interest. I have used biopython scripts to do the same thing, but this fastqselect tool is much faster.

J

For those of you using Galaxy, I have two python scripts to do available on the Galaxy Tool Shed, http://toolshed.g2.bx.psu.edu/

seq_filter_by_id - Filter sequences by ID
seq_select_by_id - Select sequences by ID

They both work with FASTA, FASTQ and SFF files. The first makes a single pass though the sequence files, and extracts reads with a requested ID in the order found in the sequence file. The second is slower because it indexes the file first then extracts the requested IDs in the order of the ID file.

**SES** · 07-05-2011, 03:29 PM

Originally posted by Wallysb01 View Post

That worked great. And covered some 1.4 billion reads in just a half hour or so. Thanks again.

You have 1.4 billion reads in one file? And you used this with BLAST?

(I don't know what you are trying to do but splitting the data, if possible, will speed up any procedure)

Anyway, these all sound like great solutions, but I would like to point out that cdbfasta has the -Q option to index fastq files and cdbyank can be used to pull the requested ID or IDs from a list. I have not used these other tools but I have tried BioPerl's Fastq indexing method and SeqIO module for pulling Fastq entries and it became clear to me that these were just not practical solutions for the size of modern NGS sequence files. cdbfasta will probably be the fastest solution for pulling reads, but like any indexing method, you have to create the index. I don't know what is best for your application but it looks like you have some options.

**Wallysb01** · 07-05-2011, 04:22 PM

Originally posted by SES View Post

You have 1.4 billion reads in one file? And you used this with BLAST?

Well, it was actually the fastest thing to do. Assembly is going to take a while.

Thanks for the other suggestions though.

**maubp** · 07-06-2011, 02:01 AM

Originally posted by SES View Post

You have 1.4 billion reads in one file? And you used this with BLAST?

(I don't know what you are trying to do but splitting the data, if possible, will speed up any procedure)

Anyway, these all sound like great solutions, but I would like to point out that cdbfasta has the -Q option to index fastq files and cdbyank can be used to pull the requested ID or IDs from a list. I have not used these other tools but I have tried BioPerl's Fastq indexing method and SeqIO module for pulling Fastq entries and it became clear to me that these were just not practical solutions for the size of modern NGS sequence files. cdbfasta will probably be the fastest solution for pulling reads, but like any indexing method, you have to create the index. I don't know what is best for your application but it looks like you have some options.

I think BioPerl offers more than one back end for indexing sequences... not sure which is fastest for this size of data.

You could also try the Bio.SeqIO.index_db(...) function in Biopython 1.57 or later which uses an SQLite database to index sequence files and give random access to the sequences by ID.

Biopython also offers similar functionality with Bio.SeqIO.index(...) but with an in memory index, which would need a fairly big RAM machine with 1.4 billion reads!

**pmiguel** · 07-06-2011, 08:11 AM

If the -o T parameter was used during the creation of the blast database using formatdb then the old stalwart, fastacmd can be used to pull the desired read from the database.

That said, cdbfasta works great with both fastq and fasta files.

--
Phillip

**maubp** · 07-06-2011, 08:28 AM

Originally posted by pmiguel View Post

If the -o T parameter was used during the creation of the blast database using formatdb then the old stalwart, fastacmd can be used to pull the desired read from the database.

Yes, if he wants FASTA files.

But the original question did say FASTQ not FASTA (but it is easy to type the wrong name...)

**SES** · 07-07-2011, 11:30 AM

Originally posted by maubp View Post

I think BioPerl offers more than one back end for indexing sequences... not sure which is fastest for this size of data.

You could also try the Bio.SeqIO.index_db(...) function in Biopython 1.57 or later which uses an SQLite database to index sequence files and give random access to the sequences by ID.

The module I used is Bio::Index::Fastq and I don't think it is designed for NGS data based on my own experience and some conversations I have read on the BioPerl listserv. The Biopython module you describe sounds like it is functionally similar to Bio::[D]B::Fasta (ignore brackets) in BioPerl but I don't think there is such a module with Fastq support. Am I wrong? That would be great if such a thing existed in BioPerl because I'm not a Python coder unfortunately.

**maubp** · 07-08-2011, 02:25 AM

Originally posted by SES View Post

The module I used is Bio::Index::Fastq and I don't think it is designed for NGS data based on my own experience and some conversations I have read on the BioPerl listserv.

I have read that Bio::Index::Fastq uses DB_File, a Perl interface to the Berkeley DB (BDB), and that this does not scale well with NGS data:

[Bioperl-l] randomizing fastq sequences

http://lists.open-bio.org/pipermail/bioperl-l/2011-February/034530.html

The Biopython module you describe sounds like it is functionally similar to Bio::[D]B::Fasta (ignore brackets) in BioPerl but I don't think there is such a module with Fastq support. Am I wrong? That would be great if such a thing existed in BioPerl because I'm not a Python coder unfortunately

According to the email I linked to above from Chris Fields, BioPerl folk have discussed an SQLite back end using the Perl interface AnyDBM_File - but I don't know that anything has been done yet.

**SES** · 07-11-2011, 06:50 AM

Originally posted by maubp View Post

I have read that Bio::Index::Fastq uses DB_File, a Perl interface to the Berkeley DB (BDB), and that this does not scale well with NGS data:

[Bioperl-l] randomizing fastq sequences

http://lists.open-bio.org/pipermail/bioperl-l/2011-February/034530.html

According to the email I linked to above from Chris Fields, BioPerl folk have discussed an SQLite back end using the Perl interface AnyDBM_File - but I don't know that anything has been done yet.

Thanks, that is informative and consistent with my experience. Though it would be nice to have this functionality in BioPerl, there are numerous good solutions out there. I don't know how difficult this would be to implement but it does seem difficult to justify spending time on given all the solutions listed here.

**maubp** · 07-11-2011, 07:13 AM

Originally posted by SES View Post

Thanks, that is informative and consistent with my experience. Though it would be nice to have this functionality in BioPerl, there are numerous good solutions out there. I don't know how difficult this would be to implement but it does seem difficult to justify spending time on given all the solutions listed here.

Sign up to the BioPerl mailing list and talk to them - it may encourage someone to work on it, especially if you offer to help test it

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

extracting reads from a large fastq file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News