Seqanswers Leaderboard Ad

**jgibbons1** · 01-06-2010, 09:30 AM

I think I've figured out a roundabout way of solving this problem using grep commands.

grep headers form my fasta file:

>READ_123
>READ_456
>READ_789
>READ_246

use find and replace in VI editor to create new file containing grep commands:

grep "READ_123" TOTAL_READS.fastq" -m 2 -A 1 > MyReads.fastq
grep "READ_456" TOTAL_READS.fastq" -m 2 -A 1 >> MyReads.fastq
grep "READ_789" TOTAL_READS.fastq" -m 2 -A 1 >> MyReads.fastq
grep "READ_246" TOTAL_READS.fastq" -m 2 -A 1 >> MyReads.fastq

-m will take the first 2 occurrences of the search item (ie. READ_123) because fastq files have one header for the base quality score and one header for the sequence

-A 1 will take one line directly beneath the search item

The end product is the fastq file of MyReads

@READ_123
NGGATTTTTCACTCTACATCCAATANNNNNNNNNNC
+READ_123
DMWXYXYZYY[[YYXY[BBBBBBBBBBBBBBBBBBB
@READ_456
NCCTCTCCTCAAACATCTTCTCCGANNNNNNNNNNT
+READ_456
DLUURWUYYXWSSXBBBBBBBBBBBBBBBBBBBBBB
@READ_789
NGCAGTTCTTGTTCCGCGGATCGTGNNNNNNNNNNT
+READ_789
DHQNKNWWWSNBBBBBBBBBBBBBBBBBBBBBBBBB
@READ_246
NTGCTTTTCATCTTTCGATCACTCTNNNNNNNNNNT
+READ_246
DOPTXXXXXYYYYYYBBBBBBBBBBBBBBBBBBBBB

**maubp** · 01-06-2010, 09:33 AM

So you have an 18 million read FASTQ file, and a 6 million read FASTA file, and you want to make a 6 million read FASTQ file containing all those reads in the FASTA file?

Personally I'd do this in Biopython, but given the number of records I'd be cautious about indexing an 18 million record file in memory. Instead I'd use Bio.SeqIO.parse to loop over all the FASTA entries and record their identifiers in memory (I'd use a Python set here). Keeping just 6 million short strings in memory shouldn't be a problem. Then loop over the FASTQ file, saving only those records with a desired identifier.

Something like this should work in Biopython 1.51 or later:

Code:

from Bio import SeqIO

#This is a memory efficient list comprehension,
#only the record identifiers are all kept in memory
wanted_ids = set(record.id for record in \
                 SeqIO.parse(open("selected.fasta"), "fasta"))
print "%i unique identifiers in the FASTA file" % len(wanted_ids)

#This is a memory efficient list comprehension,
#only one record is in memory at a time
wanted = (record for record in \
          SeqIO.parse(open("everything.fastq"), "fastq") \
          if record.id in wanted_ids)

handle = open("selected.fastq", "w") #output file
count = SeqIO.write(wanted, handle, "fastq")
handle.close()

print "%i records saved to FASTQ file" % count

This can be made faster if need be, using the ideas outlined here:

OBF » Working with FASTQ files in Biopython when speed matters » Working with FASTQ files in Biopython when speed matters

http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Open Bioinformatics Foundation Homepage

**jgibbons1** · 01-06-2010, 09:35 AM

Thanks so much for the insight!

I'll give this a shot.

**maubp** · 01-06-2010, 09:37 AM

Originally posted by jgibbons1 View Post

I think I've figured out a roundabout way of solving this problem using grep commands.

That's very inventive, although it means creating a 6 million line grep script

[It also makes some assumptions about the FASTQ files, like no line wrapping and the inclusion of the optional second identifier on the plus lines - but it would get the job done]

But seriously - learning a scripting language would make this sort of thing much easier (e.g. Python, Perl, ...)

**jgibbons1** · 01-06-2010, 09:43 AM

Originally posted by maubp View Post

But seriously - learning a scripting language would make this sort of thing much easier (e.g. Python, Perl, ...)

I couldn't agree with you more...learning some perl is on my 2010 to do list!

**jgibbons1** · 01-08-2010, 08:03 AM

In the rare case that someone is trying to do the same thing as me...here is a handy perl script that takes a fastq file and a headers file and outputs a fastq file from the headers

#!/usr/local/bin/perl
use strict;

####### Usage: perl parsefastq.pl fastq_file headers_file > output_file

open(FILE, $ARGV[0]) or die "Fastaq file not found";

my @fastqarray = <FILE>;

close FILE;

open(FILE2, $ARGV[1]) or die "Headers file not found";

my @headers = <FILE2>;

close FILE2;

my @backfastqarray = @fastqarray;
my $size = @fastqarray;
my %fastqhash=();
my @fastqstringarray;
my @greparray;

for(my $i=0;$i<$size;$i++) {

my @fastqstringa = splice(@fastqarray,0,4);
push(@fastqstringarray,[@fastqstringa]);

}

@greparray = grep(/@/, @backfastqarray);

my $size2 = @greparray;

for(my $i=0;$i<$size2;$i++) {
my $key = $greparray[$i];
$fastqhash{$key} = "@{$fastqstringarray[$i]}";

}

foreach my $element(@headers) {

print $fastqhash{$element};

}

**gpertea** · 01-08-2010, 11:50 AM

Just FYI, related to this topic, there is another FASTA indexing tool that also supports FASTQ file indexing and querying of random records based on read ID. It's the cdbfasta/cdbyank suite that can be found on this page:
http://compbio.dfci.harvard.edu/tgi/software/
(direct link to the source package: ftp://occams.dfci.harvard.edu/pub/bi...dbfasta.tar.gz - don't use the cdbfasta binaries on that site, they are very old and the -Q option wasn't implemented)

cdbfasta is the indexing program, typically used for FASTA files but with the -Q option it can index FASTQ as well.
Example usage for fastq files would be:

Code:

cdbfasta -Q reads.fq

.. and this will create the index file reads.fq.cidx.

cdbyank is the query program (like xdget) for the index file previously created with cdbfasta, and it can take a list of read IDs at stdin, e.g.:

Code:

cdbyank reads.fq.cidx < selreads.txt > selreads.fq

.. and selreads.fq will then have the FASTQ records whose names were given in the selreads.txt file.

Topics	Statistics	Last Post
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, Today, 07:17 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 07:17 AM
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 20 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM

Seqanswers Leaderboard Ad

Announcement

fastq xdget? Mosaik issue

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News