Unconfigured Ad

**kmcarr** · 03-23-2011, 12:58 PM

You are re-opening the output file every time you grab a new input from your database file (the outermost while loop) which basically reinitializes the file. You need to move the open/close statements for the output file outside your main program loop.

Also, you approach is not the most efficient. You are scanning through your entire FASTA file for each record you want to save. A better approach is to read in your search list and store it in a hash. Then go through your FASTA file just once, comparing each record to your list (hash). If the ID is in your hash then write the sequence to your output file. If the ID is not in your list then move on to the next record.

**chalex** · 03-23-2011, 01:40 PM

So, uh, wouldn't you just do

Code:

grep -F -f file1 -A 1 file2

where file1 is your search terms and file2 is the full FASTA file?

See the man page for GNU grep for more info.

**kmcarr** · 03-23-2011, 04:01 PM

Originally posted by chalex View Post

So, uh, wouldn't you just do

Code:

grep -F -f file1 -A 1 file2

where file1 is your search terms and file2 is the full FASTA file?

See the man page for GNU grep for more info.

You're method only copies the first line of sequence after the definition line. There is no way to tell how many lines of sequence are follow a given definition line so you need some way to examine the lines following your matched definition.

**greigite** · 03-23-2011, 06:50 PM

Originally posted by kmcarr View Post

You're method only copies the first line of sequence after the definition line. There is no way to tell how many lines of sequence are follow a given definition line so you need some way to examine the lines following your matched definition.

Funny, I was just working on this problem today. I quality filtered some Illumina PE data (reads 1 and 2 in separate files) and after that they didn't have exactly matched reads in them anymore. I needed to produce new files for each read containing only those reads with a partner in the other quality filtered file. Eventually I figured out that grep does work for this if you tell it to pull only the 4 lines after the read ID match:

Code:

     my $lineone = `grep -n $readID $read_file_name]`;
     my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
     $linenum = $linenum +3;
     my @printone = `head -$linenum $read_file_name | tail -4`;

**tomc** · 03-24-2011, 12:27 AM

tools do exist

Is there a reason not to use the pattern of,
indexing the sequence fasta file with whatever blast like tool
you are use to, then feeding the set of identifiers to the command
that extracts the fasta records from the indexed sequence?
especially if you may want to partition out other subset of sequence.
something like ...

classic ncbi blast

Code:

formatdb  -p F -o T -i sequence.fasta -n blastdb
fastacmd  -d blastdb  -i accession.list  -o isolated.sequences

wublast

Code:

xdformat -n -I -o blastdb sequence.fasta  
xdget -n -f  blastdb  accession.list > isolated.sequences

new blast+ (have never used yet)

Code:

makeblastdb -in sequence.fasta -dbtype nucl  -hash_index  - parse_seqids  -out blastdb
blastdbcmd  -entry_batch accession.list -db blastdb -dbtype nucl

nor I have been using the newer tools, bowtie, bwa ...
long enough to to know if they have a similar mode
but it would not surprise me.

**Seqasaurus** · 03-28-2011, 02:38 PM

pyfasta http://pypi.python.org/pypi/pyfasta/

or flatten the fasta so that each sequence only occupies a single line then use grep?

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, Today, 06:09 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 Today, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 39 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 47 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

Help with FASTA parsing code.

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News