Seqanswers Leaderboard Ad

**mastal** · 11-11-2013, 08:45 AM

Originally posted by pony2001mx View Post

Code:

	my $j=0;
		while ($j<$#list){
		if ($seq[$i] eq $list[$j]){
		print "$seq[$i]\t$seq[$i+1]";
		}
		$j++;
	}

In your second while loop, you should have

Code:

      while($j <= $#list){

Alternatively, you could use a foreach loop to go through the @list array.

**swbarnes2** · 11-11-2013, 09:45 AM

This isn't quite the right way to do this. It's way too brittle.

If you have @list, do

Code:

my %hash = map {$_, 1} @list.

Or, I think slurping a whole file at one go can be dicey if the file is huge, so it's a bit safer to do

Code:

my %hash = ();
while (<LIST>) {
     $hash{$_}++;
}

Those will both make a hash where each key is a gene name, and all the values are one.

Then go through the sequences file, ask what $hash{$seq_name} is. If it's 1, you want that sequence, because its on your list. If it doesn't exist, that seq name was not on your list.

**kmcarr** · 11-11-2013, 10:22 AM

Here is a script I put together some time ago that follows swbarnes logic. It reads the list file and stores the IDs in a hash. It then parses through the FASTA file one record at a time and checks if the ID of the current record is in the list. It operates in two modes, include (i) or exclude (e) (defaults to include). In include mode it will output those IDs which are in your list file; in exclude mode it will only output those IDs not in your list file.

Usage:

Code:

% subSetFasta_Simple.pl <-f|--fasta sequenceFile.fasta> <-l|--list listFile.txt> [-m|--mode <i|e>]

subSetFasta_simple.pl takes three arguments:
     -f|--fasta the name of the FASTA input file
     -l|--list the name of the list file. One ID per line, do not include ">"
     -m|--mode either 'i' or 'e' (default 'i')

Attached Files

subSetFasta_Simple.pl (1.8 KB, 110 views)

**rhinoceros** · 11-11-2013, 10:50 AM

or just

Code:

grep -A 1 -f listFile.txt file.fasta > list.fasta

Doesn't work if there are linebreaks in the sequences. Also, ridiculously slow..

**gringer** · 11-11-2013, 08:10 PM

Sometime in the distant past, I installed meme, which has a fasta-fetch program with the following syntax:

Code:

        fasta-fetch <fasta> [-f <file> | [<id>]+] [-c] [-s <suf>] 
                [-off <off>] [-len <len>]

                <fasta>         name of FASTA sequence file
                [-f <file>]     file containing list of sequence identifiers
                [<id>]          sequence identifier
                [-c]            check that first word in fasta id matches <id>
                [-s <suf>]      put each sequence in a file named "after" the
                                sequence identifier with ".<suf>" appended; 
                                pipes in file names are changed to underscores
                [-off <off>]    print starting at position <off>; default: 1
                [-len <len>]    print up to <len> characters for each seq; 
                                default: print entire sequence

        Note: Assumes and index file has been made using fasta-make-index.
        Sequence identifiers must be same as made by fasta-make-index.

        Reads sequence identifiers from the command line, from
        a file and from standard input, in that order.

        Fetch sequences from a FASTA sequence file and 
        write to standard output.

In response to your perl code, it would be better to process your files line-by-line, and use hashes for lookup, because then your code is more generalisable in the future (e.g. for extracting raw reads out of a 100GB FASTA file from a NGS sequencer):

Code:

my %sequenceIDs = ();
while(<handle>){
  if(/>(.*?) /){
    $sequenceIDs{$1} = 1;
  }
}

instead of the "try to read everything into memory at once" approach:

Code:

@sequenceIDs = <handle>;

... in other words, what swbarnes said.

**pony2001mx** · 11-12-2013, 06:35 PM

THANKS a lot for your suggestions, which are really helpful!

Topics	Statistics	Last Post
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, Today, 06:55 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 06:55 AM
Genetic Mosaicism More Prevalent Than Previously Thought by seqadmin Started by seqadmin, 05-30-2024, 03:16 PM	0 responses 24 views 0 likes	Last Post by seqadmin 05-30-2024, 03:16 PM
Comprehensive Sequencing of Great Ape Sex Chromosomes Yields Insights into Evolution and Genetic Variability by seqadmin Started by seqadmin, 05-29-2024, 01:32 PM	0 responses 27 views 0 likes	Last Post by seqadmin 05-29-2024, 01:32 PM
New Toolkit Enhances Plant Mitochondrial Genome Research by seqadmin Started by seqadmin, 05-24-2024, 07:15 AM	0 responses 215 views 0 likes	Last Post by seqadmin 05-24-2024, 07:15 AM

Seqanswers Leaderboard Ad

Announcement

perl script to exact sequences by name list

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News