Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • pony2001mx
    Member
    • Aug 2013
    • 32

    perl script to exact sequences by name list

    Dear All,
    i am a perl beginner. Recently i wrote a script to extract multiple sequences from a fasta file. The following is an example.

    FASTA FILE:
    >gene1
    cgcggccccccccccccccccccc
    >gene2
    ggggggggggggggggggcgcgcg
    >gene3
    tagctacgatagcgtcagatgcaacg
    >gene4
    attttttttttttttttttttttttttttaaaaaaa

    LIST FILE:
    >gene4
    >gene3

    MY SCRIPT:
    #!/usr/bin/perl
    use strict;
    use warnings;

    open SEQ, "seq.txt";
    my @seq=<SEQ>;
    open LIST, "list.txt";
    my @list=<LIST>;

    my $i=0;
    while ($i<$#seq){
    my $j=0;
    while ($j<$#list){
    if ($seq[$i] eq $list[$j]){
    print "$seq[$i]\t$seq[$i+1]";
    }
    $j++;
    }
    $i +=2;
    }
    close SEQ;
    close LIST;

    THE OUTPUT RESULT:
    >gene4
    attttttttttttttttttttttttttttaaaaaaa

    Could anyone help to correct my script or make suggestions? Thank you very much for your help!!
  • mastal
    Senior Member
    • Mar 2009
    • 666

    #2
    Originally posted by pony2001mx View Post


    Code:
    	my $j=0;
    		while ($j<$#list){
    		if ($seq[$i] eq $list[$j]){
    		print "$seq[$i]\t$seq[$i+1]";
    		}
    		$j++;
    	}
    In your second while loop, you should have

    Code:
          while($j <= $#list){
    Alternatively, you could use a foreach loop to go through the @list array.

    Comment

    • swbarnes2
      Senior Member
      • May 2008
      • 910

      #3
      This isn't quite the right way to do this. It's way too brittle.

      If you have @list, do

      Code:
      my %hash = map {$_, 1} @list.
      Or, I think slurping a whole file at one go can be dicey if the file is huge, so it's a bit safer to do

      Code:
      my %hash = ();
      while (<LIST>) {
           $hash{$_}++;
      }
      Those will both make a hash where each key is a gene name, and all the values are one.

      Then go through the sequences file, ask what $hash{$seq_name} is. If it's 1, you want that sequence, because its on your list. If it doesn't exist, that seq name was not on your list.

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        Here is a script I put together some time ago that follows swbarnes logic. It reads the list file and stores the IDs in a hash. It then parses through the FASTA file one record at a time and checks if the ID of the current record is in the list. It operates in two modes, include (i) or exclude (e) (defaults to include). In include mode it will output those IDs which are in your list file; in exclude mode it will only output those IDs not in your list file.

        Usage:

        Code:
        % subSetFasta_Simple.pl <-f|--fasta sequenceFile.fasta> <-l|--list listFile.txt> [-m|--mode <i|e>]
        
        subSetFasta_simple.pl takes three arguments:
             -f|--fasta the name of the FASTA input file
             -l|--list the name of the list file. One ID per line, do not include ">"
             -m|--mode either 'i' or 'e' (default 'i')
        Attached Files

        Comment

        • rhinoceros
          Senior Member
          • Apr 2013
          • 372

          #5
          or just

          Code:
          grep -A 1 -f listFile.txt file.fasta > list.fasta
          Doesn't work if there are linebreaks in the sequences. Also, ridiculously slow..
          savetherhino.org

          Comment

          • gringer
            David Eccles (gringer)
            • May 2011
            • 845

            #6
            Sometime in the distant past, I installed meme, which has a fasta-fetch program with the following syntax:

            Code:
                    fasta-fetch <fasta> [-f <file> | [<id>]+] [-c] [-s <suf>] 
                            [-off <off>] [-len <len>]
            
                            <fasta>         name of FASTA sequence file
                            [-f <file>]     file containing list of sequence identifiers
                            [<id>]          sequence identifier
                            [-c]            check that first word in fasta id matches <id>
                            [-s <suf>]      put each sequence in a file named "after" the
                                            sequence identifier with ".<suf>" appended; 
                                            pipes in file names are changed to underscores
                            [-off <off>]    print starting at position <off>; default: 1
                            [-len <len>]    print up to <len> characters for each seq; 
                                            default: print entire sequence
            
                    Note: Assumes and index file has been made using fasta-make-index.
                    Sequence identifiers must be same as made by fasta-make-index.
            
                    Reads sequence identifiers from the command line, from
                    a file and from standard input, in that order.
            
                    Fetch sequences from a FASTA sequence file and 
                    write to standard output.
            In response to your perl code, it would be better to process your files line-by-line, and use hashes for lookup, because then your code is more generalisable in the future (e.g. for extracting raw reads out of a 100GB FASTA file from a NGS sequencer):

            Code:
            my %sequenceIDs = ();
            while(<handle>){
              if(/>(.*?) /){
                $sequenceIDs{$1} = 1;
              }
            }
            instead of the "try to read everything into memory at once" approach:
            Code:
            @sequenceIDs = <handle>;
            ... in other words, what swbarnes said.
            Last edited by gringer; 11-11-2013, 08:17 PM.

            Comment

            • pony2001mx
              Member
              • Aug 2013
              • 32

              #7
              THANKS a lot for your suggestions, which are really helpful!

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                Yesterday, 10:05 AM
              • SEQadmin2
                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                by SEQadmin2


                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                Introduction

                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                05-22-2026, 06:42 AM
              • SEQadmin2
                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                by SEQadmin2

                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                05-06-2026, 09:04 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, Yesterday, 12:03 PM
              0 responses
              19 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, Yesterday, 11:40 AM
              0 responses
              14 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 05-28-2026, 11:40 AM
              0 responses
              29 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 05-26-2026, 10:12 AM
              0 responses
              31 views
              0 reactions
              Last Post SEQadmin2  
              Working...