Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bigmac3000lbs
    Junior Member
    • Mar 2011
    • 1

    Help with FASTA parsing code.

    Hi guys!

    I'm working on a project where I am reading in values from a text file and then using them as search terms in a FASTA file. Ideally every time the program finds encounters the search term in the header section of the FASTA file, it should copy the header section of the FASTA file, along with the associated sequence into an output file. I've included examples of the kinds of data I am working with, along with the code I'm currently working on so you can get a better sense of what I am talking about.

    Currently the problem I've been having with my code is that when I run it, it only outputs the results of the search conducted with the final term in the text file. It's that the program ran through the entire text file without performing any searches on the file to be searched, before stopping on the last item in the text file and performing a search. I should warn you that I just started programming in perl recently, so it is quite possible that my code has a few glaring errors that I've missed.

    #Data from the text file I am using for search terms. Note these sequence ID names represent a subset of the sequences found in the FASTA file I am searching through. The actual text file is much larger, containing 1000s of terms.
    BF01013B2E03.f1
    BF01010B1A11.f1
    BF01028B1E03.f1
    BF01029A2C12.f1
    BF01028B2D11.f1

    #Data from input FASTA file. I am searching through this file using th +e items from the text file shown above. The actual input FASTA file contains 1000s more items
    > BF01013B2E03.f1 735 1 735 ABI CAATCCAAGAACATTTTGAAGAAAAAATCTCTTAAAAAAAAGAAATCAAAACAAGTGATCAAAAATGAAATGAATGGTCA
    > BF01010B1A11.f1 782 1 782 ABI AACGGACNANNCGGCAACCAGGAGGCCTTCCAAGCTGAACTGGGAGAGTGGATCAAGAAG
    > BF01028B2D11.f1 674 1 674 ABI CCAGCACNNNNTNAGATATTAGCCTAGCCTCTATGTCGTATTTGTATTTCNNCTAGTTTTTCATCCGACTTTTTTGGATC

    #output file note the output file has only one sequence in it. This is clearly not what I want. Something is wrong, I just can't put my finger on it. > BF01028B2D11.f1 674 1 674 ABI CCAGCACNNNNTNAGATATTAGCCTAGCCTCTATGTCGTATTTGTATTTCNNCTAGTTTTTCATCCGACTTTTTTGGATC #perl program
    <code>
    #!/usr/bin/perl
    use strict;
    use File::Basename;
    my $database;
    my $data = shift(@ARGV);
    my $input_file = shift(@ARGV);
    my $infile;
    my $match; my $output_file;
    my $ESTs_W_SNPs;
    $output_file = "ESTsWsnpsANDgoodCoverage.fasta";
    open ($database,'<', $data) or die "Cannot open $data\n";
    while ($match = <$database>)
    {
    chomp $match;
    open (IN, $input_file) || die "Can't open input file. Please provide a valid input filename.\n";
    open ($ESTs_W_SNPs, '>', $output_file) or die "Cannot write to $output_file\n";
    my ($seq, $prevhead) = (0, "", '');
    while(<IN>)
    {
    my $line = $_;
    $line =~ s/\r\n/\n/;
    chomp $line;
    $seq.= uc($line) if(eof(IN));
    if (/\>(.*)/ || eof(IN))
    {
    my $head=$1;
    printf $ESTs_W_SNPs ">$prevhead\n$seq\n" if($prevhead =~ /$ma +tch/); $prevhead = $head; $seq='';
    }
    else
    {
    $seq.=$line;
    }
    }
    close (IN);
    close ($ESTs_W_SNPs);
    }
  • kmcarr
    Senior Member
    • May 2008
    • 1181

    #2
    You are re-opening the output file every time you grab a new input from your database file (the outermost while loop) which basically reinitializes the file. You need to move the open/close statements for the output file outside your main program loop.

    Also, you approach is not the most efficient. You are scanning through your entire FASTA file for each record you want to save. A better approach is to read in your search list and store it in a hash. Then go through your FASTA file just once, comparing each record to your list (hash). If the ID is in your hash then write the sequence to your output file. If the ID is not in your list then move on to the next record.

    Comment

    • chalex
      Junior Member
      • Mar 2011
      • 1

      #3
      So, uh, wouldn't you just do
      Code:
      grep -F -f file1 -A 1 file2
      where file1 is your search terms and file2 is the full FASTA file?

      See the man page for GNU grep for more info.

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        Originally posted by chalex View Post
        So, uh, wouldn't you just do
        Code:
        grep -F -f file1 -A 1 file2
        where file1 is your search terms and file2 is the full FASTA file?

        See the man page for GNU grep for more info.
        You're method only copies the first line of sequence after the definition line. There is no way to tell how many lines of sequence are follow a given definition line so you need some way to examine the lines following your matched definition.

        Comment

        • greigite
          Senior Member
          • Mar 2009
          • 145

          #5
          Originally posted by kmcarr View Post
          You're method only copies the first line of sequence after the definition line. There is no way to tell how many lines of sequence are follow a given definition line so you need some way to examine the lines following your matched definition.
          Funny, I was just working on this problem today. I quality filtered some Illumina PE data (reads 1 and 2 in separate files) and after that they didn't have exactly matched reads in them anymore. I needed to produce new files for each read containing only those reads with a partner in the other quality filtered file. Eventually I figured out that grep does work for this if you tell it to pull only the 4 lines after the read ID match:

          Code:
               my $lineone = `grep -n $readID $read_file_name]`;
               my ($linenum) = $lineone =~/([0-9]{1,}):.*/;
               $linenum = $linenum +3;
               my @printone = `head -$linenum $read_file_name | tail -4`;

          Comment

          • tomc
            Member
            • Feb 2011
            • 29

            #6
            tools do exist

            Is there a reason not to use the pattern of,
            indexing the sequence fasta file with whatever blast like tool
            you are use to, then feeding the set of identifiers to the command
            that extracts the fasta records from the indexed sequence?
            especially if you may want to partition out other subset of sequence.
            something like ...

            classic ncbi blast
            Code:
            formatdb  -p F -o T -i sequence.fasta -n blastdb
            fastacmd  -d blastdb  -i accession.list  -o isolated.sequences
            wublast
            Code:
            xdformat -n -I -o blastdb sequence.fasta  
            xdget -n -f  blastdb  accession.list > isolated.sequences
            new blast+ (have never used yet)
            Code:
            makeblastdb -in sequence.fasta -dbtype nucl  -hash_index  - parse_seqids  -out blastdb
            blastdbcmd  -entry_batch accession.list -db blastdb -dbtype nucl
            nor I have been using the newer tools, bowtie, bwa ...
            long enough to to know if they have a similar mode
            but it would not surprise me.

            Comment

            • Seqasaurus
              Member
              • Sep 2010
              • 24

              #7
              pyfasta http://pypi.python.org/pypi/pyfasta/

              or flatten the fasta so that each sequence only occupies a single line then use grep?
              Last edited by Seqasaurus; 03-28-2011, 02:39 PM. Reason: link for pyfasta

              Comment

              Latest Articles

              Collapse

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, Today, 06:09 AM
              0 responses
              15 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              34 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              39 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              47 views
              0 reactions
              Last Post SEQadmin2  
              Working...