Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • newbie2this
    Junior Member
    • Nov 2012
    • 4

    Parsing multi fasta sequence file using Perl

    Hello,
    I am very new to Perl scripting and scripting in general. I am trying to extract information from a multi fasta file to an output file. I have constructed a script but it isn't giving me the output that I want.

    For example (this would the current fasta file)
    >gi|546265522| SOX6
    acgtcaatccag
    cgattagcaaat
    gtcctgattttgg

    >gi|678457845| CMYC
    gttaccatgcgatg
    caatttgggacacc

    I want (notice the ">" is removed:
    gi|546265522| SOX6
    Seq length: 36
    gi|678457845| CMYC
    Seq length: 28

    I was able to put the lines of the sequences into one string but calculating the length isn't working. I removed any spaces within the new single string and attempted $length = length($line); but that doesn't work. The current method in my script also gives the same output like:::
    >gi|546265522| SOX6
    121212

    >gi|678457845| CMYC
    1414

    How do I solve this problem??
    Ultimately i want some like this but I am taking it in steps so that I actually understand what I'm doing and why.
    gi|678457845| CMYC (tab) seq length (tab) AT/GC content

    Here is my script

    #!/usr/bin/perl -w

    print "file: \n";
    $in = <STDIN>;
    chomp $in;

    print "output file: \n";
    $out = <STDIN>;
    chop $out;

    unless ( open(IN, $in) ) {
    die ("cant input file $in\n");}

    unless ( open(OUT, ">$out") ) {
    die("cant open output file $out\n");}

    my $line = <IN>;
    print OUT $line;

    while ($line = <IN>)
    {
    chomp $line;
    if ($line=~/^>(.+)/) {
    print OUT "\n",$line,"\n"; }
    else { $line =~ s/^\s*(.*)\s*$/$1/;

    $a=($line=~tr/aA//);
    $c=($line=~tr/cC//);
    $g=($line=~tr/gG//);
    $t=($line=~tr/tT//);
    $n=($line=~tr/nN//);
    $x=($line=~tr/xX//);

    $length = $a + $c + $g + $t + $n + $x;
    print OUT $length; }
    }

    print OUT "\n";
  • vivek_
    PhD Student
    • Jul 2012
    • 164

    #2
    Try using

    use strict;
    use warnings;
    in your perl scripts, its a good practice and eliminates a lot of mistakes in your code.

    One possible issue I can see with this script is that the variables you use for counting need to be declared outside the while loop if you want to use them to count on multi-line fasta files. Also its safer to initialize variables to say 0 when you declare them.

    Also try getting familiar with Bio perl modules, its much easier to parse sequence format files with them.
    Last edited by vivek_; 11-06-2012, 08:03 AM.

    Comment

    • jesus garcia
      Junior Member
      • Nov 2011
      • 7

      #3
      Originally posted by newbie2this View Post
      Hello,
      I am very new to Perl scripting and scripting in general. I am trying to extract information from a multi fasta file to an output file. I have constructed a script but it isn't giving me the output that I want.

      For example (this would the current fasta file)
      >gi|546265522| SOX6
      acgtcaatccag
      cgattagcaaat
      gtcctgattttgg

      >gi|678457845| CMYC
      gttaccatgcgatg
      caatttgggacacc

      I want (notice the ">" is removed:
      gi|546265522| SOX6
      Seq length: 36
      gi|678457845| CMYC
      Seq length: 28

      I was able to put the lines of the sequences into one string but calculating the length isn't working. I removed any spaces within the new single string and attempted $length = length($line); but that doesn't work. The current method in my script also gives the same output like:::
      >gi|546265522| SOX6
      121212

      >gi|678457845| CMYC
      1414



      How do I solve this problem??
      Ultimately i want some like this but I am taking it in steps so that I actually understand what I'm doing and why.
      gi|678457845| CMYC (tab) seq length (tab) AT/GC content

      Here is my script

      #!/usr/bin/perl -w

      print "file: \n";
      $in = <STDIN>;
      chomp $in;

      print "output file: \n";
      $out = <STDIN>;
      chop $out;

      unless ( open(IN, $in) ) {
      die ("cant input file $in\n");}

      unless ( open(OUT, ">$out") ) {
      die("cant open output file $out\n");}

      my $line = <IN>;
      print OUT $line;

      while ($line = <IN>)
      {
      chomp $line;
      if ($line=~/^>(.+)/) {
      print OUT "\n",$line,"\n"; }
      else { $line =~ s/^\s*(.*)\s*$/$1/;

      $a=($line=~tr/aA//);
      $c=($line=~tr/cC//);
      $g=($line=~tr/gG//);
      $t=($line=~tr/tT//);
      $n=($line=~tr/nN//);
      $x=($line=~tr/xX//);

      $length = $a + $c + $g + $t + $n + $x;
      print OUT $length; }
      }

      print OUT "\n";
      >gi|546265522| SOX6
      121212


      this numbers came from the length of each line but printed in the same line with any space then the number is like "121212" you should read "12-12-12" and if you add in the line marked in red the "-" you obtain this. A way to get the total length of every contig, you have to sum the length of the lines that form each contig and when you find another contig put the length equal to 0 and sum the lines of each contig. If you need more help I modify the code but I think you can modify it.

      Comment

      • newbie2this
        Junior Member
        • Nov 2012
        • 4

        #4
        thanks bunches for the help but I still wasn`t able to get my script to run the way i want it to.

        Comment

        • kmcarr
          Senior Member
          • May 2008
          • 1181

          #5
          n2t,

          One thing I almost always do when dealing with FASTA files in perl is to change the input record separator from the default newline to the more useful (for FASTA records anyway) ">". This allows me to deal with the full FASTA record, definition line and sequence as a single initial chunk of data instead of line by line. Additionally perl has a "length" function; you don't have to count and add characters.

          Here is a heavily commented bit of perl which will accomplish what you asked.

          Code:
          #!/usr/bin/perl
          
          use strict;
          use warnings;
          
          my $inFile = shift; 
          open (IN, "$inFile");
          
          # By default Perl pulls in chunks of text up to a newline (\n) character; newline is
          # the default Input Record Separator. You can change the Input Record Separator by
          # using the special variable "$/". When dealing with FASTA files I normally change the
          # Input Record Separator to ">" which allows your script to take in a full, multiline
          # FASTA record at once.
          
          $/ = ">";
          
          # At each input your script will now read text up to and including the first ">" it encounters.
          # This means you have to deal with the first ">" at the begining of the file as a special case.
          
          my $junk = <IN>; # Discard the ">" at the begining of the file
          
          # Now read through your input file one sequence record at a time. Each input record will be a
          # multiline FASTA entry.
          
          while ( my $record = <IN> ) {
          	chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record
          	
          # 	Now split up your record into its definition line and sequence lines using split at each newline.
          # 	The definition will be stored in a scalar variable and each sequence line as an
          # 	element of an array.
          	
          	my ($defLine, @seqLines) = split /\n/, $record;
          	
          #	Join the individual sequence lines into one single sequence and store in a scalar variable.
          	
          	my $sequence = join('',@seqLines); # Concatenates all elements of the @seqLines array into a single string.
          	
          	print "$defLine\n"; # Print your definition; remember the ">" has already been removed. Remember to print a newline.
          	
          	print "Seq Length: ", length($sequence), "\n"; # Print the sequence length and a newline
          
          }

          Comment

          • krobison
            Senior Member
            • Nov 2007
            • 734

            #6
            You'll do yourself a huge favor if you resist reinventing the wheel (not that I always follow this advice!).

            In particular, take a look at Bio::SeqIO -- this will let you read & write many sequence formats, including FASTA. Your particular program is roughly (taking the input file from the command line and outputting to STDOUT)

            Code:
            #!/usr/bin/perl
            use strict;
            use Bio::SeqIO;
            my $reader=new Bio::SeqIO(-format=>'fasta',-file=>shift);
            while (my $seqRec=$reader->next_seq)
            {
                print join("\t",$seqRec->id,$seqRec->length),"\n";
            }
            Of course, what you were attempting will get you up to speed on standard Perl, but you'll go far learning the higher-level idioms that BioPerl and other libraries give you, and still have plenty to do with the basics.

            Comment

            • newbie2this
              Junior Member
              • Nov 2012
              • 4

              #7
              Originally posted by kmcarr View Post
              n2t,

              One thing I almost always do when dealing with FASTA files in perl is to change the input record separator from the default newline to the more useful (for FASTA records anyway) ">". This allows me to deal with the full FASTA record, definition line and sequence as a single initial chunk of data instead of line by line. Additionally perl has a "length" function; you don't have to count and add characters.

              Here is a heavily commented bit of perl which will accomplish what you asked.

              Code:
              #!/usr/bin/perl
              
              use strict;
              use warnings;
              
              my $inFile = shift; 
              open (IN, "$inFile");
              
              # By default Perl pulls in chunks of text up to a newline (\n) character; newline is
              # the default Input Record Separator. You can change the Input Record Separator by
              # using the special variable "$/". When dealing with FASTA files I normally change the
              # Input Record Separator to ">" which allows your script to take in a full, multiline
              # FASTA record at once.
              
              $/ = ">";
              
              # At each input your script will now read text up to and including the first ">" it encounters.
              # This means you have to deal with the first ">" at the begining of the file as a special case.
              
              my $junk = <IN>; # Discard the ">" at the begining of the file
              
              # Now read through your input file one sequence record at a time. Each input record will be a
              # multiline FASTA entry.
              
              while ( my $record = <IN> ) {
              	chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record
              	
              # 	Now split up your record into its definition line and sequence lines using split at each newline.
              # 	The definition will be stored in a scalar variable and each sequence line as an
              # 	element of an array.
              	
              	my ($defLine, @seqLines) = split /\n/, $record;
              	
              #	Join the individual sequence lines into one single sequence and store in a scalar variable.
              	
              	my $sequence = join('',@seqLines); # Concatenates all elements of the @seqLines array into a single string.
              	
              	print "$defLine\n"; # Print your definition; remember the ">" has already been removed. Remember to print a newline.
              	
              	print "Seq Length: ", length($sequence), "\n"; # Print the sequence length and a newline
              
              }
              with some minor changea and addition to suit my objective, this worked like a charm. Thanks for all the help.

              Comment

              • ebioman
                Member
                • Aug 2013
                • 41

                #8
                Thanks,
                that script was very useful and simplified my attempt a lot.
                From a perl script of view, a lot of things can be simplified.
                I was only interested in the e.g. length (for statistics):

                while(<>){
                chomp;
                @split = split /\n/;
                push(@size_list, length $split[1]);

                }

                shift @size_list;

                Still, I guess lot of ways to improve and do it another way

                Comment

                • SES
                  Senior Member
                  • Mar 2010
                  • 275

                  #9
                  Originally posted by ebioman View Post
                  Thanks,
                  that script was very useful and simplified my attempt a lot.
                  From a perl script of view, a lot of things can be simplified.
                  I was only interested in the e.g. length (for statistics):

                  while(<>){
                  chomp;
                  @split = split /\n/;
                  push(@size_list, length $split[1]);

                  }

                  shift @size_list;

                  Still, I guess lot of ways to improve and do it another way
                  You really should follow the best practices and use modern pragmas. That means putting
                  Code:
                  use strict; 
                  use warnings;
                  and the top of your script and use lexical variables to limit their scope and avoid collisions. Writing in a very minimal style might mean a little less typing, but it will also mean that your script will fail silently, which won't save you any time in the long run.

                  Comment

                  • ebioman
                    Member
                    • Aug 2013
                    • 41

                    #10
                    My fault, I actually only copied a snippet, since my program is way larger.
                    But you are right:

                    Code:
                    #!/usr/bin/perl
                    use warnings;
                    use diagnostics;
                    use strict;
                    
                    $/ = ">";
                    my (@size_list,@split);
                    
                    while(<>){
                         chomp;
                         @split = split /\n/;
                         push(@size_list, length $split[1]);
                       
                    }
                    
                    shift @size_list;
                    Thx

                    Comment

                    Latest Articles

                    Collapse

                    • GATTACAT
                      Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by GATTACAT
                      Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                      07-01-2026, 11:43 AM
                    • SEQadmin2
                      Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by SEQadmin2


                      I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                      Here are nine questions we think about, in roughly the order they matter, before...
                      06-18-2026, 07:11 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, 07-02-2026, 11:08 AM
                    0 responses
                    10 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-30-2026, 05:37 AM
                    0 responses
                    13 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-26-2026, 11:10 AM
                    0 responses
                    20 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-17-2026, 06:09 AM
                    0 responses
                    54 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...