Unconfigured Ad

**vivek_** · 11-06-2012, 07:52 AM

Try using

use strict;
use warnings;

in your perl scripts, its a good practice and eliminates a lot of mistakes in your code.

One possible issue I can see with this script is that the variables you use for counting need to be declared outside the while loop if you want to use them to count on multi-line fasta files. Also its safer to initialize variables to say 0 when you declare them.

Also try getting familiar with Bio perl modules, its much easier to parse sequence format files with them.

**jesus garcia** · 11-06-2012, 08:58 AM

Originally posted by newbie2this View Post

Hello,
I am very new to Perl scripting and scripting in general. I am trying to extract information from a multi fasta file to an output file. I have constructed a script but it isn't giving me the output that I want.

For example (this would the current fasta file)
>gi|546265522| SOX6
acgtcaatccag
cgattagcaaat
gtcctgattttgg

>gi|678457845| CMYC
gttaccatgcgatg
caatttgggacacc

I want (notice the ">" is removed:
gi|546265522| SOX6
Seq length: 36
gi|678457845| CMYC
Seq length: 28

I was able to put the lines of the sequences into one string but calculating the length isn't working. I removed any spaces within the new single string and attempted $length = length($line); but that doesn't work. The current method in my script also gives the same output like:::
>gi|546265522| SOX6
121212

>gi|678457845| CMYC
1414

How do I solve this problem??
Ultimately i want some like this but I am taking it in steps so that I actually understand what I'm doing and why.
gi|678457845| CMYC (tab) seq length (tab) AT/GC content

Here is my script

#!/usr/bin/perl -w

print "file: \n";
$in = <STDIN>;
chomp $in;

print "output file: \n";
$out = <STDIN>;
chop $out;

unless ( open(IN, $in) ) {
die ("cant input file $in\n");}

unless ( open(OUT, ">$out") ) {
die("cant open output file $out\n");}

my $line = <IN>;
print OUT $line;

while ($line = <IN>)
{
chomp $line;
if ($line=~/^>(.+)/) {
print OUT "\n",$line,"\n"; }
else { $line =~ s/^\s*(.*)\s*$/$1/;

$a=($line=~tr/aA//);
$c=($line=~tr/cC//);
$g=($line=~tr/gG//);
$t=($line=~tr/tT//);
$n=($line=~tr/nN//);
$x=($line=~tr/xX//);

$length = $a + $c + $g + $t + $n + $x;
print OUT $length; }
}

print OUT "\n";

>gi|546265522| SOX6
121212

this numbers came from the length of each line but printed in the same line with any space then the number is like "121212" you should read "12-12-12" and if you add in the line marked in red the "-" you obtain this. A way to get the total length of every contig, you have to sum the length of the lines that form each contig and when you find another contig put the length equal to 0 and sum the lines of each contig. If you need more help I modify the code but I think you can modify it.

**newbie2this** · 11-06-2012, 01:42 PM

thanks bunches for the help but I still wasn`t able to get my script to run the way i want it to.

**kmcarr** · 11-06-2012, 02:01 PM

n2t,

One thing I almost always do when dealing with FASTA files in perl is to change the input record separator from the default newline to the more useful (for FASTA records anyway) ">". This allows me to deal with the full FASTA record, definition line and sequence as a single initial chunk of data instead of line by line. Additionally perl has a "length" function; you don't have to count and add characters.

Here is a heavily commented bit of perl which will accomplish what you asked.

Code:

#!/usr/bin/perl

use strict;
use warnings;

my $inFile = shift; 
open (IN, "$inFile");

# By default Perl pulls in chunks of text up to a newline (\n) character; newline is
# the default Input Record Separator. You can change the Input Record Separator by
# using the special variable "$/". When dealing with FASTA files I normally change the
# Input Record Separator to ">" which allows your script to take in a full, multiline
# FASTA record at once.

$/ = ">";

# At each input your script will now read text up to and including the first ">" it encounters.
# This means you have to deal with the first ">" at the begining of the file as a special case.

my $junk = <IN>; # Discard the ">" at the begining of the file

# Now read through your input file one sequence record at a time. Each input record will be a
# multiline FASTA entry.

while ( my $record = <IN> ) {
	chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record
	
# 	Now split up your record into its definition line and sequence lines using split at each newline.
# 	The definition will be stored in a scalar variable and each sequence line as an
# 	element of an array.
	
	my ($defLine, @seqLines) = split /\n/, $record;
	
#	Join the individual sequence lines into one single sequence and store in a scalar variable.
	
	my $sequence = join('',@seqLines); # Concatenates all elements of the @seqLines array into a single string.
	
	print "$defLine\n"; # Print your definition; remember the ">" has already been removed. Remember to print a newline.
	
	print "Seq Length: ", length($sequence), "\n"; # Print the sequence length and a newline

}

**krobison** · 11-06-2012, 06:31 PM

You'll do yourself a huge favor if you resist reinventing the wheel (not that I always follow this advice!).

In particular, take a look at Bio::SeqIO -- this will let you read & write many sequence formats, including FASTA. Your particular program is roughly (taking the input file from the command line and outputting to STDOUT)

Code:

#!/usr/bin/perl
use strict;
use Bio::SeqIO;
my $reader=new Bio::SeqIO(-format=>'fasta',-file=>shift);
while (my $seqRec=$reader->next_seq)
{
    print join("\t",$seqRec->id,$seqRec->length),"\n";
}

Of course, what you were attempting will get you up to speed on standard Perl, but you'll go far learning the higher-level idioms that BioPerl and other libraries give you, and still have plenty to do with the basics.

**newbie2this** · 11-07-2012, 05:45 AM

Originally posted by kmcarr View Post

n2t,

One thing I almost always do when dealing with FASTA files in perl is to change the input record separator from the default newline to the more useful (for FASTA records anyway) ">". This allows me to deal with the full FASTA record, definition line and sequence as a single initial chunk of data instead of line by line. Additionally perl has a "length" function; you don't have to count and add characters.

Here is a heavily commented bit of perl which will accomplish what you asked.

Code:

#!/usr/bin/perl

use strict;
use warnings;

my $inFile = shift; 
open (IN, "$inFile");

# By default Perl pulls in chunks of text up to a newline (\n) character; newline is
# the default Input Record Separator. You can change the Input Record Separator by
# using the special variable "$/". When dealing with FASTA files I normally change the
# Input Record Separator to ">" which allows your script to take in a full, multiline
# FASTA record at once.

$/ = ">";

# At each input your script will now read text up to and including the first ">" it encounters.
# This means you have to deal with the first ">" at the begining of the file as a special case.

my $junk = <IN>; # Discard the ">" at the begining of the file

# Now read through your input file one sequence record at a time. Each input record will be a
# multiline FASTA entry.

while ( my $record = <IN> ) {
	chomp $record; # Remove the ">" from the end of $record, and realize that the ">" is already gone from the begining of the record
	
# 	Now split up your record into its definition line and sequence lines using split at each newline.
# 	The definition will be stored in a scalar variable and each sequence line as an
# 	element of an array.
	
	my ($defLine, @seqLines) = split /\n/, $record;
	
#	Join the individual sequence lines into one single sequence and store in a scalar variable.
	
	my $sequence = join('',@seqLines); # Concatenates all elements of the @seqLines array into a single string.
	
	print "$defLine\n"; # Print your definition; remember the ">" has already been removed. Remember to print a newline.
	
	print "Seq Length: ", length($sequence), "\n"; # Print the sequence length and a newline

}

with some minor changea and addition to suit my objective, this worked like a charm. Thanks for all the help.

**ebioman** · 09-11-2013, 03:29 AM

Thanks,
that script was very useful and simplified my attempt a lot.
From a perl script of view, a lot of things can be simplified.
I was only interested in the e.g. length (for statistics):

while(<>){
chomp;
@split = split /\n/;
push(@size_list, length $split[1]);

}

shift @size_list;

Still, I guess lot of ways to improve and do it another way

**SES** · 09-11-2013, 04:29 AM

Originally posted by ebioman View Post

Thanks,
that script was very useful and simplified my attempt a lot.
From a perl script of view, a lot of things can be simplified.
I was only interested in the e.g. length (for statistics):

while(<>){
chomp;
@split = split /\n/;
push(@size_list, length $split[1]);

}

shift @size_list;

Still, I guess lot of ways to improve and do it another way

You really should follow the best practices and use modern pragmas. That means putting

Code:

use strict; 
use warnings;

and the top of your script and use lexical variables to limit their scope and avoid collisions. Writing in a very minimal style might mean a little less typing, but it will also mean that your script will fail silently, which won't save you any time in the long run.

**ebioman** · 09-11-2013, 04:48 AM

My fault, I actually only copied a snippet, since my program is way larger.
But you are right:

Code:

#!/usr/bin/perl
use warnings;
use diagnostics;
use strict;

$/ = ">";
my (@size_list,@split);

while(<>){
     chomp;
     @split = split /\n/;
     push(@size_list, length $split[1]);
   
}

shift @size_list;

Thx

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 10 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 13 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Parsing multi fasta sequence file using Perl

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News