Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parsing Smith-Waterman Alignment EMBOSS

    Hi all,
    I'm using the Smith-Waterman algorithm of EMBOSS to automaticaly do pairwise sequence alignments.
    Afterwards I want to filter them by an identity and length thereshold. To do that I would like to parse the output. All infos are contained in the default output format which looks like
    #=======================================
    #
    # Aligned_sequences: 2
    # 1: 3end_LTR_oligo
    # 2: Cap_S1.1
    # Matrix: EDNAFULL
    # Gap_penalty: 15.0
    # Extend_penalty: 1.5
    #
    # Length: 9
    # Identity: 8/9 (88.9%)
    # Similarity: 8/9 (88.9%)
    # Gaps: 0/9 ( 0.0%)
    # Score: 36.0
    #
    #
    #=======================================

    3end_LTR_olig 19 GTTCGCGTT 27
    |||||.|||
    Cap_S1.1 2 GTTCGTGTT 10


    #=======================================
    I tried to parse the file with perl, but there is a newline shift somewhere when the alignment ends with a "." I think.
    Did anyone allready wrote a script to parse the EMBOSS water output into a tabular format for "multiple pairwise alignments"? Or can anyone tell me how to solve this readin newline problem? My parsing code is the following:
    Code:
    #! /usr/bin/perl -w
    use strict;
    
    #Author: Ulrike Löber ([email protected])
    #This script should parse default output of EMBOSS's Waterman algorithm
    
    #INFILE is the water outfile (srspair format)
    open (INFILE, "$ARGV[0]") or die "file $ARGV[0] not found";
    	my @data	=	<INFILE>;
    close INFILE;
    my $numberoflines=@data;
    my @programm			=	split(/ +/,$data[1]);
    my @rundate				=	split(/ +/,$data[2]);
    my @gapopenpenalty		=	split(/ +/,$data[8]);
    my @gapextensionpenalty	=	split(/ +/,$data[9]);
    
    # main informations are print out in terminal, e.g. programm name, date of the run and parameters
    print "programm: $programm[2]\n";
    print "rundate: $rundate[2] $rundate[3] $rundate[4] $rundate[5] $rundate[6]\n";
    print "gap open penalty: $gapopenpenalty[2]\ngap extension penalty: $gapextensionpenalty[2]\n";
    
    #parsed information are written in a file with the same name as the infile with an additional ".tabular"
    open (OUTFILE,">$ARGV[0].tabular");
    # build a table header
    	print OUTFILE "id1\tid2\talignment length\tidentity\tsimilarity\tgaps\tscore\tstart1\tend1\tseq1\tstart2\tend2\tseq2\n";
    	for (my $i = 0; $i <= 1220; $i++) {
    		my $nextalign=$i*23;
    		my @id1				=	split(/ +/,$data[17+$nextalign]);	#splitting stings by whitespaces
    		my @id2				=	split(/ +/,$data[18+$nextalign]);
    		my @length			=	split(/ +/,$data[23+$nextalign]);
    		my @identity		=	split(/\(/,$data[24+$nextalign]);
    		my @similarityscore	=	split(/\(/,$data[25+$nextalign]);
    		my @gaps			=	split(/\(/,$data[26+$nextalign]);	#whitespace problems if there is an one digit percentage an additional space is in the line
    		my @alignmentscore	=	split(/ +/,$data[27+$nextalign]);	#extraction by "(" instead of whitespace
    		my @seq1			=	split(/ +/,$data[32+$nextalign]);
    		my $startseq1		=	$seq1[1];							#simple renaming to make the "prints" more readable
    		my $endseq1			=	$seq1[3];
    		my @seq2			=	split(/ +/,$data[34+$nextalign]);
    		my $startseq2		=	$seq2[1];
    		my $endseq2			=	$seq2[3];
    
    		$identity[1]=~s/\%\)//g;	#the percentage sign and the brace must be removed
    		$similarityscore[1]=~s/\%\)//g;;
    		$gaps[1]=~s/\%\)//g;
    # chomp every scalar to remove leading white spaces from the strings
    		chomp $id1[2];
    		chomp $id2[2];
    		chomp $length[2];
    		chomp $identity[1];
    		chomp $similarityscore[1];
    		chomp $gaps[1];
    		chomp $alignmentscore[2];
    		chomp $startseq1;
    		chomp $endseq1;
    		chomp $seq1[2];
    		chomp $startseq2;
    		chomp $endseq2;
    		chomp $seq2[2];
    # gives selected information for each pairwise alignment
    		print OUTFILE "$id1[2]\t$id2[2]\t$length[2]\t$identity[1]\t$similarityscore[1]\t$gaps[1]\t$alignmentscore[2]\t$startseq1\t$endseq1\t$seq1[2]\t$startseq2\t$endseq2\t$seq2[2]\n";
    }
    close OUTFILE;
    print "OUTPUT written in $ARGV[0].tabular\n";
    What is not good practice at all: I have to make it a bit more flexible, for the detection of first alignment line, because that depends on the number of input options (e.g. adding an option results in a shift of the alignments (one extra line)).
    Thank you for your reply!

  • #2
    Hi,
    you can change your alignment format output :

    emboss/bin/water seq1 seq2 -gapopen 10.0 -gapextend 0.5 -aformat score -outfile aln.tab

    Output will show only the names of the sequences, the length of the alignment and the score.

    for additional output options check: http://emboss.sourceforge.net/docs/t...gnFormats.html

    Ilia
    Last edited by zhidkov.ilia; 11-05-2013, 04:01 AM.

    Comment


    • #3
      Holy specifics, batman! This looks nothing like the Perl I'm used to. You've got the following lines:
      Code:
      open (INFILE, "$ARGV[0]") or die "file $ARGV[0] not found";
      	my @data	=	<INFILE>;
      close INFILE;
      # process @data
      Which has an assumption that the entire file can fit into memory. Okay, that's probably true for your SW files, but it's not true more generally. You don't seem to be using any of the nice features of Perl, and treating it mostly like an array-storage utility. What I'm more used to is something like this:
      Code:
      my %variables = ();
      open (INFILE, "$ARGV[0]") or die "file $ARGV[0] not found";
      while(<INFILE>){
        # process current line
        if(/pattern1(.*)pattern2/){
          $variables{pattern1} = $1;
        }
        if(/endPattern/){
          # process %variables and print to OUTFILE
        }
      }
      close INFILE;
      There are plenty of optimisations and generalisations that can be done with your code [e.g. using join("\t",(varia,bles)) instead of print("varia\tbles")], but you are going to save yourself a whole lot of pain by going back to your tablet and rewriting this to be a line-wise pattern-based process.

      Comment


      • #4
        Thank you for your fast reply.
        @zhidkov.ilia: thank you, but I already know the outformats. What I need to keep are the identities and the start and end points of the alignments.
        @gringer: anyway the readin is memory efficient or not (thank you for the hint), how can I get rid of the newline problem??

        Comment


        • #5
          Originally posted by uloeber View Post
          how can I get rid of the newline problem?
          You have constants all through your code where they don't need to be (e.g. 1220, as a very obvious one) -- that's the main cause of this specific problem. Your code should be changed considerably, and while I could suggest a change that could fix this specific problem, it would only be kicking the can further down the road.

          Comment


          • #6
            Ah, yes excuse me, this was just for test reasons in the original code its $linenumber devided by 23.

            Comment


            • #7
              Originally posted by gringer View Post
              Your code should be changed considerably, and while I could suggest a change that could fix this specific problem, it would only be kicking the can further down the road.
              Okay, I'll try saying that in a different way. Your code looks like it was written by someone who is trying out Perl as their first programming language and has only studied arrays and the very basics of regular expressions. There are other ways to do what you've done with considerably less frustration; please read up on good style for perl code.

              You can find a tutorial about how to write perl code here:
              Editor's note: this venerable series is undergoing updates. You might be interested in the newer versions, available at: A Beginner's Introduction to Perl 5.10 A Beginner's Introduction to Files and Strings with Perl 5.10 A Beginner's Introduction to Regular...


              If you want a bit less hand-holding, look at the examples in perl documentation, for example here.

              Comment


              • #8
                Solved

                I solved it, maybe it's helpful for people with the same problem.
                Code:
                #! /usr/bin/perl 
                use strict;
                
                # this script is written by Ulrike Löber (contact [email protected])
                # function: parsing "multiple" EMBOSS pairwise (Smith-Waterman) sequence alignments
                # output is a tabular summary of the results containing information of identity, start and end points,...
                # this is missing in default outputs, complete information not contained in tabular output by EMBOSS
                
                #start with read in of the data
                
                open (INFILE, "$ARGV[0]") or die "no such file $ARGV[0]\n";
                
                #initialize variables
                my $bool=0;
                my $check_odd_even;
                my @splitstring;
                my $id1;
                my $id2;
                my $length;
                my $identity;
                my $similarity;
                my $gaps;
                my $score;
                my $start1;
                my $end1;
                my $seq1;
                my $start2;
                my $end2;
                my $seq2;
                my $pattern_id1;
                my $pattern_id2;
                
                # initialize outfile header 
                my $outfile="$ARGV[0].tab";
                open (OUTFILE, ">$outfile") or die "error creating file $outfile !\n";
                print OUTFILE "id1\tid2\tlength\tidentity\tsimilarity\tgaps\tscore\tstart1\tseq1\tend1\tstart2\tseq2\tend2\n";
                 while (defined(my $line=<INFILE>)) {
                	chomp $line;
                	$line=~s/\R//g;	#remove linebreaks
                # first information to extract are the "method" information
                	if ($line	=~	m/\#{10,}/g){	#there should be only two lines with at least 10 # in it, before and after the description at the beginning of the file
                		$bool	=	$bool+1;
                		$check_odd_even=$bool%2;
                	}
                	if	($check_odd_even  	==	1){
                #		print "$line\n";
                	}
                # following the alignment information, written in the new outfile
                	elsif	($check_odd_even	==	0){
                		if($line	=~	m/\#\ 1\:/g){			#if line structure is like "# 1: seqname" print out third element of the line 
                			@splitstring	=	split(/ +/,$line);	#split sting by whitespaces
                			$id1		=	$splitstring[2];	
                			print OUTFILE "$id1\t";
                			$pattern_id1=substr($id1,0,13);			# attention! alignments: seqids are croped to 13 signs! 
                		}	
                		elsif($line	=~	m/\#\ 2\:/g){			#if line structure is like "# 2: seqname" print out third element of the line
                			@splitstring	=	split(/ +/,$line);	
                			$id2		=	$splitstring[2];
                			print OUTFILE "$id2\t";
                			$pattern_id2=substr($id2,0,13);			# attention! alignments: seqids are croped to 13 signs! 
                		}
                		elsif($line=~m/\#\ Length\:/ig){			#if line structure is like "# Length: 24" print out third element of the line
                			@splitstring	=	split(/ +/,$line);
                			$length		=	$splitstring[2];
                			print OUTFILE "$length\t";
                		}
                		elsif($line=~m/\#\ Identity\:/ig){			#if line structure is like "# Identity: 9/10 (90.0%)" 
                			@splitstring=split(/ +/,$line);			#split sting by whitespaces
                			$identity	=	$splitstring[3];	#fourth element is the identity in percentage
                			$identity	=~	s/[\(,\),\%]//g;	#cut off the braces and %-sign
                			print OUTFILE "$identity\t";
                		}
                		elsif($line=~m/\#\ Similarity\:/ig){
                			@splitstring=split(/ +/,$line);
                			$similarity	=	$splitstring[3];
                			$similarity	=~	s/[\(,\),\%]//g;
                			print OUTFILE "$similarity\t";
                		}
                		elsif($line=~m/\#\ Gaps\:/ig){
                			@splitstring=split(/ +/,$line);
                			$gaps	=	$splitstring[3];
                			$gaps	=~	s/[\(,\),\%]//g;
                			print OUTFILE "$gaps\t";
                		}
                		elsif($line=~m/\#\ Score\:/ig){
                			@splitstring	=	split(/ +/,$line);
                			$score		=	$splitstring[2];
                			print OUTFILE "$score\t";
                		}
                		
                		
                		elsif($line=~m/^$pattern_id1\ /g){
                			@splitstring	=	split(/ +/,$line);
                			$start1		=	$splitstring[1];
                			$seq1		=	$splitstring[2];
                			$end1		=	$splitstring[3];
                			print OUTFILE "$start1\t$seq1\t$end1\t";
                		}
                		
                		
                		elsif($line=~m/^$pattern_id2\ /g){
                			@splitstring	=	split(/ +/,$line);
                			$start2		=	$splitstring[1];
                			$seq2		=	$splitstring[2];
                			$end2		=	$splitstring[3];
                			print OUTFILE "$start2\t$seq2\t$end2\t";
                		}
                		elsif($line=~m/Aligned_sequences/g){			#insert linebreak, when next alignment starts
                			print OUTFILE "\n";
                		}
                	}
                }
                close OUTFILE;
                print "\n Now parsed to tabular format and written in $outfile\n ";
                Last edited by uloeber; 11-20-2013, 09:43 AM.

                Comment


                • #9
                  Much better! Thanks for sharing your solution.

                  Comment


                  • #10
                    Originally posted by gringer View Post
                    You can find a tutorial about how to write perl code here:
                    Editor's note: this venerable series is undergoing updates. You might be interested in the newer versions, available at: A Beginner's Introduction to Perl 5.10 A Beginner's Introduction to Files and Strings with Perl 5.10 A Beginner's Introduction to Regular...

                    That article was published more than 13 years ago. The single best piece of advice I could give for selecting a Perl guide is to look at the publication date. Unfortunately, there are a lot of how-to guides on the internet that are really out of touch with modern Perl.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    68 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X