Seqanswers Leaderboard Ad

**GenoMax** · 05-30-2014, 08:35 AM

You could do a reciprocal blat and select those that do not show a hit against the other set over the full length, if "unique" is exactly what you want.

If parts of the proteins are going to be common (domains) then you may have to do some additional parsing/work.

CD-HIT may also work: http://weizhong-lab.ucsd.edu/cd-hit/

**gringer** · 05-31-2014, 04:47 AM

Code:

cat file1 file2 | sort | uniq -u

will give you exactly unique lines from a file, and it's about as fast as you can get. But you probably don't want to do that because a naive sort won't preserve other information like sequence names, etc..

**JamieHeather** · 05-31-2014, 05:58 AM

Woa, do you want proteins that only appear once in a given file, or proteins that only appear in one file or the other?

Both are easily doable in bash. If it's the first one, then gringer's approach with uniq -u will work.

If you want the latter, you'll need to do a bit more work. First, you need to sort both files and (after ensuring that they consist only of one entry for a given protein, which can also be done with uniq) then use comm with the -3 flag: this will output two columns, with lines unique to the first and second file (with shared lines suppressed by -3).

**woa** · 06-01-2014, 08:14 PM

Thanks for your answers.
I've two sequences sets derived from two databases with different fasta headers, and I wish to find unique sequences belonging to each of these databases.I wish to retain the Fasta header information for the unique sequences as well.
I can write a perl hash based script for such a comparison but that'll be quite slow I beleive.
Any other options?

**JamieHeather** · 06-02-2014, 03:19 AM

Do you expect the actual fasta sequences to be exact matches, but with different headers, across the two databases?

For instance if ProteinA was in database 1, and there was a fasta entry of a section of the same protein in database 2, would you count that as being in both databases or is that two unique entries?

Either way you'll probably have to write your own script, but if you are indeed looking for unique proteins you're probably in for a little bit more work.

**woa** · 06-03-2014, 08:00 AM

Originally posted by JamieHeather View Post

Do you expect the actual fasta sequences to be exact matches, but with different headers, across the two databases?

For instance if ProteinA was in database 1, and there was a fasta entry of a section of the same protein in database 2, would you count that as being in both databases or is that two unique entries?

Either way you'll probably have to write your own script, but if you are indeed looking for unique proteins you're probably in for a little bit more work.

Thanks for your reply .

I'll consider only exact matches and hence I'll consider ProteinA and ProteinB unique to their corresponding databases.

and plan to write something like this in perl:

use List::Util qw(first);

if( !defined ( first {$_ eq $seqA} @sorted_seqB ) ){.....}

#$seqA is a sequence in DatabaseA and $_ is one of the sequences of DatbaseB

To speed up I might try the MCE module

use MCE::Grep;

if(! mce_grep {$_ eq $seqA} @sorted_seqB){....}

**JamieHeather** · 06-03-2014, 01:36 PM

I'm not very fluent in Perl, but I think that might do the trick.

For posterity, my approach in python would just be to make a defaultdict for each database, with the sequence strings as keys (and probably the SeqIO fasta or fasta ID instance for the value), then take do db1.keys().difference(db2.keys()) to find those sequences unique to each database.

**gringer** · 06-03-2014, 04:05 PM

I just cooked up some rough code. It needs to store sequence and ID in memory for all sequences in the first file, but should be otherwise fairly space/time efficient. Here's an overview of how it does it:

Read all lines from file1, store sequence, id as hash (hash to sequence)
Read all lines from file2, print out non-matching sequence/id and delete any matching sequence/id
Print remaining non-matching sequence/id from file1

Code:

#!/usr/bin/perl
use warnings;
use strict;

open(my $f1, "< file1.fasta") or die("Cannot open file1");
open(my $f2, "< file2.fasta") or die("Cannot open file2");

my %seenSequences = ();
my $sequence = "";
my $seqID = "";

while(<$f1>){
  chomp;
  if(/^>(.*)$/){
    $seenSequences{$sequence} = $seqID if $seqID ne "";
    $sequence = "";
    $seqID = $1;
  } else {
    $sequence .= $_;
  }
}
$seenSequences{$sequence} = $seqID if $seqID ne "";
close($f1);

$sequence = "";
$seqID = "";
while(<$f2>){
  chomp;
  if(/^>(.*)$/){
    if(($seqID ne "") && !exists($seenSequences{$sequence})){
      printf(">%s [2]\n%s\n", $seqID, $sequence);
    } else {
      delete($seenSequences{$sequence});
    }
    $seqID = $1;
    $sequence = "";
  } else {
    $sequence .= $_;
  }
}
if(($seqID ne "") && !exists($seenSequences{$sequence})){
  printf(">%s [2]\n%s\n", $seqID, $sequence);
} else {
  delete($seenSequences{$sequence});
}
close($f1);

while(my ($seq, $id) = each(%seenSequences)){
  printf(">%s [1]\n%s\n", $id, $seq);
}

Here's an example run:

Code:

 ./141964.pl > out.fasta && head *.fasta
==> file1.fasta <==                                                                                                                                                                                                               
>1                                                                                                                                                                                                                                
PRTEINEIN                                                                                                                                                                                                                         
>3
PRTEINTHREE

==> file2.fasta <==
>1
PRTEINEIN
>2
PRTEINNI

==> out.fasta <==
>2 [2]
PRTEINNI
>3 [1]
PRTEINTHREE

Topics	Statistics	Last Post
Study Highlights Challenges in Cellular Reprogramming for Regenerative Medicine by seqadmin Started by seqadmin, Today, 06:25 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:25 AM
New DNA Modification Discovered as Key to Gene Activation in Early Development by seqadmin Started by seqadmin, Yesterday, 01:02 PM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 01:02 PM
Wastewater Analysis Unlocks New Method for Identifying Public Health Threats by seqadmin Started by seqadmin, 09-18-2024, 06:39 AM	0 responses 14 views 0 likes	Last Post by seqadmin 09-18-2024, 06:39 AM
Molecular Markers Shared Across Dementias by seqadmin Started by seqadmin, 09-11-2024, 02:44 PM	0 responses 14 views 0 likes	Last Post by seqadmin 09-11-2024, 02:44 PM

Seqanswers Leaderboard Ad

Announcement

Tool for finding unique sets of sequences

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News