I've two sets of large number of proteins( in the order 100K) , and wish to find out unique proteins belonging to each set.
Is there any tool for doing it fast?
Thanks
Is there any tool for doing it fast?
Thanks
You are currently viewing the SEQanswers forums as a guest, which limits your access. Click here to register now, and join the discussion
cat file1 file2 | sort | uniq -u
#!/usr/bin/perl
use warnings;
use strict;
open(my $f1, "< file1.fasta") or die("Cannot open file1");
open(my $f2, "< file2.fasta") or die("Cannot open file2");
my %seenSequences = ();
my $sequence = "";
my $seqID = "";
while(<$f1>){
chomp;
if(/^>(.*)$/){
$seenSequences{$sequence} = $seqID if $seqID ne "";
$sequence = "";
$seqID = $1;
} else {
$sequence .= $_;
}
}
$seenSequences{$sequence} = $seqID if $seqID ne "";
close($f1);
$sequence = "";
$seqID = "";
while(<$f2>){
chomp;
if(/^>(.*)$/){
if(($seqID ne "") && !exists($seenSequences{$sequence})){
printf(">%s [2]\n%s\n", $seqID, $sequence);
} else {
delete($seenSequences{$sequence});
}
$seqID = $1;
$sequence = "";
} else {
$sequence .= $_;
}
}
if(($seqID ne "") && !exists($seenSequences{$sequence})){
printf(">%s [2]\n%s\n", $seqID, $sequence);
} else {
delete($seenSequences{$sequence});
}
close($f1);
while(my ($seq, $id) = each(%seenSequences)){
printf(">%s [1]\n%s\n", $id, $seq);
}
./141964.pl > out.fasta && head *.fasta ==> file1.fasta <== >1 PRTEINEIN >3 PRTEINTHREE ==> file2.fasta <== >1 PRTEINEIN >2 PRTEINNI ==> out.fasta <== >2 [2] PRTEINNI >3 [1] PRTEINTHREE
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, 06-05-2026, 10:09 AM
|
0 responses
14 views
0 reactions
|
Last Post
by SEQadmin2
06-05-2026, 10:09 AM
|
||
|
Started by SEQadmin2, 06-04-2026, 08:59 AM
|
0 responses
24 views
0 reactions
|
Last Post
by SEQadmin2
06-04-2026, 08:59 AM
|
||
|
Started by SEQadmin2, 06-02-2026, 12:03 PM
|
0 responses
31 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 12:03 PM
|
||
|
Started by SEQadmin2, 06-02-2026, 11:40 AM
|
0 responses
23 views
0 reactions
|
Last Post
by SEQadmin2
06-02-2026, 11:40 AM
|
Comment