Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • woa
    Member
    • Mar 2011
    • 13

    Tool for finding unique sets of sequences

    I've two sets of large number of proteins( in the order 100K) , and wish to find out unique proteins belonging to each set.

    Is there any tool for doing it fast?

    Thanks
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    You could do a reciprocal blat and select those that do not show a hit against the other set over the full length, if "unique" is exactly what you want.

    If parts of the proteins are going to be common (domains) then you may have to do some additional parsing/work.

    CD-HIT may also work: http://weizhong-lab.ucsd.edu/cd-hit/
    Last edited by GenoMax; 05-30-2014, 08:38 AM.

    Comment

    • gringer
      David Eccles (gringer)
      • May 2011
      • 845

      #3
      Code:
      cat file1 file2 | sort | uniq -u
      will give you exactly unique lines from a file, and it's about as fast as you can get. But you probably don't want to do that because a naive sort won't preserve other information like sequence names, etc..

      Comment

      • JamieHeather
        @jamimmunology
        • Nov 2012
        • 96

        #4
        Woa, do you want proteins that only appear once in a given file, or proteins that only appear in one file or the other?

        Both are easily doable in bash. If it's the first one, then gringer's approach with uniq -u will work.

        If you want the latter, you'll need to do a bit more work. First, you need to sort both files and (after ensuring that they consist only of one entry for a given protein, which can also be done with uniq) then use comm with the -3 flag: this will output two columns, with lines unique to the first and second file (with shared lines suppressed by -3).

        Comment

        • woa
          Member
          • Mar 2011
          • 13

          #5
          Thanks for your answers.
          I've two sequences sets derived from two databases with different fasta headers, and I wish to find unique sequences belonging to each of these databases.I wish to retain the Fasta header information for the unique sequences as well.
          I can write a perl hash based script for such a comparison but that'll be quite slow I beleive.
          Any other options?

          Comment

          • JamieHeather
            @jamimmunology
            • Nov 2012
            • 96

            #6
            Do you expect the actual fasta sequences to be exact matches, but with different headers, across the two databases?

            For instance if ProteinA was in database 1, and there was a fasta entry of a section of the same protein in database 2, would you count that as being in both databases or is that two unique entries?

            Either way you'll probably have to write your own script, but if you are indeed looking for unique proteins you're probably in for a little bit more work.

            Comment

            • woa
              Member
              • Mar 2011
              • 13

              #7
              Originally posted by JamieHeather View Post
              Do you expect the actual fasta sequences to be exact matches, but with different headers, across the two databases?

              For instance if ProteinA was in database 1, and there was a fasta entry of a section of the same protein in database 2, would you count that as being in both databases or is that two unique entries?

              Either way you'll probably have to write your own script, but if you are indeed looking for unique proteins you're probably in for a little bit more work.
              Thanks for your reply .

              I'll consider only exact matches and hence I'll consider ProteinA and ProteinB unique to their corresponding databases.

              and plan to write something like this in perl:

              use List::Util qw(first);

              if( !defined ( first {$_ eq $seqA} @sorted_seqB ) ){.....}

              #$seqA is a sequence in DatabaseA and $_ is one of the sequences of DatbaseB

              To speed up I might try the MCE module

              use MCE::Grep;

              if(! mce_grep {$_ eq $seqA} @sorted_seqB){....}
              Last edited by woa; 06-03-2014, 08:13 AM.

              Comment

              • JamieHeather
                @jamimmunology
                • Nov 2012
                • 96

                #8
                I'm not very fluent in Perl, but I think that might do the trick.

                For posterity, my approach in python would just be to make a defaultdict for each database, with the sequence strings as keys (and probably the SeqIO fasta or fasta ID instance for the value), then take do db1.keys().difference(db2.keys()) to find those sequences unique to each database.

                Comment

                • gringer
                  David Eccles (gringer)
                  • May 2011
                  • 845

                  #9
                  I just cooked up some rough code. It needs to store sequence and ID in memory for all sequences in the first file, but should be otherwise fairly space/time efficient. Here's an overview of how it does it:
                  1. Read all lines from file1, store sequence, id as hash (hash to sequence)
                  2. Read all lines from file2, print out non-matching sequence/id and delete any matching sequence/id
                  3. Print remaining non-matching sequence/id from file1


                  Code:
                  #!/usr/bin/perl
                  use warnings;
                  use strict;
                  
                  open(my $f1, "< file1.fasta") or die("Cannot open file1");
                  open(my $f2, "< file2.fasta") or die("Cannot open file2");
                  
                  my %seenSequences = ();
                  my $sequence = "";
                  my $seqID = "";
                  
                  while(<$f1>){
                    chomp;
                    if(/^>(.*)$/){
                      $seenSequences{$sequence} = $seqID if $seqID ne "";
                      $sequence = "";
                      $seqID = $1;
                    } else {
                      $sequence .= $_;
                    }
                  }
                  $seenSequences{$sequence} = $seqID if $seqID ne "";
                  close($f1);
                  
                  $sequence = "";
                  $seqID = "";
                  while(<$f2>){
                    chomp;
                    if(/^>(.*)$/){
                      if(($seqID ne "") && !exists($seenSequences{$sequence})){
                        printf(">%s [2]\n%s\n", $seqID, $sequence);
                      } else {
                        delete($seenSequences{$sequence});
                      }
                      $seqID = $1;
                      $sequence = "";
                    } else {
                      $sequence .= $_;
                    }
                  }
                  if(($seqID ne "") && !exists($seenSequences{$sequence})){
                    printf(">%s [2]\n%s\n", $seqID, $sequence);
                  } else {
                    delete($seenSequences{$sequence});
                  }
                  close($f1);
                  
                  while(my ($seq, $id) = each(%seenSequences)){
                    printf(">%s [1]\n%s\n", $id, $seq);
                  }
                  Here's an example run:
                  Code:
                   ./141964.pl > out.fasta && head *.fasta
                  ==> file1.fasta <==                                                                                                                                                                                                               
                  >1                                                                                                                                                                                                                                
                  PRTEINEIN                                                                                                                                                                                                                         
                  >3
                  PRTEINTHREE
                  
                  ==> file2.fasta <==
                  >1
                  PRTEINEIN
                  >2
                  PRTEINNI
                  
                  ==> out.fasta <==
                  >2 [2]
                  PRTEINNI
                  >3 [1]
                  PRTEINTHREE
                  Last edited by gringer; 06-03-2014, 04:08 PM.

                  Comment

                  Latest Articles

                  Collapse

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, 06-05-2026, 10:09 AM
                  0 responses
                  14 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-04-2026, 08:59 AM
                  0 responses
                  24 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  31 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 11:40 AM
                  0 responses
                  23 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...