Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tool for finding unique sets of sequences

    I've two sets of large number of proteins( in the order 100K) , and wish to find out unique proteins belonging to each set.

    Is there any tool for doing it fast?

    Thanks

  • #2
    You could do a reciprocal blat and select those that do not show a hit against the other set over the full length, if "unique" is exactly what you want.

    If parts of the proteins are going to be common (domains) then you may have to do some additional parsing/work.

    CD-HIT may also work: http://weizhong-lab.ucsd.edu/cd-hit/
    Last edited by GenoMax; 05-30-2014, 08:38 AM.

    Comment


    • #3
      Code:
      cat file1 file2 | sort | uniq -u
      will give you exactly unique lines from a file, and it's about as fast as you can get. But you probably don't want to do that because a naive sort won't preserve other information like sequence names, etc..

      Comment


      • #4
        Woa, do you want proteins that only appear once in a given file, or proteins that only appear in one file or the other?

        Both are easily doable in bash. If it's the first one, then gringer's approach with uniq -u will work.

        If you want the latter, you'll need to do a bit more work. First, you need to sort both files and (after ensuring that they consist only of one entry for a given protein, which can also be done with uniq) then use comm with the -3 flag: this will output two columns, with lines unique to the first and second file (with shared lines suppressed by -3).

        Comment


        • #5
          Thanks for your answers.
          I've two sequences sets derived from two databases with different fasta headers, and I wish to find unique sequences belonging to each of these databases.I wish to retain the Fasta header information for the unique sequences as well.
          I can write a perl hash based script for such a comparison but that'll be quite slow I beleive.
          Any other options?

          Comment


          • #6
            Do you expect the actual fasta sequences to be exact matches, but with different headers, across the two databases?

            For instance if ProteinA was in database 1, and there was a fasta entry of a section of the same protein in database 2, would you count that as being in both databases or is that two unique entries?

            Either way you'll probably have to write your own script, but if you are indeed looking for unique proteins you're probably in for a little bit more work.

            Comment


            • #7
              Originally posted by JamieHeather View Post
              Do you expect the actual fasta sequences to be exact matches, but with different headers, across the two databases?

              For instance if ProteinA was in database 1, and there was a fasta entry of a section of the same protein in database 2, would you count that as being in both databases or is that two unique entries?

              Either way you'll probably have to write your own script, but if you are indeed looking for unique proteins you're probably in for a little bit more work.
              Thanks for your reply .

              I'll consider only exact matches and hence I'll consider ProteinA and ProteinB unique to their corresponding databases.

              and plan to write something like this in perl:

              use List::Util qw(first);

              if( !defined ( first {$_ eq $seqA} @sorted_seqB ) ){.....}

              #$seqA is a sequence in DatabaseA and $_ is one of the sequences of DatbaseB

              To speed up I might try the MCE module

              use MCE::Grep;

              if(! mce_grep {$_ eq $seqA} @sorted_seqB){....}
              Last edited by woa; 06-03-2014, 08:13 AM.

              Comment


              • #8
                I'm not very fluent in Perl, but I think that might do the trick.

                For posterity, my approach in python would just be to make a defaultdict for each database, with the sequence strings as keys (and probably the SeqIO fasta or fasta ID instance for the value), then take do db1.keys().difference(db2.keys()) to find those sequences unique to each database.

                Comment


                • #9
                  I just cooked up some rough code. It needs to store sequence and ID in memory for all sequences in the first file, but should be otherwise fairly space/time efficient. Here's an overview of how it does it:
                  1. Read all lines from file1, store sequence, id as hash (hash to sequence)
                  2. Read all lines from file2, print out non-matching sequence/id and delete any matching sequence/id
                  3. Print remaining non-matching sequence/id from file1


                  Code:
                  #!/usr/bin/perl
                  use warnings;
                  use strict;
                  
                  open(my $f1, "< file1.fasta") or die("Cannot open file1");
                  open(my $f2, "< file2.fasta") or die("Cannot open file2");
                  
                  my %seenSequences = ();
                  my $sequence = "";
                  my $seqID = "";
                  
                  while(<$f1>){
                    chomp;
                    if(/^>(.*)$/){
                      $seenSequences{$sequence} = $seqID if $seqID ne "";
                      $sequence = "";
                      $seqID = $1;
                    } else {
                      $sequence .= $_;
                    }
                  }
                  $seenSequences{$sequence} = $seqID if $seqID ne "";
                  close($f1);
                  
                  $sequence = "";
                  $seqID = "";
                  while(<$f2>){
                    chomp;
                    if(/^>(.*)$/){
                      if(($seqID ne "") && !exists($seenSequences{$sequence})){
                        printf(">%s [2]\n%s\n", $seqID, $sequence);
                      } else {
                        delete($seenSequences{$sequence});
                      }
                      $seqID = $1;
                      $sequence = "";
                    } else {
                      $sequence .= $_;
                    }
                  }
                  if(($seqID ne "") && !exists($seenSequences{$sequence})){
                    printf(">%s [2]\n%s\n", $seqID, $sequence);
                  } else {
                    delete($seenSequences{$sequence});
                  }
                  close($f1);
                  
                  while(my ($seq, $id) = each(%seenSequences)){
                    printf(">%s [1]\n%s\n", $id, $seq);
                  }
                  Here's an example run:
                  Code:
                   ./141964.pl > out.fasta && head *.fasta
                  ==> file1.fasta <==                                                                                                                                                                                                               
                  >1                                                                                                                                                                                                                                
                  PRTEINEIN                                                                                                                                                                                                                         
                  >3
                  PRTEINTHREE
                  
                  ==> file2.fasta <==
                  >1
                  PRTEINEIN
                  >2
                  PRTEINNI
                  
                  ==> out.fasta <==
                  >2 [2]
                  PRTEINNI
                  >3 [1]
                  PRTEINTHREE
                  Last edited by gringer; 06-03-2014, 04:08 PM.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Understanding Genetic Influence on Infectious Disease
                    by seqadmin




                    During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                    Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                    09-09-2024, 10:59 AM
                  • seqadmin
                    Addressing Off-Target Effects in CRISPR Technologies
                    by seqadmin






                    The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                    08-27-2024, 04:44 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Today, 06:25 AM
                  0 responses
                  13 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 01:02 PM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 09-18-2024, 06:39 AM
                  0 responses
                  14 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 09-11-2024, 02:44 PM
                  0 responses
                  14 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X