Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gsgs
    Senior Member
    • Oct 2009
    • 139

    comparing large sets of sequences

    comparing large sets of sequences

    suppose you have 2 large sequences or sets of sequences that you want to compare
    for matching entries.
    E.g. you sequenced some ancient bone and want to check for bacterial contamination

    For simplicity assume you have 2 sets of 1000 nucleotide sequences of length 1000 ,
    1GB each set that you want to compare against each other, find the best pairs of matching
    sequences or subsequences.
    Sounds like a standard problem, doesn't it ?

    How is it done ? What is the best, fastest method ?
  • xied75
    Senior Member
    • Feb 2012
    • 129

    #2
    1000 of 1000bp is 1MB?

    When you say match you mean string match?
    When you say subsequence match you mean like substring?
    Or you mean the best aligned pair? (Smith-Waterman)

    Comment

    • gsgs
      Senior Member
      • Oct 2009
      • 139

      #3
      yes, 1000*1000 bp = 1MB as fasta file, sorry I miscalculated.
      So, lets say 1000000 sequences of 1000 bp

      stringmatch or whatever is suitable to find genetical relatives

      yes, I meant substring instead of subsequence

      Smith Waterman would be too slow ?!

      Comment

      • xied75
        Senior Member
        • Feb 2012
        • 129

        #4
        Full string exact match is easy to tell, how do you want compare substring then?

        Comment

        • gsgs
          Senior Member
          • Oct 2009
          • 139

          #5
          OK, this came up in another thread (well, 2 threads) before and I thought to myself
          that the methods being used are just ineffective and there is a better way.

          I build a binary table of used substrings of length 15 (well, length 16 or 17 if both files are 1GB ?) in file 1
          and then look it up for each new read nucleotide (and thus 15-substring) in file 2
          This is almost as fast as reading the two files from HD into memory.

          But just checking for 15-substring-matches is not enough, too short, too many random matches.
          So I check for 30-substrings each of whose 16 15-substrings are marked in the binary table.

          I found that this worked very well in practice and was wondering how the method is called and
          implementations or papers about it, but couldn't find any.
          Instead I found lots of info about blast and other methods which apparently are much slower and more complicated and lots of efforts put into it.

          Comment

          • xied75
            Senior Member
            • Feb 2012
            • 129

            #6
            What you are doing is 'Hash Join' in RDBMS. It's like

            Code:
            select * from t1 inner join t2 on t1.column1 = t2.column1
            Then the DB engine will build a hash out of t2, then using t1 rows to probe this hash.

            So in the end you want a report saying two sets have xxxxx rows in common, xxxx for set A only, xxxx for set B only, and draw a Venn diagram?

            This is EXACT match. If Inexact match allowed, i.e. a defined number of mismatch, gap open, etc. then this turns into classic alignment problem.

            Comment

            • gsgs
              Senior Member
              • Oct 2009
              • 139

              #7
              --------------------------------------------------------------------------------

              > What you are doing is 'Hash Join' in RDBMS.

              Thanks.

              I give the number (and %) of matching 15-substrings, number of matching 30-15-substrings
              (substrings of length 30 each of whose 15 substrings are marked in the table)
              with an option to print the matching 30-15-substrings, the record number
              if it's a fasta-file, and the entry-position number in the record

              You could allow for gaps, so e.g. only 15 or 14 of the 16 substrings are required to be marked
              or mark the double number of strings from file 1 (gap somewhere) or such.
              But I don't feel that this would improve things a lot.

              30,15 are variable, depending on filesize and available memory, memory-cache size



              -------------------------------------------------

              RDBMS:




              Last edited by gsgs; 01-07-2013, 09:50 AM.

              Comment

              • xied75
                Senior Member
                • Feb 2012
                • 129

                #8
                Ok, you are using a Moving Window of size 15 to build the hash. That means you can't have mismatch or gap within this 15b (it's the key to look up from hash table), you can have gap or mismatch between multiple hits, but not within any 15b hit.

                This will take too much memory once the set is large.

                This is almost the scheme of first gen aligner, including ELAND, MAQ, and SOAPv1.

                If you want a fast inexact solution, I'll take one set into FASTA, call BWA INDEX on it; then take another set into FASTQ, call BWA ALN. (Many people do this.)

                Comment

                • gsgs
                  Senior Member
                  • Oct 2009
                  • 139

                  #9
                  if n is the number of nucleotides in the smaller one of the 2 files to be compared
                  then the memory requirement is ~4*n bits or n/2 bytes.
                  I tried it with the 15-substring table (4^15 bits = 134MB) on human chromosomes of length
                  > 200MB and it seemed to work well.

                  fake hits are usually substrings with repeats or high content with one or 2 nucleotides only,
                  i.e. high content of nucleotide T. Now I need a database with frequent such "fake" hits
                  so I can exclude them ...does it exist ?
                  No big problem, if also real hits are excluded, the sequences are long and there will be other hits
                  if there is real common ancestry.

                  searching for the mentioned software, I found:


                  so many programs ...
                  I still don't understand why anyone would use a different method, at least as a first step
                  to reduce the size of the file of possible candidates.
                  What can be faster than basically just the time required to load the data into memory ? (O(n))
                  Fast memory caching could become a problem with GB-sets, but I didn't see this yet.

                  ----------------------------------------------------------------------------------------------

                  e.g. comparing human chromosome 1: (30,15)

                  224999690 15-substrings were read from 1 sequences from file f:\hg18\chr01
                  these gave 136840909 (=60.82%) different markings in the table

                  chimpanzee
                  217189828 15-substrings were read from file f:\chimp\chr01
                  192230629 (=88.51%) of these were marked in the table
                  155119260 (=71.42%)matching 30-15-substrings were found

                  gorilla
                  212549001 15-substrings were read from file f:\gorill\chr01
                  186188324 (=87.60%) of these were marked in the table
                  143706656 (=67.61%)matching 30-15-substrings were found

                  macaca mulatta
                  219576101 15-substrings were read from file f:\macmul\chr01
                  139018325 (=63.31%) of these were marked in the table
                  54682566 (=24.90%)matching 30-15-substrings were found

                  human chromosome 2 (unrelated)
                  237709794 15-substrings were read from file f:\hg18\chr02
                  123413391 (=51.92%) of these were marked in the table
                  31633817 (=13.31%)matching 30-15-substrings were found
                  (repititions,unusual strings, etc.)

                  Comment

                  • Jeremy
                    Senior Member
                    • Nov 2009
                    • 190

                    #10
                    Surely blast or CD-HIT would be easier than trying to code your own algorithm?

                    Comment

                    • gsgs
                      Senior Member
                      • Oct 2009
                      • 139

                      #11
                      yes, but why are they all doing it the wrong way
                      despite so much research and papers ??

                      Comment

                      • bioBob
                        Member
                        • Mar 2011
                        • 72

                        #12
                        If you are going to use large but exact matches, why not use blast and up the word size? This goes pretty fast at say 19.

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM
                        • SEQadmin2
                          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                          by SEQadmin2


                          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                          Introduction

                          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                          05-22-2026, 06:42 AM
                        • SEQadmin2
                          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                          by SEQadmin2

                          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                          05-06-2026, 09:04 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, 06-02-2026, 12:03 PM
                        0 responses
                        19 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 11:40 AM
                        0 responses
                        14 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-28-2026, 11:40 AM
                        0 responses
                        29 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 05-26-2026, 10:12 AM
                        0 responses
                        31 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...