No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Serious question for an algorithms person

    I'm working on a string processing algorithm that for an application unrelated to biology, but I think it might possibly be useful for processing genome data. It's been surprisingly hard to find out! Hoping some bio-computing sophisticate can shed light on this.

    Problem the algorithm solves now: searching huge strings to find matches to corrupted targets. You have a target sequence of anywhere from, say, half a kilobyte up to megabytes (e.g a sequence of base pairs or any other character data.) You want to find partial matches to the target in strings that are of any length (eg. multi-gigabytes). Caveat: this algorithm does not deal well with short targets (say, under a couple of hundred chars) but it's fast at finding longer strings that are significantly corrupted or changed.

    One reason it seems like it might be useful: a linear-time pre-processing step results in meta-data that is tiny compared to the input. Depending on parameters, maybe 1/50 to 1/1000 of the original size. Once the pre-processing is done, you only deal with the metadata for searches, so you can search against hundreds of gigabytes of genomes in memory, even on a small platform.

    This is not an alignment algorithm per se--it only finds the location where a substring that will give a good alignment exits. (more precisely, it finds substrings where the edit-distance of a substring is less than some parameter you give it.)

    My question is several-fold:
    • Am I correct that finding the location of the alignment cheaply is half the battle? (for some definition of "half".)
    • If this is true, how FAST does finding the align-able substring have to be to be useful? I believe that most procedures for finding these substrings are significantly worse than linear--this might be a false assumption.
    • A big-Oh answer would be useful. Answers in terms of clock-time might also help. E.g., do existing algorithms take seconds, minutes, or hours to find matches in a gigabyte?

    So in other words, my question is, would a function like the following have applications, and if so, how fast would it need to be, in order to be useful. I'm making no assumptions about the alphabet (it could be {G,A,T,C} or any other set of characters.
    • You give it a target string, plus a parameter that indicates the minimum quality you require of matches.
    • The minimum quality match is actually the "edit" distance from the target.
    • You also have to give it the names of the genomes you want searched.
    • You get back a list of indexes into the genomes where the matches occur.
    • Of course this assumes you've already pre-processed the genomes in question. If not, naturally, you need to supply them the first time.

    The algorithmic complexity is hard to give simply, because there are several parameters, but in a nutshell, computing the meta-data is more or less linear (you can compute it approximately as fast as you can read it.) An actual search is linear in the size of the meta-data, but varies significantly with target size.

    A crude demo on one processing thread searches the 100mbp C. Elegans genome for five-kbp sequences in about 3 seconds, returning the location of any sub-sequence within a approximately a given edit-distance of the target. Extrapolating, the same search against the human genome takes about 10 minutes. In practice, you can use as many hardware cores as you want, so on an N-core server, you can do about N times as fast.

    If you got this far, you're patient indeed! Can you offer any insight, even if it's only to say "don't waste your time---it's a solved problem!"

  • #2
    One 5Kb sequence to the human genome in 10 minutes? Try "bwa-sw" or "bowtie2" as a benchmark.


    • #3
      I don't think finding the general location is considered that much of a problem especially for longer searches. BLAST for example just uses a hash, where you can pick the word sizes, then it has some heuristics about how many hash hits are within a certain distance. There are other algorithms based on Burrows Wheeler, basically reverse engineering data compression algorithms to do search are very popular these days bwa bowtie... I think what most people in Bioinformatics are more interested in is accuracy, and if you could prove that your algorithm did better at finding the right sequence with large amounts of non-homologous regions or error then that would be interesting. Usually for longer sequences though the dynamic programming algorithms or approximations dominate the complexity for finding the actual alignments.


      • #4
        A little more detail

        Thanks for the tips. I'll try to scare up those benchmarks you refer to!

        Regarding accuracy, to clarify, the algorithm produces a very compressed metadata from which you can estimate the edit distance of the target from each region of the search data. What you''re actually computing underneath is an estimate of the Levenshtein edit distance of the target from each region of the test data.

        Thus, you can choose to exclude essentially all spurious matches by choosing a small edit distance, or allow a large edit distance to find very distant resemblances.

        Actually, saying it's like LD isn't quite accurate. If the match had large blocks shuffled around with respect to their order in the target, it gives you an estimate of the LD of the match with the blocks in the best order.

        One other clarification---the 10 minutes is for a single laptop thread. If your ran it on a 16 core server, the number would be 0.625 minutes. And so on. But on to those benchmarks!!!


        • #5
          I would hypothesize that 10 minutes on a single thread is a few orders of magnitude slower than the BWT (FM-index) based tools.


          • #6
            Thanks for the information.

            Thanks a lot for the information. That's pretty amazing---order of a second or better to search three gigs of data for a near match to a string?

            As I understand it, BWA for shorter targets, and this is for long targets, but that definitely gives me a ballpark idea that my gizmo is probably not going to be an advance on the state of the art!)

            I'd be really curious about how the same is done for longer sequences (half a K to a meg), if anyone knows.


            • #7
              BWA contains two algorithms, one for short sequences (less than 100bp) and one for longer sequences (100bp to 1Mb or more). Try "bwa bwasw" and see this paper: http://bioinformatics.oxfordjournals.../26/5/589.long


              • #8
                You should be aware of other tools for matching long sequence matching. E.g., MUMMER is designed for genome-scale comparisons.


                • #9
                  Thanks so much for this. This information is difficult to excavate for an outsider. I'll take a look at these sources. The original purpose was near duplicate detection in web-queries, social media, etc. and thought some of the techniques might translate.

                  FYI, in case there's any cross fertilization possible (to put it in bio terms!) I am specifically working on how to clean up the results of ultra high-speed duplicate detection for Web crawlers. In the abstract, the task sounds a lot like ID'ing alignment candidates. Hence this discussion, which several people have kindly given some time to.

                  A Web crawler gets a web page, then it has to find out if anything very much like it has been seen before among hundreds of billions of other pages. (takes under 100 ms, but with a lot of errors.) Results are very coarse. My algo is to allow a second, highly accurate estimate of the match quality of the candidate duplicates. It takes millions of times too long to fetch the articles and directly compare, but you can estimate their Levenshtein distance of pairs of pages fairly accurately, using only the metadata, in microseconds, with the metadata being only about 0.005 the size of the original.

                  So I was casting around for bio applications for this by applying the same technique to a sequential search of a genome. But it sounds like the existing techniques might be too good already for it to be worth pursuing. If anyone has any other ideas for the strategy, do let me know!

                  Thanks again


                  • #10
                    The literature could be dense especially if you don't understand the basics of text compression, most of us are just glad that we can use the program


                    • #11
                      The article cited by nilshomer http://bioinformatics.oxfordjournals.../26/5/589.long is GREAT! Answered exactly what I was asking. Moreover, it answers it encouragingly, because it reveals the sweet spot for this admittedly eccentric approach. Punch line at the end.

                      It's not a perfect comparison, but I read the tables correctly, the algorithm appears to be about as much faster than BLAT as it is slower than BWA-SW for finding for appropriate targets. So in it's current form, it's slower than the fastest algorithm by a factor of maybe 5.

                      Happily, the performance numbers I gave assumed that there was no lower bound on how bad the matches one is seeking are. If you're only interested in matches where a bounded distance is allowed, it's many times faster, which would put it in the same general ballpark with BWA-SW. (Of course, it's just a crude java implementation on a slower platform, so it might be pretty close even without requiring a lower bound on match quality.)

                      The possible sweet spot I see is that as I surmised, BWA-SW takes significantly more memory than the genome itself. This algorithm takes only a small fraction of the size of the genome.

                      This means that if you wanted to search, say, hundreds of genomes, it would have a crushing advantage in IO. BWA-SW could realistically only fit about one, possibly two, human genomes in memory, so you'd have to do a tremendous amount of of searching to amortize the IO cost of loading one genome.

                      The algorithm I'm talking about can fit the meta-data for about 200 to 400 genomes into the same amount of memory that BWA-SW uses for one. That means that if you ever have to do searches against hundreds of genomes, BWA-SW would be at an enormous disadvantage.

                      As a non-biologist, I don't know if searching many genomes is a real problem. But a quick look suggests that the algorithm could be faster than existing techniques in any case where you were doing:
                      • Few searches per session (because of the load time for the BWA-SW data
                      • Searching against multiple genomes.
                      • Were constrained in the amount of hardware you could apply to the problem. E.g., you could easily allow searches against 100 genomes on a laptop rather than a jumbo server. Or say, 1000 genomes on a machine big enough to hold a few terabytes of disk.

                      Something to think about, anyway.

                      Thanks for all the advice.


                      • #12
                        Actually BWA-SW takes most of the memory for indexing, not searching, and it is possible to build the indexes in pieces then merge them on standard memory machines, which is useful in assembly algorithms, but I am not sure that people really have a need in just searching so they haven't developed it just yet.


                        • #13
                          Mapping, for example, is against one human genome, not hundreds, so there may not be such an advantage. Also, disk IO is not a big problem in mapping, but really it's the random memory access that can be a killer. So if you could reduce the size of one genome such that searching it fits in the L1/2/3 cache then there would be a huge speedup in generating approximate locations.

                          Another approach for mapping 100s of genomes is only storing the differences between the genomes assuming they are 99.9% similar (which they are). Having bounds on edit distance is tricky, since suppose there are N edits in a L base read, and there is a hit with N edits, but the true location has N+1 edits. If we search up to N edits, then we have no clue about the quality of the hits (were they truly unique or was the next best really close in terms of the # of edits). Specificity is key here.


                          • #14
                            There may be an need for this kind of thing in metagenomics (e g sequencing environmental samples), where you sometimes want to map to multiple genomes (even 100s of genomes) at once.


                            • #15
                              This might be useful in metagenomics (as kopi-o mentioned) if it provided accurate output in the standard
                              SAM alignment format.

                              Certainly I have problems fitting enough microbial genomes into the 4GB reference sequence limitation of Bowtie and BWA, given there are now > 6-7 GB which are publicly available.


                              Latest Articles


                              • seqadmin
                                Advanced Tools Transforming the Field of Cytogenomics
                                by seqadmin

                                At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                                09-26-2023, 06:26 AM
                              • seqadmin
                                How RNA-Seq is Transforming Cancer Studies
                                by seqadmin

                                Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                                09-07-2023, 11:15 PM





                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 09:38 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 09-27-2023, 06:57 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 09-26-2023, 07:53 AM
                              1 response
                              Last Post seed_phrase_metal_storage  
                              Started by seqadmin, 09-25-2023, 07:42 AM
                              0 responses
                              Last Post seqadmin