Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hello from a wordgame programmer...

    Hello, everyone.

    I was recently introduced to the area of genomic assembly by one of the bioinformatic faculty at the university where I work in IT. I have a lot of experience in handling strings of data from 30-odd years of writing word games such as Scrabble, and I'm also the tech guy who supports our local MPI cluster. It occurred to me that this gene assembly area and my wordgame/spelling checker&corrector hobby had a lot in common, so I decided to see if there was anything I could bring to the field from my previous experience.

    I think I've been successful and am currently working on a proof of concept program to test my ideas. The part of the assembly problem which interests me is the stage where you take all the reads from your sequencer, and identify the overlaps between each read. My academic colleagues are using velveth to do this part of the problem, and it appears to be quite time consuming for them. I offered to write something for them that would do this task a lot more quickly :-)

    I should have my proof of concept code ready for release in a few days and I'm looking for people who've worked in this area to have a look at it and tell me if indeed I've found something new that's worth exploring, or if I've made some newbie mistake and gone down a blind alley that everyone who looks at this problem has tried and discarded :-)

    If my algorithm turns out to be new & useful in this area, I'ld like to hack out a quick academic paper on it and then release the code under an open source license such as BSD.

    To give you a ball-park estimate of what I believe the performance of this algorithm will be, I'm looking at something like finding the overlaps between about 30M reads of length 37, from a 100Mbp genome, in about 15 minutes.

    The algorithm is linear in the number of reads and scales up linearly by adding more processors+ram for larger datasets. (My code uses MPI and OMP.)

    I have to confess that having only tackled this problem for a couple of weeks and trying deliberately to _not_ read too much of the literature so as not to be too influenced by current practise, I don't know the jargon well enough yet to discuss this fluently with experts and I'm not even sure what the proper name is for the task that I'm working on! Is overlap detection called alignment perhaps, or is that something to do with the later phase where the overlaps are used to create a graph from which the genome is extracted?

    Anyway, I have enjoyed the heck out of working on this stuff over the last 10 days or so and look forward to getting to know you folks working in this field; and I'm hoping I can get some peer review from y'all to evaluate the algorithms and code that I'm working on.

    best regards,

    Graham Toal <[email protected]>
    (maintainer of the wordgame-programmers group software archives)
    A Scotsman living in the south of Texas...

  • #2
    Hum. No responses after 3 days. Too bad since having new blood can help out the field. Unfortunately I am not in position to peer review new algorithms.

    Overlap detection is not alignment. Alignment would be taking a set of longer sequences (potentially chromsome/genome size) and placing the reads onto the longer sequences.

    Overlap detection could lead into assembly. This is a field that could always use performance gains. Older algorithms used straight-up overlaps to put reads together into longer "contigs". These days most algorithms use De Bruijn graphs which, of course, depend on overlaps as well.

    Reading up would be good. My favorite assembler is ABySS -- although there are many more! -- which like your algorithm is MPI-capable.

    Assembly By Short Sequences - a de novo, parallel, paired-end sequence assembler

    Comment


    • #3
      Graham, you should use a more specific/descriptive subject to catch a broader audience ..

      Comment


      • #4
        an alternative to k-mers and hashing

        Originally posted by westerman View Post
        Overlap detection is not alignment. Alignment would be taking a set of longer sequences (potentially chromsome/genome size) and placing the reads onto the longer sequences.

        Overlap detection could lead into assembly. This is a field that could always use performance gains. Older algorithms used straight-up overlaps to put reads together into longer "contigs". These days most algorithms use De Bruijn graphs which, of course, depend on overlaps as well.http://www.bcgsc.ca/platform/bioinfo/software/abyss
        I spent the weekend programming after that post, and finished implementing enough to get some output that I could take and show my bio guy that I wasn't blowing smoke :-) After a long chat and explaining the algorithm in detail, it turned out that what I've implemented is enough to generate AFG files which can then be fed into other programs to do the full assembly. I thought what I was doing was the equivalent of velveth but he says what I've done is closer to the combination of velveth+velvetg, although it does some things in different ways. (I'll be doing the AFG generation next weekend to confirm this) The most significant difference is that I build the overlaps using graphs so the low-level information from here (eg about match quality etc) could potentially feed directly into the other graph algorithms that assemble contigs into a genome; in fact it might be that we no longer need the distinction between reads -> contigs -> genome and can go directly from reads -> genome in one step.

        He did however explain some more real-life factors that I hadn't been exposed to yet such as the reverse-complement half of the DNA being tossed in there among the forward data, which has to be identified and paired up with its complementary reads. Fortunately that turns out to be trivial but I do want to get it into the program before I give it to anyone.

        Also, I can build contigs from my data at almost no extra overhead, so I want to implement that too. I realised that currently my algorithm would only let me extend any arbitrary read into a contig in one direction, and although that is trivially solved by creating two tables - one with the data forwards and one with it entered reversed - I think that I can cheat and extend to the left by using the reverse-complements alone. It'll mean building contigs using only half the possible reads, but it does mean that we can avoid using twice as much RAM which means not requiring twice as many compute nodes. Or if the nodes are available, I can use them and get higher confidence/better coverage.

        Anyway, the interesting part to my colleague was not that this has the potential to be faster than the assembler he uses, but that having understood the data structure, he thinks it may be able to build better contigs, because it effectively finds all overlaps of any read from 2 character overlaps all the way up to (in this case) 70 character overlaps. (See below for an example, sorry the web upload has lost the indentation I used to align them)

        So... progress so far...

        My test file has 30 million reads (although almost 50% of them are exact duplicates of others in the file) of 70 characters each. My code requires 21Gb of RAM for the data structure to hold this much, which it gets by using two compute nodes with 16Gb each. A larger problem would require proportionately more RAM, so 60M reads for example would need 42Gb and could be handled by 3 compute nodes.

        Although I haven't coded it yet, the overlap construction part of the algorithm (which is by far the most time-consuming part) is embarrassingly parallel, so to run in half the elapsed time, we would need to use double the compute nodes (ie 4 for 30M reads, or 6 for 60M reads)

        The timings so far... on my current sample data, it does the equivalent of generating hashes from k=2 to k=70 at a rate of 3 seconds per million reads; and it generates overlaps (which is effectively the same as identifying contigs) at a rate of 2 mins 15 secs per million reads.

        All the speed so far is from the algorithm alone; I'm running with debugging on (gcc -g), a whole bunch of assert tests even in the tight loops, no clever coding tricks to speed up any of the critical areas, and most significantly, no use of multithreading which I think will be possible to give a 16X speedup on the overlap detection/contig construction. The low-level coding improvements may give 25-50% speedup but I'd rather keep the code unoptimised if it makes it easier for others to follow.

        Not having done any assembly myself yet with existing tools, I'm not sure what speeds they offer, but anecdotally my colleague tells me this is much faster that what he's getting from velvet. If someone can give me a ballpark for other tools with a dataset as described above I'ld like to hear some numbers. (Also how long does it take to assemble an AFG from this size of data into a complete genome, roughly?)

        regards,

        G
        ---------
        TTTTTCTTTGGTGGATTTGGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTG (read #1516994)
        TTTGGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTGAC (read #17528886)
        TTTGGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTAC (read #59162)
        TTGGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACT (read #691458)
        TGGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTT (read #12184869)
        GGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTT (read #17278)
        GAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTA (read #305732)
        AATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTAT (read #2416179)
        ATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTAGA (read #16807380)
        ATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATC (read #3814502)
        ATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTCTC (read #17252750)
        ATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATANTTATTTCTTACTTTATC (read #10287936)
        TTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCA (read #6954515)
        TGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAA (read #816328)
        GCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAAC (read #1168578)
        GCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTNTTTCTTACTTTATCAAC (read #10264133)
        CTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACA (read #13713309)
        TGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAG (read #1505460)
        TGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATNTCTTACTTTATCAACAG (read #10223881)
        GTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTCATTTCTTACTTTATCAACAGA (read #3099940)
        GTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGA (read #3073778)
        GTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAGATCG (read #27956618)
        TGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAA (read #1210900)
        GACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAG (read #1329742)
        ACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGT (read #6743611)
        CATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTA (read #5061770)
        ATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTAT (read #5007087)
        TATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGGATA (read #8827920)
        TATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATA (read #2809538)
        ATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATAT (read #389919)
        TGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATT (read #14480427)
        TGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTCTATT (read #12982529)
        GACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGAATATTA (read #3138242)
        GACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTA (read #14877580)
        ACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAG (read #12020945)
        CCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGC (read #13695180)
        CTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGAATATTAGCT (read #13722286)
        CTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCT (read #4324279)
        CTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAGGTATATTAGCT (read #4666969)
        TCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTA (read #2145231)
        TCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTCTATTAGCTA (read #21480287)
        TCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTGTATTAGCTA (read #1811176)
        TCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTCTCAACAGAAAGTATATTAGCTA (read #17298137)
        CACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAA (read #8463131)
        ACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGAATATTAGCTAAT (read #3589150)
        ACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGCATATTAGCTAAT (read #20421354)
        ACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAAT (read #7673469)
        CCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATT (read #1635798)
        CCTCATTATTATTTTTAAGCATATTGANAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATT (read #18054538)
        CTCATTATTATTTTTAAGCATATTGATAGTAATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTT (read #20300714)
        CTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGGATATTAGCTAATTT (read #2622966)
        CTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTT (read #2461430)
        CTCATTATTATTTTTAAGCATATTGATCGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTT (read #18192530)
        TCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTC (read #202594)
        TCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTGGCTAATTTC (read #23418052)
        TCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAANGTATATTAGCTAATTTC (read #10334661)
        CATTATTATTTTTAAGCATATTCATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCA (read #333705)
        CATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGAATATTAGCTAATTTCA (read #27264003)
        CATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGCATATTAGCTAATTTCA (read #20250868)
        CATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCA (read #382113)
        CATTATTATTTTTAAGCATATTGTTAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCA (read #17609808)
        ATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAG (read #576855)
        ATTATTATTTTTAAGCATATTGATAGTTATTTCTTNCTTTATCAACAGAAAGTATNTTAGCTAATTTCAG (read #19450474)
        ATTATTATTTTTAAGCATATTGATAGTTTTTTCTTACTTTTCCACCGGAAAGAATATTAGCAATTTCCAC (read #23338554)
        TTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGC (read #1317965)
        TATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCA (read #11247386)
        ATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGAATATTAGCTAATTTCAGCAT (read #28409990)
        ATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCAT (read #440090)
        ATTATTTTTAAGCATATTGATAGTTATNTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCAT (read #17962400)
        ATTATTTTTAAGCATATTTATAGTTTTTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCAT (read #20731892)
        TTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATT (read #3496968)
        TTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCGGCATT (read #20515500)
        TATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATTC (read #2018111)
        ATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATTCT (read #3657004)
        TTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTCCAGCATTCTA (read #28637251)
        TTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATTCTA (read #221482)
        TTTTAAGCATATTGATAGTTAGTTCTTACTTTATAAACAGAAAGTATATTAGCTAATTTCAGCATTCTAA (read #17492863)
        TTTTAAGCATATTGATAGTTATTTCTTACTTTATAAACAGAAAGTATATTAGCTAATTTCAGCATTCTAA (read #25965162)
        TTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATAAGCTAATTTCAGCATTCTAA (read #17420611)
        TTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATTCTAA (read #395195)
        TTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTGGCTAATTTCAGCATTCTAA (read #8213977)
        TTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGNATATTAGCTAATTTCAGCATTCTAA (read #17072456)
        TTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCAGATCGGA (read #20154326)
        TTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATTCTAAC (read #1463143)
        TTTAAGCATATTTTTATTGAGGAAAATGAAAAAGGAAAGCTACCTCGAAATATTTTTGAAGAATGTGGTT (read #14400254)


        (Eagle-eyes readers may notice that the last character of the target string hasn't been used in the matching. It's a small 'fence post' bug that I'm currently fixing. Ignore it, it's not relevant to the algorithm performance)

        Comment


        • #5
          Do you incorporate errors in the reads and genetic variation in the samples? Dealing with overlaps in the presence of non-perfect data may pose challenges, depending on what you do. Sounds interesting, though!
          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

          Comment


          • #6
            Originally posted by SNPsaurus View Post
            Do you incorporate errors in the reads and genetic variation in the samples? Dealing with overlaps in the presence of non-perfect data may pose challenges, depending on what you do. Sounds interesting, though!
            Not yet, but I know what needs to be done. The matching procedure is just like in a spelling checker, or like playing a Scrabble hand when holding blanks - it doesn't affect the speed of the algorithm noticeably at all. You can specify how many wrong-letter errors are allowed in an overlap comparison. I don't know if it's useful in this problem area to be able to do insertions or deletions of letters, but the data structure makes it easy if it's needed. I don't believe this is the case with hashed k-mers, so if matches with a small number of errors are helpful, this data structure should find more matches than a hash-based solution would find.

            So a letter in a read can be tagged with its confidence (from the fastq quality data, or 0 if the letter was an 'N'). This gives us what is called a "weighted graph" and although graph algorithms are not my field, it's my understanding that finding the least-cost path through a weighted graph is a solved problem using something like Dijkstra's algorithm (that I expect to pass to someone else to implement ;-) )

            (Also -and this is going back some ways to my student days - if I remember rightly, Warshall's algorithm will perform a transitive closure on this graph and locate all cycles, which I think is going to be helpful information)

            At a very minimum, simply extending a read linearly (no lookahead or backtracking) with a fair degree of confidence should be cheap when the fastq quality data is available. That may hit dead-ends at times but for people searching for interesting loci, it might be more useful to get them something short (say a 4K contig) very quickly than it is to get a complete genome sequence with a bit more work.

            Comment


            • #7
              Hi Graham,
              Nice to see new hackers getting into the field... reminds me of a boggle coding competition back in 1998. Since you are finding all overlaps and not just fixed length k-mers, you may want to read up on the string graph formulation of fragment assembly by Gene Myers: http://bioinformatics.oxfordjournals.../ii79.abstract . This differs from the de Bruijn based methods such as Velvet that use fixed length k-mers.

              best,
              -mark

              Comment


              • #8
                Originally posted by mchaisso View Post
                Hi Graham,
                Nice to see new hackers getting into the field... reminds me of a boggle coding competition back in 1998. Since you are finding all overlaps and not just fixed length k-mers, you may want to read up on the string graph formulation of fragment assembly by Gene Myers: http://bioinformatics.oxfordjournals.../ii79.abstract . This differs from the de Bruijn based methods such as Velvet that use fixed length k-mers.

                best,
                -mark
                I remember that - Mactech wasn't it? Or are you remembering the POTM (programmer of the month) competition? :-)

                Thanks for the lead - excellent! - this is exactly why I came here. You may have just saved me from reinventing another wheel. This is a perfect fit with what I've developed - it takes the low-level overlap data and does with it the things that I was hoping I would be able to do. If Myers has concentrated on the assembly side and doesn't have a fast overlap detector, then potentionally our two algorithms could work together in the one program.

                Of course it may be that he's already written the same overlap detection algorithm that I have, which is fine - it's been fun if nothing else!

                Really appreciate the lead, you hit the nail on the head.

                Graham

                Comment


                • #9
                  Originally posted by mchaisso View Post
                  Since you are finding all overlaps and not just fixed length k-mers, you may want to read up on the string graph formulation of fragment assembly by Gene Myers: http://bioinformatics.oxfordjournals.../ii79.abstract . This differs from the de Bruijn based methods such as Velvet that use fixed length k-mers.
                  from that paper:

                  "The overlap computation is the most time-consuming step and amounts to a large sequence comparison between the concatenation of all the reads against itself. Numerous heuristic and filtration algorithms have been developed that offer good performance--roughly O(N^2/M) expected time, where M is the available memory of the machine. In particular we used a recently introduced filter based on _q_ grams (Rasmussen et al., 2005) so that all desired overlaps are guaranteed to be found"

                  Everything else that needs to be done subsequent to overlap detection, his paper seems to spell out an ideal way to do it, and it's exactly in line with what I thought would be possible, of going directly from reads to genomes without the intermediate stage of k-mers or contigs, using the one data structure throughout.

                  The problem quoted above is the one part of this domain that I'm sure I've solved, definitively. My code calculates these overlaps in O(N) time and is independent of M. Is is however proportional to L (the length of a read) but sub-linearly. I haven't analysed the algorithm rigorously but I think there's a log_4(L) in there somewhere :-) Observationally from testing, L=71 is only barely slower than L=37.

                  I really don't want to reinvent the wheel, I'ld just like to supply a well-greased axle to anyone who can use it. There's going to be a mountainous learning curve ahead of me if I have to develop the part that I know how to do into a full usable program with all the bells and whistles, and I'ld much rather have someone else incorporate it into a proven program that understands all the fiddly stuff relevant to the domain that real life programs always need. If there are any gene assembly program authors here I'ld like to start a conversation...

                  regards

                  Graham

                  Comment


                  • #10
                    Hi Graham, welcome to the party :-)

                    A couple of other overlap assemblers that use string graphs that you might want to check out are fermi from Heng Li (http://bioinformatics.oxfordjournals...ent/28/14/1838), which attempts to preserve variants among overlapping reads, and Hapsembler (http://compbio.cs.toronto.edu/hapsembler/), which attempts to collapse variants. They both have extensive error-detecting steps which might be worth a look.

                    Comment


                    • #11
                      Progress report

                      Originally posted by gtoal View Post
                      I really don't want to reinvent the wheel, I'ld just like to supply a well-greased axle to anyone who can use it. There's going to be a mountainous learning curve ahead of me if I have to develop the part that I know how to do into a full usable program with all the bells and whistles, and I'ld much rather have someone else incorporate it into a proven program that understands all the fiddly stuff relevant to the domain that real life programs always need. If there are any gene assembly program authors here I'ld like to start a conversation...

                      regards

                      Graham
                      Hello again... a year or so later, I thought I would check in to report that I did write the software I was talking about... it worked pretty well and allowed the researcher I'm working with to find a couple of sequences (confirmed with lab tests) that he was looking for, which the other assemblers he had been using weren't able to put together.

                      It's been some months since I worked on it (I stopped once he had located his sequences) but I'm revisiting it now to put a professional polish on it and release it on sourceforge (I'd estimate 3 - 4 months to be on the safe side).

                      Although I was particularly pleased with the fact that it was really fast and that it scaled linearly with added MPI cores, what my biologist colleague liked about it was not the performance but that I wrote an interface which allowed him to enter a 'seed' sequence, and I would perform assembly from that seed in both directions as far as could be done. This was lightweight code that could run on a desktop or even a handheld, once the raw reads had been indexed by a larger system for fast retrieval.

                      I did find one other assembler that uses a trie data structure similar to mine - edena. It's pretty good software but has the limitation that it doesn't scale past a single processor. All the other packages I've looked at use k-mer hashes and frankly after a year of working in this area, I can now say with confidence that it is the wrong data structure to use for this problem. Tries are a much better fit.

                      I have two observations to pass on. 1) the level of coverage and oversampling that these systems perform makes it unnecessary to do any fuzzy matching/error correction while assembling. I got really good results by only using exact overlap matches. And 2) 5 to 10 years from now, systems with terabytes of ram will be available to ordinary researchers, and once they are large enough to encompass the data for the largest genome in existence, we can drop the huge level of extra complexity that is needed to handle those genomes using parallel systems like MPI clusters. (I tried my code on the big memory machine at the TACC and it handled all the assemblies I could throw at it on a single machine, as opposed to needing half a dozen of the more ordinary cluster systems for the same size of problem. If I could rely on having a machine like that being available, the code size would reduce by 80% and what's left would be trivial and very maintainable.)

                      I'll check in again when the software is ready for use. Currently it has a console based interface which I'm assured won't fly with the target audience, and it needs to be packaged with a simple binary installer for people who are not going to compile code themselves. (And I need to compile it under Windows; it should be portable code, but I've only tried it under Unix so far)

                      I see that there's a new open source project has been started up here. Is it getting any traction and should I do anything in my project specifically to interoperate with it?

                      My code has a simple interface. You run one program that takes a .fastq file as a parameter, and it creates an internal file that indexes the reads n the fastq file. Then the application program takes the name of the original fastq file plus a DNA string on the command line, and outputs a .afg file that can be viewed in Tablet. Is there anything more I need to do?

                      best regards,

                      Graham

                      Comment


                      • #12
                        Originally posted by gtoal View Post
                        My code has a simple interface. You run one program that takes a .fastq file as a parameter, and it creates an internal file that indexes the reads n the fastq file. Then the application program takes the name of the original fastq file plus a DNA string on the command line, and outputs a .afg file that can be viewed in Tablet. Is there anything more I need to do?

                        best regards,

                        Graham
                        Perhaps include a different output format? I suspect that most people will want to convert to FastA/Q/G for further analysis.

                        The program sounds interesting. Looking forward to it.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        9 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X