Seqanswers Leaderboard Ad

**westerman** · 05-14-2013, 09:15 AM

Hum. No responses after 3 days. Too bad since having new blood can help out the field. Unfortunately I am not in position to peer review new algorithms.

Overlap detection is not alignment. Alignment would be taking a set of longer sequences (potentially chromsome/genome size) and placing the reads onto the longer sequences.

Overlap detection could lead into assembly. This is a field that could always use performance gains. Older algorithms used straight-up overlaps to put reads together into longer "contigs". These days most algorithms use De Bruijn graphs which, of course, depend on overlaps as well.

Reading up would be good. My favorite assembler is ABySS -- although there are many more! -- which like your algorithm is MPI-capable.

ABySS | Genome Sciences Centre

http://www.bcgsc.ca/platform/bioinfo/software/abyss

Assembly By Short Sequences - a de novo, parallel, paired-end sequence assembler

**sklages** · 05-14-2013, 12:25 PM

Graham, you should use a more specific/descriptive subject to catch a broader audience ..

**gtoal** · 05-14-2013, 04:11 PM

an alternative to k-mers and hashing

Originally posted by westerman View Post

Overlap detection is not alignment. Alignment would be taking a set of longer sequences (potentially chromsome/genome size) and placing the reads onto the longer sequences.

Overlap detection could lead into assembly. This is a field that could always use performance gains. Older algorithms used straight-up overlaps to put reads together into longer "contigs". These days most algorithms use De Bruijn graphs which, of course, depend on overlaps as well.http://www.bcgsc.ca/platform/bioinfo/software/abyss

I spent the weekend programming after that post, and finished implementing enough to get some output that I could take and show my bio guy that I wasn't blowing smoke :-) After a long chat and explaining the algorithm in detail, it turned out that what I've implemented is enough to generate AFG files which can then be fed into other programs to do the full assembly. I thought what I was doing was the equivalent of velveth but he says what I've done is closer to the combination of velveth+velvetg, although it does some things in different ways. (I'll be doing the AFG generation next weekend to confirm this) The most significant difference is that I build the overlaps using graphs so the low-level information from here (eg about match quality etc) could potentially feed directly into the other graph algorithms that assemble contigs into a genome; in fact it might be that we no longer need the distinction between reads -> contigs -> genome and can go directly from reads -> genome in one step.

He did however explain some more real-life factors that I hadn't been exposed to yet such as the reverse-complement half of the DNA being tossed in there among the forward data, which has to be identified and paired up with its complementary reads. Fortunately that turns out to be trivial but I do want to get it into the program before I give it to anyone.

Also, I can build contigs from my data at almost no extra overhead, so I want to implement that too. I realised that currently my algorithm would only let me extend any arbitrary read into a contig in one direction, and although that is trivially solved by creating two tables - one with the data forwards and one with it entered reversed - I think that I can cheat and extend to the left by using the reverse-complements alone. It'll mean building contigs using only half the possible reads, but it does mean that we can avoid using twice as much RAM which means not requiring twice as many compute nodes. Or if the nodes are available, I can use them and get higher confidence/better coverage.

Anyway, the interesting part to my colleague was not that this has the potential to be faster than the assembler he uses, but that having understood the data structure, he thinks it may be able to build better contigs, because it effectively finds all overlaps of any read from 2 character overlaps all the way up to (in this case) 70 character overlaps. (See below for an example, sorry the web upload has lost the indentation I used to align them)

So... progress so far...

My test file has 30 million reads (although almost 50% of them are exact duplicates of others in the file) of 70 characters each. My code requires 21Gb of RAM for the data structure to hold this much, which it gets by using two compute nodes with 16Gb each. A larger problem would require proportionately more RAM, so 60M reads for example would need 42Gb and could be handled by 3 compute nodes.

Although I haven't coded it yet, the overlap construction part of the algorithm (which is by far the most time-consuming part) is embarrassingly parallel, so to run in half the elapsed time, we would need to use double the compute nodes (ie 4 for 30M reads, or 6 for 60M reads)

The timings so far... on my current sample data, it does the equivalent of generating hashes from k=2 to k=70 at a rate of 3 seconds per million reads; and it generates overlaps (which is effectively the same as identifying contigs) at a rate of 2 mins 15 secs per million reads.

All the speed so far is from the algorithm alone; I'm running with debugging on (gcc -g), a whole bunch of assert tests even in the tight loops, no clever coding tricks to speed up any of the critical areas, and most significantly, no use of multithreading which I think will be possible to give a 16X speedup on the overlap detection/contig construction. The low-level coding improvements may give 25-50% speedup but I'd rather keep the code unoptimised if it makes it easier for others to follow.

Not having done any assembly myself yet with existing tools, I'm not sure what speeds they offer, but anecdotally my colleague tells me this is much faster that what he's getting from velvet. If someone can give me a ballpark for other tools with a dataset as described above I'ld like to hear some numbers. (Also how long does it take to assemble an AFG from this size of data into a complete genome, roughly?)

regards,

G
---------
TTTTTCTTTGGTGGATTTGGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTG (read #1516994)
TTTGGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTGAC (read #17528886)
TTTGGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTAC (read #59162)
TTGGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACT (read #691458)
TGGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTT (read #12184869)
GGAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTT (read #17278)
GAATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTA (read #305732)
AATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTAT (read #2416179)
ATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTAGA (read #16807380)
ATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATC (read #3814502)
ATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTCTC (read #17252750)
ATTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATANTTATTTCTTACTTTATC (read #10287936)
TTGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCA (read #6954515)
TGCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAA (read #816328)
GCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAAC (read #1168578)
GCTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTNTTTCTTACTTTATCAAC (read #10264133)
CTGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACA (read #13713309)
TGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAG (read #1505460)
TGTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATNTCTTACTTTATCAACAG (read #10223881)
GTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTCATTTCTTACTTTATCAACAGA (read #3099940)
GTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGA (read #3073778)
GTTGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAGATCG (read #27956618)
TGACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAA (read #1210900)
GACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAG (read #1329742)
ACATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGT (read #6743611)
CATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTA (read #5061770)
ATATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTAT (read #5007087)
TATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGGATA (read #8827920)
TATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATA (read #2809538)
ATGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATAT (read #389919)
TGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATT (read #14480427)
TGACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTCTATT (read #12982529)
GACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGAATATTA (read #3138242)
GACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTA (read #14877580)
ACCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAG (read #12020945)
CCTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGC (read #13695180)
CTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGAATATTAGCT (read #13722286)
CTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCT (read #4324279)
CTCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAGGTATATTAGCT (read #4666969)
TCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTA (read #2145231)
TCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTCTATTAGCTA (read #21480287)
TCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTGTATTAGCTA (read #1811176)
TCACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTCTCAACAGAAAGTATATTAGCTA (read #17298137)
CACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAA (read #8463131)
ACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGAATATTAGCTAAT (read #3589150)
ACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGCATATTAGCTAAT (read #20421354)
ACCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAAT (read #7673469)
CCTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATT (read #1635798)
CCTCATTATTATTTTTAAGCATATTGANAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATT (read #18054538)
CTCATTATTATTTTTAAGCATATTGATAGTAATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTT (read #20300714)
CTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGGATATTAGCTAATTT (read #2622966)
CTCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTT (read #2461430)
CTCATTATTATTTTTAAGCATATTGATCGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTT (read #18192530)
TCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTC (read #202594)
TCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTGGCTAATTTC (read #23418052)
TCATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAANGTATATTAGCTAATTTC (read #10334661)
CATTATTATTTTTAAGCATATTCATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCA (read #333705)
CATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGAATATTAGCTAATTTCA (read #27264003)
CATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGCATATTAGCTAATTTCA (read #20250868)
CATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCA (read #382113)
CATTATTATTTTTAAGCATATTGTTAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCA (read #17609808)
ATTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAG (read #576855)
ATTATTATTTTTAAGCATATTGATAGTTATTTCTTNCTTTATCAACAGAAAGTATNTTAGCTAATTTCAG (read #19450474)
ATTATTATTTTTAAGCATATTGATAGTTTTTTCTTACTTTTCCACCGGAAAGAATATTAGCAATTTCCAC (read #23338554)
TTATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGC (read #1317965)
TATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCA (read #11247386)
ATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGAATATTAGCTAATTTCAGCAT (read #28409990)
ATTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCAT (read #440090)
ATTATTTTTAAGCATATTGATAGTTATNTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCAT (read #17962400)
ATTATTTTTAAGCATATTTATAGTTTTTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCAT (read #20731892)
TTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATT (read #3496968)
TTATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCGGCATT (read #20515500)
TATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATTC (read #2018111)
ATTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATTCT (read #3657004)
TTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTCCAGCATTCTA (read #28637251)
TTTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATTCTA (read #221482)
TTTTAAGCATATTGATAGTTAGTTCTTACTTTATAAACAGAAAGTATATTAGCTAATTTCAGCATTCTAA (read #17492863)
TTTTAAGCATATTGATAGTTATTTCTTACTTTATAAACAGAAAGTATATTAGCTAATTTCAGCATTCTAA (read #25965162)
TTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATAAGCTAATTTCAGCATTCTAA (read #17420611)
TTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATTCTAA (read #395195)
TTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTGGCTAATTTCAGCATTCTAA (read #8213977)
TTTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGNATATTAGCTAATTTCAGCATTCTAA (read #17072456)
TTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCAGATCGGA (read #20154326)
TTTAAGCATATTGATAGTTATTTCTTACTTTATCAACAGAAAGTATATTAGCTAATTTCAGCATTCTAAC (read #1463143)
TTTAAGCATATTTTTATTGAGGAAAATGAAAAAGGAAAGCTACCTCGAAATATTTTTGAAGAATGTGGTT (read #14400254)

(Eagle-eyes readers may notice that the last character of the target string hasn't been used in the matching. It's a small 'fence post' bug that I'm currently fixing. Ignore it, it's not relevant to the algorithm performance)

**SNPsaurus** · 05-14-2013, 06:47 PM

Do you incorporate errors in the reads and genetic variation in the samples? Dealing with overlaps in the presence of non-perfect data may pose challenges, depending on what you do. Sounds interesting, though!

**gtoal** · 05-14-2013, 07:09 PM

Originally posted by SNPsaurus View Post

Do you incorporate errors in the reads and genetic variation in the samples? Dealing with overlaps in the presence of non-perfect data may pose challenges, depending on what you do. Sounds interesting, though!

Not yet, but I know what needs to be done. The matching procedure is just like in a spelling checker, or like playing a Scrabble hand when holding blanks - it doesn't affect the speed of the algorithm noticeably at all. You can specify how many wrong-letter errors are allowed in an overlap comparison. I don't know if it's useful in this problem area to be able to do insertions or deletions of letters, but the data structure makes it easy if it's needed. I don't believe this is the case with hashed k-mers, so if matches with a small number of errors are helpful, this data structure should find more matches than a hash-based solution would find.

So a letter in a read can be tagged with its confidence (from the fastq quality data, or 0 if the letter was an 'N'). This gives us what is called a "weighted graph" and although graph algorithms are not my field, it's my understanding that finding the least-cost path through a weighted graph is a solved problem using something like Dijkstra's algorithm (that I expect to pass to someone else to implement ;-) )

(Also -and this is going back some ways to my student days - if I remember rightly, Warshall's algorithm will perform a transitive closure on this graph and locate all cycles, which I think is going to be helpful information)

At a very minimum, simply extending a read linearly (no lookahead or backtracking) with a fair degree of confidence should be cheap when the fastq quality data is available. That may hit dead-ends at times but for people searching for interesting loci, it might be more useful to get them something short (say a 4K contig) very quickly than it is to get a complete genome sequence with a bit more work.

**mchaisso** · 05-15-2013, 05:57 AM

Hi Graham,
Nice to see new hackers getting into the field... reminds me of a boggle coding competition back in 1998. Since you are finding all overlaps and not just fixed length k-mers, you may want to read up on the string graph formulation of fragment assembly by Gene Myers: http://bioinformatics.oxfordjournals.../ii79.abstract . This differs from the de Bruijn based methods such as Velvet that use fixed length k-mers.

best,
-mark

**gtoal** · 05-15-2013, 06:44 AM

Originally posted by mchaisso View Post

Hi Graham,
Nice to see new hackers getting into the field... reminds me of a boggle coding competition back in 1998. Since you are finding all overlaps and not just fixed length k-mers, you may want to read up on the string graph formulation of fragment assembly by Gene Myers: http://bioinformatics.oxfordjournals.../ii79.abstract . This differs from the de Bruijn based methods such as Velvet that use fixed length k-mers.

best,
-mark

I remember that - Mactech wasn't it? Or are you remembering the POTM (programmer of the month) competition? :-)

Thanks for the lead - excellent! - this is exactly why I came here. You may have just saved me from reinventing another wheel. This is a perfect fit with what I've developed - it takes the low-level overlap data and does with it the things that I was hoping I would be able to do. If Myers has concentrated on the assembly side and doesn't have a fast overlap detector, then potentionally our two algorithms could work together in the one program.

Of course it may be that he's already written the same overlap detection algorithm that I have, which is fine - it's been fun if nothing else!

Really appreciate the lead, you hit the nail on the head.

Graham

**gtoal** · 05-15-2013, 08:51 AM

Originally posted by mchaisso View Post

Since you are finding all overlaps and not just fixed length k-mers, you may want to read up on the string graph formulation of fragment assembly by Gene Myers: http://bioinformatics.oxfordjournals.../ii79.abstract . This differs from the de Bruijn based methods such as Velvet that use fixed length k-mers.

from that paper:

"The overlap computation is the most time-consuming step and amounts to a large sequence comparison between the concatenation of all the reads against itself. Numerous heuristic and filtration algorithms have been developed that offer good performance--roughly O(N^2/M) expected time, where M is the available memory of the machine. In particular we used a recently introduced filter based on _q_ grams (Rasmussen et al., 2005) so that all desired overlaps are guaranteed to be found"

Everything else that needs to be done subsequent to overlap detection, his paper seems to spell out an ideal way to do it, and it's exactly in line with what I thought would be possible, of going directly from reads to genomes without the intermediate stage of k-mers or contigs, using the one data structure throughout.

The problem quoted above is the one part of this domain that I'm sure I've solved, definitively. My code calculates these overlaps in O(N) time and is independent of M. Is is however proportional to L (the length of a read) but sub-linearly. I haven't analysed the algorithm rigorously but I think there's a log_4(L) in there somewhere :-) Observationally from testing, L=71 is only barely slower than L=37.

I really don't want to reinvent the wheel, I'ld just like to supply a well-greased axle to anyone who can use it. There's going to be a mountainous learning curve ahead of me if I have to develop the part that I know how to do into a full usable program with all the bells and whistles, and I'ld much rather have someone else incorporate it into a proven program that understands all the fiddly stuff relevant to the domain that real life programs always need. If there are any gene assembly program authors here I'ld like to start a conversation...

regards

Graham

**dgscofield** · 05-23-2013, 02:30 AM

Hi Graham, welcome to the party :-)

A couple of other overlap assemblers that use string graphs that you might want to check out are fermi from Heng Li (http://bioinformatics.oxfordjournals...ent/28/14/1838), which attempts to preserve variants among overlapping reads, and Hapsembler (http://compbio.cs.toronto.edu/hapsembler/), which attempts to collapse variants. They both have extensive error-detecting steps which might be worth a look.

**gtoal** · 04-18-2015, 08:05 PM

Progress report

Originally posted by gtoal View Post

I really don't want to reinvent the wheel, I'ld just like to supply a well-greased axle to anyone who can use it. There's going to be a mountainous learning curve ahead of me if I have to develop the part that I know how to do into a full usable program with all the bells and whistles, and I'ld much rather have someone else incorporate it into a proven program that understands all the fiddly stuff relevant to the domain that real life programs always need. If there are any gene assembly program authors here I'ld like to start a conversation...

regards

Graham

Hello again... a year or so later, I thought I would check in to report that I did write the software I was talking about... it worked pretty well and allowed the researcher I'm working with to find a couple of sequences (confirmed with lab tests) that he was looking for, which the other assemblers he had been using weren't able to put together.

It's been some months since I worked on it (I stopped once he had located his sequences) but I'm revisiting it now to put a professional polish on it and release it on sourceforge (I'd estimate 3 - 4 months to be on the safe side).

Although I was particularly pleased with the fact that it was really fast and that it scaled linearly with added MPI cores, what my biologist colleague liked about it was not the performance but that I wrote an interface which allowed him to enter a 'seed' sequence, and I would perform assembly from that seed in both directions as far as could be done. This was lightweight code that could run on a desktop or even a handheld, once the raw reads had been indexed by a larger system for fast retrieval.

I did find one other assembler that uses a trie data structure similar to mine - edena. It's pretty good software but has the limitation that it doesn't scale past a single processor. All the other packages I've looked at use k-mer hashes and frankly after a year of working in this area, I can now say with confidence that it is the wrong data structure to use for this problem. Tries are a much better fit.

I have two observations to pass on. 1) the level of coverage and oversampling that these systems perform makes it unnecessary to do any fuzzy matching/error correction while assembling. I got really good results by only using exact overlap matches. And 2) 5 to 10 years from now, systems with terabytes of ram will be available to ordinary researchers, and once they are large enough to encompass the data for the largest genome in existence, we can drop the huge level of extra complexity that is needed to handle those genomes using parallel systems like MPI clusters. (I tried my code on the big memory machine at the TACC and it handled all the assemblies I could throw at it on a single machine, as opposed to needing half a dozen of the more ordinary cluster systems for the same size of problem. If I could rely on having a machine like that being available, the code size would reduce by 80% and what's left would be trivial and very maintainable.)

I'll check in again when the software is ready for use. Currently it has a console based interface which I'm assured won't fly with the target audience, and it needs to be packaged with a simple binary installer for people who are not going to compile code themselves. (And I need to compile it under Windows; it should be portable code, but I've only tried it under Unix so far)

I see that there's a new open source project has been started up here. Is it getting any traction and should I do anything in my project specifically to interoperate with it?

My code has a simple interface. You run one program that takes a .fastq file as a parameter, and it creates an internal file that indexes the reads n the fastq file. Then the application program takes the name of the original fastq file plus a DNA string on the command line, and outputs a .afg file that can be viewed in Tablet. Is there anything more I need to do?

best regards,

Graham

**westerman** · 04-20-2015, 08:15 AM

Originally posted by gtoal View Post

My code has a simple interface. You run one program that takes a .fastq file as a parameter, and it creates an internal file that indexes the reads n the fastq file. Then the application program takes the name of the original fastq file plus a DNA string on the command line, and outputs a .afg file that can be viewed in Tablet. Is there anything more I need to do?

best regards,

Graham

Perhaps include a different output format? I suspect that most people will want to convert to FastA/Q/G for further analysis.

The program sounds interesting. Looking forward to it.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Hello from a wordgame programmer...

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News