Unconfigured Ad

**BAMseek** · 11-28-2012, 11:16 AM

I would think that the k-mer hashing would allow you to quickly find candidate reads to compare. You could do Smith-Waterman on all possible read pairs, but this would be quite expensive. If you instead only compare pairs that have some k-mer in common, then you would cut down on the number of comparisons. Maybe limit Smith Waterman to just pairs that have a common k-mer. Other algorithms like BLAST use a similar heuristic to narrow down the search space (two similar sequences will most likely have short exact matches or k-mers in common).

As for building up the consensus sequence, I would think building up a De Bruijn graph would be one way to go, although seems a bit much for an undergrad assignment. Maybe it would suffice to try to find the longest chain of reads, where each high-quality smith-waterman match between two reads is a link in that chain. Said another way, maybe you could store it as a graph, were each node is a read, and an edge signifies a high-scoring smith waterman alignment. Then try to find the longest path through that graph.

Hope that is of some help. Tough question but an interesting one. De Novo assembly is still a hot topic of research, so don't feel too discouraged. Good luck.

Justin

**bioinfo_student** · 11-28-2012, 12:08 PM

Originally posted by BAMseek View Post

I would think that the k-mer hashing would allow you to quickly find candidate reads to compare. You could do Smith-Waterman on all possible read pairs, but this would be quite expensive. If you instead only compare pairs that have some k-mer in common, then you would cut down on the number of comparisons. Maybe limit Smith Waterman to just pairs that have a common k-mer. Other algorithms like BLAST use a similar heuristic to narrow down the search space (two similar sequences will most likely have short exact matches or k-mers in common).

As for building up the consensus sequence, I would think building up a De Bruijn graph would be one way to go, although seems a bit much for an undergrad assignment. Maybe it would suffice to try to find the longest chain of reads, where each high-quality smith-waterman match between two reads is a link in that chain. Said another way, maybe you could store it as a graph, were each node is a read, and an edge signifies a high-scoring smith waterman alignment. Then try to find the longest path through that graph.

Hope that is of some help. Tough question but an interesting one. De Novo assembly is still a hot topic of research, so don't feel too discouraged. Good luck.

Justin

Where do you store the local alignments that you make on sequences with common k-mers? I was initially using a nXn matrix where n was the index of each sequence.

That's the problem I'm having right now. I'm just lost on where to put common sequences and how to find them. Basically, I have...

K-mer 1 -> [Seqa, Seqb, ...Seq n]
K-mer 2 -> [Seqc Seqd]
....
K-mer n -> [Seqe]

I can't figure out an efficient way to traverse the k-mers, and save the local alignments between sequences within the k-mer hash.

**BAMseek** · 11-28-2012, 12:38 PM

A couple possible ways to traverse the k-mer table
1. Go through each k-mer, and do an all-vs-all comparison of sequences within that k-mer address (example: at k-mer2, compare seqC and seqD). Might want to check if you already compared two sequences before doing it again.
2. For each sequence, generate all possible k-mers. Look up those k-mers in the k-mer table and get back the set of all sequences that have a k-mer in common with the given sequence. Then do Smith-Waterman between that sequence and each sequence in the set, again checking if that comparison has already been done.

The NxN matrix seems fine (if it's too big, you might have to do a sparse matrix - a hash of hashes). Maybe you would only want to store whether an edge exists or not. Instead of storing the Smith-Waterman alignments, maybe you can find the longest path, then recompute the Smith-Watermans along that path. That will add more time, but save on storing the alignments.

These are just some ideas, not necessarily saying it's the way to go but hopefully helps you out some.

**bioinfo_student** · 11-28-2012, 12:43 PM

Yup, at the moment, I was going to do an all vs all comparison within each k-mer. So, how would you recommend saving pairs that were already compared?

I used a matrix, initialized it to all zeroes and put in the actual local alignment whenever I performed a comparison. For some reason, Python would not even put zeroes in the matrix when it was large (32000 by 32000).

**BAMseek** · 11-28-2012, 01:05 PM

I was thinking you could just save the scores. Then when you figure out the path through the graph, you can recompute those alignments. Maybe use -1 as the initialization so you can tell if the score has been computed or not (since smith-waterman will give you score of at least 0).

Yeah, in general, the NxN matrix won't work because you would need lots of storage space. I wonder if you are running out of memory. I'm sure there is a sparse matrix library in Python that you could use.

Very large matrices using Python and NumPy

http://stackoverflow.com/questions/1053928/python-numpy-very-large-matrices

NumPy is an extremely useful library, and from using it I've found that it's capable of handling matrices which are quite large (10000 x 10000) easily, but begins to struggle with anything much lar...

**scottyler89** · 11-28-2012, 01:22 PM

This is just an off the cut idea, but I think it would probably work.

You could make two hashed libraries, one in each direction Dict1={K-mer1:[seqa, seqb...]...} and Dict2={Seqa:[K-mer1, K-mer2...]...}. This could be done in pre-analysis, then create a new hashed library, with the Seq IDs as keys, and for each Key, you could go through Dict2, and append the Dict1 entries for each Kmer in Dict2 for each sequence, ultimately generating a dictionary Dict3={SeqA:[Dict1[Dict2[SeqA][i]]}.

This would give you something that looks like this: Dict3={SeqA:[[SeqC,SeqG,SeqT],[SeqC,SeqT,SeqZ]]}, where you would iterate over all of the Kmers in SeqA, and look for which other sequence had overlap on those kmer entries. You could then rank each sequence in terms of the degree of overlap, to get the most probable overlaping sequences. You would also vicariously have the index where each alignment starts and ends, based on the index of sublist in Dict3.

I'm sure at first glance this probably seems convoluted, but I'm pretty sure it would work.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Student Genome Assembly Question

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News