Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bioinfo_student
    Junior Member
    • Nov 2012
    • 3

    Student Genome Assembly Question

    Hey everybody,

    I'm taking a bio-informatics course as an undergrad and having a little trouble with building an assembler.

    So, the project is to build a whole genome assembly given a set of thousands of reads. What I've done so far is write the Smith Waterman code in Python and am now trying to start with the assembly.

    The basic way I can see to start is going through all of utthe reads and build k-mer hashes. So, I did that and now I'm kind of stuck.

    I undestand the basic idea is now to go through the k-mer hash and find which reads have the same k-mer's. But, I have no idea how to use this information to determine when to use a Smith Waterman alignment. Also, where do I store these Smith Waterman alignments to later use them to build a whole genome?

    I would appreciate any direction, this is killing me.

    Thanks again!
  • BAMseek
    Senior Member
    • Apr 2011
    • 124

    #2
    I would think that the k-mer hashing would allow you to quickly find candidate reads to compare. You could do Smith-Waterman on all possible read pairs, but this would be quite expensive. If you instead only compare pairs that have some k-mer in common, then you would cut down on the number of comparisons. Maybe limit Smith Waterman to just pairs that have a common k-mer. Other algorithms like BLAST use a similar heuristic to narrow down the search space (two similar sequences will most likely have short exact matches or k-mers in common).

    As for building up the consensus sequence, I would think building up a De Bruijn graph would be one way to go, although seems a bit much for an undergrad assignment. Maybe it would suffice to try to find the longest chain of reads, where each high-quality smith-waterman match between two reads is a link in that chain. Said another way, maybe you could store it as a graph, were each node is a read, and an edge signifies a high-scoring smith waterman alignment. Then try to find the longest path through that graph.

    Hope that is of some help. Tough question but an interesting one. De Novo assembly is still a hot topic of research, so don't feel too discouraged. Good luck.

    Justin

    Comment

    • bioinfo_student
      Junior Member
      • Nov 2012
      • 3

      #3
      Originally posted by BAMseek View Post
      I would think that the k-mer hashing would allow you to quickly find candidate reads to compare. You could do Smith-Waterman on all possible read pairs, but this would be quite expensive. If you instead only compare pairs that have some k-mer in common, then you would cut down on the number of comparisons. Maybe limit Smith Waterman to just pairs that have a common k-mer. Other algorithms like BLAST use a similar heuristic to narrow down the search space (two similar sequences will most likely have short exact matches or k-mers in common).

      As for building up the consensus sequence, I would think building up a De Bruijn graph would be one way to go, although seems a bit much for an undergrad assignment. Maybe it would suffice to try to find the longest chain of reads, where each high-quality smith-waterman match between two reads is a link in that chain. Said another way, maybe you could store it as a graph, were each node is a read, and an edge signifies a high-scoring smith waterman alignment. Then try to find the longest path through that graph.

      Hope that is of some help. Tough question but an interesting one. De Novo assembly is still a hot topic of research, so don't feel too discouraged. Good luck.

      Justin

      Where do you store the local alignments that you make on sequences with common k-mers? I was initially using a nXn matrix where n was the index of each sequence.

      That's the problem I'm having right now. I'm just lost on where to put common sequences and how to find them. Basically, I have...


      K-mer 1 -> [Seqa, Seqb, ...Seq n]
      K-mer 2 -> [Seqc Seqd]
      ....
      K-mer n -> [Seqe]

      I can't figure out an efficient way to traverse the k-mers, and save the local alignments between sequences within the k-mer hash.

      Comment

      • BAMseek
        Senior Member
        • Apr 2011
        • 124

        #4
        A couple possible ways to traverse the k-mer table
        1. Go through each k-mer, and do an all-vs-all comparison of sequences within that k-mer address (example: at k-mer2, compare seqC and seqD). Might want to check if you already compared two sequences before doing it again.
        2. For each sequence, generate all possible k-mers. Look up those k-mers in the k-mer table and get back the set of all sequences that have a k-mer in common with the given sequence. Then do Smith-Waterman between that sequence and each sequence in the set, again checking if that comparison has already been done.

        The NxN matrix seems fine (if it's too big, you might have to do a sparse matrix - a hash of hashes). Maybe you would only want to store whether an edge exists or not. Instead of storing the Smith-Waterman alignments, maybe you can find the longest path, then recompute the Smith-Watermans along that path. That will add more time, but save on storing the alignments.

        These are just some ideas, not necessarily saying it's the way to go but hopefully helps you out some.

        Comment

        • bioinfo_student
          Junior Member
          • Nov 2012
          • 3

          #5
          Yup, at the moment, I was going to do an all vs all comparison within each k-mer. So, how would you recommend saving pairs that were already compared?

          I used a matrix, initialized it to all zeroes and put in the actual local alignment whenever I performed a comparison. For some reason, Python would not even put zeroes in the matrix when it was large (32000 by 32000).

          Comment

          • BAMseek
            Senior Member
            • Apr 2011
            • 124

            #6
            I was thinking you could just save the scores. Then when you figure out the path through the graph, you can recompute those alignments. Maybe use -1 as the initialization so you can tell if the score has been computed or not (since smith-waterman will give you score of at least 0).

            Yeah, in general, the NxN matrix won't work because you would need lots of storage space. I wonder if you are running out of memory. I'm sure there is a sparse matrix library in Python that you could use.

            NumPy is an extremely useful library, and from using it I've found that it's capable of handling matrices which are quite large (10000 x 10000) easily, but begins to struggle with anything much lar...

            Comment

            • scottyler89
              Junior Member
              • Nov 2012
              • 6

              #7
              This is just an off the cut idea, but I think it would probably work.

              You could make two hashed libraries, one in each direction Dict1={K-mer1:[seqa, seqb...]...} and Dict2={Seqa:[K-mer1, K-mer2...]...}. This could be done in pre-analysis, then create a new hashed library, with the Seq IDs as keys, and for each Key, you could go through Dict2, and append the Dict1 entries for each Kmer in Dict2 for each sequence, ultimately generating a dictionary Dict3={SeqA:[Dict1[Dict2[SeqA][i]]}.

              This would give you something that looks like this: Dict3={SeqA:[[SeqC,SeqG,SeqT],[SeqC,SeqT,SeqZ]]}, where you would iterate over all of the Kmers in SeqA, and look for which other sequence had overlap on those kmer entries. You could then rank each sequence in terms of the degree of overlap, to get the most probable overlaping sequences. You would also vicariously have the index where each alignment starts and ends, based on the index of sublist in Dict3.

              I'm sure at first glance this probably seems convoluted, but I'm pretty sure it would work.

              Comment

              Latest Articles

              Collapse

              • GATTACAT
                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by GATTACAT
                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                07-01-2026, 11:43 AM
              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, Yesterday, 11:08 AM
              0 responses
              6 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-30-2026, 05:37 AM
              0 responses
              11 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-26-2026, 11:10 AM
              0 responses
              19 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              53 views
              0 reactions
              Last Post SEQadmin2  
              Working...