Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Student Genome Assembly Question

    Hey everybody,

    I'm taking a bio-informatics course as an undergrad and having a little trouble with building an assembler.

    So, the project is to build a whole genome assembly given a set of thousands of reads. What I've done so far is write the Smith Waterman code in Python and am now trying to start with the assembly.

    The basic way I can see to start is going through all of utthe reads and build k-mer hashes. So, I did that and now I'm kind of stuck.

    I undestand the basic idea is now to go through the k-mer hash and find which reads have the same k-mer's. But, I have no idea how to use this information to determine when to use a Smith Waterman alignment. Also, where do I store these Smith Waterman alignments to later use them to build a whole genome?

    I would appreciate any direction, this is killing me.

    Thanks again!

  • #2
    I would think that the k-mer hashing would allow you to quickly find candidate reads to compare. You could do Smith-Waterman on all possible read pairs, but this would be quite expensive. If you instead only compare pairs that have some k-mer in common, then you would cut down on the number of comparisons. Maybe limit Smith Waterman to just pairs that have a common k-mer. Other algorithms like BLAST use a similar heuristic to narrow down the search space (two similar sequences will most likely have short exact matches or k-mers in common).

    As for building up the consensus sequence, I would think building up a De Bruijn graph would be one way to go, although seems a bit much for an undergrad assignment. Maybe it would suffice to try to find the longest chain of reads, where each high-quality smith-waterman match between two reads is a link in that chain. Said another way, maybe you could store it as a graph, were each node is a read, and an edge signifies a high-scoring smith waterman alignment. Then try to find the longest path through that graph.

    Hope that is of some help. Tough question but an interesting one. De Novo assembly is still a hot topic of research, so don't feel too discouraged. Good luck.

    Justin

    Comment


    • #3
      Originally posted by BAMseek View Post
      I would think that the k-mer hashing would allow you to quickly find candidate reads to compare. You could do Smith-Waterman on all possible read pairs, but this would be quite expensive. If you instead only compare pairs that have some k-mer in common, then you would cut down on the number of comparisons. Maybe limit Smith Waterman to just pairs that have a common k-mer. Other algorithms like BLAST use a similar heuristic to narrow down the search space (two similar sequences will most likely have short exact matches or k-mers in common).

      As for building up the consensus sequence, I would think building up a De Bruijn graph would be one way to go, although seems a bit much for an undergrad assignment. Maybe it would suffice to try to find the longest chain of reads, where each high-quality smith-waterman match between two reads is a link in that chain. Said another way, maybe you could store it as a graph, were each node is a read, and an edge signifies a high-scoring smith waterman alignment. Then try to find the longest path through that graph.

      Hope that is of some help. Tough question but an interesting one. De Novo assembly is still a hot topic of research, so don't feel too discouraged. Good luck.

      Justin

      Where do you store the local alignments that you make on sequences with common k-mers? I was initially using a nXn matrix where n was the index of each sequence.

      That's the problem I'm having right now. I'm just lost on where to put common sequences and how to find them. Basically, I have...


      K-mer 1 -> [Seqa, Seqb, ...Seq n]
      K-mer 2 -> [Seqc Seqd]
      ....
      K-mer n -> [Seqe]

      I can't figure out an efficient way to traverse the k-mers, and save the local alignments between sequences within the k-mer hash.

      Comment


      • #4
        A couple possible ways to traverse the k-mer table
        1. Go through each k-mer, and do an all-vs-all comparison of sequences within that k-mer address (example: at k-mer2, compare seqC and seqD). Might want to check if you already compared two sequences before doing it again.
        2. For each sequence, generate all possible k-mers. Look up those k-mers in the k-mer table and get back the set of all sequences that have a k-mer in common with the given sequence. Then do Smith-Waterman between that sequence and each sequence in the set, again checking if that comparison has already been done.

        The NxN matrix seems fine (if it's too big, you might have to do a sparse matrix - a hash of hashes). Maybe you would only want to store whether an edge exists or not. Instead of storing the Smith-Waterman alignments, maybe you can find the longest path, then recompute the Smith-Watermans along that path. That will add more time, but save on storing the alignments.

        These are just some ideas, not necessarily saying it's the way to go but hopefully helps you out some.

        Comment


        • #5
          Yup, at the moment, I was going to do an all vs all comparison within each k-mer. So, how would you recommend saving pairs that were already compared?

          I used a matrix, initialized it to all zeroes and put in the actual local alignment whenever I performed a comparison. For some reason, Python would not even put zeroes in the matrix when it was large (32000 by 32000).

          Comment


          • #6
            I was thinking you could just save the scores. Then when you figure out the path through the graph, you can recompute those alignments. Maybe use -1 as the initialization so you can tell if the score has been computed or not (since smith-waterman will give you score of at least 0).

            Yeah, in general, the NxN matrix won't work because you would need lots of storage space. I wonder if you are running out of memory. I'm sure there is a sparse matrix library in Python that you could use.

            NumPy is an extremely useful library, and from using it I've found that it's capable of handling matrices which are quite large (10000 x 10000) easily, but begins to struggle with anything much lar...

            Comment


            • #7
              This is just an off the cut idea, but I think it would probably work.

              You could make two hashed libraries, one in each direction Dict1={K-mer1:[seqa, seqb...]...} and Dict2={Seqa:[K-mer1, K-mer2...]...}. This could be done in pre-analysis, then create a new hashed library, with the Seq IDs as keys, and for each Key, you could go through Dict2, and append the Dict1 entries for each Kmer in Dict2 for each sequence, ultimately generating a dictionary Dict3={SeqA:[Dict1[Dict2[SeqA][i]]}.

              This would give you something that looks like this: Dict3={SeqA:[[SeqC,SeqG,SeqT],[SeqC,SeqT,SeqZ]]}, where you would iterate over all of the Kmers in SeqA, and look for which other sequence had overlap on those kmer entries. You could then rank each sequence in terms of the degree of overlap, to get the most probable overlaping sequences. You would also vicariously have the index where each alignment starts and ends, based on the index of sublist in Dict3.

              I'm sure at first glance this probably seems convoluted, but I'm pretty sure it would work.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                05-06-2024, 07:48 AM
              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 06:35 AM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 02:46 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-07-2024, 06:57 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-06-2024, 07:17 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Working...
              X