Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • New Short Read Aligner

    Hi,
    I've been working on a short read aligner and would like to find some beta testers. The suite includes single end and paired end read aligners.
    Some features are:
    • Gaps up to 7bp, affine gap penalties
    • Can handle ambiguous codes in ref sequence.
    • Quality based scoring
    • Adapter stripping for miRNA reads
    • No heuristics - reports the best alignment
    • Options for handling multiple alignments includes none, random, all alignments.
    • Alignment Quality scores
    • Can use fasta, fastq, solexa fastq, prb input formats
    • Paired end with full Needleman-Wunsch on both ends.
    • Paired end accepts a structural variation penalty and the best alignment may be two independent ends if score with SV penalty is better than the best pair that fits the fragment length distribution.
    • Supports variable read lengths
    • Includes optional soft masking of repeats.


    If anyone is interested in getting a copy for testing you can contact me novoalign <at> gmail ....
    Beta version is for X86-64 Linux 64 bit.

    Cheers, Colin
    Last edited by sparks; 06-16-2008, 11:51 PM.

  • #2
    Will it be open source?

    Hi Colin.
    This sounds cool.
    Can you just confirm for us whether or not you plan to make this aligner open source?

    Comment


    • #3
      Hi myrna,

      At the moment it's not open source but it will be free for open projects and non-profit organisations.
      I might make it open source if I had some funding.

      Colin
      Last edited by sparks; 06-16-2008, 07:13 AM.

      Comment


      • #4
        Novoalign test

        I have done some testing of sparks’ program Novoalign.

        This program seems to be incredibly fast. It requires only about 6 GB of physical RAM for aligning to human genome. Using simulated reads with no mismatches the program gives the same results as SOAP, however Novoalign is more than 100x faster (half million reads has taken only over half minute on 3.6 GHz CPU).

        I have tested some real SOLID reads translated to base space as well. Novoalign was very fast again relative to SOAP, 50 000 reads in 3 min. I used the trimming feature to help with alignment of reads that were mistranslated due to read errors. The results of uniquely mapped SOLID reads from Novoalign and SOAP were 99.96 % identical.

        I would like to know whether ELAND which is supposed to be the fastest aligner would beat Novoalign .
        Last edited by tree; 07-07-2008, 08:40 PM.

        Comment


        • #5
          I've been using novoalign as well and my bet is that ELAND should be faster than novoalign at default because novoalign will spend a little more time looking for those extra mismatches and gaps. At a threshold of 60 novoalign should be as fast as ELAND or perhaps a bit faster. ELAND achieves better performance because it indexes reads and does a fast scan of the genome.
          Perhaps somebody would be willing to try it out. Take a few million paired-end/single-end reads and see how novoalign at threshold 60 would do in comparison to ELAND on the same server specification.

          Comment


          • #6
            I have just tried novo*. A wonderful software. As previously, I only tried it on human chrX. It is as fast as eland. I kind of believe novo* should be faster on the whole human genome as indexing will be more efficient than on chrX.

            (Sorry, I was wrong previously and so remove the paragraph. Quite amazing to me. And as I was wrong, novo* looks even superior.)

            I think it is very important for novo* to support multithreading; otherwise parallelization would be a big problem.

            Novopair does work for me and it improves overall alignment accuracy. However, novopair is overoptimistic about the alignment accuracy. The error rate of Q150 alignments is 0.05%. This error rate is good enough, but it would be better to improve this more or less. This may be of more theoretical concern.

            In all, novo* is really a good set of programs. It is fast and integrates the advantages of most existing programs. I just hope the author could get funding and make it an open source project.

            PS: So far as I know, only SOLiD's own software and shrimp fully supports color alignment. Maq does partially. Both novo* and soap do not support color alignments. Note that it is not right to do SOLiD alignment in the nucleotide space.
            Last edited by lh3; 07-15-2008, 02:41 AM.

            Comment


            • #7
              see next...
              Last edited by zee; 07-15-2008, 02:24 AM.

              Comment


              • #8
                Thanks for comments Ih3. We're working on improving accuracy. Something to be aware of with novo is the alignment threshold, the "-t" parameter. Setting this very high e.g. -140, for single-end a alignment will report more false positives (FP) . It's always tricky working out the right default threshold. Setting it too high will escalate FP, and it's too low e.g. > -60, then you dont pick up enough.
                I think the author will be aware of these technicalities and this sort of feedback will help to improve the software. The foreseeable plans are to keep it open for just about everybody in the research community.

                Comment


                • #9
                  Hi Li Heng,
                  Thanks for your kind comments.
                  Performance slows on larger genomes as more possible alignment locations are evaluated for each read. Additional memory helps here as it makes the index more specific and while it can be run on an 8GB RAM server (Full Human) a 16G or 32G server is going to be 4 or 5 times faster.
                  With regard multithreading the index is memory mapped and it's quite possible to run multiple copies of novoalign (same target genome) without any increase in memory. That said multithreading wouldn't be too difficult as search classes are all designed to handle it. I need to see if there is a real demand.
                  The quality calculation is similar in principle to maq, it is Bayesian Posterior probability that the alignment is wrong. Some factors are estimated and one possible problem is that I rate the reference genome at 2bits of entropy/base, this may be the cause of the high qualities.

                  I deliberately haven't done SOLID as I'd like to it properly or not at all. That said, if someone wants to try I suggest converting the reference genome to colour space rather than the reads to nucleotide space.

                  Comment


                  • #10
                    Just one more point, even though novoalign uses a k-mer index of the genome it is not a seeded alignment ala Blast/Blat/Shrimp. It's an iterative alignment that will match the read against k-mers in the index using a combinatorial approach (with gaps).

                    Comment


                    • #11
                      see below...
                      Last edited by lh3; 07-15-2008, 05:11 AM.

                      Comment


                      • #12
                        Originally posted by sparks View Post
                        Just one more point, even though novoalign uses a k-mer index of the genome it is not a seeded alignment ala Blast/Blat/Shrimp. It's an iterative alignment that will match the read against k-mers in the index using a combinatorial approach (with gaps).
                        Lately I could vaguely see the possibility that how this can be done. But I am still keen to see the details if you publish the algorithm some day. Nice work!

                        Comment


                        • #13
                          Think blastp type seeding with qualities replacing blossum matrix and add gaps.

                          Comment


                          • #14
                            I've been back and looked at or error rate on simulated reads and it's typically around 0.005% without selecting for quality. We've used maq simulate modified to insert longer indels and paf_utils (great tools) but we also had to modify this to allow a few extra bases uncertainty in alignment location as novo aligners are much more likely to add a few gaps into an alignment than perhaps maq does.

                            Comment


                            • #15
                              Hi all,
                              I've just put an update to novoalign & novopaired. This update improves quality scores for novopaired and also fixes a illegal instruction fault reported by one user.
                              You can download at www.novocraft.com
                              I've also changed the license term so it's free for any non-profit even if you don't publish in open journals.
                              Colin

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM
                              • seqadmin
                                Multiomics Techniques Advancing Disease Research
                                by seqadmin


                                New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                                A major leap in the field has
                                ...
                                02-08-2024, 06:33 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:12 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-23-2024, 04:11 PM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-21-2024, 08:52 AM
                              0 responses
                              73 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-20-2024, 08:57 AM
                              0 responses
                              62 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X