Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ray: a NEW MPI-based 100% parallel genome assembler

    Dear SeqAnswers:

    The Ray Project Team gives you a 100% parallel MPI-based assembler called Ray. Ray is NOW available at http://sourceforge.net/projects/denovoassembler/files/. It supports Illumina paired-end reads. It is 100% parallel, and it is a single executable (no pesky perl scripts!). The source code is licensed with the GPL-v3.

    Try it, and give us your comments, bugs, suggestions, and concerns on our mailing list.


    Ray-0.0.3: a NEW MPI-based parallel genome assembler


    ***
    The Ray Project Team

  • #2
    Originally posted by seb567 View Post
    The Ray Project Team gives you a 100% parallel MPI-based assembler called Ray. Ray is NOW available at http://sourceforge.net/projects/denovoassembler/files/. It supports Illumina paired-end reads. It is 100% parallel, and it is a single executable (no pesky perl scripts!). The source code is licensed with the GPL-v3.
    The README.txt is confusing in some of the sections. I hope you can help me clarify them.

    [Q] It says "if your sff file contains paired-end reads, you must first extract the information, and tell Ray to use them with LoadPairedEndReads". Do you mean we should extract as FASTA with sffinfo, remove the linker, and create a .fasta/.fastq file?

    [Q] Is "OpenAssembler" the same software as "Ray" ?

    [Q] "OpenAssembler assembles Illumina reads or 454 + Illumina reads, or any combination without non-random error incorporation.". Can you explain what you mean by "random error incorporation" ?

    [Q] How critical are the values of "<fragmentLength>" and "<fragmentLengthStandardDeviation>" to the assembly? Are they just starting points for estimating the true value?

    [Q] Does "Ray" use the quality values in the FASTQ file for anything?

    [Q] What does "Ray" do if I provide it with really long sequences, such as contigs from another assembly?

    Thank you for your time,

    Torsten
    Last edited by Torst; 03-09-2010, 10:46 PM. Reason: Added two more questions.

    Comment


    • #3
      Does it support colorspace?
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        Ray -- questions &amp; answers!

        [Q] It says "if your sff file contains paired-end reads, you must first extract the information, and tell Ray to use them with LoadPairedEndReads". Do you mean we should extract as FASTA with sffinfo, remove the linker, and create a .fasta/.fastq file?

        That is right. Ray don't create paired-end reads from SFF file.

        [Q] Is "OpenAssembler" the same software as "Ray" ?

        No, but Ray is a parallel implementation of the OpenAssembler algorithm. The paper describing OpenAssembler is still under review (submitted on 15 October 2009...), and one of its weaknesses is that it is not parallel, thus not scalable. So, I started coding Ray (started on 2010-01-21), and I decided to put it on the web to get feedbacks.

        [Q] "OpenAssembler assembles Illumina reads or 454 + Illumina reads, or any combination without non-random error incorporation.". Can you explain what you mean by "random error incorporation" ?

        When an error occurs, it should occur randomly. The 454 homopolymer errors are not randomly observed, they occur in homopolymer stretches more often. In the OpenAssembler paper (under review since 15 October 2009) we show however that Illumina's error incorporation is random, and that 454+Illumina also has random error incorporation. The take-home message is that randomly incorporated errors are easy to detect and fix, whereas reproducible errors are defective-by-design.

        Illumina errors are distributed on all the read, with more observed errors at the end. 454 errors are mosty related to homopolymers, for instance you will observe both ATCTAGCAAAAATACGCAT and ATCTAGCAAAAAATACGCAT with the same abundance (notice the length of AAAAAs).

        [Q] How critical are the values of "<fragmentLength>" and "<fragmentLengthStandardDeviation>" to the assembly? Are they just starting points for estimating the true value?

        You should know the true values before running Ray. For instance, the SRA001125 dataset contains paired-end Illumina reads for E. coli K12 MG1655. Usually, if you have paired-end data, you should know the geometry (fragmentLength+deviation) of your reads.

        an example of that:

        [boiseb01@ls30 SRA001125]$ echo "LoadPairedEndReads 200xSRR001665_1.fastq 200xSRR001665_2.fastq 215 20
        LoadPairedEndReads 200xSRR001666_1.fastq 200xSRR001666_2.fastq 215 20" > input
        [boiseb01@ls30 SRA001125]$ /home/boiseb01/software/ompi-1.4.1-gcc/bin/mpirun -np 31 /home/boiseb01/Ray/trunk/Ray ./input |tee Log
        [boiseb01@ls30 SRA001125]$ ls -l Contigs.fasta
        -rw-rw-r--. 1 boiseb01 boiseb01 4710363 2010-03-09 17:01 Contigs.fasta
        [boiseb01@ls30 SRA001125]$ grep '>' Contigs.fasta |wc -l
        224

        As such, we get 224 >=100-nt bits for this small bug.

        If you provide paired-end reads, you need to provide accurate values for <fragmentLength> and <fragmentLengthStandardDeviation>.

        [Q] Does "Ray" use the quality values in the FASTQ file for anything?

        No, Ray auto-calibrates itself using abundance of k-mers.

        [Q] What does "Ray" do if I provide it with really long sequences, such as contigs from another assembly?

        Try it!, I don't know.

        My benchmarks so far include:

        * SRA001125 paired (E. coli k12 MG1655, Illumina data)
        * S. pneumoniae R6 50-nt reads, 50 X
        * S. pneumoniae R6 50-nt reads, 50 X, 1% random mismatches
        * E. coli k12 MG1655, 400-nt reads, 50 X
        * Human chromosome 1, 50-nt reads
        * Pseudomonas aeruginosa, 50-nt reads, 50 X

        [Q] Does it support colorspace?

        Currently, only fasta, fastq, and SFF.

        As I understand, there is a bijection between strings from {A,T,C,G} and reads from {0,1,2,3}, with each color corresponding to a nucleotide given the previous one. I have not look into that yet, but I don't think the algorithm is going to change a lots with that taken into consideration.

        My guest is that one could do the assembly in color-space (the alphabet size is 4 too), and then convert the color-space contigs to nucleotide-space.



        I hope it helps!


        ***
        The Ray Project Team

        Comment


        • #5
          Have you had a chance to compare it to the existing assemblers? both for accuracy and time, and also memory?
          --
          bioinfosm

          Comment


          • #6
            Ray is based on the OpenAssembler algorithm, and we did compare OpenAssembler with ABySS, EULER, Newbler, and Velvet for our submitted paper on OpenAssembler (submitted on 15 October 2009).

            We compared on SRA001125, an Illumina paired-end reads dataset, and SRA003611, a mix of 454 and Illumina reads, as well as many simulated datasets (with and without randomly incorporated errors). The conclusion was that only OpenAssembler can mix technologies, and that OpenAssembler is the best on Illumina paired and non-paired data.
            EULER was the worst (the very worst!) in my benchmarks. Velvet was very good, and Newbler was the best on 454 (only Newbler worked with 454 in my benchmarks)
            Because OpenAssembler auto-learns from the data instead of trying to figure out the statistics like in Velvet (they created VelvetOptimizer to alleviate that shortcoming!), we think the usability is better.

            Virtually, if errors are incorporated randomly (Illumina, and SOLiD), and if the coverage is rather uniform (any technology, I think), then we have strong theoretical support to say that the conservative approach of OpenAssembler disallows any miassemblies, (chimeric contigs), but we observed some mismatches. This theoretical support is provided by a set of rules, heuristics, and some invariants.

            Tired of waiting for the reviewing process, I decided to start Ray and release its source code as soon as possible!

            Accuracy:

            OpenAssembler does not produce chimeric contigs, but produces some mismatches when errors are present in reads (28 mismatches for SRA001125!). On SRA001125, Ray produces no chimeric contigs, and a few mismatches.

            Time:

            For SRA001125, with all the reads from NCBI SRA, it takes about 30 minutes on 31 MPI processes. Human chromosome 1, with 50-nt reads at 50 X takes about 2 hours on 400 MPI processes (Itanium, Infiniband).

            Memory:

            Ray is gentle with the memory usage. It uses SplayTree (who uses them anyway??). In a splay tree, the keys accessed often are near the root whereas keys accessed a few times will be in the leaves.
            Ray distributes everything on MPI processes: reads, paired-end linkages, vertices, arcs, seeds, extensions, fusions, finished fusions. To communicate, Ray utilizes about 90 message types!, so Ray instances like to communicate!

            If you want to know about memory usage, check Vertex.h. The coverage is stored on a uint8_t, edges are stored on a uint8_t, and there are some linked lists too.

            In Ray, there is no tip cutting, and no bubble popping, which makes it a very different approach in comparison with Velvet/soapdenovo/ABySS/EULER.

            But remember, a genome assembler is like an interpreter (python/perl/ruby), and its execution depends on the program (the reads) you give it, so you can't really summarise things that much.

            Enjoy Ray!

            ***
            The Ray Project Team

            Comment


            • #7
              Originally posted by seb567 View Post
              [Q] Does it support colorspace?

              Currently, only fasta, fastq, and SFF.

              As I understand, there is a bijection between strings from {A,T,C,G} and reads from {0,1,2,3}, with each color corresponding to a nucleotide given the previous one. I have not look into that yet, but I don't think the algorithm is going to change a lots with that taken into consideration.

              My guest is that one could do the assembly in color-space (the alphabet size is 4 too), and then convert the color-space contigs to nucleotide-space.
              SOLiD will require some thought as errors, variants and combinations thereof manifest differently with respect to each other. Hopefully you will embrace SOLiD data as many groups/labs are clamoring for an easy and powerful assembler for SOLiD data.

              Comment


              • #8
                Originally posted by nilshomer View Post
                SOLiD will require some thought as errors, variants and combinations thereof manifest differently with respect to each other. Hopefully you will embrace SOLiD data as many groups/labs are clamoring for an easy and powerful assembler for SOLiD data.
                AGREED
                AFAIK, they do not have their own assembler but rely on conversion scripts to feed into Velvet. Which we all know is very memory hungry.
                http://kevin-gattaca.blogspot.com/

                Comment


                • #9
                  Thanks seb567 for the detailed response.

                  One point I missed earlier is the approx coverage of data recommended by Ray for Solexa data. I know velvet recommends ~40x, and is not efficient with more or less coverage.

                  Also, I heard MIRA is an assembler that combines various read technologies data..
                  --
                  bioinfosm

                  Comment


                  • #10
                    Any sort of limit to number of Illumina reads Ray can handle? We were going to try it on a 200 Mb worm that's repetitive and has high heterozygosity. What do you think, too big?

                    Comment


                    • #11
                      @bioinfosm The SRA001125 dataset has about 109 X coverage. I think something between 30 and 100 is adequate for Illumina data.

                      @Mizzou55 You will need paired-end reads. What is your read length? Fragment length? You can handle as much as you can with the available distributed memory. Please note that you need Open-MPI, not MPICH2 or MVAPICH because these libraries are crashing whereas Open-MPI does not. Ray MPI processes always send small messages, and Open-MPI always sends small messages eagerly, but MPICH2-based MPI implementations apparently lack that behavior. For the high heterozygosity, Ray does not support that right now, because Ray currently sees this as non-random error incorporation. I am currently working on color-space for the next upcoming release version 0.0.4, but heterozygosity is the next feature I will add.


                      Thanks!

                      **
                      The Ray Project Team

                      Comment


                      • #12
                        For the worm genome we have 100bp reads and two inserts sizes; 300 and 400 PE's. We will have 30-40X. Assuming the heterozygosity issue is resolved you would anticipate better results than SOAP or Abyss with this data input?

                        Comment


                        • #13
                          @Mizzou55: I don't know, honestly, if your data are better assembled with a specific tool.
                          Last edited by seb567; 03-22-2010, 06:33 PM. Reason: spelling

                          Comment


                          • #15
                            NIce, SOLiD support is in already.
                            But darn on CentOS 5.4
                            (Open MPI) is version 1.3.2 so I had compile errors. still messing around with it.
                            http://kevin-gattaca.blogspot.com/

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Choosing Between NGS and qPCR
                              by seqadmin



                              Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                              10-18-2024, 07:11 AM
                            • seqadmin
                              Non-Coding RNA Research and Technologies
                              by seqadmin




                              Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                              Nobel Prize for MicroRNA Discovery
                              This week,...
                              10-07-2024, 08:07 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 05:31 AM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 10-24-2024, 06:58 AM
                            0 responses
                            20 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 10-23-2024, 08:43 AM
                            0 responses
                            48 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 10-17-2024, 07:29 AM
                            0 responses
                            58 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X