Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bioinfosm
    replied
    Thanks seb567 for the detailed response.

    One point I missed earlier is the approx coverage of data recommended by Ray for Solexa data. I know velvet recommends ~40x, and is not efficient with more or less coverage.

    Also, I heard MIRA is an assembler that combines various read technologies data..

    Leave a comment:


  • KevinLam
    replied
    Originally posted by nilshomer View Post
    SOLiD will require some thought as errors, variants and combinations thereof manifest differently with respect to each other. Hopefully you will embrace SOLiD data as many groups/labs are clamoring for an easy and powerful assembler for SOLiD data.
    AGREED
    AFAIK, they do not have their own assembler but rely on conversion scripts to feed into Velvet. Which we all know is very memory hungry.

    Leave a comment:


  • nilshomer
    replied
    Originally posted by seb567 View Post
    [Q] Does it support colorspace?

    Currently, only fasta, fastq, and SFF.

    As I understand, there is a bijection between strings from {A,T,C,G} and reads from {0,1,2,3}, with each color corresponding to a nucleotide given the previous one. I have not look into that yet, but I don't think the algorithm is going to change a lots with that taken into consideration.

    My guest is that one could do the assembly in color-space (the alphabet size is 4 too), and then convert the color-space contigs to nucleotide-space.
    SOLiD will require some thought as errors, variants and combinations thereof manifest differently with respect to each other. Hopefully you will embrace SOLiD data as many groups/labs are clamoring for an easy and powerful assembler for SOLiD data.

    Leave a comment:


  • seb567
    replied
    Ray is based on the OpenAssembler algorithm, and we did compare OpenAssembler with ABySS, EULER, Newbler, and Velvet for our submitted paper on OpenAssembler (submitted on 15 October 2009).

    We compared on SRA001125, an Illumina paired-end reads dataset, and SRA003611, a mix of 454 and Illumina reads, as well as many simulated datasets (with and without randomly incorporated errors). The conclusion was that only OpenAssembler can mix technologies, and that OpenAssembler is the best on Illumina paired and non-paired data.
    EULER was the worst (the very worst!) in my benchmarks. Velvet was very good, and Newbler was the best on 454 (only Newbler worked with 454 in my benchmarks)
    Because OpenAssembler auto-learns from the data instead of trying to figure out the statistics like in Velvet (they created VelvetOptimizer to alleviate that shortcoming!), we think the usability is better.

    Virtually, if errors are incorporated randomly (Illumina, and SOLiD), and if the coverage is rather uniform (any technology, I think), then we have strong theoretical support to say that the conservative approach of OpenAssembler disallows any miassemblies, (chimeric contigs), but we observed some mismatches. This theoretical support is provided by a set of rules, heuristics, and some invariants.

    Tired of waiting for the reviewing process, I decided to start Ray and release its source code as soon as possible!

    Accuracy:

    OpenAssembler does not produce chimeric contigs, but produces some mismatches when errors are present in reads (28 mismatches for SRA001125!). On SRA001125, Ray produces no chimeric contigs, and a few mismatches.

    Time:

    For SRA001125, with all the reads from NCBI SRA, it takes about 30 minutes on 31 MPI processes. Human chromosome 1, with 50-nt reads at 50 X takes about 2 hours on 400 MPI processes (Itanium, Infiniband).

    Memory:

    Ray is gentle with the memory usage. It uses SplayTree (who uses them anyway??). In a splay tree, the keys accessed often are near the root whereas keys accessed a few times will be in the leaves.
    Ray distributes everything on MPI processes: reads, paired-end linkages, vertices, arcs, seeds, extensions, fusions, finished fusions. To communicate, Ray utilizes about 90 message types!, so Ray instances like to communicate!

    If you want to know about memory usage, check Vertex.h. The coverage is stored on a uint8_t, edges are stored on a uint8_t, and there are some linked lists too.

    In Ray, there is no tip cutting, and no bubble popping, which makes it a very different approach in comparison with Velvet/soapdenovo/ABySS/EULER.

    But remember, a genome assembler is like an interpreter (python/perl/ruby), and its execution depends on the program (the reads) you give it, so you can't really summarise things that much.

    Enjoy Ray!

    ***
    The Ray Project Team

    Leave a comment:


  • bioinfosm
    replied
    Have you had a chance to compare it to the existing assemblers? both for accuracy and time, and also memory?

    Leave a comment:


  • seb567
    replied
    Ray -- questions & answers!

    [Q] It says "if your sff file contains paired-end reads, you must first extract the information, and tell Ray to use them with LoadPairedEndReads". Do you mean we should extract as FASTA with sffinfo, remove the linker, and create a .fasta/.fastq file?

    That is right. Ray don't create paired-end reads from SFF file.

    [Q] Is "OpenAssembler" the same software as "Ray" ?

    No, but Ray is a parallel implementation of the OpenAssembler algorithm. The paper describing OpenAssembler is still under review (submitted on 15 October 2009...), and one of its weaknesses is that it is not parallel, thus not scalable. So, I started coding Ray (started on 2010-01-21), and I decided to put it on the web to get feedbacks.

    [Q] "OpenAssembler assembles Illumina reads or 454 + Illumina reads, or any combination without non-random error incorporation.". Can you explain what you mean by "random error incorporation" ?

    When an error occurs, it should occur randomly. The 454 homopolymer errors are not randomly observed, they occur in homopolymer stretches more often. In the OpenAssembler paper (under review since 15 October 2009) we show however that Illumina's error incorporation is random, and that 454+Illumina also has random error incorporation. The take-home message is that randomly incorporated errors are easy to detect and fix, whereas reproducible errors are defective-by-design.

    Illumina errors are distributed on all the read, with more observed errors at the end. 454 errors are mosty related to homopolymers, for instance you will observe both ATCTAGCAAAAATACGCAT and ATCTAGCAAAAAATACGCAT with the same abundance (notice the length of AAAAAs).

    [Q] How critical are the values of "<fragmentLength>" and "<fragmentLengthStandardDeviation>" to the assembly? Are they just starting points for estimating the true value?

    You should know the true values before running Ray. For instance, the SRA001125 dataset contains paired-end Illumina reads for E. coli K12 MG1655. Usually, if you have paired-end data, you should know the geometry (fragmentLength+deviation) of your reads.

    an example of that:

    [boiseb01@ls30 SRA001125]$ echo "LoadPairedEndReads 200xSRR001665_1.fastq 200xSRR001665_2.fastq 215 20
    LoadPairedEndReads 200xSRR001666_1.fastq 200xSRR001666_2.fastq 215 20" > input
    [boiseb01@ls30 SRA001125]$ /home/boiseb01/software/ompi-1.4.1-gcc/bin/mpirun -np 31 /home/boiseb01/Ray/trunk/Ray ./input |tee Log
    [boiseb01@ls30 SRA001125]$ ls -l Contigs.fasta
    -rw-rw-r--. 1 boiseb01 boiseb01 4710363 2010-03-09 17:01 Contigs.fasta
    [boiseb01@ls30 SRA001125]$ grep '>' Contigs.fasta |wc -l
    224

    As such, we get 224 >=100-nt bits for this small bug.

    If you provide paired-end reads, you need to provide accurate values for <fragmentLength> and <fragmentLengthStandardDeviation>.

    [Q] Does "Ray" use the quality values in the FASTQ file for anything?

    No, Ray auto-calibrates itself using abundance of k-mers.

    [Q] What does "Ray" do if I provide it with really long sequences, such as contigs from another assembly?

    Try it!, I don't know.

    My benchmarks so far include:

    * SRA001125 paired (E. coli k12 MG1655, Illumina data)
    * S. pneumoniae R6 50-nt reads, 50 X
    * S. pneumoniae R6 50-nt reads, 50 X, 1% random mismatches
    * E. coli k12 MG1655, 400-nt reads, 50 X
    * Human chromosome 1, 50-nt reads
    * Pseudomonas aeruginosa, 50-nt reads, 50 X

    [Q] Does it support colorspace?

    Currently, only fasta, fastq, and SFF.

    As I understand, there is a bijection between strings from {A,T,C,G} and reads from {0,1,2,3}, with each color corresponding to a nucleotide given the previous one. I have not look into that yet, but I don't think the algorithm is going to change a lots with that taken into consideration.

    My guest is that one could do the assembly in color-space (the alphabet size is 4 too), and then convert the color-space contigs to nucleotide-space.



    I hope it helps!


    ***
    The Ray Project Team

    Leave a comment:


  • KevinLam
    replied
    Does it support colorspace?

    Leave a comment:


  • Torst
    replied
    Originally posted by seb567 View Post
    The Ray Project Team gives you a 100% parallel MPI-based assembler called Ray. Ray is NOW available at http://sourceforge.net/projects/denovoassembler/files/. It supports Illumina paired-end reads. It is 100% parallel, and it is a single executable (no pesky perl scripts!). The source code is licensed with the GPL-v3.
    The README.txt is confusing in some of the sections. I hope you can help me clarify them.

    [Q] It says "if your sff file contains paired-end reads, you must first extract the information, and tell Ray to use them with LoadPairedEndReads". Do you mean we should extract as FASTA with sffinfo, remove the linker, and create a .fasta/.fastq file?

    [Q] Is "OpenAssembler" the same software as "Ray" ?

    [Q] "OpenAssembler assembles Illumina reads or 454 + Illumina reads, or any combination without non-random error incorporation.". Can you explain what you mean by "random error incorporation" ?

    [Q] How critical are the values of "<fragmentLength>" and "<fragmentLengthStandardDeviation>" to the assembly? Are they just starting points for estimating the true value?

    [Q] Does "Ray" use the quality values in the FASTQ file for anything?

    [Q] What does "Ray" do if I provide it with really long sequences, such as contigs from another assembly?

    Thank you for your time,

    Torsten
    Last edited by Torst; 03-09-2010, 10:46 PM. Reason: Added two more questions.

    Leave a comment:


  • seb567
    started a topic Ray: a NEW MPI-based 100% parallel genome assembler

    Ray: a NEW MPI-based 100% parallel genome assembler

    Dear SeqAnswers:

    The Ray Project Team gives you a 100% parallel MPI-based assembler called Ray. Ray is NOW available at http://sourceforge.net/projects/denovoassembler/files/. It supports Illumina paired-end reads. It is 100% parallel, and it is a single executable (no pesky perl scripts!). The source code is licensed with the GPL-v3.

    Try it, and give us your comments, bugs, suggestions, and concerns on our mailing list.


    Ray-0.0.3: a NEW MPI-based parallel genome assembler


    ***
    The Ray Project Team

Latest Articles

Collapse

  • seqadmin
    Exploring the Dynamics of the Tumor Microenvironment
    by seqadmin




    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
    07-08-2024, 03:19 PM
  • seqadmin
    Exploring Human Diversity Through Large-Scale Omics
    by seqadmin


    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
    06-25-2024, 06:43 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 07-16-2024, 05:49 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-15-2024, 06:53 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-10-2024, 07:30 AM
0 responses
40 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-03-2024, 09:45 AM
0 responses
205 views
0 likes
Last Post seqadmin  
Working...
X