Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    @KevinLam

    Indeed, I started the development for color space using these datasets:




    However, these data contain too many errors (in color space) to be assembled de novo (in color space), in my opinion. My estimation is that the error rate in color space ranges from 8% to 12% for these two datasets. That would explain the total lack of de novo assemblies performed so far with SOLiD technology.

    So, you are free to try Ray with csfasta files, but it is not 100% tested yet.

    Perhaps the last version of the SOLiD sequencer produces more reliable readouts, but that I don't know. And I am sure someone else is more aware of that than me on SeqAnswers.com.

    Thank you, happy assembly!

    ***
    The Ray Project Team

    Comment


    • #17
      Dear Ray enthusiasts:


      Ray 0.0.5 is now available with these new features:

      * Ray now outputs assemblies in AMOS format (with -a),
      * Ray commands can be provided with a commands file (like in 0.0.3 and 0.0.4) as well as with command-line arguments, and
      * Ray removes non-A-T-C-G letters at both ends of reads.

      About Ray:

      Ray is a computer-controlled software that perform parallel de novo genome assemblies of next-gen sequencing data using message passing interface. It uses an assembly engine called Parallel_Ray_Engine.

      Download Ray 0.0.5: https://sourceforge.net/projects/den...r.bz2/download

      Mailing list: https://lists.sourceforge.net/lists/...ssembler-users

      Statistics:

      Ray 0.0.3 downloads since 2010-03-09: 63
      Ray 0.0.4 downloads since 2010-03-22: 23
      SeqAnswers Thread Views since 2010-03-09: 767

      Tests results (2010-03-28-3159-1): https://sourceforge.net/mailarchive/...ssembler-users

      Comment


      • #18
        Colour space Alignmnet

        Hi Kevin,
        I had a quick look at your code for colour space and I think you need to skip the first colour as well as the leading primer base on each read as the first colour is made by primer base plus first base of the fragment. If you leave the first colour on it will add an extra error into 3/4 reads.

        ColourSpaceLoader.cpp:63 t->copy(NULL,bufferForLine+2,readMyAllocator);// remove the leading T & first colour

        Colin

        Comment


        • #19
          Dear sparks,

          You are right. I changed +1 to +2 to skip the first color too.

          p.s.: I (Sébastien Boisvert) developed Ray.

          Comment


          • #20
            Colour Space

            Hi Sebastien,
            My apologies re name mix up. We have two lanes of 50bp PE from a bacteria to assemble in next few weeks so well give Ray a try. I'm thinking assembly in colour space isn't much different to that in nucleotide space but after CS assembly we need to convert back to Nucleotide. This could mean remembering first colour of all the reads and their positions in the contigs as first colour and primer base gives a reference for conversion. Are you doing this?
            Thanks for giving us Ray. We'll let you know how it goes.
            Colin

            Originally posted by seb567 View Post
            Dear sparks,

            You are right. I changed +1 to +2 to skip the first color too.

            p.s.: I (Sébastien Boisvert) developed Ray.

            Comment


            • #21
              Hi sparks,

              I am glad that Ray sparks interest.

              Ray is not ready yet for color space. Ray loads color-space reads and builds a distributed de Bruijn graph in color space and compute paths in that graph. The algorithm is pretty much the same, except that in color space, the reverse-complement is simply the reverse (AA and TT have the same color). But I have not implemented the conversion back to nucleotides yet because I have not figured out which starting base to use for decoding color-encoded paths.

              In particular, these questions remain unanswered regarding color space:

              Q1) If all color-space reads use T, does that mean the decoding is done with T?

              Q2) If some (color-space) reads start with T, while others use A, how do I sort things out?

              Q3) What is the error (mismatch errors) rate of the numerous versions of the SOLiD appliance?

              Thanks!

              ***
              Sébastien Boisvert
              The Ray Project Team

              Comment


              • #22
                Hi Sébastien,
                For Q1&2, the primer base and first colour define first base of read so you need to keep this for every read along with where the read started in the contig. With some luck the contigs would consistent with the first bases and so if you start code conversion at one read start then all the rest will match but I expect this might not work in practice (error in first colour) so maybe try a sliding window that selects conversion that matches the most first bases.
                I haven't any experience of error rate yet.

                Colin

                Comment


                • #23
                  Originally posted by sparks View Post
                  Hi Sébastien,
                  For Q1&2, the primer base and first colour define first base of read so you need to keep this for every read along with where the read started in the contig. With some luck the contigs would consistent with the first bases and so if you start code conversion at one read start then all the rest will match but I expect this might not work in practice (error in first colour) so maybe try a sliding window that selects conversion that matches the most first bases.
                  I haven't any experience of error rate yet.

                  Colin
                  You can also normalize the color reads to have the same starting adapter (say A). You convert the adapter and first color appropriately. You will then only need to store the first color.

                  Code:
                  original: T0010100
                  base: TTGGTTT
                  normalized: A3010100

                  Comment


                  • #24
                    Nils Homer: Correct me if I am wrong, but decoding the color-space read in a nucleotide representation will impede the meaning of the bits if at least one color is erroneous.

                    Edit: as Sparks suggested, one can simply discard the starting base and the first color (in your exemple T0010100 becomes 010100). But then, which (A or T or C or G) base should be utilized for decoding paths produced by Ray's algorithm? Thanks a lot for your expertise with the SOLiD sequencing technology!
                    Last edited by seb567; 04-01-2010, 12:51 PM. Reason: added a point (indicated by 'Edit:')

                    Comment


                    • #25
                      Originally posted by seb567 View Post
                      Nils Homer: Correct me if I am wrong, but decoding the color-space read in a nucleotide representation will impede the meaning of the bits if at least one color is erroneous.
                      The normalization procedure above produces the read back in color space, so proper base space decoding can happen later. But what it really does that is useful is to make all the reads have the same starting adapter. If you are worried about storing the first base and color for each color read, now you can normalize the color space read and then only have to store the first color. Both original and normalized color space read produce the same base sequence, and therefore are equivalent encodings.

                      You are right that in the final alignment or assembly, naively decoding the color space read without identifying the sequencing errors will cause incorrect bases after the sequencing error. However, most color space aligners do, and in this case your assembler should, identify the sequencing errors as part of the alignment and in final result.

                      Comment


                      • #26
                        It's equivalent

                        primer base + colour = 1st base = "A" + normalised colour --- requires 2 bits storage per read

                        Originally posted by nilshomer View Post
                        The normalization procedure above produces the read back in color space, so proper base space decoding can happen later. But what it really does that is useful is to make all the reads have the same starting adapter. If you are worried about storing the first base and color for each color read, now you can normalize the color space read and then only have to store the first color. Both original and normalized color space read produce the same base sequence, and therefore are equivalent encodings.

                        You are right that in the final alignment or assembly, naively decoding the color space read without identifying the sequencing errors will cause incorrect bases after the sequencing error. However, most color space aligners do, and in this case your assembler should, identify the sequencing errors as part of the alignment and in final result.

                        Comment


                        • #27
                          Ray 0.0.7 compares very favorably with available short-read paired assemblers

                          Dear appreciated SEQanswers community:

                          Parallel software for parallel sequencing technologies

                          Ray 0.0.7 -- a computer-controlled software that perform parallel de novo genome assemblies of next-gen sequencing data using message passing interface -- is now available for download.

                          Download Ray 0.0.7: http://sourceforge.net/projects/deno...r.bz2/download
                          Wiki page: http://sourceforge.net/apps/mediawik...itle=Main_Page
                          Do-it-yourself examples: http://sourceforge.net/apps/mediawik...rself_examples
                          Review changes: http://sourceforge.net/apps/mediawik...?title=Changes
                          Mailing list: http://lists.sourceforge.net/lists/l...ssembler-users

                          Less contigs with Roche/454 and Illumina reads

                          We are delighted to report to SEQanswers that Ray 0.0.7 with Roche/454 and Illumina reads outperforms Newbler on Roche/454 reads systematically on three public datasets. Specifically, Ray computes less contigs with less errors while covering must of the coverable genome.

                          Review numbers: http://sourceforge.net/apps/mediawik..._for_Ray_0.0.7

                          de novo assembly with Illumina -- because outstanding quality and practical cost matter

                          Ray 0.0.7 also crushes the competition on Illumina unpaired and paired public datasets. Ray also outperforms on simulated data -- but these are not very useful outside assembler development.

                          Review comparisons: http://sourceforge.net/apps/mediawik..._for_Ray_0.0.7

                          Scientific paper on its way

                          For those (numerous?) people looking for a Ray paper: I am working on my revised manuscript.

                          Conflicts of interest

                          None

                          Acknowledgments

                          This project is funded by the Canadian Institutes of Health Research (Institute of Genetics).

                          More information: http://sourceforge.net/apps/mediawik...cknowledgments



                          Thank you,

                          make this day an open assembly day!

                          -seb

                          ---
                          Mr. Sébastien Boisvert
                          on the behalf of the Ray Project Team

                          Comment


                          • #28
                            Ray and genome size

                            Hi Seb
                            your assembler seems really promising. I was wondering if it able to work also with plant and animals genomes that have the problem to be really long (Gigabases) and to have really long repeats.

                            One of the point of strength of SOAPdenovo and ABySS is their ability to assemble really complex genomes like the human one. If I'm not wrong your benchmarks are made "only" on small genomes.

                            Thanks
                            Francesco

                            Comment


                            • #29
                              Larger genomes -- not yet but coming soon!



                              Dear Mr. Francesco Vezzi, and SEQanswers great community,

                              First, you are right to say that Ray is currently benchmarked openly and only on small genomes.

                              In my roadmap, I am waiting for a paper to get published to continue my effort on larger genomes (the publish or perish thing).
                              I will send my revised form hopefully in the next days when I get OKs from co-authors.

                              Next thing (after the paper thing) is to help decode larger genomes --
                              but it's hard to find the reads that goes with a larger genome (and the reference).

                              You can't do much with just raw reads from an otherwise un-sequenced/assembled entity.
                              N50 is cool, but it is not a critical assessment metric, it is just a number everyone
                              blindly maximises.

                              The community reported that our benchmarks are only on small genomes
                              ( http://seqanswers.com/forums/showthr...8643#post18643 ).
                              We are currently working on the matter (larger genomes).
                              Ray can handle them if hardware requirements are met (InfiniBand,
                              memory, and processors), but it is not extensively tested and they probably need accommodation.

                              Most assemblers (Velvet, EULER-SR, amongst others) sacrify sequence quality for N50, at least that is what I understand from my open benchmarks.

                              In the early ages and stages of short-reads assemblers, greedy approaches were at the crux of their
                              behaviours -- greed is locally good, but can be globally bad (SSAKE, VCAKE, and SHARGCS). They were evaluated with mostly nothing but N50 measurement.




                              If you ask "What's N50 anyway?":

                              "The N50 size is computed by sorting all contigs from largest to smallest and by
                              determining the minimum set of contigs whose sizes total 50% of the entire genome.
                              The N50 size is the [one of the] smallest contig in that set."

                              Source: Bioinformatics 2005 http://dx.doi.org/doi:10.1093/bioinformatics/bti769

                              You might want to read this (very short) paper above to get acquainted with missassemblies.







                              Not to get off-topic, but the greed thing is general.

                              Greed is locally good but globally [VERY] bad -- here are three examples with references:


                              (1)

                              Research funding is good for academic careers, think-tanks, (locally good) but apparently not good enough for healthcare patients (globally bad).

                              Too fundamental, not enough translational, they say.

                              ==> http://www.nature.com/news/2010/1005....2010.243.html
                              ==> http://www.newsweek.com/id/238078


                              (2)

                              Finance powerhouse makes money (greed is locally good for them, they can buy food, cars, houses, and lobbies), but wrecks the world economy (globally VERY bad).

                              ==> http://news.bbc.co.uk/2/hi/business/8625931.stm
                              ==> http://money.cnn.com/2010/04/16/news...ldman.fortune/


                              (3)

                              Drilling for oil is financially sustainable (locally good for energy and economy), but [VERY] bad for almost everything else when disasters show up.

                              ==> http://www.cbc.ca/world/story/2010/0...oil-spill.html
                              ==> http://www.reuters.com/article/idUSTRE64D69K20100514





                              So as the title goes by: larger genomes -- not yet but coming [VERY] soon!


                              Thanks and cheers!

                              ************
                              Mr. Sébastien M. Boisvert, first-year PhD student, http://boisvert.info/
                              The Ray Project Team, http://denovoassembler.sf.net/

                              Comment


                              • #30
                                Quick questions: does Ray supports illumina 1.6+ fastq sequences (the ones with trailing B's : http://seqanswers.com/forums/showthr...ght=fastq+wiki) does Ray has the capability for trimming low-quality bases or should I pre-process my reads beforehand? should I convert my libraries to Phred/sanger scores? and last but not least can I run Bambus with rays's output? Sorry for so many questions and thank you for any information. BRGDS

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Exploring the Dynamics of the Tumor Microenvironment
                                  by seqadmin




                                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                  07-08-2024, 03:19 PM
                                • seqadmin
                                  Exploring Human Diversity Through Large-Scale Omics
                                  by seqadmin


                                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                  06-25-2024, 06:43 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 07:20 AM
                                0 responses
                                21 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-16-2024, 05:49 AM
                                0 responses
                                37 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-15-2024, 06:53 AM
                                0 responses
                                40 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-10-2024, 07:30 AM
                                0 responses
                                41 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X