Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • seb567
    Senior Member
    • Jul 2008
    • 260

    #16
    @KevinLam

    Indeed, I started the development for color space using these datasets:




    However, these data contain too many errors (in color space) to be assembled de novo (in color space), in my opinion. My estimation is that the error rate in color space ranges from 8% to 12% for these two datasets. That would explain the total lack of de novo assemblies performed so far with SOLiD technology.

    So, you are free to try Ray with csfasta files, but it is not 100% tested yet.

    Perhaps the last version of the SOLiD sequencer produces more reliable readouts, but that I don't know. And I am sure someone else is more aware of that than me on SeqAnswers.com.

    Thank you, happy assembly!

    ***
    The Ray Project Team

    Comment

    • seb567
      Senior Member
      • Jul 2008
      • 260

      #17
      Dear Ray enthusiasts:


      Ray 0.0.5 is now available with these new features:

      * Ray now outputs assemblies in AMOS format (with -a),
      * Ray commands can be provided with a commands file (like in 0.0.3 and 0.0.4) as well as with command-line arguments, and
      * Ray removes non-A-T-C-G letters at both ends of reads.

      About Ray:

      Ray is a computer-controlled software that perform parallel de novo genome assemblies of next-gen sequencing data using message passing interface. It uses an assembly engine called Parallel_Ray_Engine.

      Download Ray 0.0.5: https://sourceforge.net/projects/den...r.bz2/download

      Mailing list: https://lists.sourceforge.net/lists/...ssembler-users

      Statistics:

      Ray 0.0.3 downloads since 2010-03-09: 63
      Ray 0.0.4 downloads since 2010-03-22: 23
      SeqAnswers Thread Views since 2010-03-09: 767

      Tests results (2010-03-28-3159-1): https://sourceforge.net/mailarchive/...ssembler-users

      Comment

      • sparks
        Senior Member
        • Mar 2008
        • 126

        #18
        Colour space Alignmnet

        Hi Kevin,
        I had a quick look at your code for colour space and I think you need to skip the first colour as well as the leading primer base on each read as the first colour is made by primer base plus first base of the fragment. If you leave the first colour on it will add an extra error into 3/4 reads.

        ColourSpaceLoader.cpp:63 t->copy(NULL,bufferForLine+2,readMyAllocator);// remove the leading T & first colour

        Colin

        Comment

        • seb567
          Senior Member
          • Jul 2008
          • 260

          #19
          Dear sparks,

          You are right. I changed +1 to +2 to skip the first color too.

          p.s.: I (Sébastien Boisvert) developed Ray.

          Comment

          • sparks
            Senior Member
            • Mar 2008
            • 126

            #20
            Colour Space

            Hi Sebastien,
            My apologies re name mix up. We have two lanes of 50bp PE from a bacteria to assemble in next few weeks so well give Ray a try. I'm thinking assembly in colour space isn't much different to that in nucleotide space but after CS assembly we need to convert back to Nucleotide. This could mean remembering first colour of all the reads and their positions in the contigs as first colour and primer base gives a reference for conversion. Are you doing this?
            Thanks for giving us Ray. We'll let you know how it goes.
            Colin

            Originally posted by seb567 View Post
            Dear sparks,

            You are right. I changed +1 to +2 to skip the first color too.

            p.s.: I (Sébastien Boisvert) developed Ray.

            Comment

            • seb567
              Senior Member
              • Jul 2008
              • 260

              #21
              Hi sparks,

              I am glad that Ray sparks interest.

              Ray is not ready yet for color space. Ray loads color-space reads and builds a distributed de Bruijn graph in color space and compute paths in that graph. The algorithm is pretty much the same, except that in color space, the reverse-complement is simply the reverse (AA and TT have the same color). But I have not implemented the conversion back to nucleotides yet because I have not figured out which starting base to use for decoding color-encoded paths.

              In particular, these questions remain unanswered regarding color space:

              Q1) If all color-space reads use T, does that mean the decoding is done with T?

              Q2) If some (color-space) reads start with T, while others use A, how do I sort things out?

              Q3) What is the error (mismatch errors) rate of the numerous versions of the SOLiD appliance?

              Thanks!

              ***
              Sébastien Boisvert
              The Ray Project Team

              Comment

              • sparks
                Senior Member
                • Mar 2008
                • 126

                #22
                Hi Sébastien,
                For Q1&2, the primer base and first colour define first base of read so you need to keep this for every read along with where the read started in the contig. With some luck the contigs would consistent with the first bases and so if you start code conversion at one read start then all the rest will match but I expect this might not work in practice (error in first colour) so maybe try a sliding window that selects conversion that matches the most first bases.
                I haven't any experience of error rate yet.

                Colin

                Comment

                • nilshomer
                  Nils Homer
                  • Nov 2008
                  • 1283

                  #23
                  Originally posted by sparks View Post
                  Hi Sébastien,
                  For Q1&2, the primer base and first colour define first base of read so you need to keep this for every read along with where the read started in the contig. With some luck the contigs would consistent with the first bases and so if you start code conversion at one read start then all the rest will match but I expect this might not work in practice (error in first colour) so maybe try a sliding window that selects conversion that matches the most first bases.
                  I haven't any experience of error rate yet.

                  Colin
                  You can also normalize the color reads to have the same starting adapter (say A). You convert the adapter and first color appropriately. You will then only need to store the first color.

                  Code:
                  original: T0010100
                  base: TTGGTTT
                  normalized: A3010100

                  Comment

                  • seb567
                    Senior Member
                    • Jul 2008
                    • 260

                    #24
                    Nils Homer: Correct me if I am wrong, but decoding the color-space read in a nucleotide representation will impede the meaning of the bits if at least one color is erroneous.

                    Edit: as Sparks suggested, one can simply discard the starting base and the first color (in your exemple T0010100 becomes 010100). But then, which (A or T or C or G) base should be utilized for decoding paths produced by Ray's algorithm? Thanks a lot for your expertise with the SOLiD sequencing technology!
                    Last edited by seb567; 04-01-2010, 12:51 PM. Reason: added a point (indicated by 'Edit:')

                    Comment

                    • nilshomer
                      Nils Homer
                      • Nov 2008
                      • 1283

                      #25
                      Originally posted by seb567 View Post
                      Nils Homer: Correct me if I am wrong, but decoding the color-space read in a nucleotide representation will impede the meaning of the bits if at least one color is erroneous.
                      The normalization procedure above produces the read back in color space, so proper base space decoding can happen later. But what it really does that is useful is to make all the reads have the same starting adapter. If you are worried about storing the first base and color for each color read, now you can normalize the color space read and then only have to store the first color. Both original and normalized color space read produce the same base sequence, and therefore are equivalent encodings.

                      You are right that in the final alignment or assembly, naively decoding the color space read without identifying the sequencing errors will cause incorrect bases after the sequencing error. However, most color space aligners do, and in this case your assembler should, identify the sequencing errors as part of the alignment and in final result.

                      Comment

                      • sparks
                        Senior Member
                        • Mar 2008
                        • 126

                        #26
                        It's equivalent

                        primer base + colour = 1st base = "A" + normalised colour --- requires 2 bits storage per read

                        Originally posted by nilshomer View Post
                        The normalization procedure above produces the read back in color space, so proper base space decoding can happen later. But what it really does that is useful is to make all the reads have the same starting adapter. If you are worried about storing the first base and color for each color read, now you can normalize the color space read and then only have to store the first color. Both original and normalized color space read produce the same base sequence, and therefore are equivalent encodings.

                        You are right that in the final alignment or assembly, naively decoding the color space read without identifying the sequencing errors will cause incorrect bases after the sequencing error. However, most color space aligners do, and in this case your assembler should, identify the sequencing errors as part of the alignment and in final result.

                        Comment

                        • seb567
                          Senior Member
                          • Jul 2008
                          • 260

                          #27
                          Ray 0.0.7 compares very favorably with available short-read paired assemblers

                          Dear appreciated SEQanswers community:

                          Parallel software for parallel sequencing technologies

                          Ray 0.0.7 -- a computer-controlled software that perform parallel de novo genome assemblies of next-gen sequencing data using message passing interface -- is now available for download.

                          Download Ray 0.0.7: http://sourceforge.net/projects/deno...r.bz2/download
                          Wiki page: http://sourceforge.net/apps/mediawik...itle=Main_Page
                          Do-it-yourself examples: http://sourceforge.net/apps/mediawik...rself_examples
                          Review changes: http://sourceforge.net/apps/mediawik...?title=Changes
                          Mailing list: http://lists.sourceforge.net/lists/l...ssembler-users

                          Less contigs with Roche/454 and Illumina reads

                          We are delighted to report to SEQanswers that Ray 0.0.7 with Roche/454 and Illumina reads outperforms Newbler on Roche/454 reads systematically on three public datasets. Specifically, Ray computes less contigs with less errors while covering must of the coverable genome.

                          Review numbers: http://sourceforge.net/apps/mediawik..._for_Ray_0.0.7

                          de novo assembly with Illumina -- because outstanding quality and practical cost matter

                          Ray 0.0.7 also crushes the competition on Illumina unpaired and paired public datasets. Ray also outperforms on simulated data -- but these are not very useful outside assembler development.

                          Review comparisons: http://sourceforge.net/apps/mediawik..._for_Ray_0.0.7

                          Scientific paper on its way

                          For those (numerous?) people looking for a Ray paper: I am working on my revised manuscript.

                          Conflicts of interest

                          None

                          Acknowledgments

                          This project is funded by the Canadian Institutes of Health Research (Institute of Genetics).

                          More information: http://sourceforge.net/apps/mediawik...cknowledgments



                          Thank you,

                          make this day an open assembly day!

                          -seb

                          ---
                          Mr. Sébastien Boisvert
                          on the behalf of the Ray Project Team

                          Comment

                          • francesco.vezzi
                            Member
                            • Jan 2009
                            • 50

                            #28
                            Ray and genome size

                            Hi Seb
                            your assembler seems really promising. I was wondering if it able to work also with plant and animals genomes that have the problem to be really long (Gigabases) and to have really long repeats.

                            One of the point of strength of SOAPdenovo and ABySS is their ability to assemble really complex genomes like the human one. If I'm not wrong your benchmarks are made "only" on small genomes.

                            Thanks
                            Francesco

                            Comment

                            • seb567
                              Senior Member
                              • Jul 2008
                              • 260

                              #29
                              Larger genomes -- not yet but coming soon!



                              Dear Mr. Francesco Vezzi, and SEQanswers great community,

                              First, you are right to say that Ray is currently benchmarked openly and only on small genomes.

                              In my roadmap, I am waiting for a paper to get published to continue my effort on larger genomes (the publish or perish thing).
                              I will send my revised form hopefully in the next days when I get OKs from co-authors.

                              Next thing (after the paper thing) is to help decode larger genomes --
                              but it's hard to find the reads that goes with a larger genome (and the reference).

                              You can't do much with just raw reads from an otherwise un-sequenced/assembled entity.
                              N50 is cool, but it is not a critical assessment metric, it is just a number everyone
                              blindly maximises.

                              The community reported that our benchmarks are only on small genomes
                              ( http://seqanswers.com/forums/showthr...8643#post18643 ).
                              We are currently working on the matter (larger genomes).
                              Ray can handle them if hardware requirements are met (InfiniBand,
                              memory, and processors), but it is not extensively tested and they probably need accommodation.

                              Most assemblers (Velvet, EULER-SR, amongst others) sacrify sequence quality for N50, at least that is what I understand from my open benchmarks.

                              In the early ages and stages of short-reads assemblers, greedy approaches were at the crux of their
                              behaviours -- greed is locally good, but can be globally bad (SSAKE, VCAKE, and SHARGCS). They were evaluated with mostly nothing but N50 measurement.




                              If you ask "What's N50 anyway?":

                              "The N50 size is computed by sorting all contigs from largest to smallest and by
                              determining the minimum set of contigs whose sizes total 50% of the entire genome.
                              The N50 size is the [one of the] smallest contig in that set."

                              Source: Bioinformatics 2005 http://dx.doi.org/doi:10.1093/bioinformatics/bti769

                              You might want to read this (very short) paper above to get acquainted with missassemblies.







                              Not to get off-topic, but the greed thing is general.

                              Greed is locally good but globally [VERY] bad -- here are three examples with references:


                              (1)

                              Research funding is good for academic careers, think-tanks, (locally good) but apparently not good enough for healthcare patients (globally bad).

                              Too fundamental, not enough translational, they say.

                              ==> http://www.nature.com/news/2010/1005....2010.243.html
                              ==> http://www.newsweek.com/id/238078


                              (2)

                              Finance powerhouse makes money (greed is locally good for them, they can buy food, cars, houses, and lobbies), but wrecks the world economy (globally VERY bad).

                              ==> http://news.bbc.co.uk/2/hi/business/8625931.stm
                              ==> http://money.cnn.com/2010/04/16/news...ldman.fortune/


                              (3)

                              Drilling for oil is financially sustainable (locally good for energy and economy), but [VERY] bad for almost everything else when disasters show up.

                              ==> http://www.cbc.ca/world/story/2010/0...oil-spill.html
                              ==> http://www.reuters.com/article/idUSTRE64D69K20100514





                              So as the title goes by: larger genomes -- not yet but coming [VERY] soon!


                              Thanks and cheers!

                              ************
                              Mr. Sébastien M. Boisvert, first-year PhD student, http://boisvert.info/
                              The Ray Project Team, http://denovoassembler.sf.net/

                              Comment

                              • DeNovoG
                                Junior Member
                                • May 2010
                                • 7

                                #30
                                Quick questions: does Ray supports illumina 1.6+ fastq sequences (the ones with trailing B's : http://seqanswers.com/forums/showthr...ght=fastq+wiki) does Ray has the capability for trimming low-quality bases or should I pre-process my reads beforehand? should I convert my libraries to Phred/sanger scores? and last but not least can I run Bambus with rays's output? Sorry for so many questions and thank you for any information. BRGDS

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                  by SEQadmin2


                                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                  Here are nine questions we think about, in roughly the order they matter, before...
                                  Today, 07:11 AM
                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Yesterday, 06:09 AM
                                0 responses
                                16 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-09-2026, 11:58 AM
                                0 responses
                                37 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-05-2026, 10:09 AM
                                0 responses
                                43 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-04-2026, 08:59 AM
                                0 responses
                                49 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...