Unconfigured Ad

**seb567** · 03-23-2010, 03:38 AM

@KevinLam

Indeed, I started the development for color space using these datasets:

http://solidsoftwaretools.com/gf/project/dh10bfrag/

http://solidsoftwaretools.com/gf/project/ecoli2x50/

However, these data contain too many errors (in color space) to be assembled de novo (in color space), in my opinion. My estimation is that the error rate in color space ranges from 8% to 12% for these two datasets. That would explain the total lack of de novo assemblies performed so far with SOLiD technology.

So, you are free to try Ray with csfasta files, but it is not 100% tested yet.

Perhaps the last version of the SOLiD sequencer produces more reliable readouts, but that I don't know. And I am sure someone else is more aware of that than me on SeqAnswers.com.

Thank you, happy assembly!

***
The Ray Project Team

Ray -- Parallel genome assemblies for parallel DNA sequencing

http://denovoassembler.sf.net/

**seb567** · 03-29-2010, 06:41 AM

Dear Ray enthusiasts:

Ray 0.0.5 is now available with these new features:

* Ray now outputs assemblies in AMOS format (with -a),
* Ray commands can be provided with a commands file (like in 0.0.3 and 0.0.4) as well as with command-line arguments, and
* Ray removes non-A-T-C-G letters at both ends of reads.

About Ray:

Ray is a computer-controlled software that perform parallel de novo genome assemblies of next-gen sequencing data using message passing interface. It uses an assembly engine called Parallel_Ray_Engine.

Download Ray 0.0.5: https://sourceforge.net/projects/den...r.bz2/download

Mailing list: https://lists.sourceforge.net/lists/...ssembler-users

Statistics:

Ray 0.0.3 downloads since 2010-03-09: 63
Ray 0.0.4 downloads since 2010-03-22: 23
SeqAnswers Thread Views since 2010-03-09: 767

Tests results (2010-03-28-3159-1): https://sourceforge.net/mailarchive/...ssembler-users

**sparks** · 03-31-2010, 11:19 PM

Colour space Alignmnet

Hi Kevin,
I had a quick look at your code for colour space and I think you need to skip the first colour as well as the leading primer base on each read as the first colour is made by primer base plus first base of the fragment. If you leave the first colour on it will add an extra error into 3/4 reads.

ColourSpaceLoader.cpp:63 t->copy(NULL,bufferForLine+2,readMyAllocator);// remove the leading T & first colour

Colin

**seb567** · 04-01-2010, 04:51 AM

Dear sparks,

You are right. I changed +1 to +2 to skip the first color too.

p.s.: I (Sébastien Boisvert) developed Ray.

**sparks** · 04-01-2010, 06:44 AM

Colour Space

Hi Sebastien,
My apologies re name mix up. We have two lanes of 50bp PE from a bacteria to assemble in next few weeks so well give Ray a try. I'm thinking assembly in colour space isn't much different to that in nucleotide space but after CS assembly we need to convert back to Nucleotide. This could mean remembering first colour of all the reads and their positions in the contigs as first colour and primer base gives a reference for conversion. Are you doing this?
Thanks for giving us Ray. We'll let you know how it goes.
Colin

Originally posted by seb567 View Post

Dear sparks,

You are right. I changed +1 to +2 to skip the first color too.

p.s.: I (Sébastien Boisvert) developed Ray.

**seb567** · 04-01-2010, 07:19 AM

Hi sparks,

I am glad that Ray sparks interest.

Ray is not ready yet for color space. Ray loads color-space reads and builds a distributed de Bruijn graph in color space and compute paths in that graph. The algorithm is pretty much the same, except that in color space, the reverse-complement is simply the reverse (AA and TT have the same color). But I have not implemented the conversion back to nucleotides yet because I have not figured out which starting base to use for decoding color-encoded paths.

In particular, these questions remain unanswered regarding color space:

Q1) If all color-space reads use T, does that mean the decoding is done with T?

Q2) If some (color-space) reads start with T, while others use A, how do I sort things out?

Q3) What is the error (mismatch errors) rate of the numerous versions of the SOLiD appliance?

Thanks!

***
Sébastien Boisvert
The Ray Project Team

Ray -- Parallel genome assemblies for parallel DNA sequencing

http://denovoassembler.sf.net/

**sparks** · 04-01-2010, 08:02 AM

Hi Sébastien,
For Q1&2, the primer base and first colour define first base of read so you need to keep this for every read along with where the read started in the contig. With some luck the contigs would consistent with the first bases and so if you start code conversion at one read start then all the rest will match but I expect this might not work in practice (error in first colour) so maybe try a sliding window that selects conversion that matches the most first bases.
I haven't any experience of error rate yet.

Colin

**nilshomer** · 04-01-2010, 09:35 AM

Originally posted by sparks View Post

Hi Sébastien,
For Q1&2, the primer base and first colour define first base of read so you need to keep this for every read along with where the read started in the contig. With some luck the contigs would consistent with the first bases and so if you start code conversion at one read start then all the rest will match but I expect this might not work in practice (error in first colour) so maybe try a sliding window that selects conversion that matches the most first bases.
I haven't any experience of error rate yet.

Colin

You can also normalize the color reads to have the same starting adapter (say A). You convert the adapter and first color appropriately. You will then only need to store the first color.

Code:

original: T0010100
base: TTGGTTT
normalized: A3010100

**seb567** · 04-01-2010, 12:46 PM

Nils Homer: Correct me if I am wrong, but decoding the color-space read in a nucleotide representation will impede the meaning of the bits if at least one color is erroneous.

Edit: as Sparks suggested, one can simply discard the starting base and the first color (in your exemple T0010100 becomes 010100). But then, which (A or T or C or G) base should be utilized for decoding paths produced by Ray's algorithm? Thanks a lot for your expertise with the SOLiD sequencing technology!

**nilshomer** · 04-01-2010, 01:16 PM

Originally posted by seb567 View Post

Nils Homer: Correct me if I am wrong, but decoding the color-space read in a nucleotide representation will impede the meaning of the bits if at least one color is erroneous.

The normalization procedure above produces the read back in color space, so proper base space decoding can happen later. But what it really does that is useful is to make all the reads have the same starting adapter. If you are worried about storing the first base and color for each color read, now you can normalize the color space read and then only have to store the first color. Both original and normalized color space read produce the same base sequence, and therefore are equivalent encodings.

You are right that in the final alignment or assembly, naively decoding the color space read without identifying the sequencing errors will cause incorrect bases after the sequencing error. However, most color space aligners do, and in this case your assembler should, identify the sequencing errors as part of the alignment and in final result.

**sparks** · 04-01-2010, 04:41 PM

It's equivalent

primer base + colour = 1st base = "A" + normalised colour --- requires 2 bits storage per read

Originally posted by nilshomer View Post

The normalization procedure above produces the read back in color space, so proper base space decoding can happen later. But what it really does that is useful is to make all the reads have the same starting adapter. If you are worried about storing the first base and color for each color read, now you can normalize the color space read and then only have to store the first color. Both original and normalized color space read produce the same base sequence, and therefore are equivalent encodings.

You are right that in the final alignment or assembly, naively decoding the color space read without identifying the sequencing errors will cause incorrect bases after the sequencing error. However, most color space aligners do, and in this case your assembler should, identify the sequencing errors as part of the alignment and in final result.

**seb567** · 04-26-2010, 06:33 AM

Ray 0.0.7 compares very favorably with available short-read paired assemblers

Dear appreciated SEQanswers community:

Parallel software for parallel sequencing technologies

Ray 0.0.7 -- a computer-controlled software that perform parallel de novo genome assemblies of next-gen sequencing data using message passing interface -- is now available for download.

Download Ray 0.0.7: http://sourceforge.net/projects/deno...r.bz2/download
Wiki page: http://sourceforge.net/apps/mediawik...itle=Main_Page
Do-it-yourself examples: http://sourceforge.net/apps/mediawik...rself_examples
Review changes: http://sourceforge.net/apps/mediawik...?title=Changes
Mailing list: http://lists.sourceforge.net/lists/l...ssembler-users

Less contigs with Roche/454 and Illumina reads

We are delighted to report to SEQanswers that Ray 0.0.7 with Roche/454 and Illumina reads outperforms Newbler on Roche/454 reads systematically on three public datasets. Specifically, Ray computes less contigs with less errors while covering must of the coverable genome.

Review numbers: http://sourceforge.net/apps/mediawik..._for_Ray_0.0.7

de novo assembly with Illumina -- because outstanding quality and practical cost matter

Ray 0.0.7 also crushes the competition on Illumina unpaired and paired public datasets. Ray also outperforms on simulated data -- but these are not very useful outside assembler development.

Review comparisons: http://sourceforge.net/apps/mediawik..._for_Ray_0.0.7

Scientific paper on its way

For those (numerous?) people looking for a Ray paper: I am working on my revised manuscript.

Conflicts of interest

None

Acknowledgments

This project is funded by the Canadian Institutes of Health Research (Institute of Genetics).

More information: http://sourceforge.net/apps/mediawik...cknowledgments

Thank you,

make this day an open assembly day!

-seb

---
Mr. Sébastien Boisvert
on the behalf of the Ray Project Team

Ray -- Parallel genome assemblies for parallel DNA sequencing

http://denovoassembler.sf.net/

**francesco.vezzi** · 05-16-2010, 07:07 AM

Ray and genome size

Hi Seb
your assembler seems really promising. I was wondering if it able to work also with plant and animals genomes that have the problem to be really long (Gigabases) and to have really long repeats.

One of the point of strength of SOAPdenovo and ABySS is their ability to assemble really complex genomes like the human one. If I'm not wrong your benchmarks are made "only" on small genomes.

Thanks
Francesco

**seb567** · 05-17-2010, 07:09 AM

Larger genomes -- not yet but coming soon!

Dear Mr. Francesco Vezzi, and SEQanswers great community,

First, you are right to say that Ray is currently benchmarked openly and only on small genomes.

In my roadmap, I am waiting for a paper to get published to continue my effort on larger genomes (the publish or perish thing).
I will send my revised form hopefully in the next days when I get OKs from co-authors.

Next thing (after the paper thing) is to help decode larger genomes --
but it's hard to find the reads that goes with a larger genome (and the reference).

You can't do much with just raw reads from an otherwise un-sequenced/assembled entity.
N50 is cool, but it is not a critical assessment metric, it is just a number everyone
blindly maximises.

The community reported that our benchmarks are only on small genomes
( http://seqanswers.com/forums/showthr...8643#post18643 ).
We are currently working on the matter (larger genomes).
Ray can handle them if hardware requirements are met (InfiniBand,
memory, and processors), but it is not extensively tested and they probably need accommodation.

Most assemblers (Velvet, EULER-SR, amongst others) sacrify sequence quality for N50, at least that is what I understand from my open benchmarks.

In the early ages and stages of short-reads assemblers, greedy approaches were at the crux of their
behaviours -- greed is locally good, but can be globally bad (SSAKE, VCAKE, and SHARGCS). They were evaluated with mostly nothing but N50 measurement.

If you ask "What's N50 anyway?":

"The N50 size is computed by sorting all contigs from largest to smallest and by
determining the minimum set of contigs whose sizes total 50% of the entire genome.
The N50 size is the [one of the] smallest contig in that set."

Source: Bioinformatics 2005 http://dx.doi.org/doi:10.1093/bioinformatics/bti769

You might want to read this (very short) paper above to get acquainted with missassemblies.

Not to get off-topic, but the greed thing is general.

Greed is locally good but globally [VERY] bad -- here are three examples with references:

(1)

Research funding is good for academic careers, think-tanks, (locally good) but apparently not good enough for healthcare patients (globally bad).

Too fundamental, not enough translational, they say.

==> http://www.nature.com/news/2010/1005....2010.243.html
==> http://www.newsweek.com/id/238078

(2)

Finance powerhouse makes money (greed is locally good for them, they can buy food, cars, houses, and lobbies), but wrecks the world economy (globally VERY bad).

==> http://news.bbc.co.uk/2/hi/business/8625931.stm
==> http://money.cnn.com/2010/04/16/news...ldman.fortune/

(3)

Drilling for oil is financially sustainable (locally good for energy and economy), but [VERY] bad for almost everything else when disasters show up.

==> http://www.cbc.ca/world/story/2010/0...oil-spill.html
==> http://www.reuters.com/article/idUSTRE64D69K20100514

So as the title goes by: larger genomes -- not yet but coming [VERY] soon!

Thanks and cheers!

************
Mr. Sébastien M. Boisvert, first-year PhD student, http://boisvert.info/
The Ray Project Team, http://denovoassembler.sf.net/

**DeNovoG** · 05-18-2010, 01:42 PM

Quick questions: does Ray supports illumina 1.6+ fastq sequences (the ones with trailing B's : http://seqanswers.com/forums/showthr...ght=fastq+wiki) does Ray has the capability for trimming low-quality bases or should I pre-process my reads beforehand? should I convert my libraries to Phred/sanger scores? and last but not least can I run Bambus with rays's output? Sorry for so many questions and thank you for any information. BRGDS

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 48 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 107 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News