Seqanswers Leaderboard Ad

**seb567** · 02-02-2011, 12:09 PM

Originally posted by corthay View Post

Hi seb567,

I have two questions.

1) Can Ray handle sanger reads such as WGS, Fosmid end ?

Good question. Particularly relevant I would say !

I am currently modifying Ray heuristics to produce assemblies using paired sequences separated by very large physical distances.

I simulated 4 libraries from E. coli K-12 MG1655 genome sequences. These reads include 0.5% mismatch errors. ANd I used wgsim available in samtools.

Fragment standard deviation is 10% of the fragment average length.

These libraries are

200 +/- 20
1000 +/- 100
4000 +/- 400
10000 +/- 1000

Code:

/software/samtools-0.1.7/wgsim -e 0.005 -d 200 -s 20 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Streptococcus-pneumoniae-R6.fasta 200-17_1.fastq 200-17_2.fastq 

/software/samtools-0.1.7/wgsim -e 0.005 -d 1000 -s 100 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Ecoli-k12-mg1655.fasta 1000_1.fastq 1000_2.fastq 

/software/samtools-0.1.7/wgsim -e 0.005 -d 4000 -s 400 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Ecoli-k12-mg1655.fasta 4000_1.fastq 4000_2.fastq 

/software/samtools-0.1.7/wgsim -e 0.005 -d 10000 -s 1000 -N 4000000 -1 30 -2 30 -r 0 -R 0 -X 0 Ecoli-k12-mg1655.fasta 10000_1.fastq 10000_2.fastq

Then, using Ray:

Code:

mpirun  -np 31 ~/Ray/trunk/code/Ray \
-p 200_1.fastq 200_2.fastq -p 1000_1.fastq 1000_2.fastq -p 4000_1.fastq 4000_2.fastq -p 10000_1.fastq 10000_2.fastq | tee log

Without scaffolding, Ray generates 65 contigs each having at least 500 nucleotides. There is a total of 4625792 assembled bases, all are A, T, C or G. The mean size is 71166, the N50 is 121756 and the largest is 371980.

This is without any scaffolding.

I then used MUMmer to validate contigs: 0 misassembly, 0 mismatch error and 0 small indel.

Code:

        %  & numberOfContigs & bases & meanSize  & n50  & max   & coverage   & misassembled & mismatches & indels
 Ray & 65 &  & 71166 & 121756 &  371980 &  0.9965 & 0 & 0 & 0 \\

99.65% of the genome is covered by Ray contigs.

Running time/31 processors:

Code:

[1,0]<stdout>: Beginning of computation: 2 seconds
[1,0]<stdout>: Distribution of sequence reads: 52 seconds
[1,0]<stdout>: Distribution of vertices & edges: 2 minutes, 56 seconds
[1,0]<stdout>: Calculation of coverage distribution: 0 seconds
[1,0]<stdout>: Indexing of sequence reads: 15 seconds
[1,0]<stdout>: Computation of seeds: 27 seconds
[1,0]<stdout>: Computation of library sizes: 4 minutes, 45 seconds
[1,0]<stdout>: Extension of seeds: 4 minutes, 15 seconds
[1,0]<stdout>: Computation of fusions: 37 seconds
[1,0]<stdout>: Collection of fusions: 0 seconds
[1,0]<stdout>: Completion of the assembly: 14 minutes, 9 seconds

The contig lengths:

388
641
1038
2164
2714
3438
4735
5845
7199
7939
9557
10364
11883
16323
16937
20213
24049
25788
26056
26433
27014
29449
30205
31159
33620
34875
36084
37445
38010
38550
39509
43086
43623
43841
46455
47970
58692
59083
59441
59889
64822
66096
66767
66783
67838
70713
82799
85321
94029
95691
97407
100961
102906
103167
121756
121858
138099
156439
161412
163121
169912
177402
213535
284796
318866
371980

Let me answer your question now. For seed extension, an algorithmic process that generates contigs, available versions of Ray will not accomodate very large fragment lengths, 400 at most I think.

It is because for larger distances, the right sequence in a pair most likely starts on a repeated k-mer. As such, constraints on the observed distances most be further enforced -- paired reads not satisfying the constraints must be set free in order to allow them to be used in the adequate time.

Ray 1.2.3 'cRAYfish' is set to be released soon. For this release, I want to give a running time on 512 processors for a human genome. But, for the time being, I am currently waiting for some compute time on a compute cluster.

1.2.3 fully utilizes large distances provided by jumping libraries (fosmid ends).

I also plan to incorporate a scaffolder, but that is not my priority now.

My priority is a memory utilization reducer that is utilized during the process of building the distributed de Bruijn graph.

Originally posted by corthay View Post

2) Does Ray do scaffolding ? The results did not contain N even though
I used paired end as following.

Not yet, but it is a planned feature.

Note however that Ray make full use of paired reads to traverse repeated elements in the genome. And apparently it does that very well.

Meanwhile, I suggest you use SSPACE.

SSPACE: a new stand-alone scaffolding tool for small and large genomes - SEQanswers

http://seqanswers.com/forums/showthread.php?t=8350

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

Bioinformatics paper http://bioinformatics.oxfordjournals...s.btq683.short

Originally posted by corthay View Post

The header line of fasta file is modified by SOAPdenovo correction tool.
Is this problem ?

Thanks,
Corthay

no

Ray does not read the header in the fasta and fastq files containing sequences.

**seb567** · 02-02-2011, 12:21 PM

Originally posted by gstitan View Post

Hey,

I have uploaded the last version of Ray to can obtain an amos output format. It ran during ten hours but I obtain errors as:
MPI_Isend(145): MPI_Isend(buf=0x2b196b013ad0, count=1, MPI_UNSIGNED_LONG_LONG, dest=23612, tag=89, MPI_COMM_WORLD, request=0x7fff25927d24) failed
MPI_Isend(95).: Invalid rank has value 23612 but must be nonnegative and less than 16

Do you konw this problem (and have you a solution)?

Yes, there is bug in Ray 1.2.2 concerning the numbering of reads in the output of an AMOS file.

In Ray 1.2.2, I introduced the following feature:

Parallel reading of sequence files

But I forgot to modify the module that writes AMOS files.

In Ray, each sequence has a unique identifier, encoded in a unsigned integer of 64 bits.

Before Ray 1.2.2, the rank responsible for the sequence x was computed with x%NumberOfRanks. And the index of that sequence on the rank was computed with x/NumberOfRanks.

In Ray 1.2.2 and later, this changed.

Given a sequence numbered x, a number of ranks N and a number of sequences M, the rank of x is x/(M/N). If the computed rank exceeds the number of ranks minus 1, then it is decremented because the last rank will have a little more sequences.

The index of a sequence on a rank is computed by subtracting from x the total number of sequences stored by ranks having a lower rank number.

This issue is fixed in 1.2.3, which is set to be released in the next weeks, I guess.

Originally posted by gstitan View Post

Best regards

PS: on the same data with a default output format, it was ok...!

OK

**corthay** · 02-02-2011, 06:52 PM

Hi seb567,

Thank you for the detailed answer.
I am looking forward to the new version

Also, it would make me happy if Ray can handle more large kmer
though it might not be important for Ray.

Thanks
Corthay

**seb567** · 02-05-2011, 11:04 AM

Ray 1.2.3: new heuristics for large distances (jumping libraries)

Hello,

Ray 1.2.3 is now available.

Ray -- Parallel genome assemblies for parallel DNA sequencing

http://denovoassembler.sourceforge.net/

One of the new features is the capability of assembling genomes with
larger distances.

See this post for an example:

Ray: a NEW MPI-based 100% parallel genome assembler - SEQanswers

http://seqanswers.com/forums/showpost.php?p=34108&postcount=76

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

Other changes:

1.2.3 'cRAYfish'
2011-02-05

Instruction manual
Fixed a bug leading to a segmentation fault when providing an invalid file or no file at all. This bug was introduced in Ray 1.2.2.
Fixed a bug leading to a hang when providing invalid arguments.
Added reporting of memory usage after loading sequences and after distributing vertices.
Removed read simulators because samtools include one called 'wgsim'.
Changed the paired-end heuristics to accomodate larger distances.
Removed RepeatedVertexWatchdog.
Changed the RepeatedThreshold from 255 to 2*peakCoverage (or 255 if it is higher than 255)
Now both versions of read pairs are utilized.
Paired reads are utilized if the distance falls in mean-3 standard deviation;mean+3 standard deviation. (3 instead of 1 as described in the paper).
If a pair of reads does not respect the constraints, the right read is removed from the used read set.
Fusions are computed more cautiously.
Added reporting of memory usage for sequence reads and for vertices. And for a few other places too.
Improved memory usage for the extensions. Now calling destructors instead of clear() because the latter does not free memory.
Messages are grouped in the computation of seeds, using VirtualCommunicator.
Added a Doxygen configuration file.
The paths in the distributed graph are now numbered with an integer of 64 bits. This fixes a segmentation fault occuring with large genomes -- when identifiers overflow on 32 bits. (need to verify this)
The partition on sequence reads is outputted.
Instead of having the Rank 0 (master) to request the contigs of each every other rank, now each rank appends its fusions to the contigs file and also to the AMOS file.
Fixed a bug in the numbering of reads in the AMOS file.
Fixed a memory usage problem in the memory consumption reducer
Improved the running time for the extension of seeds by removing an inner loop.

-seb

**lcollado** · 02-10-2011, 08:49 AM

Hello,

I just began reading this whole thread (and skipped some of the longer posts :P) and I would like to know what is the current status with CS data. In some of the older posts you mention a few problems Ray has with CS data. Does it work correctly now?

Thank you and greetings
Leo

**seb567** · 02-23-2011, 08:41 AM

Originally posted by lcollado View Post

What is the current status with color space data ?

It is mostly not tested, I would say.

**seb567** · 03-22-2011, 06:13 AM

Ray 1.3.0 and de novo assembly of Illumina CEO genome in 11.5 h

Dear all,

Ray 1.3.0 is now available online.

http://sourceforge.net/projects/denovoassembler/files/Ray-1.3.0.tar.bz2

The most important change is the correction of a major bug that caused
parallel infinite loop on the human genome.

This, along concepts incorporated in Ray 1.2.4, allowed Ray to assemble
the genome of Illumina's CEO in 11.5 hours using 512 compute cores (see
below for the link).

What's new?

1.3.0

2011-03-22

* Vertices with less than 1 of coverage are ignored during the
computation of seeds and during the computation of extensions.
* Computation of library outer distances relies on the virtual
communicator.
* Expiry positions are used to toss away reads that are out-of-range
* When only one choice is given during the extension and some reads
are in-range, then the sole choice is picked up.
* Fixed a bug for empty reads.
* A read is not added in the active set if it is marked on a
repeated vertex and its mate was not encountered yet.
* Grouped messages in the extension of seeds.
* Reads marked on repeated vertices are cached during the extension.
* Paths are cached in the computation of fusions.
* Fixed an infinite loop in the extension of seeds.
* When fetching read markers for a vertex, send a list of mates to
meet if the vertex is repeated in order to reduce the communication.
* Updated the Instruction Manual
* Added a version of the logo without text.

I fixed a bug that caused an infinite loop. Now Ray can assemble large
genomes. See my blog post for more detail about that.

de novo assembly of Illumina CEO genome in 11.5 h

http://dskernel.blogspot.com/2011/03/de-novo-assembly-of-illumina-ceo-genome.html

THE initial aims of our group regarding de novo assembly of genomes were: 1. To assemble genomes using mixes of sequencing technologies si...

Version 1.2.4 of Ray incorporated also new concepts that I will present
at RECOMB-Seq 2011.

The talk is available online:

http://boisvert.info/dropbox/recomb-seq-2011-talk.pdf

Sébastien Boisvert

**ttnguyen** · 03-22-2011, 06:31 AM

Cool, I was looking forward for a MPI-based assembler like this. Will have a try now. Thanks a lot!

**Pedro** · 03-24-2011, 03:43 AM

Hi Sebastien,

thank you very much for this tool. I've been using several assemblers to process solexa and roche data on 3 bacterial genomes and your tool is one of the best performers. In fact, with my solexa datasets is the best

. Anyway, I wonder if you could include more output formats in a next release. More precisely, I really wanted to have FASTQ output files.
Indeed, this would not be needed if there was an (almost) universal sequence/assembly format conversion tool. There are several around but are often very targeted.

cheers

**lletourn** · 04-21-2011, 12:43 PM

Hi seb567, glad to have met you in Quebec at the Rendez-vous. I'm playing with Ray on some of my "troublesome" species and I'm having very good results.

Like corthay, I was wondering if you had any plans to support bigger kmers than 31 either through a compiler switch or something else.

One could argue that past a point (kmer > 100) an overlap based assembler might do a very good job, but where speed is concerned graph based assemblers are hard to beat.

Also, with illumina that now has high 3' quality long reads, maybe bigger kmer can be justified.

Thanks

**ssully** · 04-21-2011, 02:45 PM

seb, Ray looks like a great assembly tool but I have a question -- could I expect Ray to do a better job than Newbler if the input reads are 454 only (single and paired end)? Or is the advantage mainly seen when mixed platform reads are input (e.g. 454 and Illumina)?

**Adjuvant** · 04-26-2011, 05:47 AM

Like corthay, I was wondering if you had any plans to support bigger kmers than 31 either through a compiler switch or something else.

One could argue that past a point (kmer > 100) an overlap based assembler might do a very good job, but where speed is concerned graph based assemblers are hard to beat.

I would also like to put in a request for longer kmers. Assembling paired illumina reads of 100 bp (forward) and 83 bp (reverse, quality-trimmed) and scaffolding with SSPACE, I get 537 scaffolds with an N50 of 19163 when I use the default kmer setting, but 254 scaffolds with an N50 of 52083 when I use -k 31. Velvet Optimiser using the same data picks a kmer length of 67. I can only guess how awesome an assembly Ray could give me with the ability to use longer kmers...

**seb567** · 04-26-2011, 09:36 AM

Originally posted by ttnguyen View Post

Cool, I was looking forward for a MPI-based assembler like this. Will have a try now. Thanks a lot!

I hope you enjoy Ray.

Originally posted by Pedro View Post

Hi Sebastien,

thank you very much for this tool. I've been using several assemblers to process solexa and roche data on 3 bacterial genomes and your tool is one of the best performers. In fact, with my solexa datasets is the best

. Anyway, I wonder if you could include more output formats in a next release. More precisely, I really wanted to have FASTQ output files.
Indeed, this would not be needed if there was an (almost) universal sequence/assembly format conversion tool. There are several around but are often very targeted.

cheers

Ray is very good when errors are randomly occuring. 454's homopolymers are not random.

I guess you would like the nucleotides to be accompanied by quality values in the output. Is that so ?

Originally posted by lletourn View Post

Hi seb567, glad to have met you in Quebec at the Rendez-vous. I'm playing with Ray on some of my "troublesome" species and I'm having very good results.

Like corthay, I was wondering if you had any plans to support bigger kmers than 31 either through a compiler switch or something else.

One could argue that past a point (kmer > 100) an overlap based assembler might do a very good job, but where speed is concerned graph based assemblers are hard to beat.

Also, with illumina that now has high 3' quality long reads, maybe bigger kmer can be justified.

Thanks

I agree with you. I will check how easy (or hard) it is for me to change the upper bound for k-mer length in Ray.

Originally posted by ssully View Post

seb, Ray looks like a great assembly tool but I have a question -- could I expect Ray to do a better job than Newbler if the input reads are 454 only (single and paired end)? Or is the advantage mainly seen when mixed platform reads are input (e.g. 454 and Illumina)?

With only 454 reads, Ray will not produce a very good assembly because the errors are not randomly occuring in reads. Ray will go through some of these pesky homopolymer bubbles in the graph.

However, should you mix 454 with another sequencing technology with random errors, you will obtain a better that one with only 454 alone.

Originally posted by Adjuvant View Post

I would also like to put in a request for longer kmers. Assembling paired illumina reads of 100 bp (forward) and 83 bp (reverse, quality-trimmed) and scaffolding with SSPACE, I get 537 scaffolds with an N50 of 19163 when I use the default kmer setting, but 254 scaffolds with an N50 of 52083 when I use -k 31. Velvet Optimiser using the same data picks a kmer length of 67. I can only guess how awesome an assembly Ray could give me with the ability to use longer kmers...

I understand, I will see how easy (or hard) it is to change the maximum k-mer length in Ray.

Thank you for your interest in Ray !

p.s.: I am presently optimising Ray by studying its communicational behavior.

http://i.imgur.com/g9NFV.png

x axis: time; blue: number of iterations; red: number of sent messages; green: number of received messages; data are collected 10 times per second

I am also in the process of devising a better way to compute seeds -- with are relatively similar to the concept of unipaths observed in classic assemblers.

**lletourn** · 04-26-2011, 10:33 AM

Hi seq567 do you think it's a wonderful or terrible idea to try Ray on RNASeq data since coverage is not uniform?

You don't do bubble bursting, but when choosing paths could non-uniform coverage create mis-assemblies?

(I could read all the source, but I thought it would be faster this way :-) )

**seb567** · 04-26-2011, 11:29 AM

Originally posted by lletourn View Post

Hi seq567 do you think it's a wonderful or terrible idea to try Ray on RNASeq data since coverage is not uniform?

Worth a try although I believe the coverage distribution may have more than one peak because, as you said, non-uniformity is present.

Originally posted by lletourn View Post

You don't do bubble bursting, but when choosing paths could non-uniform coverage create mis-assemblies?

Choosing the next vertex to visit is not performed with the coverage. Instead, pairs of sequences are utilised for that purpose.

Bubbles are not bursted, but when Ray encounters one (in general, an heterozygous site), it will attempt to pass through it by selecting one child.

In my opinion, bubble bursting and bubble merging are bad approaches because you either lose information (bursting) or create misassemblies that are otherwise not so hard to avoid (merging).

Originally posted by lletourn View Post

(I could read all the source, but I thought it would be faster this way :-) )

However, the bits related to message passing may interfere with your reading.

Things are easier to grasp in natural languages than in computer languages !

***
Sébastien Boisvert

http://boisvert.info

http://twitter.com/sebhtml

Ray -- Parallel genome assemblies for parallel DNA sequencing

http://denovoassembler.sf.net

http://genome.ulaval.ca/corbeillab

Topics	Statistics	Last Post
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, Today, 12:17 PM	0 responses 7 views 0 likes	Last Post by seqadmin Today, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, Yesterday, 10:49 AM	0 responses 18 views 0 likes	Last Post by seqadmin Yesterday, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News