Seqanswers Leaderboard Ad

**figure002** · 06-14-2011, 12:33 AM

Originally posted by seb567 View Post

Regardless, I guess it is correct to consider SFF files as containers, just like FASTA or FASTQ files.

Therefore, Ray will no longer try to match the key sequence. Instead, it will *simply* load all sequences in the SFF file and trim them using the clipping values therein.

See http://github.com/sebhtml/ray/commit/15826e290f1

Thanks Sébastien! I'll give version 1.6.0 a try.

**seb567** · 06-15-2011, 07:38 AM

Originally posted by kail View Post

seb567,

This is the first time I assemble a genome, so, i thought that my set was big because it has MANY sequences, anyway...

How long does the assembly will take?, if i have the following two set:

Paired-Ends (500 +- 50)
47.803.856 pairs

Mate-pair (2200 +- 200)
42.599.342 pairs

PD: I'm using Ray 1.3.0

You should update to v1.6.0 !

Ray -- Parallel genome assemblies for parallel DNA sequencing

http://denovoassembler.sf.net

You should try Ray on your MANY sequences !

**figure002** · 06-16-2011, 12:16 AM

I just finished a run with Ray 1.6.0 on about 70 gigs of trimmed 454 reads data (singles+pairs) in .sff format. This time it finished without any errors and in just 4.5 hours (on 16 cores). I'm surprised that it finished this fast, but the resulting contig file was much smaller than expected (just 3.3MB). I didn't explicitly specify the insert sizes (I let Ray estimate those, not sure if it's reliable), so maybe I should do that next time.

Edit: Apparently I set the k-mer size too low (17). I did another run with k-mer size set to 31 which resulted in many more contigs (235MB).

**seb567** · 06-20-2011, 10:27 AM

Originally posted by figure002 View Post

I just finished a run with Ray 1.6.0 on about 70 gigs of trimmed 454 reads data (singles+pairs) in .sff format. This time it finished without any errors and in just 4.5 hours (on 16 cores). I'm surprised that it finished this fast, but the resulting contig file was much smaller than expected (just 3.3MB). I didn't explicitly specify the insert sizes (I let Ray estimate those, not sure if it's reliable), so maybe I should do that next time.

Edit: Apparently I set the k-mer size too low (17). I did another run with k-mer size set to 31 which resulted in many more contigs (235MB).

To use 454 mate-pairs in an SFF file, you must extract them
and provide Ray with the 2 resulting fastq files.

Ray only supports 454 shotgun (single) reads.

**gringer** · 06-21-2011, 12:32 AM

I'm going to have a go at tackling the colour space problem. It might take me a while to get up to speed, because it's been a few years since I last worked on a large c++ project -- my head is currently geared towards the cotton-wool world of Java.

If, as appears, the reverse complement of a colour space sequence is its reverse, then it may be more space-efficient to store as colourspace, something like:

<first base><colour-space sequence><reverse-complement first base>

My guess at what needs to happen:
* store colour space reads as <first base>[0-3]+<rc first base>
* assemble by matching colour space [i.e. ignore first reads]
- don't convert to base space, because misreads in colour space cause the remainder of the read [when converted to base space] to be junk
- this would help greatly if reads were stored by Ray as colour space as above, but I expect that would be quite a disruptive change
* when reporting assembly, convert to base space
- only convert / report where the starting base in a segment is known, or can be inferred

There is also an equivalent csfastq format, which might be nice to implement.

My git fork (which happens to be my first attempt at a git fork, so apologies for badness) has started off with modifying the colour space decoder. I've replaced char* use in that file with strings, which may mean it's ended up more broken if constant strings are being thrown around.

**flxlex** · 06-21-2011, 04:04 AM

Originally posted by seb567 View Post

To use 454 mate-pairs in an SFF file, you must extract them
and provide Ray with the 2 resulting fastq files.

Ray only supports 454 shotgun (single) reads.

In order for RAY to recognize the pair halves belonging together, do they need to confirm to the fastq readID convention?

forward read: @READ_ID/1
reverse read: @READ_ID/2

In a single fastq file, or in two files?

One way to do this would be to
1)run newbler on the paired-read sff files with the -tr option, this results in the reads being split and output (in a single file) as

>READ_ID_left
sequence
>READ_ID_right
sequence

2) split the fasta/qual files according to _left and _right, if needed (small script, I guess)
3) change the _left to /1 and _right to /2 (e.g. using sed)
4) convert fasta+qual to fastq using your favorite tool

Correct?

**seb567** · 06-21-2011, 10:42 AM

Originally posted by flxlex View Post

In order for RAY to recognize the pair halves belonging together, do they need to confirm to the fastq readID convention?

forward read: @READ_ID/1
reverse read: @READ_ID/2

In a single fastq file, or in two files?

One way to do this would be to
1)run newbler on the paired-read sff files with the -tr option, this results in the reads being split and output (in a single file) as

>READ_ID_left
sequence
>READ_ID_right
sequence

2) split the fasta/qual files according to _left and _right, if needed (small script, I guess)
3) change the _left to /1 and _right to /2 (e.g. using sed)
4) convert fasta+qual to fastq using your favorite tool

Correct?

In Ray, your paired sequences can be in two files (-p file1.fastq file2.fastq) or in one file (-i file.fastq).

For -p, files must contain the same number of sequences.

Example: file1 contains seq1/1, seq2/1 and file2 contains seq1/2, seq2/2.

For -i, the file must contain seq1/1, seq1/2, seq2/1, seq2/2,

The name of your sequences are irrelevant to Ray.

**seb567** · 06-22-2011, 10:55 AM

Originally posted by gringer View Post

I'm going to have a go at tackling the colour space problem. It might take me a while to get up to speed, because it's been a few years since I last worked on a large c++ project -- my head is currently geared towards the cotton-wool world of Java.

If, as appears, the reverse complement of a colour space sequence is its reverse, then it may be more space-efficient to store as colourspace, something like:

<first base><colour-space sequence><reverse-complement first base>

My guess at what needs to happen:
* store colour space reads as <first base>[0-3]+<rc first base>
* assemble by matching colour space [i.e. ignore first reads]
- don't convert to base space, because misreads in colour space cause the remainder of the read [when converted to base space] to be junk
- this would help greatly if reads were stored by Ray as colour space as above, but I expect that would be quite a disruptive change
* when reporting assembly, convert to base space
- only convert / report where the starting base in a segment is known, or can be inferred

There is also an equivalent csfastq format, which might be nice to implement.

My git fork (which happens to be my first attempt at a git fork, so apologies for badness) has started off with modifying the colour space decoder. I've replaced char* use in that file with strings, which may mean it's ended up more broken if constant strings are being thrown around.

Presently, Ray can assemble color-space reads to produce color-space contigs.

GitHub - sebhtml/ray: Ray -- Parallel genome assemblies for parallel DNA sequencing

https://github.com/sebhtml/ray

Ray -- Parallel genome assemblies for parallel DNA sequencing - GitHub - sebhtml/ray: Ray -- Parallel genome assemblies for parallel DNA sequencing

The only color-space data I have is ecoli50x50

2010-05-19

Attention Required! | Cloudflare

http://solidsoftwaretools.com/gf/project/ecoli50x50/

An example of contig:

>contig-0 1037 nucleotides
233001233101131131320333012130033132102200122123303132133111
110110230111011310130100101130301003233013332033103101113122
132330033101100120331321333012101320332012311312322312213100
103202013101300233010123103313122103220031030113313031013000
211210100010110330322020130113033330021130101220112301210213
231002121313320013010032021330312020213230100300222033012233
002312023232010122302100111333023230330230331331331232333312
333030023220232210010010211202321100230332102220301232330132
330103313203212301203312313122213110121301330012333012312322
0221023133032310312002212103102303310200ly12332033002101310310
011100130101300312331000002131331230210122001030033230123330
103320133330330123302300332001131201221200233300321233021330
112211213321122101302022122110132111012023121113231201211201
321131122031112030000030100200133301230332133231220111331230
321221210222112030103031123321223120130321222311101103031330
101213203201011330110300201300211231223100202013210123012023
130003102300211303301003231330001201100013231032231222021223
03112102222232320

With these data and with a k-mer length = 21,

93.1947% of k-mers occur only once, owing to an enormous error rate presumably.

I (still) don't see a way to convert these colored strings back into biological (nucleotide) space.

To do the conversion, you need the first letter. But the first letter is not known. Furthermore, a contig does not necessarily starts right at the beginning of a read.

If you want to discuss about the code, please use the mailing list instead.

Sébastien

**arelouse** · 06-28-2011, 01:39 AM

Hi, a simple question: how do u achieve parallelism in Ray?

I took a quick look at this thread and also read the paper and the slides describing Ray, but ending with almost nothing.
Sorry if I missed some points! It would be appreciated if u can provide a little basic ideas behind Ray.
For example, it's easy (at least from the description) to understand AByss' distribution strategy.

**gringer** · 06-30-2011, 03:30 AM

Just as a heads-up about memory consumption in upcoming Ray releases, I've just finished a Ray transcriptome run on my desktop computer (using 10 processor cores). This was done using a bleeding-edge git version of Ray (post Kmer academy). Here are some statistics:

Input files:

2 paired-end Illumina files, each 7.6GB
1 454 file, 2.1MB
1 solid colour-space file, converted to base-space, 2.7Gb

These input files were masked and filtered to eliminate sequences < Q20 (so Ray got no 'unknown' bases, which would be converted to 'A'). I presume this is why the 454 input file was so small, about 1/20 of its original size.

Total memory consumption was about 21Gb [my desktop computer has 24GB], which was similar to memory consumption using the paired-end files alone. I presume this is because the consumption is based on the number of unique Kmers, rather than the input sequence length.

Elapsed time for each step, Thu Jun 30 12:44:37 2011

Sequence partitioning: 15 minutes, 53 seconds
K-mer counting: 34 minutes, 34 seconds
Coverage distribution analysis: 13 seconds
Graph construction: 1 hours, 5 minutes, 28 seconds
Edge purge: 3 minutes, 48 seconds
Selection of optimal read markers: 46 minutes, 56 seconds
Detection of assembly seeds: 4 minutes, 18 seconds
Estimation of outer distances for paired reads: 6 minutes, 1 seconds
Bidirectional extension of seeds: 11 minutes, 42 seconds
Merging of redundant contigs: 52 seconds
Generation of contigs: 4 seconds
Scaffolding of contigs: 2 minutes, 44 seconds
Total: 3 hours, 12 minutes, 33 seconds

Rank 8: assembler memory usage: 2215932 KiB
Rank 0: assembler memory usage: 2027504 KiB
Rank 2: assembler memory usage: 2031596 KiB
Rank 6: assembler memory usage: 2027500 KiB
Rank 4: assembler memory usage: 2105324 KiB
Rank 7: assembler memory usage: 2084844 KiB
Rank 3: assembler memory usage: 2035692 KiB
Rank 9: assembler memory usage: 2187264 KiB
Rank 1: assembler memory usage: 2027500 KiB
Rank 5: assembler memory usage: 2125804 KiB
Number of contigs: 48078
Total length of contigs: 9535608
Number of contigs >= 500 nt: 1746
Total length of contigs >= 500 nt: 1208359
Number of scaffolds: 47705
Total length of scaffolds: 9564818
Number of scaffolds >= 500 nt: 1894
Total length of scaffolds >= 500: 1390923

**seb567** · 07-03-2011, 06:45 AM

Originally posted by arelouse View Post

Hi, a simple question: how do u achieve parallelism in Ray?

I took a quick look at this thread and also read the paper and the slides describing Ray, but ending with almost nothing.
Sorry if I missed some points! It would be appreciated if u can provide a little basic ideas behind Ray.
For example, it's easy (at least from the description) to understand AByss' distribution strategy.

Read my blog:

More on virtual communication with the message-passing interface.

More on virtual communication with the message-passing interface.

http://dskernel.blogspot.com/2011/06/more-on-virtual-communication-with.html

The message-passing interface (MPI) is a standard that allows numerous computers to communicate in order to achieve a large-scale peer-to-p...

Also, a silly story:

The Virtual Communicator

http://dskernel.blogspot.com/2011/01/virtual-communicator.html

IT WAS a wintry day of January, in a coldly-tempered land. On this island lived peculiar citizens whose main everyday whereabouts involved p...

**habm** · 07-03-2011, 04:54 PM

Our longer-insert Illumina mate-pair libraries have significant duplication contamination - ie two size peaks, one of inward facing false pe reads (innies) at under 300bp, and one of the outward facing reads (outies) nearer the desired insert size eg 3000 or 5000 bp.
How should the mp library size mean and SD be specified to allow Ray to deal with this, please?
Thanks.

PS, a run without any insert sizes specified (ie Automatic DetectionType) suggests that Ray found the innies OK, but not the useful outies:
LibraryNumber: 1 (nominally 3kbp, really more like 2200bp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 457
StandardDeviation: 441
DetectionFailure: Yes

LibraryNumber: 2 (nominally 6bp, really more like 4-5 kbp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 302
StandardDeviation: 218
DetectionFailure: Yes

LibraryNumber: 3 (nominally 8 kbp, really 6.3 kbp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 260
StandardDeviation: 213
DetectionFailure: Yes

Another way of expressing this is to ask whether Ray can disambiguate pe and mp reads, and if so, what input information is needed?

**Wallysb01** · 07-12-2011, 12:09 PM

So, I'm interested to try Ray, as I have access to a cluster with ungodly numbers of cores but very real limitations in RAM that make other programs difficult to run.

Anyway, I have 2 Illumina lanes with 104bp PE reads, totaling about 250M PE reads, from a vertebrate with a genome size of roughly 2Gbp. Do you have any suggestions on how many cores I should try using and for how long?

I was also thinking of trying a fairly large Kmer first, around maybe ~65. Any suggestions on that?

**seb567** · 07-12-2011, 12:47 PM

Originally posted by habm View Post

Our longer-insert Illumina mate-pair libraries have significant duplication contamination - ie two size peaks, one of inward facing false pe reads (innies) at under 300bp, and one of the outward facing reads (outies) nearer the desired insert size eg 3000 or 5000 bp.
How should the mp library size mean and SD be specified to allow Ray to deal with this, please?
Thanks.

PS, a run without any insert sizes specified (ie Automatic DetectionType) suggests that Ray found the innies OK, but not the useful outies:
LibraryNumber: 1 (nominally 3kbp, really more like 2200bp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 457
StandardDeviation: 441
DetectionFailure: Yes

LibraryNumber: 2 (nominally 6bp, really more like 4-5 kbp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 302
StandardDeviation: 218
DetectionFailure: Yes

LibraryNumber: 3 (nominally 8 kbp, really 6.3 kbp)
InputFormat: TwoFiles,Paired
AverageOuterDistance: 260
StandardDeviation: 213
DetectionFailure: Yes

Another way of expressing this is to ask whether Ray can disambiguate pe and mp reads, and if so, what input information is needed?

Presently, Ray can not disambiguate two paired libraries with different outer distances that are pooled in the same files.

Do you also observe two peaks in those files:

PREFIX.Library0.txt
PREFIX.Library1.txt
PREFIX.Library2.txt
PREFIX.Library3.txt

(replace PREFIX by what you have given to the -o switch.)

Example of such a file for MiSeq data:

cat ecoli-MiSeq.Library0.txt

52 1
53 1
56 1
58 1
61 1
62 2
63 2
64 2
65 1
66 2
67 2
68 2
69 1
71 1
72 1
73 2
74 1
75 1
76 2
77 2
78 1
79 3
80 3
82 3
83 3
84 2
85 3
86 2
87 6
89 8
90 1
91 7
92 5
93 5
94 8
95 4
96 6
97 7
98 6
99 10
100 5
101 4
102 10
103 4
104 5
105 3
106 12
107 5
108 10
109 8
110 5
111 10
112 14
113 13
114 11
115 11
116 8
117 10
118 11
119 14
120 23
121 12
122 16
123 14
124 17
125 20
126 18
127 20
128 24
129 21
130 19
131 20
132 14
133 28
134 24
135 34
136 31
137 24
138 29
139 25
140 25
141 42
142 34
143 32
144 36
145 34
146 40
147 39
148 38
149 50
150 33
151 50
152 141
153 141
154 154
155 150
156 142
157 153
158 146
159 173
160 163
161 176
162 147
163 147
164 148
165 140
166 162
167 159
168 136
169 139
170 132
171 149
172 149
173 159
174 165
175 151
176 168
177 167
178 144
179 149
180 153
181 151
182 145
183 150
184 148
185 143
186 159
187 148
188 128
189 140
190 144
191 139
192 156
193 127
194 106
195 149
196 117
197 118
198 130
199 134
200 134
201 139
202 149
203 156
204 138
205 140
206 139
207 146
208 164
209 173
210 153
211 153
212 167
213 167
214 136
215 143
216 187
217 159
218 202
219 157
220 173
221 210
222 187
223 204
224 229
225 230
226 240
227 255
228 261
229 298
230 292
231 313
232 383
233 408
234 465
235 500
236 567
237 660
238 710
239 774
240 958
241 1074
242 1191
243 1296
244 1494
245 1590
246 1924
247 2021
248 2269
249 2456
250 2668
251 2966
252 3229
253 3439
254 3777
255 3940
256 4350
257 4597
258 5087
259 5409
260 5743
261 6285
262 6936
263 7585
264 8592
265 9747
266 11368
267 13517
268 16384
269 20030
270 24454
271 29734
272 35766
273 42817
274 49781
275 57815
276 65042
277 72012
278 78801
279 84473
280 90123
281 93640
282 97052
283 100139
284 101610
285 103303
286 103863
287 104660
288 104885
289 104595
290 104248
291 104087
292 104276
293 103179
294 102572
295 102231
296 101174
297 100269
298 100211
299 99244
300 98984
301 97880
302 97111
303 95995
304 94988
305 94205
306 92844
307 91975
308 91365
309 89238
310 89011
311 87040
312 85224
313 84300
314 82654
315 80892
316 79329
317 77002
318 74527
319 71500
320 68193
321 64387
322 60527
323 55998
324 51459
325 46082
326 41498
327 36553
328 32011
329 27418
330 23535
331 19623
332 16185
333 13669
334 11028
335 9012
336 7569
337 6332
338 5236
339 4454
340 3676
341 3071
342 2564
343 2167
344 1875
345 1538
346 1278
347 935
348 766
349 591
350 412
351 305
352 253
353 151
354 119
355 76
356 51
357 37
358 32
359 22
360 24
361 10
362 7
363 5
364 4
365 3
367 3
369 2
370 1
373 1
378 1
390 1
398 1
399 1
414 1
417 1
421 1
431 1
497 1
501 1
516 1
528 1
668 1
1200 1
1230 1
1373 1
1478 1
1886 1
2028 1

**seb567** · 07-12-2011, 01:10 PM

Originally posted by Wallysb01 View Post

So, I'm interested to try Ray, as I have access to a cluster with ungodly numbers of cores but very real limitations in RAM that make other programs difficult to run.

Anyway, I have 2 Illumina lanes with 104bp PE reads, totaling about 250M PE reads, from a vertebrate with a genome size of roughly 2Gbp. Do you have any suggestions on how many cores I should try using and for how long?

I was also thinking of trying a fairly large Kmer first, around maybe ~65. Any suggestions on that?

Well, for low memory usage, you definitely want to use Ray v1.6.1 (on its way, presently Ray v1.6.1-rc3 which is available at https://github.com/sebhtml/ray/zipball/v1.6.1-rc3 ).

See http://sourceforge.net/mailarchive/m...sg_id=27781099 for more details.

In your message, you don't report how much memory your compute cores have access to.

Ray is a peer-to-peer program, that is you can launch it on 2048 compute cores if you want.

But, you should first do a run with k=31 just to quality-control the thing first.

You'll get something like this:

cat parrot-BGI-Assemblathon2-k31-20110711.CoverageDistributionAnalysis.txt

k-mer length: 31
Lowest coverage observed: 1
MinimumCoverage: 31
PeakCoverage: 133
RepeatCoverage: 235
Number of k-mers with at least MinimumCoverage: 2462747440 k-mers
Estimated genome length: 1231373720 nucleotides
Percentage of vertices with coverage 1: 82.8132 %
DistributionFile: parrot-BGI-Assemblathon2-k31-20110711.CoverageDistribution.txt

In Ray, k-mers from 15 to 31 are stored on one 64-bit integer.

K-mers from 33 to 63 are stored on 2 64-bit integers.

K-mers from 65 to 95 are stored on 3 64-bit integers.

Example for the memory usage with Illumina TruSeq 3 chemistry

Ray v1.6.1-rc3 compiled with FORCE_PACKING=y MAXKMERLENGTH=32

(FORCE_PACKING=y causes bus errors on some architectures such as UltraSparc and Itanium)

k=31

2 386 063 326 Illumina TruSeq 3 sequences, length is 90 or 151

data for the Parrot dataset of Assemblathon 2

Data generated by the BGI.

Running time:

[1,0]<stdout>: Sequence partitioning: 2 hours, 30 minutes, 21 seconds
[1,0]<stdout>: K-mer counting: 2 hours, 33 minutes, 44 seconds
[1,0]<stdout>: Coverage distribution analysis: 3 minutes, 51 seconds
[1,0]<stdout>: Graph construction: 1 hours, 36 minutes, 47 seconds
[1,0]<stdout>: Edge purge: 48 minutes, 20 seconds
[1,0]<stdout>: Selection of optimal read markers: 1 hours, 5 minutes, 30 seconds
[1,0]<stdout>: Detection of assembly seeds: 12 minutes, 15 seconds
[1,0]<stdout>: Estimation of outer distances for paired reads: 4 minutes, 51 seconds
[1,0]<stdout>: Bidirectional extension of seeds: 2 hours, 11 minutes, 46 seconds
[1,0]<stdout>: Merging of redundant contigs: 13 minutes, 31 seconds
[1,0]<stdout>: Generation of contigs: 1 minutes, 24 seconds
[1,0]<stdout>: Scaffolding of contigs: 34 minutes, 46 seconds
[1,0]<stdout>: Total: 11 hours, 57 minutes, 30 seconds

Peak memory usage:

~800 GiB, distributed on 512 compute cores uniformly by Ray's peer-to-peer scheme.

Each compute core utilises on average ~ 1.5 GiB maximum.

Measured network latency is ~150 microseconds, this figure includes software overheads.

head parrot-BGI-Assemblathon2-k31-20110711.NetworkTest.txt
# average latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes
# Message passing interface rank Name Latency in microseconds
0 r104-n7 153
1 r104-n7 156
2 r104-n7 155
3 r104-n7 155
4 r104-n7 154
5 r104-n7 155
6 r104-n7 155
7 r104-n7 155

What is the interconnect between your compute cores ?

Sébastien

sebhtml - Overview

http://github.com/sebhtml

I like software development, AI, biology and using good tools like git, cargo, and docker. - sebhtml

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News