Seqanswers Leaderboard Ad

**seb567** · 11-09-2010, 06:01 AM

I think I found the reason behind all the hanging.

I myself experienced the hanging with shared memory disabled, using 384 cores (Xeon).

It is more likely to be a MPI rank being flooded by messages and being unable to response than anything else I believe.

I am currently testing regularization of message sending in the extension of seeds. Ensuring that a particular number of microseconds between messages is what I am testing.

If that fails, Ray will simply do the extension of seeds on MPI rank after another.

In the other steps of the algorithm (distribution of vertices, for example), the messages sent are uniformly observed.

With the detailed information you provided, I can safely say that running on more cores won't change anything with Ray 0.1.0 and below.

Thank you.

**PHSchi** · 11-19-2010, 04:26 AM

memory consumption and other issues

Hi everyone!

We are trying to run Ray on an IntelMPI based Linux (RH) cluster. Recently one of my test jobs crashed, i.e. killed the node it was running on. As we can't access it's run stats I have some questions:
Given a Ray run is started with ~ 202,136,864 Illumina pe reads (100bp on average) what would be the expected peak memory requirement? Anybody any estimates from their own experience?
And what does everybody see in terms of runtime for their Ray assemblies? We are testing on one node with 8 cores in the moment, as earlier tests with multiple nodes crashed and took other running procs to hell with them.
Any help with this would be highly appreciated!
Btw. anybody out there actually running Ray with IntelMPI?

Regards,

Philipp

**mrawlins** · 11-23-2010, 09:31 AM

Ray has worked great for our work with Illumina and 454 reads, but is giving us some trouble in our tests on SOLiD data. Our test set is from the NCBI SRA (run SRR035444, submission SRA009283), which can be downloaded either from NCBI or from the SRA at EBI.
Ray 0.1.0 gets just past the coverage distribution and hangs. The -TheCoverageDistribution.tab file reads:

#Coverage NumberOfVertices
255 2
1431655765 16859178292550

which seems wrong to me.
Ray was compiled using OpenMPI 1.5 and gcc 4.5.1. I get this same error using anywhere between 7 and 48 nodes, and it doesn't seem to be a memory issue. If anybody has experienced this sort of thing and/or has a recommendation on how to fix it that would be great.

**pallo** · 11-24-2010, 12:29 AM

@seb567: You are right, trying to run on a bigger set of nodes (20x8cores) as well as a smaller set of larger memory cores made no difference - the jobs hang within 24hrs of startup and do so using 100% cpu load until killed. If I can provide any further debug info, let me know...

@PHSchi: Im using openMPI 1.4.2 but for approx 500M 100bp paired Illumina reads, random checking on nodes suggests that the total memory usage was 400-600GB, but thats just a finger in the wind estimate and the jobs got stuck so I cant say for sure. The target genome is estimated at around 1.3Gbases. So a very linear guess is that you need at least half of this, that is 2-300GB

cheers
pallo

**seb567** · 11-25-2010, 07:48 AM

Ray 1.0.0 is compliant the standard MPI 2.2 !

Warning: long post ahead.

Statements:

1. Ray 0.1.0 and before were not 100 % compatible with the standard MPI 2.2. Thus, Ray hanged sometimes.

2. Ray 1.0.0 is compliant with the standard MPI 2.2.

3. Ray 1.0.0 __SHOULD__ not hang.

4. Ray 1.0.0 is released.

Now, let me answer your questions.

@seb567 (self) 11-09-2010, 06:01 AM

I think I found the reason behind all the hanging.

I myself experienced the hanging with shared memory disabled, using 384 cores (Xeon).

It is more likely to be a MPI rank being flooded by messages and being unable to response than anything else I believe.

As George Bosilca puts it:

No message is eager if there is congestion. 64K is eager for TCP only if the kernel buffer has enough room to hold the 64k. For SM it only works if there are ready buffers. In fact, eager is an optimization of the MPI library, not something the users should be aware of, or base their application on this particular behavior.

On the MPI 2.2 there is a specific paragraph that advice the users not to do it.

http://www.open-mpi.org/community/lists/devel/2010/11/8702.php

I am currently testing regularization of message sending in the extension of seeds. Ensuring that a particular number of microseconds between messages is what I am testing.

That failed.

If that fails, Ray will simply do the extension of seeds on MPI rank after another.

That was not fast, and failed with MPICH2.

The ultimate solution was to read the standard MPI 2.2.

http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf

Warning: 647 pages, very technical.

In the other steps of the algorithm (distribution of vertices, for example), the messages sent are uniformly observed.

But still, MPI_Send can block !

Note that MPI_Send was replaced with MPI_Isend in Ray 1.0.0.

With the detailed information you provided, I can safely say that running on more cores won't change anything with Ray 0.1.0 and below.

Thank you.

Ray 1.0.0 is compliant with MPI 2.2 and should not hang.

@PHSchi 11-19-2010, 04:26 AM

Hi everyone!

We are trying to run Ray on an IntelMPI based Linux (RH) cluster. Recently one of my test jobs crashed, i.e. killed the node it was running on.

IntelMPI is based, I believe, on MPICH2. Thus, Ray 1.0.0 will works fine, but not previous versions.

As we can't access it's run stats I have some questions:

If you start your jobs with qsub (Oracle/Sun Grid Engine), try to modify and run qhost.py, which is readily available in scripts/ from the Ray 1.0.0 distribution. The script uses 'qhost -j -xml>dump.xml' and then parse the XML file.

Given a Ray run is started with ~ 202,136,864 Illumina pe reads (100bp on average) what would be the expected peak memory requirement? Anybody any estimates from their own experience?

Memory usage depends mainly on the genome size and error rates.

And what does everybody see in terms of runtime for their Ray assemblies?

End-users working with bacterial data are satisfied.

I don't know for others.

We are testing on one node with 8 cores in the moment, as earlier tests with multiple nodes crashed and took other running procs to hell with them.

8 cores sound low for ~ 202,136,864 Illumina pe reads.

Any help with this would be highly appreciated!

I hope Ray 1.0.0 works for you!

Btw. anybody out there actually running Ray with IntelMPI?

Regards,

Philipp

As I wrote, IntelMPI is based on MPICH2.
see http://www.mcs.anl.gov/research/proj...x.php?s=collab

With Ray 1.0.0, IntelMPI should work fine. And that should be true with g++ and icc.

@mrawlins 11-23-2010, 09:31 AM

Ray has worked great for our work with Illumina and 454 reads,

Yes, mixing technologies eliminates 454 homopolymer errors and Illumina shorter read length.

but is giving us some trouble in our tests on SOLiD data. Our test set is from the NCBI SRA (run SRR035444, submission SRA009283), which can be downloaded either from NCBI or from the SRA at EBI.

I added a ticket, but my tests with public datasets from solidsoftwaretools indicated that the error rate of this technology does not allow a de novo assembly with Ray.

For instance, with k=21, you probably want the error (substitution) rate to be below 1/21. Otherwise any k-mer will be erroneous, and thus unique !

1 / 21 = 0,0476190476 = 4.76 %

If I remember well, error rates for these datasets were above that (~12 % or so, I think).

Solid Software: Situs Berita Teknologi Terunggul Anda

http://solidsoftwaretools.com/

Solid Software adalah situs berita teknologi, program dan software dengan visi dan misi memajukan industri teknologi untuk warga Indonesia.

Datasets are:

SOLiD™4 System E.Coli DH10B Fragment Data Set
SOLiD™ System E.Coli DH10B 50X50 Mate-Pair Data Set

Ray 0.1.0 gets just past the coverage distribution and hangs. The -TheCoverageDistribution.tab file reads:
#Coverage NumberOfVertices
255 2
1431655765 16859178292550

I think that does not mean anything. 1431655765 is just not possible because the maximum value is 255.

Can you try again with Ray 1.0.0 and post/send me the results ?

which seems wrong to me.

You are not alone.

Ray was compiled using OpenMPI 1.5 and gcc 4.5.1.

You are better off with Open-MPI 1.4.3 or MPICH2 1.3.1 or any other super-stable releases. Open-MPI 1.5 is a beta 'feature release'.

I get this same error using anywhere between 7 and 48 nodes, and it doesn't seem to be a memory issue.

I would bet on an error rate above 1/k. Try

mpirun -np 40 -k 15 -p dataLEFT.fastq.bz2 dataRIGHT.fastq.gz

Supposing that your genome/transcriptome size is far below 1 073 741 824.

Code:

4^15 =                    1 073 741 824
4^21 =              4 398 046 511 104
4^32 = 18 446 744 073 709 551 616

If anybody has experienced this sort of thing and/or has a recommendation on how to fix it that would be great.

Well, again, my tests on the datasets from http://solidsoftwaretools.com/ indicated that the error rate of the SOLiD technology is not friendly with de novo assembly with Ray.

Let us hope that 'Exact Call Chemistry' will fix that.

Thermo Fisher Scientific - US

http://www3.appliedbiosystems.com/cms/groups/global_marketing_group/documents/generaldocuments/cms_088755.pdf

Thermo Fisher Scientific enables our customers to make the world healthier, cleaner and safer. Delivering technology, pharmaceutical and biotechnology services.

Page not found

http://www.news-medical.net/news/20101027/Addition-of-ECC-probe-to-SOLiD-Systems-chemistry-achieves-greater-

@pallo Yesterday, 12:29 AM

@seb567: You are right, trying to run on a bigger set of nodes (20x8cores) as well as a smaller set of larger memory cores made no difference - the jobs hang within 24hrs of startup and do so using 100% cpu load until killed. If I can provide any further debug info, let me know...

Can you try with Ray 1.0.0 as it is compliant with the standard MPI 2.2 ?

I replaced MPI_Send with MPI_Isend, and I carefully added some sort of busy-waiting before sending additional messages. Note that I say 'some sort' because an MPI rank can still receive MPI messages while waiting.

Also, I removed calls to MPI_Iprobe, and I replaced them with a ring of 128 bins of MPI requests that are MPI_Recv_init'ed & MPI_Start'ed at the start of computation.

Credit for this idea goes to George Bosilca (University of Tennessee & MPI/Open-MPI researcher/scientist).

http://www.open-mpi.org/community/lists/devel/2010/11/8710.php

@PHSchi: Im using openMPI 1.4.2 but for approx 500M 100bp paired Illumina reads, random checking on nodes suggests that the total memory usage was 400-600GB, but thats just a finger in the wind estimate and the jobs got stuck so I cant say for sure. The target genome is estimated at around 1.3Gbases.

Given the genome size and the presence of errors, I must agree with your estimate.

In an MPI rank provide you with 3 gigabytes of memory, then you need around 200 MPI ranks.

Code:

600 / 3 = 200

Contrary to ABySS, which uses google-sparsehash to store data on disk --at least that was true the last time I checked, Ray stores everything in memory.

Google Code Archive - Long-term storage for Google Code Project Hosting.

http://code.google.com/p/google-sparsehash/

So a very linear guess is that you need at least half of this, that is 2-300GB

cheers
pallo

For sure you can't buy that if you work in a laboratory.

However, in the United States of America, the National Center for Computational Sciences provides resources to scientists.

http://www.nccs.gov/

In Canada, Compute Canada/Calcul Canada (on parle français et anglais !) provides compute resources to scientists.

https://computecanada.org/

Acknowledgment for Ray 1.0.0

Élénie Godzaridis (Institut de biologie intégrative et des systèmes de l'Université Laval) for suggesting using End of transmission to pack sequences & suggesting using enum for constants.

George Bosilca (University of Tennessee) for MPI_Recv_init/MPI_Start and for pointing out that MPI_Send can block even below the eager threshold.

Jeff Squyres (Cisco) for pointing out that MPI_Send to self is not safe and that MPI_Request_free on an active request is evil.

Eugene Loh (Oracle) for the correct eager threshold (4000 bytes, not 4096 bytes).

René Paradis (Centre de recherche du CHUL) for giving me a good-old Sun
Blade 100 (SPARC V9, TI UltraSparc IIe (Hummingbird) & for maintaining my testing boxes.

Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash University, AUSTRALIA) for suggesting that Ray should load interleaved files and GZIP-compressed files.

Frédéric Lefebvre (CLUMEQ - Université Laval) for installing software on the mighty colosse. http://www.top500.org/system/10195

The Canadian Institutes of Health Research for my scholarship.

ChangeLog for Ray 1.0.0

v. 1.0.0

r4038 | 2010-11-25

* Made a lots of changes to make Ray compliant with the standard MPI 2.2
* Added master and slave modes.
* Added an array of master methods (pointers): selecting the master method
with the master mode is done in O(1).
* Added an array of slave methods (pointers): selecting the slave method
with the master mode is done in O(1).
* Added an array of message handlers (pointers): selecting the message handler method
with the message tag is done in O(1).
* Replaced MPI_Send by MPI_Isend. Thanks to Open-MPI developpers for their
support and explanation on the eagerness of Open-MPI: George Bosilca (University of Tennessee), Jeff Squyres (Cisco), Eugene Loh (Oracle)
* Moved some code for the extension of seeds.
* Grouped messages for library updates.
* Added support for paired-end interleaved sequencing reads (-i option)
Thanks to Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash University, AUSTRALIA) for suggesting the feature !
* Moved detectDistances & updateDistances in their own C++ file.
* Updated the Wiki.
* Decided that the next release was 1.0.0.
* Added support for .fasta.gz and .fastq.gz files, using libz (GZIP).
Thanks to Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash University, AUSTRALIA) for suggesting the feature !
* Tested with k=17: it uses less memory, but is less precise.
* Fixed a memory allocation bug when the code runs on 512 cores and more.
* Added configure script using automake & autoconf.
Note that if that fails, read the INSTALL file !
* Moved the code that loads fasta files in FastaLoader.
* Moved the code that loads fastq files in FastqLoader.
* Regulated the communication in the MPI 'tribe'.
* Added an assertion to verify the message buffer length before sending it.
* Modified bits so that if a message is more than 4096 bytes, split it in
chunks.
* Used a sentinel to remove two messages, coupled with TAG_REQUEST_READS.
* Stress-tested with MPICH2.
* Implemented a ring allocator for inboxes and outboxes.
* Changed flushing so that all use <flush> & <flushAll> in BufferedData.
* Changed the maximum message size from 4096 to 4000 to send messages eagerly
more often (if it happens). Thanks to Open-MPI developpers for their support and explanation on the eagerness of Open-MPI: Eugene Loh (Oracle), George Bosilca (University of Tennessee), Jeff Squyres (Cisco).
* Changed the way sequencing reads are indexed: before the master was
reloading (again !) files to do so, now no files are loaded and every MPI ranks participate in the task.
* Modified the way sequences are distributed. These are now appended to fill the buffer, and
the sentinel called 'End of transmission' is used. Thanks to Élénie Godzaridis for pointing out that '\0' is not a valid sentinel for strings !
* Optimized the flushing in BufferedData: flush is now destination-specific.
O(1) instead of O(n) where n is the number of MPI ranks.
* Optimized the extension: paired information is appended in the buffer in
which the sequence itself is.
* Added support for .fasta.bz2 & .fastq.bz2. This needs LIBBZ2 (-lbz2)
* Added instructions in the INSTALL file for manually compiling the source in
case the configure script gets tricky (cat INSTALL).
* Added a received messages file. This is pretty useless unless you want to
see if the received messages are uniform !.
* Added bits to write the fragment length distribution of each library.
* Changed the definition of MPI tags: they are now defined with a enum.
Thanks to Élénie Godzaridis for the suggestion.
* Changed the definition of slave modes: they are now defined with a enum.
Thanks to Élénie Godzaridis for the suggestion.
* Changed the definition of master modes: they are now defined with a enum.
Thanks to Élénie Godzaridis for the suggestion.
* Optimized finishFusions: complexity changed from O(N*M) to O(N log M).
* Designed a beautiful logo with Inkscape.
* Added a script for regression tests.
* Changed bits so that a paired read is not updated if it does not need it
* Changed the meaning of the -o parameter: it is now a prefix.
* Added examples with MPICH2, Open-MPI, and Open-MPI/SunGridEngine.
* Changed DEBUG for ASSERT as it activates assertions.
* Updated the citation in the standard output.
* Corrected the interleave-fastq python script.
* Changed the license file from LICENSE to COPYING.
* Removed the trimming of reads if they are not read from a file.
* Increased the verbosity of the extension step.
* Added gnuplot scripts.
* Changed the file name for changes: from NEWS to ChangeLog.
* Optimized the MPI layer: replaced MPI_Iprobe by MPI_Recv_init+MPI_Start.
see MessagesHandler.cpp ! (Thanks to George Bosilca (University of Tennessee) for the suggestion !
* Compiled and tested on architecture SPARC V9 (sparc64).
* Compiled and tested on architecture Intel Itanium (ia64).
* Compiled and tested on architecture Intel64 (x86_64).
* Compiled and tested on architecture AMD64 (x86_64).
* Compiled and tested on Intel architecture (x86/ia32).
* Evaluated regression tests.

**jfpombert** · 11-25-2010, 04:58 PM

Ray 1.0.0 doesn't load reads

Hi Seb,

I compiled Ray version 1.0.0 today (openMPI 1.4.3, gcc 4.5.1, Fedora 14) and when I run the new executable it stops at loading the first of the paired end Solexa reads (Rank 0 loads nameoffile) and exits. When I use the previous Ray version (0.1.0) with the same command line on the same dataset it runs fine. Tried compiling it twice but got the same result.

Jean-Francois Pombert

2x Xeon E5506
96G RAM
Intel Server Board S5520HC
Linux kernel 2.6.35.6-48

**seb567** · 11-25-2010, 05:39 PM

Ray 1.0.0 doesn't load reads
Hi Seb,

I compiled Ray version 1.0.0 today (openMPI 1.4.3, gcc 4.5.1, Fedora 14) and when I run the new executable it stops at loading the first of the paired end Solexa reads (Rank 0 loads nameoffile) and exits. When I use the previous Ray version (0.1.0) with the same command line on the same dataset it runs fine. Tried compiling it twice but got the same result.

Jean-Francois Pombert

2x Xeon E5506
96G RAM
Intel Server Board S5520HC
Linux kernel 2.6.35.6-48

Can you provide more details (by email if you wish) ?

The module for loading sequences from files have not changed much, but the distribution of sequences has.

However, I have not seen that glitch.

**jfpombert** · 11-25-2010, 06:04 PM

Here is the console log. I`ll look again at the compilation. I might have goofed somehow.

Thx

JF

****************************************
[David@bigdaddy Ray]$ mpirun -np 8 Ray -p 100420_s_7_1_seq_GKD-1.txt 100420_s_7_2_seq_GKD-1.txt -s FQH37LX05.sff -s FQH37LX06.sff -s FTX7HMM01.sff -s FU6LJ3H01.sff -s FWZEL0L06.sff -o test.txt
Bienvenue !

Rank 0: Ray 1.0.05.sff -s FQH37LX06.sff -s FTX7HMM01.sff -s FU6LJ3H01.sff -s FWZEL0L06.sffRank 0: compiled with Open-MPI 1.4.3
seq_GKD-1.txt -s FQH37LX05.sff -s FQH37LX06.sff -s FTX7HMM01.sff -s FU6LJ3H01.sff
Rank 0 reports the elapsed time, Thu Nov 25 17:48:37 2010HMM01.sff -s FU6LJ3H01. ---> Step: Beginning of computation
Elapsed time: 1 seconds
Since beginning: 1 seconds

**************************************************
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see "COPYING" for details.
**************************************************

Ray Copyright (C) 2010 SÃ©bastien Boisvert, Jacques Corbeil, FranÃ§ois Laviolette
Centre de recherche en infectiologie de l'UniversitÃ© Laval
Project funded by the Canadian Institutes of Health Research (Doctoral award 200902CGM-204212-172830 to S.B.)

Ray -- Parallel genome assemblies for parallel DNA sequencing

http://denovoassembler.sf.net/

Reference to cite:

SÃ©bastien Boisvert, FranÃ§ois Laviolette & Jacques Corbeil.
Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies.
Journal of Computational Biology (Mary Ann Liebert, Inc. publishers, New York, U.S.A.).
November 2010, Volume 17, Issue 11, Pages 1519-1533.
doi:10.1089/cmb.2009.0238

Just a moment...

http://dx.doi.org/doi:10.1089/cmb.2009.0238

Rank 0 welcomes you to the MPI_COMM_WORLD
Rank 0 is running as UNIX process 18016 on bigdaddy
Rank 2 is running as UNIX process 18018 on bigdaddy
Rank 3 is running as UNIX process 18019 on bigdaddy
Rank 5 is running as UNIX process 18021 on bigdaddy
Rank 1 is running as UNIX process 18017 on bigdaddy
Rank 4 is running as UNIX process 18020 on bigdaddy
Rank 7 is running as UNIX process 18023 on bigdaddy
Rank 0: I am the master among 8 ranks in the MPI_COMM_WORLD.

Ray command:

Ray \
-p \
100420_s_7_1_seq_GKD-1.txt \
100420_s_7_2_seq_GKD-1.txt \
-s \
FQH37LX05.sff \
-s \
FQH37LX06.sff \
-s \
FTX7HMM01.sff \
-s \
FU6LJ3H01.sff \
-s \
FWZEL0L06.sff \
-o \
test.txt

-p (paired-end sequences)
Left sequences: 100420_s_7_1_seq_GKD-1.txt
Right sequences: 100420_s_7_2_seq_GKD-1.txt
Average length: auto
Standard deviation: auto

-s (single sequences)
Sequences: FQH37LX05.sff

-s (single sequences)
Sequences: FQH37LX06.sff

-s (single sequences)
Sequences: FTX7HMM01.sff

-s (single sequences)
Sequences: FU6LJ3H01.sff

-s (single sequences)
Sequences: FWZEL0L06.sff

k-mer size: 21
--> Number of k-mers of size 21: 4398046511104
*** Note: A lower k-mer size bounds the memory usage. ***

Rank 0 is loading 100420_s_7_1_seq_GKD-1.txt
Rank 6 is running as UNIX process 18022 on bigdaddy
[David@bigdaddy Ray]$
********************************************************

**seb567** · 11-25-2010, 06:34 PM

Très cher Jean-Francois Pombert,

Thank you for your timely answer.

In Ray 0.1.0 and before, fasta and fastq were detected using the first line in the file.

In Ray 1.0.0, I solely use the file extension to select the appropriate loader.

Ray \
-p \
100420_s_7_1_seq_GKD-1.txt \
100420_s_7_2_seq_GKD-1.txt \
-s \
FQH37LX05.sff \
-s \
FQH37LX06.sff \
-s \
FTX7HMM01.sff \
-s \
FU6LJ3H01.sff \
-s \
FWZEL0L06.sff \
-o \
test.txt

So, Ray does not know what to do with .txt files and just stops.

Usage:

Supported sequences file format:
.fasta
.fasta.gz
.fasta.bz2
.fastq
.fastq.gz
.fastq.bz2
.sff (paired reads must be extracted manually)

Parameters:

Single-end reads
-s <sequencesFile>

Paired-end reads:
-p <leftSequencesFile> <rightSequencesFile> [ <fragmentLength> <standardDeviation> ]

Paired-end reads:
-i <interleavedFile> [ <fragmentLength> <standardDeviation> ]

Output (default: Ray-Contigs.fasta)
-o <outputFile>

AMOS output
-a

k-mer size (default: 21)
-k <kmerSize>

I will add a specific message to alarm the user about the extension.

Thank you for your interest in Ray !

**seb567** · 11-25-2010, 06:44 PM

@jfpombert

I forgot to provide a fix.

quick fix:

ln -s 100420_s_7_1_seq_GKD-1.txt 100420_s_7_1_seq_GKD-1.txt.fastq

ln -s 100420_s_7_2_seq_GKD-1.txt 100420_s_7_2_seq_GKD-1.txt.fastq

mpirun -np 8 \
Ray \
-p \
100420_s_7_1_seq_GKD-1.txt.fastq \
100420_s_7_2_seq_GKD-1.txt.fastq \
-s \
FQH37LX05.sff \
-s \
FQH37LX06.sff \
-s \
FTX7HMM01.sff \
-s \
FU6LJ3H01.sff \
-s \
FWZEL0L06.sff \
-o \
test.txt

or using bzip2, you will save precious space:

bzip2<100420_s_7_1_seq_GKD-1.txt>100420_s_7_1_seq_GKD-1.txt.fastq.bz2

bzip2<100420_s_7_2_seq_GKD-1.txt>100420_s_7_2_seq_GKD-1.txt.fastq.bz2

Ray \
-p \
100420_s_7_1_seq_GKD-1.txt.fastq.bz2 \
100420_s_7_2_seq_GKD-1.txt.fastq.bz2 \
-s \
FQH37LX05.sff \
-s \
FQH37LX06.sff \
-s \
FTX7HMM01.sff \
-s \
FU6LJ3H01.sff \
-s \
FWZEL0L06.sff \
-o \
test.txt

Thank you for providing a detailed report of what you did.

**jfpombert** · 11-25-2010, 06:48 PM

Ok, great, i'll just change the extensions.

Un gros merci!

JF

**caddymob** · 11-29-2010, 10:30 AM

processes aborted

Really excited to try out Ray! I first tried to grab the example datasets, but the links are dead.. getting a 550 error, no such file..

So then I went ahead and tried to assemble my own genome of a very homozygus (>96%) mammalian genome sequenced on illumina with paired 105bp reads. Ray is failing and I do not understand why.

Ray ran for just over 2 hours on 256 cores before dying. Here are my commands:

Code:

use intel-openmpi-1.4.2
use Ray-0.1.0

mpirun -np 256 Ray -p $wd\Lunde_1.fq $wd\Lunde_2.fq -o Lunde-contigs

And the output I get:

Code:

Rank 0 welcomes you to the MPI_COMM_WORLD.
Rank 0: website -> http://denovoassembler.sf.net/
Rank 0: using Open-MPI 1.2.7
Rank 0 is running as UNIX process 4193 on s28-2.local (MPI version 2.0)
.
.
.
Rank 243 is running as UNIX process 4231 on s26-4.local (MPI version 2.0)
Rank 0: I am the master among 256 ranks in the MPI_COMM_WORLD.

Rank 0: Ray 0.1.0 is running
Rank 0: operating system is Linux (during compilation)

LoadPairedEndReads
 Left sequences: /scratch/jcorneveaux/LUNDE_ASSEMBLE/Lunde_1.fq
 Right sequences: /scratch/jcorneveaux/LUNDE_ASSEMBLE/Lunde_2.fq
 Average length: auto
 Standard deviation: auto

k-mer size: 21
 --> Number of k-mers of size 21: 4398046511104
  *** Note: A lower k-mer size bounds the memory usage. ***


Rank 0 loads /scratch/jcorneveaux/LUNDE_ASSEMBLE/Lunde_1.fq.
Rank 0 has 140174250 sequences to distribute.
Rank 0 distributes sequences, 1/140174250
mpirun noticed that job rank 1 with PID 4194 on node s28-2 exited on signal 15 (Terminated). 
254 additional processes aborted (not shown)
1 process killed (possibly by Open MPI)Rank 0 welcomes you to the MPI_COMM_WORLD.
Rank 0: website -> http://denovoassembler.sf.net/
Rank 0: using Open-MPI 1.2.7
Rank 0 is running as UNIX process 4193 on s28-2.local (MPI version 2.0)

Is there something wrong with my configuration?

**seb567** · 11-29-2010, 10:49 AM

Really excited to try out Ray! I first tried to grab the example datasets, but the links are dead.. getting a 550 error, no such file..

NCBI moved their infrastructure from .fastq to .sra files.

My favorite toy dataset is SRA001125, Illumina data of E. coli K-12 MG1655.

Search SRA001125 and you'll find it.

So then I went ahead and tried to assemble my own genome of a very homozygus (>96%) mammalian genome sequenced on illumina with paired 105bp reads. Ray is failing and I do not understand why.

Many reasons can explain that.

Ray ran for just over 2 hours on 256 cores before dying. Here are my commands:

use intel-openmpi-1.4.2
use Ray-0.1.0

You use Ray 0.1.0 ! Try Ray 1.0.0, I assure you it has many fixes included.

v. 1.0.0 is the release with the most changes to date.

Ray: scalable assembly

http://sourceforge.net/apps/mediawiki/denovoassembler/index.php?title=ChangeLog#v._1.0.0

Download Ray: scalable assembly for free. Ray -- Parallel genome assemblies for parallel DNA sequencing . de novo genome assembly is now a challenge because of the overwhelming amount of data produced by sequencers. Ray assembles reads obtained with new sequencing technologies (Illumina, 454, SOLiD) using MPI 2.2 -- a message passing inferface standard.

mpirun -np 256 Ray -p $wd\Lunde_1.fq $wd\Lunde_2.fq -o Lunde-contigs

Do you have acces to a SMP machine with 256 processor cores ?!

If so, I envy you.

And the output I get:

Code:

Rank 0 welcomes you to the MPI_COMM_WORLD.
Rank 0: website -> http://denovoassembler.sf.net/
Rank 0: using Open-MPI 1.2.7

So basically, you use a bad mix of software: intel-openmpi-1.4.2 with Ray compiled against Open-MPI 1.2.7.

This will surely fail !

Rank 0 is running as UNIX process 4193 on s28-2.local (MPI version 2.0)

Last standard is MPI 2.2 from 2009. MPICH2 and Open-MPI 1.4.3 comply with MPI 2.2.

Ray works with MPI 2.0 too, I guess.

Rank 0: Ray 0.1.0 is running

As I said, 0.1.0 is defunct. Embrace the new 1.0.0.

The next release is coming soon.

Ray for large genomes is on its way !

My last test on human chromosome 1 (the largest) with one library of
length 200 and another of length 400 shows great success:

Rank 0: 69173 contigs/205904915 nucleotides

Rank 0 reports the elapsed time, Sun Nov 28 20:38:22 2010
---> Step: Collection of fusions
Elapsed time: 1 minutes, 16 seconds
Since beginning: 8 hours, 22 minutes, 4 seconds

Elapsed time for each step, Sun Nov 28 20:38:22 2010

Beginning of computation: 3 seconds
Distribution of sequence reads: 25 minutes, 3 seconds
Distribution of vertices: 1 minutes, 16 seconds
Calculation of coverage distribution: 1 seconds
Distribution of edges: 1 minutes, 30 seconds
Indexing of sequence reads: 2 seconds
Computation of seeds: 10 minutes, 39 seconds
Computation of library sizes: 4 minutes, 51 seconds
Extension of seeds: 7 hours, 33 minutes, 36 seconds
Computation of fusions: 3 minutes, 47 seconds
Collection of fusions: 1 minutes, 16 seconds
Completion of the assembly: 8 hours, 22 minutes, 4 seconds

Rank 0 wrote r4068-human.CoverageDistribution.txt
Rank 0 wrote r4068-human.Library0.txt
Rank 0 wrote r4068-human.Library1.txt
Rank 0 wrote r4068-human.fasta
Rank 0 wrote r4068-human.ReceivedMessages.txt

Is there something wrong with my configuration?

You configuration is erroneous in two independent ways.

1. You are using Ray 0.1.0, not Ray 1.0.0.

2. You are running a executable compiled against Open-MPI 1.2.7 with, I believe, Open-MPI 1.4.2.

Thank you for your interest in Ray !

"The Ray of light is coming to life, and the Ray of darkness is fading away."

-Seb

**caddymob** · 11-29-2010, 11:31 AM

Many thanks seb567!

I had to have my IT department compile and install Ray on our cluster and did not notice they used the old version of Ray. Thanks for pointing this out. I have requested the 1.0.0 version to be installed, along with questions about the MPI version available.

Any idea when the new version will be available for mammalian genomes? Looking forward to it!

I will keep you posted on my progress once I get the new version up and running. Thanks again!

**seb567** · 11-29-2010, 11:42 AM

I had to have my IT department compile and install Ray on our cluster and did not notice they used the old version of Ray. Thanks for pointing this out. I have requested the 1.0.0 version to be installed, along with questions about the MPI version available.

I think you are fine with Open-MPI 1.4.2 compiled with Intel Compiler (use intel-openmpi-1.4.2).

Any idea when the new version will be available for mammalian genomes? Looking forward to it!

Before Friday, for sure.

I will keep you posted on my progress once I get the new version up and running.

Thank you for your updates !

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News