Unconfigured Ad

**seb567** · 05-25-2010, 05:22 AM

Hi,

Question 1

Does Ray support Illumina 1.6+ fastq format (with trailing B's)?

Answer

No, but it should work if the trailing B's are at the end, but not at the beginning.

Question 2

Does Ray have the capability for trimming low-quality bases or should I pre-process my reads beforehand?

Answer

Ray does not trim sequences, but random errors are not a problem.

Question 3

Should I convert my libraries to Phred/Sanger scores?

Answer

No: Ray does not use quality scores.

Question 4

Can I run Bambus with Rays's output?

Answer

Ray outputs a fasta file -- I have never utilized Bambus, so the question is what Bambus needs.

-Sebhtml

**ldong** · 06-04-2010, 03:14 PM

OpenMPI 1.2.8?

Hi Sebastien,
I followed the instruction in WiKi page and launched Ray about 20 hours ago. The processes are still running. Is this normal? When should it finish? I believe we have enough computing power. Only concern is that we have OpenMPI 1.2.8 in stead of 1.3.4 or 1.4.1. Does Ray work with 1.2.8? Please give some suggestion. Thank you very much for your great work, Best, ldong

== Do-it-yourself examples ==

=== E. coli K-12 MG1655 with Illumina paired-end reads & amos output ===

* Reads
ftp://ftp.ncbi.nlm.nih.gov/sra/stati...665_1.fastq.gz
ftp://ftp.ncbi.nlm.nih.gov/sra/stati...665_2.fastq.gz
ftp://ftp.ncbi.nlm.nih.gov/sra/stati...666_1.fastq.gz
ftp://ftp.ncbi.nlm.nih.gov/sra/stati...666_2.fastq.gz

*vi a file with the following content, Commands.Ray
LoadPairedEndReads SRR001665_1.fastq SRR001665_2.fastq 215 20
LoadPairedEndReads SRR001666_1.fastq SRR001666_2.fastq 215 20
OutputAmosFile

* Command line
mpirun -np 32 Ray Commands.Ray

**seb567** · 06-07-2010, 07:28 AM

Re:

Hi,

@ldong

What is the connectivity? (Infiniband's OK)

Which step does Ray reach before stalling?

Do dots kept being printed ?

If the answer's no, then it might be the spin-lock bug in Open-MPI.

https://svn.open-mpi.org/trac/ompi/ticket/2043 (shoud be fixed in Milestone 1.4.3.)

Anyway, 1.2.8's very old. Current release's 1.4.2!

I also need to optimize communication for "Computing seeds" and later steps in Ray.

Recommendation: upgrade to 1.4.1 or 1.4.2

Cordially,

Sébastien

**seb567** · 06-07-2010, 07:33 AM

@ldong What is your 'computing power'?

**ldong** · 06-08-2010, 08:36 AM

Hi, Seb,

Thank you very much for your suggestions. Not sure if we can upgrade openmpi. I will check with system administrator.

We have a few nodes with 16 CPU and 64G memory allowing me to test Ray. Here are what I found:

If I run Ray on two nodes with 32 process. There is always one process on the first node of the host list slowly reaches 25G memory, then gets killed by the system. Other processes never reach 1G.

It seems like 25G is a system limitation. I will ask our administrator. What do you think? Best, ldong

**talioto** · 09-16-2010, 01:56 AM

hang? during "Extending seeds"

I compiled Ray with openmpi 1.4.2, gcc version 4.1.2 20080704 (Red Hat 4.1.2-44), x86_64 architecture and run it with "mpirun -mca btl ^sm". The data is 3 simulated Illumina libraries comprising 52x coverage of a 225MB chromosome: 40x 500bp PE 95nt reads (inward facing), 8x 5kb mate paired 36nt reads (outward facing), 4x 10kb mate paired 36nt reads(outward facing).

Using 128 cores (16 8-core nodes), it runs fine up until the "Extending seeds" step. After a while the printing of the dots seem to slow down to glacial speeds. I've let it sit for several days with no progress. Is this an open mpi problem, you think? Any ideas on getting around this problem?

**baihezimu** · 10-11-2010, 07:28 AM

If the paired-end reads are put in the same file, Ray can handle it?

**seb567** · 10-20-2010, 07:55 AM

Replies

@talioto What is the interconnection.

@baihezimu No.

**seb567** · 10-20-2010, 07:55 AM

Ray paper is finally available

Sébastien Boisvert, François Laviolette, Jacques Corbeil.
Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies
Journal of Computational Biology
Not available-, ahead of print.
doi:10.1089/cmb.2009.0238

**omaha420** · 10-27-2010, 08:49 PM

Congrats on e-publication

Congratulations on Ray's epub!

I learned of your work about 2 weeks ago, familiarized myself with the documentation you've provided, and successfully completed the E. coli assembly with sample data.

As the publication is now complete, could you please provide some more details with respect to your processing of Human Chromosome 1 (under the limitations section of the project website).

Specifically, could you provide the output text for that run so that I could better ascertain:
a. the run time for that assembly on your hardware
b. the version of the Open-MPI Library used in that assembly

I'd like to use Ray for assembly of sequencing reads for eukaryotes... and would like to know what, if any, potential problems to anticipate.

As Open-MPI 1.5 has now been released, I'd like to know if the shared memory problem is still a concern when performing analyses of larger datasets. I believe that this has been fixed in versions > 1.4.1, but would like to know for certain if it is a problem with Ray before spending hours of analysis time on shared hardware.

Thanks for both your work and time in addressing these questions-- your efforts are very much appreciated.

**seb567** · 11-03-2010, 07:11 PM

Response to 'Congrats on e-publication'

Congratulations on Ray's epub!

Thanks !

I learned of your work about 2 weeks ago, familiarized myself with the documentation you've provided, and successfully completed the E. coli assembly with sample data.

Now that is reproducible research !

As the publication is now complete, could you please provide some more details with respect to your processing of Human Chromosome 1 (under the limitations section of the project website).

Well, I used Open-1.3.4 with shared memory disabled. Ray 0.0.7 was
utilized.

I simulated reads of length 50 at a depth of 50 for the human chromosome
1 (the largest). To do so, I used the simtools provided with Ray. To get
them, type 'make simtools'.

The wiki is misleading on this, because I actually used a MPI-enabled
Infiniband-connected computer. I'll correct that shortly. Precisely, 384
cores were used.

The Sun Grid Engine script follows.

PHP Code:


[12@colosse2 0.0.7-run]$ cat Human-chr1-ompi-1.3.4-gcc.sh

#!/bin/bash

#$ -N Ray

#$ -P nne-790-aa

#$ -l h_rt=24:00:00

#$ -pe mpi 384

module load compilers/gcc/4.4.2 mpi/openmpi/1.3.4_gcc

/software/MPI/openmpi-1.3.4_gcc/bin/mpirun /home/12/Ray/tags/0.0.7/Ray /home/12/nne-790-aa/colosse.clumeq.ca/qsub/Ray-input.txt

If you ask why Open-MPI 1.3.4, it is because all other versions have
shared memory enabled on the said computer, and that Open-MPI 1.4.3 is
not available yet to users of the said computer.

The content of the command file:

PHP Code:


[12@colosse2 0.0.7-run]$ cat Ray-input.txt 

LoadSingleEndReads /home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta

Specifically, could you provide the output text for that run so that I could better ascertain:
a. the run time for that assembly on your hardware
b. the version of the Open-MPI Library used in that assembly

PHP Code:


[12@colosse2 0.0.7-run]$ cat Ray.o876984

**************************************************

This program comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it

under certain conditions; see "gpl-3.0.txt" for details.

**************************************************



Ray Copyright (C) 2010  Sébastien Boisvert, Jacques Corbeil, François

Laviolette

http://denovoassembler.sf.net/



AssemblyEngine: Ray 0.0.7

NumberOfRanks: 384

MPILibrary: Open-MPI 1.3.4

OperatingSystem: Linux



LoadSingleEndReads

Sequences: /home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta



Loading /home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta

Distributing sequences

Counting vertices

Loading /home/12/nne-790-aa/50xhs_ref_GRCh37_chr1.fa_fragments.fasta

Indexing sequences

Connecting vertices

MinimumCoverage: 5

PeakCoverage: 30

Computing seeds

Extending seeds

Computing fusions

Finishing fusions

Collecting fusions

              

Writing Ray-Contigs.fasta

140101 contigs/175230944 nucleotides

Elapsed time: 0 d 13 h 48 min 28 s

I'd like to use Ray for assembly of sequencing reads for eukaryotes... and would like to know what, if any, potential problems to anticipate.

I have not myself extensively used Ray on eukaryotic sequence reads, so I am not really aware of potential pitfalls.

As Open-MPI 1.5 has now been released, I'd like to know if the shared memory problem is still a concern when performing analyses of larger datasets. I believe that this has been fixed in versions > 1.4.1, but would like to know for certain if it is a problem with Ray before spending hours of analysis time on shared hardware.

You better use Open-MPI 1.4.3 as it is a super stable release whereas Open-MPI 1.5 is a feature release. I only have access to Open-MPI 1.3.4 with disabled shared memory and Open-MPI 1.4.1 with defaults.

I should gain access to Open-MPI 1.4.3 with defaults in the next days/weeks.

Thanks for both your work and time in addressing these questions-- your efforts are very much appreciated.

Thank you also for bringing these questions.

Sébastien

**seb567** · 11-03-2010, 07:58 PM

Ray 0.1.0 is out

Dear de novo assembly enthusiasts:

Following the publication and some work over the last months, Ray 0.1.0
is now available incorporating (some) features requested as well as
improvements on speed (Extending seeds).

There is a full list of changes, based on the NEWS file.

v. 0.1.0
2010-11-03

* Moved some code from Machine.cpp to new files. (Ticket #116)
* Improved the speed of the extension of seeds by reducing the number of messages sent. (Tickets #164 & #490)
Thanks to all the people who reported this on the list !
* Ray is now verbose ! (Ticket #167)
Feature requested by Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
University, AUSTRALIA)
* The k-mer size can now be changed. Minimum value is 15 & maximum value is 32. (Tickets #169 & #483)
Feature requested by Dr. Torsten Seemann (Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
University, AUSTRALIA)
* Ray should work now on architectures requiring alignments of address on 8 bytes such as Itanium. (Ticket #446)
Bug reported by Jordi Camps Puchades (Centre Nacional d'Anàlisi Genòmica/CNAG)
* Added reference to the paper in stdout. (Ticket #479)
* The coverage distribution is now always written. (Ticket #480)
* The code for extracting edges is now in a separate file (Ticket #486)
* Messages for paired reads are now grouped with messages for querying sequences in the extension of seeds. (Tickets #487 & #495)
* Messages for sequence reads are now done only once, when the read is initially discovered. (Ticket #488)
* Messages with tag TAG_HAS_PAIRED_READ are grouped with messages to get sequence reads. (Ticket #491)
* Added TimePrinter to print the elapsed time at each step. (Ticket #494)
* All generated files (AMOS, Contigs, and coverage distribution) are named following the -o parameter. (Ticket #426)
Feature requested by Jordi Camps Puchades (Centre Nacional d'Anàlisi Genòmica/CNAG)
* Print an exception if requested memory exceeds CHUNK_SIZE. That should never happen. (r3690)
* Print an exception if the system runs out of memory.
* Ray informs you on the number of k-mers for a k-mer size. (r3691)
* Unique IDs of sequence reads are now unsigned 64-bits integers. (r3710)
* The code is now in code/, scripts are now in scripts/. Examples are in scripts/examples/. (r3712)
* The compilation is more verbose. (r3714)

Download it:

http://sourceforge.net/projects/denovoassembler/files/Ray-0.1.0.tar.bz2/download

I will update the wiki shortly with improved running times for the E.
coli dataset as well as in-depth simulation of paired reads on
chromosome 1 (with errors).

Thank you !

**pallo** · 11-08-2010, 12:44 AM

First of all: thanks for providing Ray. I am reading the paper and it sounds very promising.

I am testing version 0.1.0 (openMPI 1.4.2, compiled with intel11.1) on 8 X 8 core/48GB nodes. The data are 12 lanes of illumina PE reads and two runs of 454 of a bird species we are sequencing. The first 14 hrs Ray output tons of messages in ray.out, but for the past 36hrs has been quiet, but still keeping a 100% load on the nodes, utilizing about 5GB of memory for each job.

Is this quiet to be expected, or a manifestation of this "spin-lock" bug mentioned above? Is there any way of checking that Ray is still running OK?
Cheers
Pallo

EDIT: Ok looking closer at the spin-lock bug reports, it only seems to affect GCC, so Ill try to be patient

**seb567** · 11-08-2010, 06:56 AM

Where does it hang?

I think the bug 2043 was addressed in Open-MPI 1.4.3.

I don't know if ICC can produce the same problem though.

Yes, there is a way if you can log on the worker nodes.

First, get the pid of the processes associated to Ray

ps aux|grep Ray

then, attach a gdb instance to one of them.

gdb attach <pid of a Ray instance>

Finally, do a backtracking in gdb

bt

You will see which code is currently executed.

What is your interconnection?

Infiniband of gigaethernet ?

**pallo** · 11-08-2010, 11:09 PM

Hi,

The job had to be killed for other reasons, but here are the last lines of ray.out:

$ tail testrun/ray.out
Rank 51 stores an extension, 1354 vertices.
Rank 51 starts on a seed, length=106
Rank 43 starts on a seed, length=444
Rank 4 stores an extension, 1166 vertices.
Rank 4 starts on a seed, length=142
Rank 59 stores an extension, 152 vertices.
Rank 59 starts on a seed, length=211
Rank 6 stores an extension, 1095 vertices.
Rank 6 starts on a seed, length=89
Rank 0 starts on a seed, length=1175

The interconnections are Infiniband.

Im rerunning the job on a bigger set of nodes, Ill post the progress.

cheers
Pallo

Topics	Statistics	Last Post
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, Today, 10:04 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Today, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, Yesterday, 10:08 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Yesterday, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM

Unconfigured Ad

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News