I think I found the reason behind all the hanging.
I myself experienced the hanging with shared memory disabled, using 384 cores (Xeon).
It is more likely to be a MPI rank being flooded by messages and being unable to response than anything else I believe.
I am currently testing regularization of message sending in the extension of seeds. Ensuring that a particular number of microseconds between messages is what I am testing.
If that fails, Ray will simply do the extension of seeds on MPI rank after another.
In the other steps of the algorithm (distribution of vertices, for example), the messages sent are uniformly observed.
With the detailed information you provided, I can safely say that running on more cores won't change anything with Ray 0.1.0 and below.
Thank you.
I myself experienced the hanging with shared memory disabled, using 384 cores (Xeon).
It is more likely to be a MPI rank being flooded by messages and being unable to response than anything else I believe.
I am currently testing regularization of message sending in the extension of seeds. Ensuring that a particular number of microseconds between messages is what I am testing.
If that fails, Ray will simply do the extension of seeds on MPI rank after another.
In the other steps of the algorithm (distribution of vertices, for example), the messages sent are uniformly observed.
With the detailed information you provided, I can safely say that running on more cores won't change anything with Ray 0.1.0 and below.
Thank you.
Comment