I made a lot of people angry with Ray today....
Hi seb567,
So, I had Ray 1.2.0 installed on our cluster, complied with the same with Intel Compiler. The job starts to run, then is crashing.... Apparently today I took down 32 compute nodes according to the angry email I received from IT...
They sent me the following information about the job, I am told the problem is the huge amount of swap space I was using -- nearly 1TB!
Here is the tail of my outfile:
So, now IT is really angry with me, I'm kinda proud of myself, but I still have a genome to try to assemble. What do you suggest?
Hi seb567,
So, I had Ray 1.2.0 installed on our cluster, complied with the same with Intel Compiler. The job starts to run, then is crashing.... Apparently today I took down 32 compute nodes according to the angry email I received from IT...
They sent me the following information about the job, I am told the problem is the huge amount of swap space I was using -- nearly 1TB!
Code:
Req[0] TaskCount: 256 Partition: anon Utilized Resources Per Task: PROCS: 120.18 MEM: 2596M SWAP: 881G Avg Util Resources Per Task: PROCS: 120.18 Max Util Resources Per Task: PROCS: 237.11 MEM: 2596M SWAP: 881G Average Utilized Memory: 1641.30 MB Average Utilized Procs: 48473.59 NodeSet=ONEOF:FEATURE:awesometown NodeAccess: SINGLEJOB NodeCount: 32
Code:
Rank 230 is adding ingoing edges (reverse complement) 300001/1095111 [[18226,1],230][btl_openib_component.c:3224:handle_wc] from s54-5.local to: s54-15 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 246144128 opcode 0 vendor error 129 qp_idx 2 [s56-6.local:00534] 33 more processes have sent help message help-mpi-btl-openib.txt / pp retry exceeded [s56-6.local:00534] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [[18226,1],232][btl_openib_component.c:3224:handle_wc] from s54-4.local to: s55-8 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 13189504 opcode 32767 vendor error 244 qp_idx 0 [[18226,1],173][btl_openib_component.c:3224:handle_wc] from s54-12.local to: s55-6 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 16690688 opcode 32767 vendor error 244 qp_idx 0 [[18226,1],237][btl_openib_component.c:3224:handle_wc] from s54-4.local to: s54-12 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 121666816 opcode 32767 vendor error 129 qp_idx 2 [s56-6.local:00534] 1 more process has sent help message help-mpi-btl-openib.txt / pp retry exceeded [[18226,1],234][btl_openib_component.c:3224:handle_wc] from s54-4.local to: s56-3 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 13992832 opcode 32767 vendor error 244 qp_idx 0 [[18226,1],175][btl_openib_component.c:3224:handle_wc] from s54-12.local to: s55-13 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 113500032 opcode 32767 vendor error 129 qp_idx 2 [s56-6.local:00534] 1 more process has sent help message help-mpi-btl-openib.txt / pp retry exceeded [[18226,1],172][btl_openib_component.c:3224:handle_wc] from s54-12.local to: s55-13 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 117999232 opcode 32767 vendor error 129 qp_idx 2 [[18226,1],170][btl_openib_component.c:3224:handle_wc] from s54-12.local to: s55-13 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 59806336 opcode 1 vendor error 129 qp_idx 2 [s56-6.local:00534] 2 more processes have sent help message help-mpi-btl-openib.txt / pp retry exceeded [[18226,1],238][btl_openib_component.c:3224:handle_wc] from s54-4.local to: s55-16 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 17854848 opcode 128 vendor error 244 qp_idx 0 [[18226,1],215][btl_openib_component.c:3224:handle_wc] from s54-7.local to: s54-12 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 115086208 opcode 0 vendor error 129 qp_idx 2 [s56-6.local:00534] 1 more process has sent help message help-mpi-btl-openib.txt / pp retry exceeded [[18226,1],197][btl_openib_component.c:3224:handle_wc] from s54-9.local to: s54-12 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 102530048 opcode 32767 vendor error 129 qp_idx 2 [s56-6.local:00534] 1 more process has sent help message help-mpi-btl-openib.txt / pp retry exceeded [[18226,1],199][btl_openib_component.c:3224:handle_wc] from s54-9.local to: s54-12 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 61565696 opcode 32767 vendor error 129 qp_idx 2 [s56-6.local:00534] 1 more process has sent help message help-mpi-btl-openib.txt / pp retry exceeded [[18226,1],171][btl_openib_component.c:3224:handle_wc] from s54-12.local to: s54-15 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 9502080 opcode 128 vendor error 244 qp_idx 0 [[18226,1],196][btl_openib_component.c:3224:handle_wc] from s54-9.local to: s54-2 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 13076352 opcode 128 vendor error 244 qp_idx 0 [[18226,1],169][btl_openib_component.c:3224:handle_wc] from s54-12.local to: s55-8 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 21503744 opcode 128 vendor error 244 qp_idx 0 [[18226,1],174][btl_openib_component.c:3224:handle_wc] from s54-12.local to: s56-6 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 11039872 opcode 128 vendor error 244 qp_idx 0 [[18226,1],194][btl_openib_component.c:3224:handle_wc] from s54-9.local to: s55-8 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 12326400 opcode 32767 vendor error 244 qp_idx 0 =>> PBS: job killed: node 26 (s54-7) requested job terminate, 'EOF' (code 1099) - internal or network failure attempting to communicate with sister MOM's mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate 15 total processes killed (some possibly by mpirun during cleanup)
Comment