OpenMPI custom fault tolerance for lowly coupled parallel processes
I do computations on the Amazon EC3 platform, using multiple machines
which are connected through OpenMPI. To reduce the cost of the
computation, spot instances are used, which are automatically shut down
when the cost of a machine goes above a maximum preset price: :
http://aws.amazon.com/ec2/spot-instances/ . A weird behaviour occurs: when
a machine is shut down, the other processes in the MPI communicator still
continue to run. I think that the network interfaces are silenced before
the process has the time to indicate to the other processes that it has
received a kill signal.
I have read in multiple posts that MPI does not provide a lot of
high-level resources regarding fault-tolerance. On the other hand, the
structure of my program is very simple: a master process is queried by
slave processes, for the permission to execute a portion of code. The
master process only keeps track of the number of queries it has replied
to, and tell the slave to stop when an upper limit is reached. There is no
coupling between the slaves.
I would like to be able to detect when a process silently died as
mentioned previously. In that case I would re-attribute the work he was
doing to a slave that is still alive. Is there a simple way to check
whether a died ? I have thought of using threads and sockets to do that
independently of the rest of the MPI layer, but that seem cumbersome. I
also though of maintaining on the master process (which is launched on a
non spot instance) a list of the time of last communication with each
process, and specify a timeout, but that would not guarantee me that a
slave process is dead. There is also the problem that "barrier" and
"finalize functions will not see all the processes, and potentially hang.
My question would then be what kind of solution would you implement to
detect if processes are silently dead ? And how would you modify the
remainder of the code to be compatible with a reduced number of processes
?
No comments:
Post a Comment