Handling Failure

Batcht has been designed to deal with two types of failures: head node failure, and compute node failure. Handling these failure modes requires minimal effort on the part of the user or client software. Additional effort may be required to minimize the effects of failures.

Compute Node Failure

Detection

Each compute nodes sends a “heartbeat” to the its head node every few seconds. A compute node is considered to have failed if the head node hasn’t received a heartbeat for some number of seconds. This is the compute node’s timeout setting.

Causes

Because detection depends on the heartbeat mechanism described above, the cause could be anything that would prevent the heartbeat messages from being received. This could be due to a hardware failure, a software failure, or a problem with the private network.

Mitigation

For long-running jobs, you may want to design your software to store partial results as it runs in order to speed up recovery after the job is restarted on a new compute node.

Recovery

Compute node failures are automatically detected and recovered from if you have set a compute node timeout greater than zero. In this case the failed compute node will be destroyed, and a new compute node will be spawned to replace it if necessary. If the node was running a job, the job will be put back into the job queue.

Head Node Failure

Detection

Head nodes send a heartbeat message to a master server a few times a minute. A head node is considered to have failed if the master server hasn’t received a heartbeat for some number of seconds. This is the head node’s timeout setting.

In contrast to a compute node, the head node’s heartbeat messages are sent over the internet to an external server, so they may be somewhat less reliable than the compute node heartbeat messages on the private network.

Causes

The causes for a head node failure are the same as those for a compute node failure.

Mitigation

Head node failures are more serious than compute node failures because all the currently running jobs will be cancelled. However, the same techniques for minimizing the effects of the failure apply here.

Recovery

Recovering from a head node failure requires action on the client’s part. Because the head node will be destroyed and replaced, each client will need to re-initialize with the master server to get the new head node address and certificate.

An appropirate action to take on a suspected head node failure would be to sleep for a minute or two, then initailize with the master server. If the initialization fails, it should be retried after a timeout.

See the command line interface documentation for more details.