A batch compute cluster consists of a head node, a job queue, a shared file system, and a number of compute nodes.
The head node coordinates the cluster. It performs three main functions:
The job queue keeps track of jobs submitted to the cluster for processing. Jobs are assigned to compute nodes depending on their priority and submission time.
The shared file system is stored on network-attached storage in order to keep data safe in the event of a head node failure. The job queue, job data, and any shared data are stored on this device. The file system is made available to the compute nodes using NFSv4 on a private network.
Each compute node has its own processor, memory, and file system.
Compute nodes acquire jobs from the head node and process them. Each compute node has full access to the shared file system.
A job consists of a directory containing a script,
run.sh, that runs the job,
and any number of additional files.
When a job is run on a compute node, the script
run.sh is called in its own
directory. Any files created in the job’s directory may be retrieved by the