Batcht Overview


Components

A batch compute cluster consists of a head node, a job queue, a shared file system, and a number of compute nodes.

Head Node

The head node coordinates the cluster. It performs three main functions:

  1. Managing the job queue: Allowing users to submit and manage jobs, and compute nodes to acquire jobs for processing.
  2. Managing shared storage: Providing access to the shared file system to the compute nodes.
  3. Managing compute nodes: Monitoring the health of compute nodes. Adding and removing compute nodes as the job queue grows or shrinks.

Job Queue

The job queue keeps track of jobs submitted to the cluster for processing. Jobs are assigned to compute nodes depending on their priority and submission time.

Shared File System

The shared file system is stored on network-attached storage in order to keep data safe in the event of a head node failure. The job queue, job data, and any shared data are stored on this device. The file system is made available to the compute nodes using NFSv4 on a private network.

Compute Nodes

Each compute node has its own processor, memory, and file system.

Compute nodes acquire jobs from the head node and process them. Each compute node has full access to the shared file system.


Jobs

A job consists of a directory containing a script, run.sh, that runs the job, and any number of additional files.

When a job is run on a compute node, the script run.sh is called in its own directory. Any files created in the job’s directory may be retrieved by the user.