In this project, I built a small Linux cluster to practice system design, deployment, automation, validation, recovery, and general operations in a small-scale HPC environment. I built a controller node (head/login) that manages cluster services and a worker compute node that executes scheduled workloads. My goal was to practice systems engineering workflows used in HPC: deterministic networking, repeatable configuration, scheduler-driven execution, authentication, validation, troubleshooting, and recovery.
I created two nodes: the head node orchestrates, and the compute node executes. The idea is that compute nodes should be as consistent and reproducible as possible, and changes should be applied via automation rather than manual edits.
Clusters rely on internal networking where nodes can reach each other reliably, and hostnames resolve consistently. I set up a private host-only cluster network for node-to-node traffic and a NAT interface for package installs.
Each VM uses two adapters:
Head node:
sudo nmcli con mod hostonly \
ipv4.method manual \
ipv4.addresses 192.168.56.10/24
sudo nmcli con up hostonly
Compute node:
sudo nmcli con mod hostonly \
ipv4.method manual \
ipv4.addresses 192.168.56.11/24
sudo nmcli con up hostonly
I set hostnames and /etc/hosts so that both nodes can resolve each other consistently.
sudo hostnamectl set-hostname head # on head node
sudo hostnamectl set-hostname cone # on compute node (c-one)
Edited /etc/hosts on both nodes:
192.168.56.10 head
192.168.56.11 cone
ping -c 3 cone # from head node
ping -c 3 head # from compute node
I installed baseline updates, essential utilities, and time synchronization. Distributed systems depend on time synchronization. For example, clock drift will break MUNGE authentication.
sudo dnf -y update
sudo dnf -y install vim git curl wget chrony
sudo systemctl enable --now chronyd
chronyc tracking
Ansible will require a control point that can reach nodes. I configured passwordless SSH from the head node to the compute node so that automation can run unattended and repeatedly for my controller/worker management.
sh-keygen -t ed25519 # generate key on head
ssh-copy-id youruser@cone # copy to compute node
ssh cone # tests
Locking SSH config:
Host cone
HostName cone
User youruser
Ansible will turn node configuration into versionable, repeatable infrastructure code. The head node acts as the controller, applying changes to the compute node(s) consistently.
[controllers]
head ansible_host=192.168.56.10 ansible_connection=local
[compute]
cone ansible_host=192.168.56.11
[all:vars]
ansible_user=youruser
ansible_become=true
Enabling passwordless sudo:
sudo visudo
youruser ALL=(ALL)NOPASSWD:ALL
Now, instead of installing common packages manually on each node, I can define a baseline configuration and push it via `ansible-playbook`. This automation ensures both nodes have consistent tooling (Python, Chrony, etc.), and re-running the playbook will be safe.
Slurm provides scheduling: users request resources, and the scheduler decides where and when jobs run. MUNGE provides authentication between Slurm services. I used Ansible to install required packages across the cluster, then configure MUNGE so the Slurm controller and daemons can authenticate securely.
Enabling CRB, EPEL, and installing: slurm, slurmctld, slurmd, munge
MUNGE uses a shared key (/etc/munge/munge.key) across the cluster. This key must be identical and properly permissioned on every node.
Slurm is configured via slurm.conf, which defines cluster name, controller, authentication method, nodes, and partitions. Configuration lives in source control and is deployed to nodes via Ansible (push_slurm_conf.yml), not hand-edited on each node. Once services are started, I verify that the controller sees the compute node and can schedule jobs to it.
sinfo
srun hostname
I increased the compute node’s CPU allocation, updated slurm.conf, and validated that Slurm could schedule multi-task jobs. During this change, after I had stepped away from the project for the night, I encountered a downed node. I was able to diagnose the issue using logs and time sync tools and found that the problem was time drift, which caused MUNGE credentials to be rejected and Slurm to mark the node unavailable. I corrected the system time, restarted services, and revalidated scheduling.
Encountering a downed node:
sudo journalctl -u slurmd -xe --no-pager | tail -n 40
timedatectl
chronyc tracking
date
Logs indicated:
Munge decode failed: expired credential
MUNGE credentials are time-sensitive. If node clocks drift far enough apart, credentials generated on one node can be rejected on another as “expired.”
Solution:
Re-sync time on both nodes:
sudo systemctl restart chronyd
sudo chronyc makestep
Restart MUNGE:
sudo systemctl restart munge
Restart Slurm daemons:
sudo systemctl restart slurmd # on compute node
sudo systemctl restart slurmctld # on head
or use ansible playbook.
In this project, I’m using MPI as a health check. Running MPI under Slurm proves that the scheduler can allocate resources and launch a parallel workload reliably.
I use ansible-playbook to install OpenMPI on all nodes, compile the validation binary on the controller, distribute it to all nodes, then run it through srun.
I wanted to simulate a node failure by stopping the compute daemon, observing scheduler state changes and job failure behavior, then restoring services and re-running validation.
On compute node, stop slurmd:
sudo systemctl stop slurmd
On head node, observe the impact:
sinfo
scontrol show node cone
srun --chdir=/tmp -n 2 /opt/mpi/mpi-validate/mpi_validate
Bring the node back up and re-validate MPI:
sudo systemctl start slurmd # on cone
sinfo # on head
scontrol show node cone
srun --chdir=/tmp -n 2 /opt/mpi/mpi-validate/mpi_validate