HPC Health Observations

An open reference for where we begin investigating HPC environments

About this guide

This is an open, living reference describing how the Concertim team begins investigating HPC health observations. It's shared so customers and peers can compare our practices against their own — not as a prescription of what anyone must do. Each entry captures an observation, what it can mean for the service, and where we'd start looking into it.

Why no priority or severity labels? Because what's urgent depends entirely on the specific customer, site, and the assets an observation touches. A "critical" filesystem fill on one cluster may be routine headroom on another. We've deliberately left severity out of this guide to avoid implying a fixed ranking — it is recommended to triage against your environment specifically.

Topics

Observations are grouped into topics: Vitals (essential health indicators), Security, and Performance. Use the topic pills to focus on an area; the scope pills and search narrow further. You can browse the grid or switch to a focused walkthrough with the button (arrow keys navigate).

Some entries are still being written and are badged Draft — they're shown so the gaps are visible, not hidden.

Prefer a printable copy for offline use? Download the companion PDF