Skip to main content

Prometheus Alertmanager

Prometheus Alertmanager handles alerts generated by Prometheus and routes notifications to tools such as email, Slack, Microsoft Teams, or PagerDuty.

When your Prometheus server is configured to send alerts to Alertmanager, you can use MetricsHub alert rules to detect hardware, storage, and system issues.

note
  • These alert rules are distinct from the internal alerts generated by MetricsHub and emitted as OpenTelemetry logs.
  • The alert rules described on this page are evaluated by Prometheus. When an alert fires, Prometheus sends it to Alertmanager, which is responsible for routing and notifications.
  • To view detailed alert descriptions and annotations, you must use the full Prometheus Alertmanager interface (typically available on port 9093). The lightweight Prometheus web UI does not display this additional alert information.

Available Alert Rules

The following rule sets are provided with MetricsHub:

Alert RulesWhen to UseAlerts Triggered When
MetricsHubAlways
  • A host cannot be reached
  • A connector has failed
  • A protocol has failed
  • The MetricsHub Agent is not sending metrics.
HardwareWhen hardware monitoring is performed
  • Battery charge is critically or abnormally low
  • Devices report high error rates (e.g. CPU, memory, disks, network)
  • Fan speed is too low
  • LUN has too few or no available paths
  • Network card error ratio is high
  • Physical disk endurance is low
  • Power supply usage is abnormally high
  • Temperature or voltage is out of range
  • A hardware device is missing, predicted to fail, degraded, or failing.
StorageWhen storage monitoring is performed
  • Storage pools are close to saturation, oversubscribed, growing abnormally, or projected to run out of capacity
  • Volumes experience abnormal latency, become stalled, degraded, or failed
  • Volumes are orphaned or mapped without consumer identity information
  • A volume monopolizes pool activity and impacts neighboring workloads
  • Controllers, storage networks, or physical disks experience abnormal latency or instability
  • Storage consumers become unexpectedly idle, generate abnormal I/O spikes, grow unusually fast, or are significantly over-provisioned
SystemWhen system monitoring is performed
  • CPU usage, file system utilization, memory usage, or bandwidth usage is abnormally high
  • Too many network errors are detected
  • A high page faults rate occurs over an extended period of time.

Install Alert Rules

To activate the alert rules:

  1. Copy the required configuration files into your Prometheus configuration directory:

    • config/metricshub-rules.yaml
    • config/metricshub-hardware-rules.yaml
    • config/metricshub-storage-rules.yaml
    • config/metricshub-system-rules.yaml
  2. Declare them in the prometheus.yml file:

    rule_files:
    - metricshub-rules.yaml
    - metricshub-hardware-rules.yaml
    - metricshub-storage-rules.yaml
    - metricshub-system-rules.yaml
  3. Restart your Prometheus server to take the new rules into account.

Understanding Alert Rule Thresholds

MetricsHub alert rules use two types of thresholds:

  • Static thresholds Use fixed values that apply to all devices. Example: battery charge below 30%.

  • Dynamic thresholds Use device-specific threshold metrics exposed directly by the monitored hardware. Example: warning and critical temperature limits provided by the device itself.

Dynamic thresholds allow MetricsHub to adapt alerts automatically to different hardware vendors, models, and configurations.

The following examples illustrate how static and dynamic thresholds are implemented in Prometheus alert rules.

Static Threshold Example

For the hw_battery_charge_ratio metric:

  • a warning alert is triggered when the battery charge is below 0.5 (50%)
  • a critical alert is triggered when the battery charge is below 0.3 (30%)

Because Prometheus rules are evaluated independently, both alerts fire when the charge falls below 30%.

- name: MetricsHub-Hardware-Battery-Charge
rules:
- alert: MetricsHub-Hardware-Battery-Charge-Warning
expr: hw_battery_charge_ratio >= 0 AND hw_battery_charge_ratio * 100 <= 50
for: 5m
labels:
severity: warning

- alert: MetricsHub-Hardware-Battery-Charge-Critical
expr: hw_battery_charge_ratio >= 0 AND hw_battery_charge_ratio * 100 < 30
for: 5m
labels:
severity: critical

Dynamic Threshold Example

For the hw_temperature_celsius metric:

  • a warning alert is triggered when the temperature exceeds the value of hw_temperature_limit_celsius{limit_type="high.degraded"}
  • a critical alert is triggered when the temperature exceeds the value of hw_temperature_limit_celsius{limit_type="high.critical"}
- name: Temperature
rules:
- alert: Temperature-High-Warning
expr: hw_temperature_celsius >= ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.degraded"}
labels:
severity: warning

- alert: Temperature-High-Critical
expr: hw_temperature_celsius >= ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.critical"}
labels:
severity: critical

Customizing Alert Rules

All MetricsHub alert rules can be customized.

You can:

  • Adjust thresholds
  • Modify alert durations (for:)
  • Add or remove labels
  • Customize annotations and descriptions
  • Enable or disable specific alerts
  • Integrate additional routing labels for Alertmanager

After modifying a rule file, restart Prometheus to reload the updated configuration.