Prometheus Alertmanager

Prometheus Alertmanager handles alerts generated by Prometheus and routes notifications to tools such as email, Slack, Microsoft Teams, or PagerDuty.

When your Prometheus server is configured to send alerts to Alertmanager, you can use MetricsHub alert rules to detect hardware, storage, and system issues.

note

These alert rules are distinct from the internal alerts generated by MetricsHub and emitted as OpenTelemetry logs.
The alert rules described on this page are evaluated by Prometheus. When an alert fires, Prometheus sends it to Alertmanager, which is responsible for routing and notifications.
To view detailed alert descriptions and annotations, you must use the full Prometheus Alertmanager interface (typically available on port 9093). The lightweight Prometheus web UI does not display this additional alert information.

Available Alert Rules

The following rule sets are provided with MetricsHub:

Alert Rules	When to Use	Alerts Triggered When
MetricsHub	Always	A host cannot be reached A connector has failed A protocol has failed The MetricsHub Agent is not sending metrics.
Hardware	When hardware monitoring is performed	Battery charge is critically or abnormally low Devices report high error rates (e.g. CPU, memory, disks, network) Fan speed is too low LUN has too few or no available paths Network card error ratio is high Physical disk endurance is low Power supply usage is abnormally high Temperature or voltage is out of range A hardware device is missing, predicted to fail, degraded, or failing.
Storage	When storage monitoring is performed	Storage pools are close to saturation, oversubscribed, growing abnormally, or projected to run out of capacity Volumes experience abnormal latency, become stalled, degraded, or failed Volumes are orphaned or mapped without consumer identity information A volume monopolizes pool activity and impacts neighboring workloads Controllers, storage networks, or physical disks experience abnormal latency or instability Storage consumers become unexpectedly idle, generate abnormal I/O spikes, grow unusually fast, or are significantly over-provisioned
System	When system monitoring is performed	CPU usage, file system utilization, memory usage, or bandwidth usage is abnormally high Too many network errors are detected A high page faults rate occurs over an extended period of time.

Install Alert Rules

To activate the alert rules:

Copy the required configuration files into your Prometheus configuration directory:
- config/metricshub-rules.yaml
- config/metricshub-hardware-rules.yaml
- config/metricshub-storage-rules.yaml
- config/metricshub-system-rules.yaml

Declare them in the prometheus.yml file:

rule_files:
  - metricshub-rules.yaml
  - metricshub-hardware-rules.yaml
  - metricshub-storage-rules.yaml
  - metricshub-system-rules.yaml

Restart your Prometheus server to take the new rules into account.

Understanding Alert Rule Thresholds

MetricsHub alert rules use two types of thresholds:

Static thresholds Use fixed values that apply to all devices. Example: battery charge below 30%.
Dynamic thresholds Use device-specific threshold metrics exposed directly by the monitored hardware. Example: warning and critical temperature limits provided by the device itself.

Dynamic thresholds allow MetricsHub to adapt alerts automatically to different hardware vendors, models, and configurations.

The following examples illustrate how static and dynamic thresholds are implemented in Prometheus alert rules.

Static Threshold Example

For the hw_battery_charge_ratio metric:

a warning alert is triggered when the battery charge is below 0.5 (50%)
a critical alert is triggered when the battery charge is below 0.3 (30%)

Because Prometheus rules are evaluated independently, both alerts fire when the charge falls below 30%.

- name: MetricsHub-Hardware-Battery-Charge
  rules:
    - alert: MetricsHub-Hardware-Battery-Charge-Warning
      expr: hw_battery_charge_ratio >= 0 AND hw_battery_charge_ratio * 100 <= 50
      for: 5m
      labels:
        severity: warning

    - alert: MetricsHub-Hardware-Battery-Charge-Critical
      expr: hw_battery_charge_ratio >= 0 AND hw_battery_charge_ratio * 100 < 30
      for: 5m
      labels:
        severity: critical

Dynamic Threshold Example

For the hw_temperature_celsius metric:

a warning alert is triggered when the temperature exceeds the value of hw_temperature_limit_celsius{limit_type="high.degraded"}
a critical alert is triggered when the temperature exceeds the value of hw_temperature_limit_celsius{limit_type="high.critical"}

- name: Temperature
  rules:
    - alert: Temperature-High-Warning
      expr: hw_temperature_celsius >= ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.degraded"}
      labels:
        severity: warning

    - alert: Temperature-High-Critical
      expr: hw_temperature_celsius >= ignoring(limit_type) hw_temperature_limit_celsius{limit_type="high.critical"}
      labels:
        severity: critical

Customizing Alert Rules

All MetricsHub alert rules can be customized.

You can:

Adjust thresholds
Modify alert durations (for:)
Add or remove labels
Customize annotations and descriptions
Enable or disable specific alerts
Integrate additional routing labels for Alertmanager

After modifying a rule file, restart Prometheus to reload the updated configuration.

Available Alert Rules​

Install Alert Rules​

Understanding Alert Rule Thresholds​

Static Threshold Example​

Dynamic Threshold Example​

Customizing Alert Rules​