I’m using node_exporter to generate host metrics for several of the nodes in my lab. I was re-working one of my thermal graphs, today, with the goal of getting good historical temps of my Pis and my Ubuntu-based homebuilt NAS into a single readable graph. node_exporter
has two relevant time series:
node_thermal_zone_temp
which was exported on all of the Raspberries Pi
node_hwmon_temp_celsius
which was exported by the NAS and the Raspberries Pi 4. The rPi3 did not export this metric.
I liked node_hwmon_temp_celsius
a lot, and opted to spend some time focusing on getting that to fit as well as I could. It’s an [instant vector][instant_vector], and it returned the following with my config:
node_hwmon_temp_celsius{chip="0000:00:01_1_0000:01:00_0", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter", sensor="temp1"} 29.85
node_hwmon_temp_celsius{chip="0000:00:01_1_0000:01:00_0", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter", sensor="temp2"} 29.85
node_hwmon_temp_celsius{chip="0000:00:01_1_0000:01:00_0", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter", sensor="temp3"} 32.85
node_hwmon_temp_celsius{chip="0000:20:00_0_0000:21:00_0", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter", sensor="temp1"} 52.85
node_hwmon_temp_celsius{chip="0000:20:00_0_0000:21:00_0", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter", sensor="temp2"} 52.85
node_hwmon_temp_celsius{chip="0000:20:00_0_0000:21:00_0", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter", sensor="temp3"} 58.85
node_hwmon_temp_celsius{chip="pci0000:00_0000:00:18_3", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter", sensor="temp1"} 37.75
node_hwmon_temp_celsius{chip="pci0000:00_0000:00:18_3", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter", sensor="temp2"} 37.75
node_hwmon_temp_celsius{chip="pci0000:00_0000:00:18_3", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter", sensor="temp3"} 27
node_hwmon_temp_celsius{chip="thermal_thermal_zone0", class="raspberry pi", environment="cluster", hostname="cluster1", instance="10.0.1.201:9100", job="node-exporter", sensor="temp0"} 37.485
node_hwmon_temp_celsius{chip="thermal_thermal_zone0", class="raspberry pi", environment="cluster", hostname="cluster1", instance="10.0.1.201:9100", job="node-exporter", sensor="temp1"} 37.972
node_hwmon_temp_celsius{chip="thermal_thermal_zone0", class="raspberry pi", environment="cluster", hostname="cluster2", instance="10.0.1.252:9100", job="node-exporter", sensor="temp0"} 32.128
node_hwmon_temp_celsius{chip="thermal_thermal_zone0", class="raspberry pi", environment="cluster", hostname="cluster2", instance="10.0.1.252:9100", job="node-exporter", sensor="temp1"} 32.128
The class
, environment
, and hostname
labels are added when scraped.
The chip
label looked interesting, but it appears to the an identifier as opposed to a name, and I’m terrible at mentally mapping hard-to-read identifiers to something meaningful. Digging around a little more, I found node_hwmon_chip_names
, which when queried returned
node_hwmon_chip_names{chip="0000:00:01_1_0000:01:00_0", chip_name="nvme", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="0000:20:00_0_0000:21:00_0", chip_name="nvme", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="pci0000:00_0000:00:18_3", chip_name="k10temp", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="platform_rpi_poe_fan_0", chip_name="rpipoefan", class="raspberry pi", environment="cluster", hostname="cluster0", instance="10.0.1.42:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="platform_rpi_poe_fan_0", chip_name="rpipoefan", class="raspberry pi", environment="cluster", hostname="cluster1", instance="10.0.1.201:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="platform_rpi_poe_fan_0", chip_name="rpipoefan", class="raspberry pi", environment="cluster", hostname="cluster2", instance="10.0.1.252:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="power_supply_hidpp_battery_0", chip_name="hidpp_battery_0", class="nas server", environment="storage", hostname="20-size", instance="10.0.1.217:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="soc:firmware_raspberrypi_hwmon", chip_name="rpi_volt", class="raspberry pi", environment="cluster", hostname="cluster0", instance="10.0.1.42:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="soc:firmware_raspberrypi_hwmon", chip_name="rpi_volt", class="raspberry pi", environment="cluster", hostname="cluster1", instance="10.0.1.201:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="soc:firmware_raspberrypi_hwmon", chip_name="rpi_volt", class="raspberry pi", environment="cluster", hostname="cluster2", instance="10.0.1.252:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="thermal_thermal_zone0", chip_name="cpu_thermal", class="raspberry pi", environment="cluster", hostname="cluster1", instance="10.0.1.201:9100", job="node-exporter"} 1
node_hwmon_chip_names{chip="thermal_thermal_zone0", chip_name="cpu_thermal", class="raspberry pi", environment="cluster", hostname="cluster2", instance="10.0.1.252:9100", job="node-exporter"} 1
You might notice that the chip
label matches in both vectors. Which made me think I could cross-refrence one against the other. This was way more hack-y than I expected.
Prometheus only allows for label joining by using the group_right
and group_left
operations, which are very poorly documented. Fortunately, I came across these two posts by Brian Brazil, which got me started. This answer on Stack Overflow helped me get the rest of the way there.
I’ll start with my working query and work backwards.
avg (node_hwmon_temp_celsius) by (chip,type,hostname,instance,class,environemenet,job) * ignoring(chip_name) group_left(chip_name) avg (node_hwmon_chip_names) by (chip,chip_name,hostname,instance,class,environemt,job)
We’ll break the query above into two parts seperated by the operator:
- the Left side:
avg (node_hwmon_temp_celsius) by (chip,type,hostname,instance,class,environemenet,job)
- the Right side:
avg (node_hwmon_chip_names) by (chip,chip_name,hostname,instance,class,environemt,job)
- the Operator:
* ignoring(chip_name) group_left(chip_name)
Let’s go through each.
The left side averages the records for every series that has the same chip
label. In this case, the output above showed that some chip
s had multiple series seperated by temp1
…tempN
labels. I don’t really care about those, so I averaged them. Averaging records with one series just returns that series value, so that’s a good solution.
The right side returns several series with labels matching chip
s to chip_name
s, and the other requisite labels. The value for these series are all 1
, effecitvely saying “this chip exists.”
The operator is where it gets both interesting and hacky.
- Arithmetic operations are a type of vector match, which take series with identical labels and perform the operation on their values. I used a
*
(multiplication) vector match because the right-side value is always 1
and therefore safe to multiply my left-side values without changing them.
- The
ignore()
keyword allows us to list lablels to be ignored when looking for identical label sets. In this case I told the arithmetic operator to ignore(chip_name)
becuase it only exists on the right side.
- We can use the grouping modifiers (
group_left()
and group_right()
) to match many-to-one or one-to-many. That is, the group_left()
modifier will take any labels specified and pass them along with the results of the equation. Since I used group_left(chip_name)
, it returned chip_name
in the list of fields after matching.
Here’s what makes this hacky: as far as I can tell, this is the only way to take matching labels and use them in reference to one-another.
The query returns
{chip="0000:00:01_1_0000:01:00_0",chip_name="nvme",class="nas server",hostname="20-size",instance="10.0.1.217:9100",job="node-exporter"} 28.85
{chip="0000:20:00_0_0000:21:00_0",chip_name="nvme",class="nas server",hostname="20-size",instance="10.0.1.217:9100",job="node-exporter"} 54.85
{chip="pci0000:00_0000:00:18_3",chip_name="k10temp",class="nas server",hostname="20-size",instance="10.0.1.217:9100",job="node-exporter"} 30.166666666666668
{chip="thermal_thermal_zone0",chip_name="cpu_thermal",class="raspberry pi",hostname="cluster1",instance="10.0.1.201:9100",job="node-exporter"} 36.998000000000005
{chip="thermal_thermal_zone0",chip_name="cpu_thermal",class="raspberry pi",hostname="cluster2",instance="10.0.1.252:9100",job="node-exporter"} 32.128
Pretty sweet.