2.4 KiB

Raw Blame History

NVIDIA GPU monitoring with Netdata

Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using nvidia-smi cli tool.

Requirements and Notes:

You must have the nvidia-smi tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about nvidia_smi.
You must enable this plugin as its disabled by default due to minor performance issues.
On some systems when the GPU is idle the nvidia-smi tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.
Currently the nvidia-smi tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: https://github.com/netdata/netdata/pull/4357
Contributions are welcome.
Make sure netdata user can execute /usr/bin/nvidia-smi or wherever your binary is.
If nvidia-smi process is not killed after netdata restart you need to off loop_mode.
poll_seconds is how often in seconds the tool is polled for as an integer.

It produces:

Per GPU
- GPU utilization
- memory allocation
- memory utilization
- fan speed
- power usage
- temperature
- clock speed
- PCI bandwidth

Configuration

Edit the python.d/nvidia_smi.conf configuration file using edit-config from the your agent's config directory, which is typically at /etc/netdata.

cd /etc/netdata   # Replace this path with your Netdata config directory, if different
sudo ./edit-config python.d/nvidia_smi.conf

Sample:

loop_mode    : yes
poll_seconds : 1

analytics

2.4 KiB Raw Blame History

NVIDIA GPU monitoring with Netdata

Configuration

2.4 KiB

Raw Blame History