2.4 KiB
NVIDIA GPU monitoring with Netdata
Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using nvidia-smi
cli tool.
Requirements and Notes:
-
You must have the
nvidia-smi
tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about nvidia_smi. -
You must enable this plugin as its disabled by default due to minor performance issues.
-
On some systems when the GPU is idle the
nvidia-smi
tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue. -
Currently the
nvidia-smi
tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: https://github.com/netdata/netdata/pull/4357 -
Contributions are welcome.
-
Make sure
netdata
user can execute/usr/bin/nvidia-smi
or wherever your binary is. -
If
nvidia-smi
process is not killed after netdata restart you need to offloop_mode
. -
poll_seconds
is how often in seconds the tool is polled for as an integer.
It produces:
-
Per GPU
- GPU utilization
- memory allocation
- memory utilization
- fan speed
- power usage
- temperature
- clock speed
- PCI bandwidth
Configuration
Edit the python.d/nvidia_smi.conf
configuration file using edit-config
from the your agent's config
directory, which is typically at /etc/netdata
.
cd /etc/netdata # Replace this path with your Netdata config directory, if different
sudo ./edit-config python.d/nvidia_smi.conf
Sample:
loop_mode : yes
poll_seconds : 1