Real time monitoring NVIDIA GPU of telegraf + influxdb + grafana for performance monitoring

High rise (Zee) 2021-11-25 18:02:19

What is? GPU?

Graphics processor ( English :Graphics Processing Unit, abbreviation :GPU), Also known as display core 、 Visual processor 、 Display chip , It's a special kind of PC 、 The workstation 、 Game consoles and some mobile devices ( Like a tablet 、 Smart phones and so on ) On the image operation of the microprocessor . The purpose is to convert and drive the display information required by the computer system , And provide line scanning signal to the display , Correct display of control display , It is an important component connecting the monitor and the PC motherboard , It's also “ Man-machine dialogue ” One of the most important equipment . Graphics card as an important part of the host computer , Undertake the task of outputting display graphics , Graphics card is very important for professional graphic designers , At the same time, it is also widely used in the field of deep learning .

Preliminary knowledge

NVIDIA System management interface ( ​nvidia-smi​) Is a command line utility , be based on NVIDIA Management of the library (NVML), Designed to help manage and monitor ​NVIDIA GPU​ equipment . This utility allows administrators to query GPU Equipment status and corresponding permissions , Allow administrators to modify GPU Equipment status . It's about ​ TeslaTM​, ​GRID TM​, ​QuadroTM​ and ​TitanX​ product , But other ​NVIDIA GPU​ Limited support is also provided .​NVIDIA-smi​ stay Linux Equipped with ​NVIDIA GPU​ Display driver , And equipped with 64 position ​WindowsServer2008R2​ and ​Windows7​. ​Nvidia-smi​ Query information can be used as XML Or readable plain text report to standard output or file form .

Example ​NVIDIA-smi​ Output : 

 Performance monitoring Telegraf+InfluxDB+Grafana Real-time monitoring NVIDIA GPU_github

window How to use nvidia-smi?

nvidia-smi with nvidia Graphics card drivers are placed together , So we can set the default installation file path in the driver ​C:\ProgramFiles\NVIDIACorporation\NVSMI​ I found the file in ​nvidia-smi.exe​, Drag the file to CMD window , You can display information about GPU Information about , As shown in the figure below : 

 Performance monitoring Telegraf+InfluxDB+Grafana Real-time monitoring NVIDIA GPU_ data _02

Above, NVIDIA GeForce GTX 750 Information about , The following explains the parameters .

The information in the upper table box corresponds to the information in the lower four boxes one by one :

  • GPU:GPU Number ;
  • Name:GPU model ;
  • Fan: Fan speed , from 0 To 100% Change between ;
  • Temp: temperature , The unit is centigrade ;
  • Perf: Performance status , from P0 To P12,P0 Represents maximum performance ,P12 Indicates the minimum performance of the State ( namely GPU When not working, it is P0, When the maximum working limit is reached, it is P12).
  • Pwr:Usage/Cap: energy consumption ;
  • MemoryUsage: Video memory usage ;
  • Bus-Id: involve GPU Bus stuff , domain:bus:device.function;
  • Disp.A: DisplayActive, Express GPU Whether the display of is initialized ;
  • VolatileGPU-Util: floating GPU utilization ( GPULoad);
  • Uncorr.ECC: ErrorCorrectingCode, Error checking and correction ;
  • ComputeM: compute mode, Calculation mode .
  • At the bottom of the Processes Indicates that each process pair GPU Video memory usage .

Telegraf+InfluxDB+Grafana monitor NVIDIA GPU

Telegraf Provide nvidia-smi The collection plug-in collects GPU The performance data

github Address :

Configuration plug-ins

1. [[inputs.nvidia_smi]]

2. ## Optional: path to nvidia-smi binary, defaults to $PATH via exec.LookPath

3. bin_path = "C:\\Program Files\\NVIDIA Corporation\\NVSMI\\nvidia-smi.exe"


5. ## Optional: timeout for GPU polling

6. timeout = "5s"
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.

Acquisition measurement


  • tags
  • name( for example GPU The type of  GeForceGTX1070Ti
  • compute_mode( for example GPU The calculation mode of Default)
  • index(GPU The port index connected to the motherboard , for example 1)
  • pstate( for example GPU Overclocking status of P0)
  • uuid( for example GPU Unique identifier of ,GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665)
  • fields
  • fanspeed ( Integers , percentage )
  • memoryfree ( Integers ,MiB)
  • memoryused ( Integers ,MiB)
  • memorytotal ( Integers ,MiB)
  • powerdraw ( floating-point ,W)
  • temperaturegpu ( Integers ,℃)
  • utilizationgpu ( Integers , percentage )
  • utilizationmemory ( Integers , percentage )

Sample data collection : 

 Performance monitoring Telegraf+InfluxDB+Grafana Real-time monitoring NVIDIA GPU_ data _03

Grafana Dashboard effect  

 Performance monitoring Telegraf+InfluxDB+Grafana Real-time monitoring NVIDIA GPU_ System management _04

Please bring the original link to reprint ,thank
Similar articles