メインコンテンツまでスキップ

GPU Management is designed to track and manage all GPU resources in Kubernetes clusters in real time. It not only displays the current status, but also supports integrated management of multiple clusters, provides information customized for each user role, and includes analytical tools to monitor and evaluate GPU usage trends over time.

Main Functions

  • Cluster-level GPU Monitoring
    • View GPU resources across multiple clusters at a glance
    • Compare GPU utilization by cluster at a glance
    • Immediately identify clusters with insufficient resources and free resources
  • Multi-Instance GPU (MIG) support
    • Divide a single GPU into multiple instances for flexible utilization
    • Collect independent metrics for each MIG instance
    • Monitor resource contention between instances within physical GPUs
  • Real-time metrics visualization
    • Update GPU status every 15 seconds
    • Distinguish status with color codes (Normal/Warning/Critical)
    • Analyze trends with time-series charts
  • Role-based view
    • System Administrators: Overall infrastructure health
    • Project Manager: GPU pool usage status of team
    • Developers: Detailed metrics for their own workloads

Key Metrics

GPU Utilization

GPU utilization is one of the most frequently monitored metrics, but it is also one of the easiest to misinterpret. This metric indicates the percentage of time that the GPU's compute cores are actively performing calculations. For instance, if GPU utilization is 80%, it means the GPU was performing computations for 80% of the measured period and was idle for the remaining 20%.

  • Normal range:
    • Allocated GPU: Above 50% (ideally 75–85%)
    • Unallocated GPU: Below 5% (ideally 0%)
  • Problem diagnosis:
    • Low utilization (20% or less): Possible data loading bottleneck, small batch size, or CPU bottleneck
    • Abnormal activity (Unallocated GPU > 20%): Possibly unfinished processes or security issues

Memory Utilization

Memory utilization indicates the proportion of time GPU memory is actively performing read and write operations. This is a different concept from memory usage. Memory usage shows how much memory has been allocated, whereas memory utilization reflects how actively that memory is being used.

GPU utilizationMemory utilizationInterpretation
HighLowNormal (compute-intensive)
LowHighMemory bandwidth bottleneck
LowLowData loading/CPU bottleneck
Caution for memory usage:
  • Over 95%: Risk of out-of-memory
  • Continuous increase: Possible memory leak

Temperature

Temperature is the most direct indicator of GPU health. GPUs consume significant power for high performance, generating considerable heat as a result. If this heat is not properly controlled, it can not only cause performance degradation but also damage hardware. The normal operating temperature range for most data center GPUs is between 50°C and 80°C.

Temperature standards:
TemperatureStatusRecommended action
< 70°CNormal-
70–80°CGood-
80–85°CCautionCheck cooling
85–90°CWarningStart SW throttling
≥ 90°CCriticalHW throttling, immediate action required
Causes of temperature rise:
  • Cooling fan failure or dust accumulation
  • Rising server room environmental temperature
  • Thermal interference between GPUs within nodes (simultaneous heavy load)

Power Usage

Power consumption shows the amount of power (in Watts) that the GPU is currently using. Each GPU model has a rated power limit; for example, the default power limit for NVIDIA A100 is 400W, and this limit can be adjusted by administrators via software.

  • Purposes of power monitoring:
    • Performance limitation detection: If power usage exceeds 95% compared to Power Limit, the GPU may be power-limited.
    • Cost control: Identify inefficient of GPUs with high power consumption.
    • Hardware issue detection: Detect abnormal power usage patterns.
  • Analysis points:
    • Low GPU utilization + high power consumption = Inefficiency (memory transfer overhead)
    • High GPU utilization + high power consumption = Normal (GPU is working at full capacity)

Clock Throttling

The GPU operates based on its clock speed, similar to a CPU. The higher the clock speed, the more computations it can perform per second. However, the GPU does not always run at its maximum clock speed. To manage power consumption and temperature, the GPU dynamically adjusts its clock—this process is known as clock throttling.

Clock throttling can occur for various reasons, each with a different meaning. It is important to understand the reasons for clock throttling as recorded by the system.

Throttle TypeMeaningSeverityRecommended Action
GPU IdleIdle stateNormalNone
SW Power CapAdmin-set power limitNormalIntended Limitation
SW ThermalTemperature > 85°CCautionReview cooling improvement
HW ThermalTemperature > 90°CCriticalImmediate cooling check
HW Power BrakeInstantaneous power overloadWarningCheck power supply

It is important to monitor the frequency of throttling events. Occasional throttling is normal, but if it occurs frequently or continuously, the underlying cause should be addressed. The system tracks the number of clock throttle events, so if this number increases rapidly, further investigation may be needed.

SM Activity

SM stands for Streaming Multiprocessor, which is the core computational unit within the GPU. A single GPU may have dozens to hundreds of SMs, and each SM consists of many cores. SM Activity indicates how actively these SMs are performing computations.

SM Activity provides more granular insight compared to overall GPU Utilization. While GPU Utilization shows the percentage of time the GPU is , SM Activity indicates how many cores are actually being used during operation. For example, a GPU Utilization of 80% does not necessarily mean every SM is operating at 80% load; some SMs may be running at 100% while others are idle.

Recommended Range:
  • For allocated GPUs: Above 50% (ideally above 75%).
  • Low SM → Lack of parallelism → Increase batch size or apply model parallelization

ECC Errors

ECC stands for Error Correcting Code, a technology that automatically detects and corrects bit errors in GPU memory. Most data center-class GPUs are equipped with ECC memory to ensure data integrity. ECC is essential for workloads where accuracy is critical, such as AI training or scientific computing.

There are two types of ECC errors. Single Bit Error (SBE) occurs when only a single bit is incorrect; ECC can automatically correct this. Occasional SBEs are normal and can occur naturally due to cosmic rays or electrical noise. Double Bit Error (DBE) occurs when two or more bits are incorrect; ECC cannot correct this. DBEs can lead to data corruption or system failures.

Monitoring ECC errors is important for tracking the health of memory hardware. If the aggregate ECC error count exceeds 1,000, it is recommended to consider replacing the GPU’s memory. A sudden increase in the ECC error rate is a strong indication of deteriorating memory.

If even a single DBE occurs, it is classified as a critical event. The affected GPU should be immediately removed from workloads and inspected. This is especially important during critical training jobs, as a DBE can make results unreliable.

Error Types
  • Single Bit Error (SBE): Automatically correctable; occasional occurrences are normal
  • Double Bit Error (DBE): Not correctable; may cause data corruption
Action Criteria
  • Aggregate ECC errors ≥ 1,000: Consider GPU memory replacement
  • At least one DBE: Critical status; immediately stop jobs and inspect the GPU

GPU Health Evaluation System

Once user understand the meaning of each individual metric, the next step is to evaluate the overall health status of a GPU by integrating these metrics. This system does more than simply present raw numbers—it analyzes multiple metrics and classifies each GPU into six levels, allowing both administrators and users to easily assess the GPU’s status at a glance.

Status Level

Each GPU is evaluated and categorized into one of six status levels. These statuses are color-coded and displayed on the dashboard, allowing administrators and users to quickly scan dozens of GPUs and immediately spot any issues.

StatusColorMeaningExample
ExcellentGreenOptimal performanceUtilization 90%, Temp 65°C
GoodGreenNormal operationUtilization 60%, Temp 75°C
BadYellowSuboptimal performanceUtilization 30%, needs tuning
PoorOrangeSevere inefficiencyUtilization 10%, immediate check
WarningOrangeWarning stateTemp 87°C, throttling detected
CriticalRedImmediate action requiredTemp 92°C, XID error
  • Excellent: This is the best state, where everything is operating ideally. The assigned GPU is performing tasks with high efficiency, the temperature is low, and all metrics are within optimal ranges. Displayed in green, users can work with peace of mind. For administrators, having most of the cluster in Excellent state indicates healthy infrastructure.
  • Good: This is a normal and healthy operating state. All metrics are within acceptable ranges, though not as optimized as Excellent. For example, a GPU utilization of 60% meets recommended levels but doesn't reach the ideal 75%. Also displayed in green, there are no issues, but there’s room for improvement.
  • Bad: This state indicates insufficient performance. The GPU is allocated but not being used efficiently; for example, utilization is only 30% or memory usage is minimal. This is shown in yellow and signals users to optimize their workloads. Costs are wasted, but there is no immediate risk.
  • Poor: This is a very low-performance state, even more inefficient than Bad. It occurs when GPU utilization is below 10% or the GPU is allocated but doing almost no work. Displayed in orange, this state requires immediate investigation. It could be that there is a bug in the code, the operation failed but the GPU is still allocated, or the developer forgot to return the GPU.