メインコンテンツまでスキップ

GPU Monitoring (Project Level)

The GPU Monitoring screen allows users to monitor the status and usage of GPU Pools assigned to the projects.

Section A: Profile Distribution

Section B: GPU Pool Card

Profile Distribution

  • Purpose: Display the total number of profiles across all GPU pools assigned to that project, categorized by profile type and status.
  • Color indicates the profile status.

GPU Pool Card

Each card corresponds to one GPU Pool and displays the following information:

  • Title: GPU Pool Name
  • Usage status:
    • Total: Total number of GPUs allocated to the pool
    • In Use: Number of GPUs in use
  • Utilization: Percentage of GPUs in use
  • Allocation: Number of resources allocated, displayed in the format “Number of profiles in use/Total number of profiles in the pool”.
    • Progress bar: Green indicates in-use profiles; gray indicates available profiles.
  • Tag: GPU profile specification assigned to the pool.
  • Created at: Displays the pool creation date in YYYY-MM-DD format.

When a user clicks a GPU Pool card, the system navigates to the detailed information screen of the corresponding GPU Pool. This screen includes two main tabs (Allocation Tab and Monitoring Tab) for understanding both the configuration and actual usage status of the pool.

Allocation Tab: Resource Configuration Overview

The Allocation tab shows how the GPU Pool is physically configured. It visualizes which GPUs, with which profiles, are allocated to this Pool, on which nodes, and in which clusters.

General Status:

  • In Use: The number of profiles allocated to workload.
  • Ready: The number of profiles not allocated to workload.

Node-Profile (Resource Distribution)

  • Purpose: Displays the number of GPUs allocated and unallocated for each profile specification.
  • Status:
    • In Use (green): Profiles that are currently in use
    • Available (grey): Available profiles

Node-GPU (Physical GPU Distribution)

  • Purpose: Provides a detailed display of GPU placement on each node.
  • Status:
    • In Use (Pool) (green): Profiles that are currently in use within this pool.
    • In Use (External) (green): Profiles that are currently in use in another pool.
    • Idle (white): Available profiles

Monitoring Tab: Actual Usage

  • Purpose: Display the Pods that are actually using the GPUs in this pool. This is the most important tab in GPU Pool management, as it allows resource usage to be tracked at the Pod level.
  • Displayed Information:
    • Pod: Pod name
    • Profile: GPU profile used by the Pod.
    • GPU: UUID of the GPU used by the pod
      • Fire icon: Indicates whether Hardware Thermal Throttling is enabled or disabled.
      • Lightning icon: Indicates whether the GPU Power Limit (power consumption limit) is enabled or disabled.
      • Thermometer icon: Indicates the status of the Software Thermal Limit.
    • Status: Status of profile
    • SM Activity: Percentage of actual computing units (SMs) inside the GPU that are active.
    • SM Occupy: The ratio of active warps on an SM to the maximum supported warps, indicating how efficiently CUDA kernels utilize GPU resources.
    • SM Clock: The current speed of the SMs compared to their maximum speed.
    • Tensor: Indicates whether Tensor Cores are currently being utilized.
    • Memory Usage: Percentage of GPU memory used/total available memory
    • Memory Clock: Determines how fast data is transferred between the GPU core and VRAM.
    • Temperature: GPU temperature.
    • Power: GPU power consumption.