Chuyển tới nội dung chính

The node view displays all GPU nodes within the cluster and shows the physical layout and individual status of each GPU installed on each node in detail. This is essential for diagnosing hardware-level issues and identifying performance differences or thermal problems among GPUs within a node.

Access Node View Screen

From the Cluster List, clicking on a cluster card navigates to the Node View of the corresponding cluster.

The screen supports two view modes (Grid View and Table View), allowing users to view and manage data in different ways.

Table View

①~③ Summary of GPU Resource

④ GPU and MIG Instance Metrics

⑤ Detailed Specifications of GPUs and MIG Instances

Summary of GPU Resource

Resource Summary

  • Purpose: Display aggregated statistics of Nodes, GPUs, and MIGs in the system, along with their corresponding statuses.
  • Displayed information:
    • Node: Number of nodes in the cluster.
    • GPU: Number of GPUs in the cluster.
    • MIG: Number of MIG instances in the cluster.
    • Error/Throttling/Caution: Statistics on the number of errors, throttling events, and warnings.

Profile Distribution

  • Purpose: Display the number of each MIG profile specification and Full GPUs in the cluster, categorized by profile specification and status.
  • Color: Colors vary depending on the metric selected in section â‘£.
  • Statistical Categories: 1G, 2G, 3G, 4G, 7G, Full
  • Displayed information:
    • Center Value: Total number of the corresponding profile within the cluster.
    • Colored Markers: Display the color corresponding to each status along with the number of profiles in that status.

Profile Status Visualization

  • Purpose: Display status of each GPU in a node to give admins an overall view and make it easier to monitor the status of them.
  • This section consists of multiple blocks, each containing five columns. Each column represents a node.
  • The tiles under a column represent physical GPUs and color of each tile indicates its status:
    • Gray: Unknown/Not allocated
    • Blue: Excellent, Good
    • Yellow: Bad, Poor
    • Red: Warning, Critical

⇒ In Full GPU mode, the status reflects the state of the GPU instance itself.

In MIG mode, the status is determined by the worst status among the GPU’s MIG instances.

GPU and MIG Instance Metrics

  • Purpose: Visually displays GPU and MIG Instance metrics on each Node by specific metrics. Through this view, users can identify how many GPUs each node has, how many MIG Instances are configured on each GPU, and their corresponding metrics.
  • When user hovers over a GPU or a MIG Instance, its summary information is displayed:
    • Node Name: Name of the node in the cluster, used to identify which node the GPU or MIG Instance belongs to.
    • GPU Name: Name of the GPU.
    • Profile Name: Name of the MIG profile.
    • Status Reason: Reason explaining the current status of the profile.
  • When clicking on an MIG Instance, a panel opens on the right side, displaying detailed information about the corresponding GPU or MIG Instance, including changes in their Performance and Status over time:
    • Header: Displays the Cluster name, Node name, and GPU name in the format Cluster name/Node name/GPU name.
    • General information:
      • Profile: Name of profile
      • Status
      • Status Reason
      • Workload
    • Performance Trends:
      • SM Activity: The activation ratio of physical compute units (SMs) inside the GPU (%)
      • SM Occupancy: The ratio of active warps on an SM to the maximum supported warps, indicating how efficiently CUDA kernels utilize GPU resources (%)
      • SM Clock: The current speed of the SMs (MHz)
      • Tensor Active: Indicates whether the Tensor Cores are active.
      • Memory performance:
        • Memory Usage: Percentage of memory currently in use (%)
        • Memory Bandwidth: The read/write speed of the GPU memory (%)
      • Status Trend:
        • Temperature & Power
          • Temperature: Temperature of the GPU or MIG Instance (°C)
          • Power: Power consumption (W)
        • ECC Error (GPU):
          • ECC Volatile: Number of volatile ECC errors
          • ECC Aggregate: Total number of accumulated ECC errors

Detailed Specifications of GPUs and MIG Instances

  • Purpose: Displays detailed metrics of each GPU or MIG Instance.
  • Displayed Information:
    • Node: Name of the node where the GPU or MIG Instance is located.
    • Device: Device ID
    • Profile: Name of profile.
    • Status: Current status of profile.
      • Fire icon: Indicates whether Hardware Thermal Throttling is enabled or disabled.
      • Lightning icon: Indicates whether the GPU Power Limit (power consumption limit) is enabled or disabled.
      • Thermometer icon: Indicates the status of the Software Thermal Limit
    • Assigned:
      • Displays the pod to which the profile is assigned, shown in the format pod namespace (blue)/pod name.
      • The icon next to it is highlighted if the profile has ECC errors (including volatile ECC errors and aggregate ECC errors).
    • SM Act: SM Activity – The activation ratio of physical compute units (SMs) inside the GPU.
    • SM Occ: SM Occupy – The ratio of active warps on an SM to the maximum supported warps, indicating how efficiently CUDA kernels utilize GPU resources.
    • SM Clk: SM Clock – The current speed of the SMs compared to their maximum speed.
    • Tensor: Indicates whether Tensor Cores are currently being utilized.
    • Mem Use: Memory Usage – Percentage of GPU memory used/total available memory.
    • Mem Clk: Memory Clock – Determines how fast data is transferred between the GPU core and VRAM.
    • Temp: Temperature – GPU temperature.
    • Power: Power – GPU power consumption.

Filter

Allows users to filter Sections ④ and ⑤ with different criteria:

  • Profile Status: The available status filter values depend on the tab selected in Section â‘£. (e.g., Critical/Warning, Bad/Poor, Good/Excellent, Unknown)
  • Assigned: Determines whether the profile has been assigned to a pod or not.

⇒ User can apply filters by combining profile status and assignment conditions.

Grid View

①~③ Summary of GPU Resource

④ GPU and MIG Instance Metrics

Summary of GPU Resource

Resource Summary

Same as Table View

Profile Availability

  • Purpose: Display the percentage of available profiles, categorized by profile specification.
  • Color:
    • Green: ≤ 60%
    • Orange: > 61%
  • Statistical Categories: 1G, 2G, 3G, 4G, 7G, Full
  • Displayed information:
    • Center Value: Display percentage of available profiles.
      • Percentage of available profiles = (Number of available profiles : total number of profile) x 100%

Profile Distribution

  • Purpose: Display the number of each MIG profile specification, categorized by profile specification.
  • Color: Colors vary depending on the metric selected in section â‘£.
  • Statistical Categories: 1G, 2G, 3G, 4G, 7G, Full
  • Displayed information:
    • Each bar displays data divided into multiple-colored segments, where each color represents a different profile status.
    • The total number of profiles is shown on the right side of each bar, allowing admins to easily observe the proportion of each status within the total.

GPU and MIG Instance Metrics

Same as Table View

Filter

Allows users to filter Sections ④ and ⑤ with different criteria:

  • Profile Status: The available status filter values depend on the tab selected in Section â‘£. (e.g., Critical/Warning, Bad/Poor, Good/Excellent, Unknown)
  • Assigned: Determines whether the profile has been assigned to a pod or not.

⇒ User can apply filters by combining profile status and assignment conditions.