본문으로 건너뛰기

Monitoring Architecture

Architecture Overview

ZCP gathers metrics from multiple Data Plane clusters using the Prometheus Agent and stores them centrally in Cortex within the Control Plane Cluster. It offers built-in consolidated dashboards for monitoring cluster-wide and project-specific workloads via Grafana, with the flexibility to customize dashboards as needed.

The system deploys Prometheus on each Data Plane Cluster to collect metrics, which are then transmitted to the Cortex store on the Control Plane Cluster using the remote_write API.

In the Control Plane Cluster, a Cortex Gateway is deployed to receive metric data from Prometheus Agents running on Data Plane Clusters. Cortex components handle metric storage and visualization across multiple clusters. Additionally, Cortex Rule and Alertmanager components enable alarm rules based on collected metrics, triggering notifications for operators when predefined thresholds are exceeded.

The Monitoring Backend Component, responsible for managing Prometheus Agents, multi-project support, and configuring Grafana Organizations and dashboards, is also housed within the Control Plane Cluster.

Components and Roles

Control Plane Components

ComponentRoles
Grafana- Visualizes collected metrics
Monitoring Backend- Manages Grafana organizations, datasources, and dashboards
- Deploys Prometheus on Data Plane Clusters
Cortex Gateway- Handles multi-tenant authentication
- Routes requests to Cortex components
Distributor- Validates incoming metrics
- Tracks high availability (HA)
- Performs load balancing
Ingester- Stores metrics in long-term storage
Querier- Executes PromQL queries
Query frontend (Optional)- Enhances read performance
- Handles queuing, splitting, and caching
Ruler (Optional)- Runs PromQL queries for alerts and recording rules
Alertmanager (Optional)- Adds multi-tenant support for alerting
Configs (Optional)- Stores settings for Ruler and Alertmanager
Compactor- Merges multiple blocks into a single optimized block
- Reduces storage costs
- Speeds up queries
Store Gateway- Handles block sharding and replication
- Caches query results
Index cache- Improves lookup performance
Chunks cache- Stores retrieved data chunks for faster access
Metadata cache- Caches metadata, such as tenant lists and block mappings

Data Plane Components

ComponentRoles
Prometheus- Acts as a time-series database
- Collects and stores metrics
- Triggers alerts
Prometheus Node Exporter- Gathers node-level system metrics
Kube State Metrics- Collects Kubernetes API-related metrics