Monitoring Architecture
Architecture Overview
ZCP gathers metrics from multiple Data Plane clusters using the Prometheus Agent and stores them centrally in Cortex within the Control Plane Cluster. It offers built-in consolidated dashboards for monitoring cluster-wide and project-specific workloads via Grafana, with the flexibility to customize dashboards as needed.
The system deploys Prometheus on each Data Plane Cluster to collect metrics, which are then transmitted to the Cortex store on the Control Plane Cluster using the remote_write API.
In the Control Plane Cluster, a Cortex Gateway is deployed to receive metric data from Prometheus Agents running on Data Plane Clusters. Cortex components handle metric storage and visualization across multiple clusters. Additionally, Cortex Rule and Alertmanager components enable alarm rules based on collected metrics, triggering notifications for operators when predefined thresholds are exceeded.
The Monitoring Backend Component, responsible for managing Prometheus Agents, multi-project support, and configuring Grafana Organizations and dashboards, is also housed within the Control Plane Cluster.
Components and Roles
Control Plane Components
Component | Roles |
---|---|
Grafana | - Visualizes collected metrics |
Monitoring Backend | - Manages Grafana organizations, datasources, and dashboards - Deploys Prometheus on Data Plane Clusters |
Cortex Gateway | - Handles multi-tenant authentication - Routes requests to Cortex components |
Distributor | - Validates incoming metrics - Tracks high availability (HA) - Performs load balancing |
Ingester | - Stores metrics in long-term storage |
Querier | - Executes PromQL queries |
Query frontend (Optional) | - Enhances read performance - Handles queuing, splitting, and caching |
Ruler (Optional) | - Runs PromQL queries for alerts and recording rules |
Alertmanager (Optional) | - Adds multi-tenant support for alerting |
Configs (Optional) | - Stores settings for Ruler and Alertmanager |
Compactor | - Merges multiple blocks into a single optimized block - Reduces storage costs - Speeds up queries |
Store Gateway | - Handles block sharding and replication - Caches query results |
Index cache | - Improves lookup performance |
Chunks cache | - Stores retrieved data chunks for faster access |
Metadata cache | - Caches metadata, such as tenant lists and block mappings |
Data Plane Components
Component | Roles |
---|---|
Prometheus | - Acts as a time-series database - Collects and stores metrics - Triggers alerts |
Prometheus Node Exporter | - Gathers node-level system metrics |
Kube State Metrics | - Collects Kubernetes API-related metrics |