🚀 Executive Summary
TL;DR: Organizations face a dilemma monitoring hybrid cloud (AWS, GCP) and on-premise infrastructure: extend traditional Centreon or adopt cloud-native Prometheus/AlertManager. The article compares these approaches, including a hybrid model, to guide selection based on infrastructure dynamics and operational needs.
🎯 Key Takeaways
- Centreon, using API-based Plugin Packs, offers a unified dashboard and leverages existing IT skillsets for monitoring stable hybrid environments, but can face latency and scalability issues with ephemeral cloud resources.
- Prometheus and AlertManager are designed for dynamic, cloud-native, and containerized workloads, featuring powerful service discovery, PromQL for flexible querying, and an efficient time-series data model.
- A hybrid strategy, where Prometheus collects cloud-native metrics and forwards critical alerts via AlertManager webhooks to Centreon, allows leveraging Prometheus’s flexibility while maintaining Centreon’s mature centralized alert management.
Choosing between Centreon and Prometheus with AlertManager for cloud monitoring in AWS and GCP requires a deep dive into architecture, scalability, and integration. This guide compares both solutions, provides configuration examples, and outlines a hybrid approach to help you select the right toolset for your cloud and on-premise infrastructure.
The Challenge: Cloud Monitoring Crossroads
You’re managing a hybrid infrastructure with critical workloads on-premise and across multiple cloud providers like AWS and GCP. Your existing monitoring stack, perhaps built around a traditional tool like Centreon, is robust for your servers and network gear. However, as you scale in the cloud, you face a new set of challenges:
- Ephemeral Infrastructure: Cloud resources (VMs, containers, functions) are created and destroyed dynamically. Traditional host-based, static monitoring struggles to keep up.
- Service-Oriented Metrics: You need to monitor managed services like RDS, S3, BigQuery, and Pub/Sub, which don’t have an “agent” you can install. Monitoring is done via APIs (e.g., CloudWatch, Google Cloud Monitoring).
- Metric Volume and Cardinality: Cloud-native applications, especially those using microservices and containers, generate a massive volume of high-cardinality metrics (e.g., metrics per container ID).
- Tooling Mismatch: The question arises—do you extend your existing, trusted tool (Centreon) to the cloud, or adopt a cloud-native stack like Prometheus and AlertManager?
This decision impacts everything from team skillset requirements to the reliability of your alerting. Let’s explore three practical solutions to this common problem.
Solution 1: The Centreon-Centric Approach
For organizations with a significant investment in Centreon, extending it to monitor the cloud is a logical first step. This approach leverages Centreon’s powerful framework and connects it to cloud provider APIs, treating cloud services as just another set of resources to be monitored.
How It Works
Centreon integrates with cloud platforms primarily through its “Plugin Packs” and the underlying Nagios-style plugins. The workflow is typically:
-
Connectors: You use specific monitoring plugins (like
centreon-plugin-Cloud-Aws-Apiorcentreon-plugin-Cloud-Gcp-Api) that query the cloud provider’s monitoring API (e.g., AWS CloudWatch, Google Cloud Monitoring). - Authentication: The Centreon poller is configured with secure credentials (e.g., an AWS IAM user with specific permissions or a GCP Service Account key) to authenticate against the API.
-
Service Checks: You define service checks in Centreon that execute these plugins. For example, a check for an AWS RDS instance would call the plugin, which in turn queries the CloudWatch API for metrics like
CPUUtilizationorFreeableMemory. - State-Based Alerting: Centreon evaluates the returned metrics against predefined WARNING and CRITICAL thresholds and generates alerts based on state changes (OK, WARNING, CRITICAL, UNKNOWN).
Example: Monitoring an AWS EC2 Instance’s CPU
First, you install the necessary AWS plugin on your Centreon poller. Then, within the Centreon UI, you would configure a new host and a service check. The underlying command might look something like this:
/usr/lib/centreon/plugins/centreon_aws_ec2_api.pl
--plugin=cloud::aws::ec2::plugin
--mode=cpu
--aws-secret-key='SECRET_KEY'
--aws-access-key='ACCESS_KEY'
--region='eu-west-1'
--dimension-name='InstanceId'
--dimension-value='i-0123456789abcdef0'
--warning-cpu-utilization='80'
--critical-cpu-utilization='95'
This command checks the CPU utilization for a specific EC2 instance (i-0123456789abcdef0) and will change state if the utilization exceeds 80% (Warning) or 95% (Critical).
Pros & Cons
-
Pros:
- Unified Dashboard: Provides a single pane of glass for both on-premise and cloud resources.
- Existing Skillset: Your team can leverage their existing Centreon expertise.
- Mature Alerting: Benefits from Centreon’s robust notification, escalation, and dependency logic.
-
Cons:
- API Polling Latency: Relies on periodic polling of cloud APIs, which can have delays (e.g., CloudWatch metrics can have a 1-5 minute lag).
- Scalability Concerns: Can become cumbersome and slow if you are polling thousands of cloud resources, potentially hitting API rate limits.
- Less Suited for Ephemeral Resources: Auto-discovery of resources is possible but often requires more complex configuration compared to cloud-native solutions.
Solution 2: The Prometheus & AlertManager Stack
This approach embraces the cloud-native ecosystem. Prometheus is a pull-based monitoring system designed for the dynamic, service-oriented world of containers and microservices, making it a natural fit for monitoring cloud environments.
How It Works
The Prometheus stack uses a different paradigm:
-
Exporters & Service Discovery: Instead of agents, Prometheus “scrapes” metrics from HTTP endpoints. For cloud services, you use specialized exporters (e.g.,
stackdriver_exporterfor GCP,cloudwatch_exporterfor AWS) that query the cloud APIs and expose the metrics in a Prometheus-compatible format. Crucially, Prometheus has built-in service discovery for AWS (EC2) and GCP (GCE), automatically finding new instances to monitor. - Time-Series Database (TSDB): Prometheus stores all data as time-series, which is highly efficient for the high volume of metrics from cloud applications.
- PromQL: You query and analyze data using the powerful Prometheus Query Language (PromQL), which allows for complex aggregations and calculations on the fly.
- AlertManager: Alerting rules are defined in Prometheus based on PromQL expressions. When an alert fires, it is sent to AlertManager, which handles deduplication, grouping, silencing, and routing of notifications to different receivers (Slack, PagerDuty, email, etc.).
Example: Scraping GCP Metrics and Alerting on High CPU
Your prometheus.yml configuration would use service discovery to find and scrape metrics from all GCE instances in a project:
# prometheus.yml
scrape_configs:
- job_name: 'gcp-gce-instances'
gce_sd_configs:
- project: 'your-gcp-project-id'
zone: 'europe-west1-b'
port: 9100 # Assuming node_exporter is running on this port
relabel_configs:
- source_labels: [__meta_gce_instance_name]
target_label: instance
Next, you would define an alerting rule in a separate file (e.g., gce_alerts.yml):
# gce_alerts.yml
groups:
- name: gce_instance_alerts
rules:
- alert: HighCpuUtilization
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: critical
annotations:
summary: "High CPU utilization on {{ $labels.instance }}"
description: "{{ $labels.instance }} has had a CPU utilization above 90% for the last 10 minutes."
This rule will fire if any instance’s CPU utilization (calculated from the node_exporter metric) remains above 90% for 10 minutes. AlertManager would then take over to route the notification.
Pros & Cons
-
Pros:
- Cloud-Native Design: Built for dynamic, ephemeral environments with powerful service discovery.
- Powerful Query Language: PromQL is extremely flexible for slicing and dicing metrics.
- Vibrant Ecosystem: A huge number of official and community-built exporters, integrations, and dashboards (e.g., Grafana).
-
Cons:
- Steeper Learning Curve: Requires learning PromQL and a new operational model (pull vs. push/check).
- Not a Complete Solution: Prometheus focuses on metrics. For logs (Loki) and traces (Tempo), you often need to add other components. Centreon offers a more all-in-one experience.
- Long-Term Storage: Requires a separate solution like Thanos or Cortex for long-term, highly available metric storage.
Head-to-Head Comparison: Centreon vs. Prometheus/AlertManager for Cloud
| Feature | Centreon | Prometheus & AlertManager |
| Architecture | Centralized pollers executing checks (push/active check model). State-based (OK, WARN, CRIT). | Decentralized scrapers pulling metrics from endpoints. Stores data as time-series. |
| Cloud Integration | Via API-based plugins (Plugin Packs). Requires manual or semi-automated configuration of hosts/services. | Native service discovery for major cloud providers. Uses exporters to query cloud APIs (e.g., cloudwatch_exporter). |
| Dynamic Environments | Can be challenging. Relies on auto-discovery modules or API scripts to keep configuration in sync. | Excellent. Service discovery automatically detects and removes targets as they are created and destroyed. |
| Alerting | Mature and powerful. Features complex dependencies, acknowledgements, scheduled downtime, and escalation chains built-in. | Highly flexible rules via PromQL. AlertManager handles grouping, silencing, and routing but lacks Centreon’s deep dependency logic out-of-the-box. |
| Data Model | Stores performance data (RRDtool) and state. Less suited for high-cardinality metrics. | Time-series with labels. Optimized for high-volume, high-cardinality data from sources like containers. |
| Best For | Hybrid environments with a strong on-premise footprint. Teams invested in a traditional ITIL/NOC workflow. | Cloud-native, containerized, and microservice-based workloads. DevOps teams that value flexibility and integration. |
Solution 3: The Hybrid Approach – Best of Both Worlds?
You don’t always have to choose. A hybrid approach can be a powerful strategy, especially during a transition period or in complex environments where each tool plays to its strengths.
How It Works
The goal is to integrate the two systems. A common and effective pattern is to use Prometheus for what it does best (collecting cloud-native metrics) and feed critical alerts into Centreon to leverage its powerful notification engine.
- Prometheus scrapes metrics from cloud services and applications.
- Alerting rules are defined in Prometheus.
- When an alert fires, Prometheus sends it to AlertManager.
- AlertManager is configured with a
webhook_configreceiver that forwards the alert to a custom script or API endpoint on the Centreon side. - This script then uses the Centreon API (or a passive check mechanism like NSCA/Gorgone) to create/update a service status within Centreon.
This way, your Network Operations Center (NOC) can still use Centreon as their single source of truth for alerts, while your DevOps teams can leverage the power and flexibility of Prometheus for cloud monitoring.
Example: Forwarding Prometheus Alerts to Centreon
In your alertmanager.yml, you would define a receiver that points to a webhook listener on your Centreon server:
# alertmanager.yml
route:
receiver: 'centreon-webhook'
receivers:
- name: 'centreon-webhook'
webhook_configs:
- url: 'http://your-centreon-server/path/to/webhook-listener.php'
send_resolved: true
The webhook-listener.php script would be responsible for parsing the JSON payload from AlertManager and translating it into a passive check result for a corresponding service in Centreon. For example, it could extract the alert’s status (‘firing’ or ‘resolved’) and map it to a Centreon state (CRITICAL or OK).
When to Use This Approach
- You have a mature Centreon deployment with complex on-call schedules, escalations, and reporting that you cannot easily replicate.
- Your DevOps teams need the flexibility of Prometheus and PromQL for monitoring dynamic cloud applications.
- You are in a multi-year transition from traditional infrastructure to the cloud and need a bridge between the two monitoring worlds.
Conclusion: Making the Right Choice
The choice between Centreon and Prometheus/AlertManager is not just about technology; it’s about matching the tool to your architecture, your team, and your operational model.
- Go with Centreon if your primary focus is on providing a unified view of a stable, hybrid infrastructure and you value its mature, all-in-one feature set for traditional IT operations.
- Choose Prometheus & AlertManager if your infrastructure is heavily cloud-native, containerized, and dynamic. This stack is built for the scale and ephemerality of modern cloud environments.
- Consider a Hybrid approach to leverage the strengths of both platforms, using Prometheus for cloud data collection and Centreon for centralized alert management and reporting. This offers a pragmatic path forward for complex organizations.
Ultimately, the best solution is one that provides clear, actionable insights into the health of your systems, regardless of where they run.
👉 Read the original article on TechResolve.blog
☕ Support my work
If this article helped you, you can buy me a coffee:
