2026-01-18Hünkar Döner

How to use AI to make Kubernetes monitoring smarter

KubernetesAIDevOpsPrometheusMonitoringAutomationAWSEKS
H

How to use AI to make Kubernetes monitoring smarter

A practical guide to building an intelligent alert handling system using Prometheus, n8n, and OpenAI

As a DevOps engineer, I'm sure you've experienced the pain of being woken up by Prometheus alerts at 3 AM. Every time an alert comes in, you have to get up, check pod status, dig through logs, troubleshoot the issue, and often find that it could have been resolved with a simple restart — but you've already spent 30 minutes investigating.

What frustrated me most was that many alerts follow predictable troubleshooting patterns. For pod health issues, we always check logs, resource usage, and configuration. We repeat the same steps every time. It's such a waste of time.

So I started wondering: Could AI help us with these repetitive tasks? Could it follow our expert troubleshooting logic to diagnose problems and provide initial recommendations?

After several months of experimentation and practice, I finally built this intelligent monitoring system. Today I want to share my experience with you.

The Problems I Was Facing

Before diving in, let me share some typical problems our team encountered:

Problem 1: Alert Overload

Our Kubernetes clusters generate hundreds of alerts daily, from pod restarts to node resource shortages. The ops team often gets overwhelmed by the alert flood, unable to prioritize which issues need immediate attention.

Problem 2: Standardized but Manual Troubleshooting

For a KubePodNotReady alert, our standard process is:

  1. Check pod status (kubectl get pods)
  2. Get detailed info (kubectl describe pod)
  3. Check logs (kubectl logs)
  4. Restart service if necessary

This workflow is mature, but having to execute it manually every time is inefficient.

Problem 3: Steep Learning Curve for New Team Members

New ops engineers often don't know where to start when facing alerts. Even with documentation, they still need guidance from experienced colleagues during actual troubleshooting.

My Solution Approach

Facing these problems, my idea was simple:

Could I teach AI the troubleshooting mindset of ops experts to do initial problem diagnosis?

Specifically:

  • Let AI understand alert content — Extract key information from Alertmanager JSON
  • Let AI know how to troubleshoot — Choose appropriate checking steps based on alert types
  • Let AI operate the cluster — Call kubectl commands through tools
  • Let AI provide professional advice — Offer diagnosis and solutions based on check results

The Architecture I Built

After iterative adjustments, I finally chose this technology stack:

System Architecture

Kubernetes Cluster → Prometheus Server → Alertmanager
↓
n8n Workflow ← OpenAI API ← Custom MCP Server
↓ ↓
Redis Cache Kibana Logs

Data Flow

Metrics → Alert Rules → Alert Routing → Workflow Orchestration → AI Analysis → Cluster Diagnosis → Automated Reports

Why Choose n8n?

Initially, I considered writing Python scripts directly, but I discovered n8n has several advantages:

  1. Open Source & Free: Can self-host without licensing fees
  2. Visual Orchestration: The entire processing flow is clear and easy to debug
  3. Rich Nodes: Webhook, HTTP requests, AI calls all have ready-made nodes
  4. Easy Extension: Adding new processing logic is just drag-and-drop
  5. Team Collaboration: Other colleagues can understand and modify workflows

Why Build Custom MCP Server?

There's no existing Kubernetes MCP Server on the market, so I developed one myself. Main features include:

  • 17 K8s Tools: From basic kubectl get to advanced kubectl top
  • Security Control: Only allows safe queries and limited operations (like restarting Deployments)
  • Multi-cluster Support: Can manage multiple K8s clusters, including AWS EKS and others simultaneously.
  • Dual Protocol Support: Supports both n8n's SSE and other clients' HTTP

How Well Does It Work?

Let me demonstrate with a real case.

Case Study: Pod Long-term NotReady

  • Traditional Approach:

    1. Receive alert SMS or email
    2. Log into jump server
    3. Switch to corresponding cluster
    4. Manually execute kubectl commands for troubleshooting
    5. Analyze results, formulate solutions
    6. Execute fix operations
    7. Confirm problem resolution
  • AI Intelligent Processing:

When the system receives this alert JSON:

{
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "KubePodNotReady",
        "cluster": "prod-cluster-***",
        "namespace": "production",
        "pod": "app-service-***",
        "severity": "warning"
      },
      "annotations": {
        "description": "Pod has been in a non-ready state for more than 10 minutes"
      }
    }
  ]
}

The AI System Will Automatically:

Step 1: Extract Key Information

Alert Type: KubePodNotReady Cluster: prod-cluster-*** Namespace: production Pod Name: app-service-***

Step 2: Choose Check Strategy

Based on alert type, AI knows this is a "Pod Status Problem" and follows preset checking procedures.

Step 3: Execute Check Commands

# 1. Check Pod status
kubectl get pods -n production --show-labels -o wide
# 2. If status is abnormal, get detailed info
kubectl describe pod app-service-*** -n production
# 3. If needed, check logs
kubectl logs app-service-*** -n production --tail=20

Step 4: AI Analysis and Report Generation

────────────────────────────────
🔧 K8s Alert Handler AI
🚨 KubePodNotReady - firing
Start Time: 2025-01-15T10:30:00Z
Cluster: prod-cluster-***
Component: production/app-service-***
Quick Diagnosis:
Pod in NotReady state, container restarted 3 times
Last restart time: 5 minutes ago
Error reason: Database connection timeout
Analysis Conclusion:
Application cannot connect to database, possibly due to database service issues or network problems
Recommended Actions:
1. Immediately check database service status
2. Check Service and Endpoint configuration
3. Consider restarting Pod or scaling database
4. Check network policies and firewall rules
────────────────────────────────

The entire process completes in 2–3 minutes and runs 24/7 continuously!

The AI Logic I Designed

The core of this system is AI's decision logic. I spent a lot of time fine-tuning the AI's system message to handle various alerts in our production environment.

Supported Alert Types

Currently, the system can intelligently handle 45 different Kubernetes alerts, covering almost all common problem scenarios:

Cluster Core Component Alerts (10)

  • AlertmanagerClusterCrashlooping, AlertmanagerClusterDown
  • AlertmanagerClusterFailedToSendAlerts, AlertmanagerConfigInconsistent
  • AlertmanagerFailedReload, KubeAPIDown
  • KubeAPIErrorBudgetBurn, PrometheusBadConfig
  • PrometheusTargetSyncFailure, PrometheusRuleFailures

Workload Related Alerts (12)

  • KubeDeploymentReplicasMismatch, KubeStatefulSetReplicasMismatch
  • KubePodCrashLooping, KubePodNotReady
  • KubeJobFailed, KubeHpaReplicasMismatch
  • KubeHpaMaxedOut, KubeHpaHighUtilization
  • KubeHpaFrequentScaling, KubeImagePullBackOff
  • KubeServiceEndpointsUnavailable, KubeConfigReloadFailed

Container and Resource Alerts (12)

  • KubeContainerOOMKilled
  • KubeContainerCPUNearLimit
  • KubeContainerMemoryNearLimit
  • KubeTooManyPendingPods
  • KubeCPUOvercommit
  • KubeMemoryOvercommit
  • KubeQuotaExceeded
  • KubeQuotaAlmostFull
  • CPUThrottlingHigh
  • NodeCPUHighUsage
  • NodeMemoryHighUtilization
  • NodeSystemSaturation

Node and Storage Alerts (10)

  • KubeletDown, KubeNodeNotReady
  • NodeFileDescriptorLimit, KubePersistentVolumeErrors
  • KubePersistentVolumeFillingUp, KubePersistentVolumeInodesFillingUp
  • KubeVolumeMountFailed, NodeFilesystemAlmostOutOfSpace
  • NodeFilesystemSpaceFillingUp, KubeStateMetricsDown

Certificate and Monitoring Alerts (5)

  • KubeClientCertificateExpiration
  • KubeletClientCertificateExpiration
  • KubeletServerCertificateExpiration
  • KubeStateMetricsListErrors
  • Watchdog

Each alert type has corresponding check strategies, AI automatically selects the most appropriate troubleshooting process based on alert type.

AI Decision Logic Design

1. Parameter Extraction Rules

Tell AI how to find key information from complex JSON:

  • Must Extract: cluster, alertType (most important, cannot be wrong)
  • Optional Extract: namespace, pod, container, deployment, etc.
  • Status Judgment: Whether status is firing or resolved

2. Problem Classification Logic

I categorized 45 alerts into several major classes, each with corresponding check strategies:

  • HPA Problems:
    • Check HPA status first, then Deployment
    • Maximum 2 kubectl calls
  • Pod Problems:
    • Check Pod list first, then describe specific Pod, check logs if necessary
    • Maximum 3 kubectl calls
  • Node Problems:
    • Check node list first, then describe specific node
    • Maximum 2 kubectl calls
  • Resource Problems:
    • Use kubectl top to check resource usage
    • Maximum 2 kubectl calls

3. Rate Limit Protection

This is important! I limit each alert to execute a maximum of 3 kubectl commands to prevent AI from overwhelming the API.

  • Strategy:
    • First time: Combined query, get overview information at once
    • Second time: Targeted describe, deep dive into problems
    • Third time: Log viewing or other supplementary information

4. Smart Skip Logic

If the first check finds status normal, subsequent deep checks won't be executed, saving resources.

Pitfalls I Encountered

Pitfall 1: AI Tends to "Overthink"

Initially without Rate Limit, AI often executed dozens of commands, overwhelming the K8s API. Later I added limits, forcing it to complete diagnosis within limited call counts.

Pitfall 2: Alert Name Confusion

Alertmanager JSON has both alertname and summary fields, AI sometimes gets confused. I specifically emphasized using alertname in the system message.

Pitfall 3: Cluster Authentication Issues

We have multiple GKE clusters and Amazon EKS clusters, each with different authentication methods. Finally added auto-authentication logic in MCP Server, so AI doesn't need to handle these details.

Pitfall 4: Inadequate Error Handling

Initially didn't consider kubectl command failures, often got stuck due to permission or network issues. Later added complete error handling and retry mechanisms.

Actual Performance Results

After 3 months of operation, the results are quite good:

Data Comparison

  • Alert Processing Time: Reduced from average 20 minutes to 3 minutes
  • Night Response: 24/7 automated processing, no more middle-of-night wake-ups
  • New Employee Training: New colleagues can directly reference AI diagnostic reports for learning
  • Processing Accuracy: 90% of common problems get correct diagnostic direction

This system has significantly improved our efficiency when dealing with Docker containers and orchestration issues. Whether you are using Jenkins or other CI/CD tools, integrating AI into your monitoring stack is a game changer.