2024-05-21DevOpsN

Build Centralized Alerting with CloudWatch, EventBridge, and CDK

AWSCloudWatchEventBridgeCDKLambdaDevOps
B

How to Build Centralized Alerting in AWS Organizations

Managing alerts in a multi-account environment managed by AWS Control Tower or Landing Zones can be an operational nightmare. Logging into each account individually to check alarms is not scalable. In this post, we will explore how to build a centralized alerting system that aggregates alarms from your entire organization into a single "Observability Account" using AWS CloudWatch, EventBridge, and Lambda.

The Case for Centralization

If you manage multiple AWS accounts (e.g., Prod, Staging, Dev), decentralized monitoring leads to missed incidents and fatigue. A centralized structure provides:

  • Single Pane of Glass: Monitor all alarms from one location.
  • Standardization: Use common notification channels (Slack, Discord, Microsoft Teams) for all teams.
  • Automation: Trigger automated remediation actions centrally.

Architecture Overview

The system consists of three main components:

  1. Member Accounts: Where CloudWatch alarms reside. When an alarm changes state, the event is captured by the default EventBridge bus and forwarded.
  2. Management Account: Uses CloudFormation StackSets to deploy the necessary EventBridge rules to all member accounts.
  3. Observability Account: The central hub. A custom EventBus receives events, and an EventBridge Rule triggers a Lambda function to send notifications.

AWS Centralized Alerting Architecture Diagram

Implementation Guide (with CDK)

We will implement this using the AWS Cloud Development Kit (CDK) and TypeScript.

1. Setting Up the Observability Account

First, let's configure the central account that will receive the alarms. We'll create a custom EventBus and a Lambda function.

// lib/observability-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as events from 'aws-cdk-lib/aws-events';
import * as targets from 'aws-cdk-lib/aws-events-targets';
import * as lambda from 'aws-cdk-lib/aws-lambda';

export class ObservabilityStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // 1. Create Central EventBus
    const centralBus = new events.EventBus(this, 'CentralAlertingBus', {
      eventBusName: 'central-alerting-bus',
    });

    // 2. Allow other accounts to put events (Resource Policy)
    // Note: In production, restrict this by OrganizationId
    centralBus.addToResourcePolicy(new cdk.iam.PolicyStatement({
      sid: 'AllowAllAccounts',
      actions: ['events:PutEvents'],
      principals: [new cdk.iam.AnyPrincipal()], 
      resources: [centralBus.eventBusArn],
    }));

    // 3. Lambda function for notifications
    const alertingLambda = new lambda.Function(this, 'AlertingLambda', {
      runtime: lambda.Runtime.NODEJS_18_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda/alerting'),
      environment: {
        WEBHOOK_URL: 'https://discord.com/api/webhooks/...' // Your Webhook URL
      }
    });

    // 4. Rule on the EventBus
    new events.Rule(this, 'CatchAllAlarms', {
      eventBus: centralBus,
      eventPattern: {
        source: ['aws.cloudwatch'],
        detailType: ['CloudWatch Alarm State Change'],
      },
      targets: [new targets.LambdaFunction(alertingLambda)],
    });
  }
}

2. Configuring Member Accounts (StackSet)

The default EventBridge bus in member accounts automatically captures CloudWatch alarms. We need to add a rule to forward these to our central account. We can deploy this rule to all accounts using CfnStackSet from the Management Account.

// lib/management-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as cfn from 'aws-cdk-lib/aws-cloudformation';

export class ManagementStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // CloudFormation template to be deployed to all accounts
    const memberAccountTemplate = `
      Resources:
        ForwardToCentralRule:
          Type: AWS::Events::Rule
          Properties:
            Name: ForwardToCentralObservability
            EventPattern:
              source:
                - aws.cloudwatch
              detail-type:
                - "CloudWatch Alarm State Change"
            State: ENABLED
            Targets:
              - Arn: "arn:aws:events:us-east-1:123456789012:event-bus/central-alerting-bus"
                Id: "CentralBus"
                RoleArn: !GetAtt EventBridgeRole.Arn
        
        EventBridgeRole:
          Type: AWS::IAM::Role
          Properties:
            AssumeRolePolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: Allow
                  Principal:
                    Service: events.amazonaws.com
                  Action: sts:AssumeRole
            Policies:
              - PolicyName: PutEventsToCentral
                PolicyDocument:
                  Version: "2012-10-17"
                  Statement:
                    - Effect: Allow
                      Action: events:PutEvents
                      Resource: "arn:aws:events:us-east-1:123456789012:event-bus/central-alerting-bus"
    `;

    new cfn.CfnStackSet(this, 'CentralAlertingStackSet', {
      stackSetName: 'CentralAlerting-MemberAccounts',
      templateBody: memberAccountTemplate,
      permissionModel: 'SERVICE_MANAGED', // For AWS Organizations
      autoDeployment: {
        enabled: true,
        retainStacksOnAccountRemoval: false,
      },
      stackInstancesGroup: [{
        regions: ['us-east-1'], // Region where your alarms are
        deploymentTargets: {
          organizationalUnitIds: ['ou-xxxx-yyyyyyy'], // Target OU ID
        },
      }],
    });
  }
}

Testing the Setup

To verify the system, create a temporary alarm in any member account.

  1. Create Alarm: Create a simple CPU alarm in the CloudWatch console.
  2. Change State: Use the AWS CLI to manually trigger the alarm:
    aws cloudwatch set-alarm-state --alarm-name "TestAlarm" --state-value ALARM --state-reason "Testing centralized alerting"
    
  3. Verify: Check the logs of the Lambda function in the Observability Account or your notification channel (Discord/Slack). You should see the alert.

Advanced Configuration

You can enhance this setup to fit your specific needs:

  • Dynamic Severity: Add tags like severity: critical to your alarms. Your Lambda function can read these tags and route alerts to different channels (e.g., PagerDuty vs. Slack) accordingly.
  • Dynamic Routing: Use SSM Parameter Store to map alarms to responsible teams, allowing the Lambda function to route notifications based on the alarm name or tags.

This architecture ensures that as your organization grows, your monitoring infrastructure scales without becoming unmanageable.

For more AWS solutions, check out our AWS Consultancy and Kubernetes Consultancy services.