Building Observability to Increase Resiliency

<div class="toc">
    <h3>Table of Contents</h3>
    <ul>
        <li><a href="#diagnose-issues">Diagnose Issues</a></li>
        <li><a href="#uncover-hidden-issues">Uncover Hidden Issues</a></li>
        <li><a href="#prevent-future-issues">Prevent Future Issues</a></li>
        <li><a href="#step-by-step-guide">Step-by-Step: Detecting Client-Side Anomalies</a></li>
        <li><a href="#faqs">Frequently Asked Questions</a></li>
    </ul>
</div>

<p>Building observability into your systems is not just about collecting logs; it's about gaining the insights needed to make your applications resilient. In a recent re:Invent talk, David Yanacek emphasized the importance of observability in identifying and resolving issues swiftly.</p>

<h2 id="diagnose-issues">Diagnose Issues</h2>
<p>When systems fail, the first goal is diagnosis. Effectively diagnosing issues requires looking at four key areas:</p>
<ul>
    <li><strong>Bad Dependencies:</strong> Use Composite Alarms to monitor errors across your entire stack.</li>
    <li><strong>Bad Components:</strong> Utilize Service Maps (like in AWS X-Ray) to visualize interactions and pinpoint faulty components.</li>
    <li><strong>Bad Deployments:</strong> Detect anomalies immediately after a release to enable quick rollbacks.</li>
    <li><strong>Traffic Spikes:</strong> Analyze detailed metrics to understand if a surge is legitimate traffic or a DDoS attack.</li>
</ul>

<h2 id="uncover-hidden-issues">Uncover Hidden Issues</h2>
<p>Some issues are silent killers. They don't trigger standard 5xx alarms but significantly impact user experience.</p>

<h3>RUM vs. Synthetic Monitoring</h3>
<p><strong>Real User Monitoring (RUM)</strong> measures the experience from the actual user's browser. However, if your traffic drops (e.g., during a DNS outage), RUM goes silent.</p>
<p><strong>Synthetic Monitoring</strong> uses automated scripts (Canaries) to simulate user interactions continuously, ensuring you have visibility even when real traffic is low.</p>

<h3>Client-Side vs. Server-Side Errors</h3>
<p>Consider a scenario where a deployment reduces an input field's character limit. This might cause a spike in client-side (4xx) errors, which server-side (5xx) alarms might miss. To catch this, you need to monitor the <em>rate of clients experiencing errors</em>, not just the raw error count.</p>

<h2 id="prevent-future-issues">Prevent Future Issues</h2>
<p>Resiliency is about preparation.</p>
<ul>
    <li><strong>Elasticity:</strong> Measure utilization across CPU, Memory, and Thread Pools to drive Auto Scaling.</li>
    <li><strong>Game Days:</strong> regularly simulate failures in production (in a controlled manner) to verify your observability tools work as expected.</li>
</ul>

<h2 id="step-by-step-guide">Step-by-Step: Detecting Client-Side Anomalies</h2>
<p>To detect if a deployment has caused a surge in client-side errors, you can use CloudWatch Contributor Insights. Here is how you might define a rule to track the ratio of affected customers.</p>

<p><strong>Scenario:</strong> You want to know if a specific client ID is generating a disproportionate number of 4xx errors.</p>

<pre><code>{
"Schema": {
    "Name": "CloudWatchLogRule",
    "Version": 1
},
"LogGroupNames": ["/aws/lambda/my-app-logs"],
"LogFormat": "JSON",
"Contribution": {
    "Keys": ["$.clientId"],
    "ValueOf": "$.requestId",
    "Filters": [
        {
            "Match": "$.status",
            "In": [400, 401, 403, 404]
        }
    ]
},
"AggregateOn": "Count"

}</code></pre>

<p>This JSON rule for Contributor Insights counts requests with 4xx status codes grouped by <code>clientId</code>. This helps you distinguish between a single user spamming bad requests and a systemic issue affecting everyone.</p>

<h2 id="faqs">Frequently Asked Questions</h2>
<h3>What is the difference between Monitoring and Observability?</h3>
<p>Monitoring tells you <em>if</em> the system is healthy. Observability allows you to ask arbitrary questions to understand <em>why</em> it is not.</p>

<h3>Why do I need Synthetic Monitoring if I have RUM?</h3>
<p>Synthetic Monitoring provides a baseline of performance and availability even when there is no real user traffic, ensuring you catch issues during low-traffic periods.</p>

<h3>How can I detect "unknown unknowns"?</h3>
<p>By collecting high-cardinality data and using tools like CloudWatch Contributor Insights, you can slice and dice data to find anomalies you didn't explicitly set alarms for.</p>

<p>For more on building resilient systems on AWS, consider our <a href="/tech/aws-consultancy">AWS Consultancy services</a> or explore <a href="/tech/kubernetes-consultancy">Kubernetes solutions</a>. You can also learn more about our comprehensive cloud offerings on our <a href="/">homepage</a>.</p>

<p>Source / Kaynak: <a href="https://awsfundamentals.com/blog/building-observability-to-increase-resiliency" target="_blank" rel="noopener noreferrer">https://awsfundamentals.com/blog/building-observability-to-increase-resiliency</a></p>