Kubernetes Observability: What to Monitor and Why

The Kubernetes Monitoring Maze

Kubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.

After running K8s in production for four years, here's what actually matters.

The Three Layers

Kubernetes observability has three distinct layers, and you need different strategies for each:

Layer 1: Cluster Health (infrastructure)

Read more →
On-Call Wellness: Protecting Your Engineers from Burnout

The On-Call Burnout Epidemic

I watched three senior SREs leave our team in six months. Exit interviews all said the same thing: on-call was unsustainable.

We were spending $500K+ recruiting replacements for a problem that could have been fixed with $0 and better practices.

The Warning Signs

Before someone quits, they show these signals:

  1. Cynicism in post-mortems — "This will…
Read more →
Async LLM inference in CI: stop build workers blocking on slow jobs

TL;DR: Async inference through an AI gateway lets CI build workers submit a long LLM job, get an id back, and poll later, so a 30-second model call stops holding a worker hostage. Here's how I wired it with Bifrost.

Our build workers at Buildkite were each blocked for up to 35 seconds waiting on a single LLM call that summarised failed test output. With a few hundred concurrent builds…

Read more →
Code isn’t the only thing causing your production failures​​​​‌‍​‍​‍‌‍‌​‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍​‍​‍​‍‍​‍​‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌‍‍‌‌‍​‍​‍​‍​​‍​‍‌‍‍​‌​‍‌‍‌‌‌‍‌‍​‍​‍​‍‍​‍​‍‌‍‍​‌‌​‌‌​‌​​‌​​‍‍​‍​‍‌‍​‌‍‌‌​​‍‍‌​‌‌​‌‍​‌‌‍​‌‍‍‌‍‌‌‍‌‍‌‌‌​‍‌‍‌‍‌‍​‌‍‌‌​‍‍‌‍​‌‍​‍‌‍‍‌‌‍‍‌‌​‌‍‌‌‌‍‍‌‌​​‍‌‍‌‌‌‍‌​‌‍‍‌‌‌​​‍‌‍‌‌‍‌‍‌​‌‍‌‌​‌‌​​‌​‍‌‍‌‌‌​‌‍‌‌‌‍‍‌‌​‌‍​‌‌‌​‌‍‍‌‌‍‌‍‍​‍‌‍‍‌‌‍‌​​‌‌‍‌‍​‍​​​‌‍‌‌‌‍​‍​‌‌‌‍‌‍​​​​‍‌​​‌​​‍​​​‌​‍‌​‌​​‍​​‌‌‍‌‍​‍‌​‍​​‌​‌‍‌​​‍​​‍‌‌‍‌‍​‌‌‍​‌‌‍​‍‌‍‌‍​​‍​​​​‌​‍​‌‍​​​​‍‌​‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‌​‌‍‍‌‌‌​‌‍​‌‍‌‌​‌‍​‍‌‍​‌‌​‌‍‌‌‌‌‌‌‌…

Ryan sits down with Anish Agarwal, CEO and co-founder of Traversal, to chat about why AI coding agents have made writing code easier but running it safely in production harder, why production failures are really caused by interactions between systems and not just the code itself, and how teams can troubleshoot more effectively when traditional observability tools are not enough for agentic AI…

Read more →
Capacity Planning Without ML: The 80/20 Approach

There's a small industry of vendors that want to sell you machine learning capacity planning. For 95% of teams, you don't need it. You need a spreadsheet, an honest growth assumption, and a buffer.

Here's the practical version of capacity planning that catches most real problems.

What you actually need to know

You need to answer three questions:

  1. When does the current setup run out?
Read more →
Automate creation of Amazon CloudWatch alarms

Recently I developed a new feature for this Github Action to automate the creation of AWS Cloudwatch alarms.
Next steps I will show you the settings you need to add to your project to automate creation of CloudWatch alarms.

Prerequisites

1- Your project should use the github actions

2- Your user must have permissions to create an OpenID Connect IDP, policies, and roles in your AWS…

Read more →
Semantic caching our flaky-test summariser: 58% fewer LLM calls

TL;DR: Our internal flaky-test summariser at Buildkite was firing ~40k LLM calls a day, and most were near-duplicates of failures we'd already explained. Switching on semantic caching in Bifrost cut live provider calls by 58% and dropped p50 latency on cache hits from ~900ms to about 40ms. It also kept the feature alive when our primary provider browned out for 11 minutes.

The feature…

Read more →
What 60+ Claude Code memory entries taught me about solo ops

I run a paid infrastructure service. Alone. No co-founder, no on-call rotation, no senior engineer to escalate to. My only collaborator is Claude Code, and after about a year, my persistent memory has grown to 60+ entries.

Those entries have become more valuable than any runbook I've written. They've also taught me — painfully — what makes memory architecture work and what makes it quietly…

Read more →
Chaos Engineering for Node.js Without the Infrastructure

Chaos engineering sounds expensive. Netflix built Chaos Monkey to randomly kill production servers. Google runs DiRT (Disaster Recovery Testing) across their entire infrastructure. Amazon does game days where they intentionally take down services.

You're building a Node.js API. You don't have a platform team. You don't have a chaos infrastructure. But you still need to know: **what happens when…

Read more →
Humanizing Artificial Intelligence in DevOps Documentation: Making Runbooks Easier to Create and Use

The Runbook That Lied to Me at 3am

The pager went off at 3:14am for a wedged OpenStack Neutron agent. I did what any tired engineer does: I opened the runbook. It told me to restart a service that had been renamed eighteen months earlier, pointed at a Grafana dashboard that 404'd, and assumed a network topology we'd migrated off of two quarters back. The runbook wasn't just unhelpful. It was…

Read more →
Fault-injecting our LLM provider to trust Bifrost fallbacks

TL;DR: We run an LLM-backed build-failure summariser at Buildkite. To stop a provider wobble from breaking it mid-deploy, I ran a game day that fault-injected OpenAI with 429s and 500s and watched whether Bifrost's fallback config actually rerouted. It did, but only after I fixed two things I'd set up wrong.

We've got a small service that reads failed CI jobs and writes a one-paragraph…

Read more →
Agent Handoff Contracts: The Missing Piece in Production Agent Systems

Most of the "we're adding AI to our ops platform" stories you'll read this year will skip the one part that actually determines whether the system works: the handoff between agents. Here's why it matters and what a good one looks like.

The problem

When you have one agent, handoff is a non-issue. The agent does its thing, returns output, done. When you have two, you start needing a format:…

Read more →
The SRE Mindset in API Architecture

I spent a good chunk of my career working in SRE and then when the opportunity came up I took the decision to move over into an Architecture role - in some ways a change and in many ways a wider remit to continue with the types of things I'd been doing and working on in the SRE team.

How SRE reliability principles shape architecture

Core principles from my time in SRE have definitely shaped…

Read more →
Page 1