ArchitectureStaff

Staff Prep 28: Observability — Metrics, Logs, Traces & SLOs

April 4, 202610 min readPART 06 / 06

Monitoring tells you something is wrong. Observability tells you why. That difference stops being academic the moment you are debugging a production incident at 2am. Most teams wire up a handful of Grafana dashboards, call it observability, and then spend three hours clicking around when something actually breaks. Real observability means your system emits enough telemetry that you can answer a question you had not thought to ask yet.

The three pillars

Metrics are numeric measurements aggregated over time. They answer "is something wrong right now": CPU at 95%, p99 latency at 2s, error rate at 3%. They are cheap to store and fast to query, and useless for explaining why.

Logs are discrete events with context. They tell you what actually happened for a specific request. Full story, expensive at scale. A service doing 10k RPS can easily generate millions of log lines a minute, and your bill will remind you.

Traces follow a single request through your system and tell you where its time went. A good trace spans multiple services and shows you the exact database query, the downstream call, or the JSON serialization step that cost you the SLO. This is the pillar most teams under-invest in, and the one I miss most when it's absent.

text

Typical incident flow:

1. Metric alert fires: "p99 latency > 500ms for 5 minutes"
2. Go to dashboard: latency spike started at 14:32
3. Check logs: errors mention "Connection pool exhausted"
4. Pull a trace from 14:32: DB query taking 800ms, normally 20ms
5. Check DB metrics: replica lag spiked → read queries hitting primary
6. Fix: failover replica, adjust connection pool

Metrics: what to measure

The USE method (for infrastructure): Utilisation, Saturation, Errors. The RED method (for services): Rate, Errors, Duration. Apply RED to every service boundary.

python

from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status_code=response.status_code
    ).inc()

    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(duration)

    return response

Why percentiles beat averages

Average latency hides the tail. If 99% of requests take 10ms and 1% take 10 seconds, your average is around 110ms and your dashboard looks fine. Your p99 is 10 seconds and your worst customers are having a miserable time.

At Staff level you always look at p50, p95, p99 and p99.9. The gap between p95 and p99 tells you how bad the tail is. The gap between p99 and p99.9 tells you how bad it is for the unlucky few (and those unlucky few are usually your biggest customers, which is its own cruel joke).

sql

-- PromQL: percentile latency by endpoint
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

-- See p50, p95, p99 side by side
histogram_quantile(0.50, rate(...[5m]))  -- median
histogram_quantile(0.95, rate(...[5m]))  -- p95
histogram_quantile(0.99, rate(...[5m]))  -- p99

Distributed tracing with OpenTelemetry

OpenTelemetry is the vendor-neutral standard. Instrument once, send to Jaeger, Tempo, Datadog, or Honeycomb without code changes.

python

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

# Instrument a function
async def get_user(user_id: int):
    with tracer.start_as_current_span("get_user") as span:
        span.set_attribute("user.id", user_id)
        user = await db.fetch_one(
            "SELECT * FROM users WHERE id = :id", {"id": user_id}
        )
        if not user:
            span.set_status(StatusCode.ERROR, "User not found")
        return user

The trace propagates automatically across HTTP calls via the traceparent header. Every service in the chain adds its spans to the same trace. You see the full waterfall in your tracing UI.

Slis, slos and error budgets

This is the Staff engineer vocabulary for reliability conversations with product and leadership.

SLI (Service Level Indicator): a metric that measures your service quality. Example: "the proportion of requests completed in under 200ms."

SLO (Service Level Objective): a target for your SLI. Example: "99.5% of requests complete in under 200ms, measured over a rolling 28-day window."

Error Budget: how much you can fail before violating the SLO. At 99.5% over 28 days you have 0.5% of requests to burn, which is roughly 3.5 hours of full downtime or a proportional mix of partial failures.

text

Error budget = (1 - SLO target) * time period

For 99.5% SLO over 30 days:
Budget = 0.5% * 30 days * 24h * 60min = 216 minutes

If you burn 100 minutes this week → 116 minutes left.
If you are on track to exhaust budget → freeze releases, fix reliability.
If budget is healthy → ship features aggressively.

Error budgets turn "reliability vs features" from a political argument into a data-driven conversation. If the team wants to ship 3 risky features this sprint, the question becomes: does our error budget support this risk?

Structured logging

Logs that humans parse are logs that are useless at scale. Structured logs are JSON, queryable by your log aggregator (Loki, Elasticsearch, CloudWatch Insights). Any engineer who has tried to grep their way through a 20GB log file during an incident ends up converted.

python

import structlog

log = structlog.get_logger()

# Bad: unstructured
print(f"User {user_id} failed to login: {reason}")

# Good: structured
log.warning(
    "login_failed",
    user_id=user_id,
    reason=reason,
    ip_address=request.client.host,
    trace_id=trace.get_current_span().get_span_context().trace_id
)

# Now you can query: SELECT * WHERE reason = 'invalid_password' AND timestamp > now() - 1h

Always include the trace ID in your logs. It's what lets you jump from a metric alert to the relevant log lines to the specific trace in seconds instead of minutes.

Quiz: test your understanding

Before moving on, answer these in your head (or out loud):

Your p50 latency is 20ms. Your p99 is 4 seconds. Your average is 60ms. A PM asks "is our API fast?" What is your answer, and what does this distribution tell you about the source of the problem?
You have an SLO of 99.9% availability over 30 days. An incident took the service down for 90 minutes. How much error budget is remaining? Should you freeze deployments?
A distributed trace shows your API handler takes 800ms total. The DB query takes 10ms. The downstream payment service takes 750ms. What should you investigate first? What information in the trace would help you narrow it down?
Your team logs every request body for debugging purposes. The service processes 50k requests per minute, each with a 2KB body. Estimate the log volume per day in GB. What would you do differently?
Explain the difference between a metric, a log, and a trace using a single concrete example: a user clicks "checkout" on an e-commerce site and the payment fails.

That's the end of the 28-day Staff Engineer Prep curriculum. Pick the topic where your quiz answers felt shakiest and read that post again. Then go interview.

← PREV

Staff Prep 27: Distributed Systems — CAP Theorem, Eventual Consistency & Conflict Resolution