Staff Prep 28: Observability — Metrics, Logs, Traces & SLOs
Monitoring tells you something is wrong. Observability tells you why. The difference matters when you are debugging a production incident at 2 AM. Most teams set up dashboards and call it observability — then spend three hours clicking through Grafana when things break. Real observability means your system is instrumented well enough that you can answer any question about its internal state using the telemetry it emits. Here is how to build that.
The three pillars
Metrics are numeric measurements aggregated over time. They answer: is something wrong right now? CPU at 95%, p99 latency at 2s, error rate at 3%. Cheap to store, fast to query, terrible at explaining why.
Logs are discrete events with context. They answer: what happened for this specific request? Logs give you the full story but are expensive at scale. A service doing 10k RPS generates millions of log lines per minute.
Traces are the path of a single request through your system. They answer: where did this request spend its time? A trace spans multiple services, showing you exactly which database query, which downstream call, which serialization step was slow.
Typical incident flow:
1. Metric alert fires: "p99 latency > 500ms for 5 minutes"
2. Go to dashboard: latency spike started at 14:32
3. Check logs: errors mention "Connection pool exhausted"
4. Pull a trace from 14:32: DB query taking 800ms, normally 20ms
5. Check DB metrics: replica lag spiked → read queries hitting primary
6. Fix: failover replica, adjust connection pool
Metrics: what to measure
The USE method (for infrastructure): Utilisation, Saturation, Errors. The RED method (for services): Rate, Errors, Duration. Apply RED to every service boundary.
from prometheus_client import Counter, Histogram, start_http_server
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status_code']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration = time.time() - start
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status_code=response.status_code
).inc()
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)
return response
Why percentiles beat averages
Average latency hides tail latency. If 99% of requests take 10ms and 1% take 10 seconds, your average might be 110ms — looks fine. Your p99 is 10 seconds — your worst customers are suffering.
At Staff level, you always look at p50, p95, p99, and p99.9. The gap between p95 and p99 tells you how bad your tail is. The gap between p99 and p99.9 tells you how bad your worst users have it.
-- PromQL: percentile latency by endpoint
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
-- See p50, p95, p99 side by side
histogram_quantile(0.50, rate(...[5m])) -- median
histogram_quantile(0.95, rate(...[5m])) -- p95
histogram_quantile(0.99, rate(...[5m])) -- p99
Distributed tracing with OpenTelemetry
OpenTelemetry is the vendor-neutral standard. Instrument once, send to Jaeger, Tempo, Datadog, or Honeycomb without code changes.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Instrument a function
async def get_user(user_id: int):
with tracer.start_as_current_span("get_user") as span:
span.set_attribute("user.id", user_id)
user = await db.fetch_one(
"SELECT * FROM users WHERE id = :id", {"id": user_id}
)
if not user:
span.set_status(StatusCode.ERROR, "User not found")
return user
The trace propagates automatically across HTTP calls via the traceparent header. Every service in the chain adds its spans to the same trace. You see the full waterfall in your tracing UI.
Slis, slos and error budgets
This is the Staff engineer vocabulary for reliability conversations with product and leadership.
SLI (Service Level Indicator): a metric that measures your service quality. Example: "the proportion of requests completed in under 200ms."
SLO (Service Level Objective): a target for your SLI. Example: "99.5% of requests complete in under 200ms, measured over a rolling 28-day window."
Error Budget: how much you can fail before violating the SLO. At 99.5% SLO over 28 days, you have 0.5% of requests as your budget — about 3.5 hours of complete downtime, or a proportional mix of partial failures.
Error budget = (1 - SLO target) * time period
For 99.5% SLO over 30 days:
Budget = 0.5% * 30 days * 24h * 60min = 216 minutes
If you burn 100 minutes this week → 116 minutes left.
If you are on track to exhaust budget → freeze releases, fix reliability.
If budget is healthy → ship features aggressively.
Error budgets turn "reliability vs features" from a political argument into a data-driven conversation. If the team wants to ship 3 risky features this sprint, the question becomes: does our error budget support this risk?
Structured logging
Logs that humans parse are logs that are useless at scale. Structured logs are JSON — queryable by your log aggregator (Loki, Elasticsearch, CloudWatch Insights).
import structlog
log = structlog.get_logger()
# Bad: unstructured
print(f"User {user_id} failed to login: {reason}")
# Good: structured
log.warning(
"login_failed",
user_id=user_id,
reason=reason,
ip_address=request.client.host,
trace_id=trace.get_current_span().get_span_context().trace_id
)
# Now you can query: SELECT * WHERE reason = 'invalid_password' AND timestamp > now() - 1h
Always include the trace ID in your logs. This lets you jump from a metric alert → relevant log lines → the specific trace, in seconds.
Quiz: test your understanding
Before moving on, answer these in your head (or out loud):
- Your p50 latency is 20ms. Your p99 is 4 seconds. Your average is 60ms. A PM asks "is our API fast?" What is your answer, and what does this distribution tell you about the source of the problem?
- You have an SLO of 99.9% availability over 30 days. An incident took the service down for 90 minutes. How much error budget is remaining? Should you freeze deployments?
- A distributed trace shows your API handler takes 800ms total. The DB query takes 10ms. The downstream payment service takes 750ms. What should you investigate first? What information in the trace would help you narrow it down?
- Your team logs every request body for debugging purposes. The service processes 50k requests per minute, each with a 2KB body. Estimate the log volume per day in GB. What would you do differently?
- Explain the difference between a metric, a log, and a trace using a single concrete example: a user clicks "checkout" on an e-commerce site and the payment fails.
You have completed the 28-day Staff Engineer Prep curriculum. Review the topics where your quiz answers were weakest and revisit those posts. Good luck with the interview.