SLI

A Service Level Indicator (SLI) is a measurement of how well a service is performing on a given dimension, such as availability, latency, throughput, or correctness. SLIs are the raw observables; SLOs are targets set on top of them.

Common SLI types

  • Availability: ratio of successful requests to total requests over a window.
  • Latency: proportion of requests served below a threshold (for…
SLO

A Service Level Objective (SLO) is an internal target for a service's reliability, expressed as a percentage over a window (for example, 99.9% of requests succeed within 300 ms over a rolling 28-day window). SLOs anchor the practice of Site Reliability Engineering by making reliability concrete, measurable, and tradable against feature velocity.

Related terms in the SLI/SLO/SLA family

  • SLI…

Grafana is an open-source visualisation and dashboarding platform for time-series and observability data. It connects to many data sources (Prometheus, Loki, Tempo, Elasticsearch, MySQL, BigQuery, CloudWatch, Datadog) and renders queries as panels: graphs, tables, heatmaps, single-stat tiles, and more.

Core capabilities

  • Dashboards. Compositions of panels with shared time ranges and variable…

Prometheus is an open-source time-series database and monitoring system that scrapes metrics from instrumented applications over HTTP, stores them locally, and exposes a powerful query language (PromQL) for dashboards and alerts. It is the de facto standard for cloud-native infrastructure monitoring.

How it works

Applications expose a /metrics endpoint in the Prometheus exposition format. A…

OpenTelemetry (OTel) is the CNCF standard for instrumenting applications to produce telemetry: traces, metrics, and logs. It defines an open specification, language SDKs, a vendor-neutral collector, and wire protocols so applications can be instrumented once and the data can be sent to any compatible backend.

What it covers

  • API and SDK. Per-language libraries for creating spans, recording…

Distributed tracing captures the path of a single request as it travels through multiple services, recording the timing, parent-child relationships, and metadata of each step. A trace gives operators a flame-graph view of where latency comes from and how services depend on each other.

Core concepts

  • Trace. The full record of one request across all services it touches, identified by a…

Logging is the practice of emitting timestamped records of events from running software so operators can reconstruct what happened. Modern systems use structured logging, where each record is a JSON object with named fields (level, timestamp, service, request_id, message, plus arbitrary attributes), making logs queryable rather than just human-readable.

Log levels

  • TRACE / DEBUG:…
Page 1