Vetora logo
๐Ÿ”ญObservability

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for vendor-neutral observability instrumentation. It provides a unified API, SDK, and Collector for generating, processing, and exporting logs, metrics, and traces from any application to any backend.

Overview

OpenTelemetry (OTel) emerged in 2019 from the merger of two competing CNCF projects: OpenTracing (a vendor-neutral tracing API) and OpenCensus (Google's instrumentation library for traces and metrics). The merger resolved the fragmentation that had plagued the observability ecosystem, where library authors had to choose between two incompatible instrumentation APIs. Today, OpenTelemetry is the second-most-active CNCF project after Kubernetes and the de facto standard for application instrumentation.

OTel's architecture has three layers. The API layer defines the interfaces for creating spans, recording metrics, and emitting logs. Library authors instrument against this API, which has zero dependencies and negligible overhead when no SDK is configured. The SDK layer implements the API with configurable samplers, processors, and exporters. Application operators configure the SDK to sample traces at 5%, batch metrics every 15 seconds, and export everything via OTLP. The Collector layer is a standalone process (deployed as a sidecar, DaemonSet, or gateway) that receives telemetry from SDKs, applies transformations (add attributes, filter sensitive data, sample), and exports to one or more backends.

Auto-instrumentation is OTel's killer feature. For Java, a single -javaagent flag instruments all HTTP clients (Apache, OkHttp), web frameworks (Spring, JAX-RS), database drivers (JDBC, Hibernate), messaging (Kafka, RabbitMQ), and gRPC calls without any code changes. Python, Node.js, .NET, and Go have similar auto-instrumentation packages. This means a team can go from zero observability to full traces across their microservice fleet in hours, not weeks.

The OTel Collector is the Swiss Army knife of telemetry pipelines. It can receive data in OTLP, Jaeger, Zipkin, Prometheus, or StatsD format. It can process data (add Kubernetes metadata, tail-sample traces, drop sensitive attributes). And it can export to 50+ backends. Running a Collector decouples instrumentation from backend choice -- you can switch from Jaeger to Tempo or from Prometheus to Datadog by changing the Collector config, not your application code.

OTLP (OpenTelemetry Protocol) is the wire format that ties everything together. It is a gRPC/HTTP protocol optimized for batched telemetry: spans, metrics, and logs share a common resource model (service.name, service.version, deployment.environment) and are encoded in Protobuf for efficiency. OTLP is natively supported by all major observability backends.

Key Points
  • 1OTel unifies traces, metrics, and logs under a single instrumentation API. Instrument once, export to any backend. This eliminates vendor lock-in at the instrumentation layer, which is the hardest layer to change.
  • 2Auto-instrumentation (Java agent, Python site-packages, Node.js --require) instruments HTTP, DB, messaging, and gRPC calls without code changes. It provides 80% of tracing value with zero development effort.
  • 3Manual instrumentation adds business-logic spans and custom attributes. Use @WithSpan annotations (Java) or the Tracer API to create spans for operations like 'validate_coupon' or 'calculate_shipping'.
  • 4The OTel Collector decouples telemetry routing from application code. Deploy it as a sidecar or gateway to receive, process (batch, filter, sample, enrich), and export telemetry. Changing backends means changing Collector config, not app code.
  • 5OTLP (OpenTelemetry Protocol) is the standard wire format. It uses Protobuf over gRPC or HTTP and is supported natively by Datadog, Grafana (Tempo/Mimir/Loki), New Relic, Honeycomb, and 50+ other backends.
  • 6Resource attributes (service.name, service.version, k8s.pod.name) are attached to all telemetry, enabling filtering by service, version, or pod across all three signal types.
Simple Example

Adding OTel to a Python FastAPI Service

Install opentelemetry-distro and opentelemetry-instrumentation. Run 'opentelemetry-bootstrap -a install' to auto-detect and install instrumentation packages for FastAPI, httpx, SQLAlchemy, and Redis. Set environment variables: OTEL_SERVICE_NAME=orders-api, OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317. Start the app with 'opentelemetry-instrument python main.py'. Every incoming HTTP request now generates a trace with spans for FastAPI routing, SQLAlchemy queries, and Redis calls -- zero code changes. For business logic, add @tracer.start_as_current_span('validate_order') to custom functions.

Real-World Examples

eBay

eBay migrated from a proprietary tracing system to OpenTelemetry across 3,000+ microservices. The migration used a phased approach: first deploying OTel Collectors as sidecars that accepted both the legacy format and OTLP, then gradually switching services to OTel SDKs. Full migration took 12 months. Post-migration, eBay reduced observability vendor costs by 30% by routing telemetry through the Collector's processing pipeline (tail sampling, attribute filtering).

Skyscanner

Skyscanner uses OpenTelemetry auto-instrumentation for all Java and Python services, with manual instrumentation for critical business paths (flight search, price calculation). Their OTel Collector runs as a Kubernetes DaemonSet that enriches spans with pod metadata, applies tail-based sampling (keep errors and slow requests), and exports to Grafana Tempo. The standardized instrumentation enabled them to build a fleet-wide dependency map automatically.

GitHub

GitHub adopted OpenTelemetry for their Ruby on Rails monolith and surrounding Go microservices. They contributed the Ruby auto-instrumentation package back to the OTel project. By using the Collector as a telemetry gateway, they can A/B test observability backends (evaluating Honeycomb vs. Datadog) by duplicating telemetry to both backends without changing any application code.

Trade-Offs
AspectDescription
Vendor-Neutral vs. Vendor-OptimizedOTel provides portability but may not expose vendor-specific features (Datadog's profiling integration, Honeycomb's BubbleUp). Some teams use OTel for base instrumentation and add vendor-specific agents for advanced features.
Auto-Instrumentation vs. Manual ControlAuto-instrumentation is zero-effort but generates framework-level spans that may be noisy. Manual instrumentation adds business context but requires developer effort and ongoing maintenance as code changes.
Collector Sidecar vs. GatewaySidecar deployment (one Collector per pod) provides isolation but consumes resources per pod. Gateway deployment (shared Collector pool) is efficient but creates a centralized bottleneck. Most teams use a DaemonSet (one per node) as a compromise.
SDK Maturity Across LanguagesJava and Go OTel SDKs are stable and battle-tested. Python and Node.js are GA but less mature. Ruby, PHP, and Rust are still evolving. Teams with polyglot stacks may find inconsistent instrumentation quality across languages.
Case Study

Zalando Migrates 500 Services to OpenTelemetry in 6 Months

Scenario

Zalando, Europe's largest online fashion platform, ran a proprietary tracing library that required manual integration in each service. With 500+ microservices and 200+ developers, keeping the library updated was a full-time job.

Solution

They migrated to OpenTelemetry Java auto-instrumentation, deploying the OTel Java agent as a default JVM argument in their Kubernetes base image. 80% of services were instrumented automatically with no code changes. For the remaining 20% (custom protocols, native code), they added manual instrumentation. The Collector runs as a DaemonSet with tail-based sampling.

Outcome

500 services instrumented in 6 months (vs. 3 years for the previous library), 40% reduction in mean time to resolve incidents, and the ability to switch trace backends by changing a Collector config file.

Common Mistakes
  • โš Exporting telemetry directly from the SDK to the backend: Without a Collector, you cannot process telemetry before export (no tail sampling, no attribute filtering, no batching optimization), and switching backends requires redeploying every service. Always deploy an OTel Collector between your SDKs and your backend -- even a minimal Collector (receive OTLP, export OTLP) gives you a control plane for future processing and routing changes.
  • โš Not setting resource attributes: Traces arrive at the backend without service.name, deployment.environment, or k8s.pod.name, so you cannot filter by service or correlate with Kubernetes events. Set OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES at deploy time, and in Kubernetes use the OTel Operator or Collector's k8sattributes processor to auto-inject pod and node metadata.
  • โš Over-instrumenting with too many custom spans: Creating a span for every function call produces traces with 100+ spans per request, making the trace waterfall unreadable while export volume explodes and overhead impacts latency. Instrument at service boundaries (HTTP, gRPC, DB, cache) and key business operations -- a typical trace should have 5-15 spans per service.
  • โš Ignoring the Collector's processing pipeline: Raw telemetry is exported without filtering, so PII (user emails in span attributes), high-cardinality labels, and debug-level logs flow to the backend, increasing cost and creating compliance risk. Use Collector processors: attributes/filter to drop sensitive fields, tail_sampling to keep only useful traces, transform to normalize attribute names, and batch to optimize export efficiency.
Related Concepts

See OpenTelemetry in action

Explore system design templates that use opentelemetry and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Instrument a ride-hailing service with OpenTelemetry spans

Metrics to watch
instrumentation_overhead_msspan_export_ratetrace_sampling_pctp99_latency_ms
Run Simulation
Test Your Understanding

1What is the primary benefit of using an OTel Collector between your application and observability backend?

2What does OpenTelemetry auto-instrumentation provide?

Deeper Reading