ADR-0008: Centralized Telemetry Gateway and TraceParent Bridging
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-03-16 |
| Author | Subhash Bhushan |
| Epic | 6.1 OpenTelemetry Integration (#742) |
Context
The optional dependency problem
OpenTelemetry is an optional dependency in Protean (pip install
protean[telemetry]). Instrumentation code -- span creation, metric
recording, context propagation -- is woven into core modules like
CommandProcessor, UnitOfWork, Repository, Engine, and
OutboxProcessor. Without a clear boundary, every instrumented file
would need its own try: import opentelemetry ... except ImportError
guard and conditional logic, scattering OTEL awareness across dozens
of files.
During the Epic 6.1 implementation (8 PRs over multiple weeks), code
review caught at least four instances where contributors directly
imported opentelemetry in instrumented modules (mixins.py,
command_processor.py, observatory/metrics.py, fastapi/telemetry.py).
Each was corrected by adding a helper to the central module. This
demonstrated that the boundary needs to be explicit and well-documented,
not just a convention.
The distributed tracing problem
Protean is an asynchronous, message-driven framework. A single user action can span multiple processing steps:
HTTP POST /orders
-> CommandProcessor.process(PlaceOrder)
-> CommandHandler raises OrderPlaced event
-> UnitOfWork commits, writes to outbox
-> OutboxProcessor publishes to event store
-> Engine picks up event, dispatches to EventHandler
-> EventHandler dispatches ReduceInventory command
-> ...
Each arrow may cross a process boundary (via event store or broker).
For a distributed trace to survive this journey, the trace context must
be explicitly carried in message metadata and extracted/injected at each
boundary. Protean already had a TraceParent value object in
MessageHeaders using W3C format -- but it was not connected to OTEL's
context propagation. The challenge was bridging these two systems
bidirectionally without coupling core domain code to OTEL imports.
Decision
Part 1: Centralized telemetry gateway
All OpenTelemetry interaction is funneled through a single module:
src/protean/utils/telemetry.py. The rest of the codebase never
imports opentelemetry directly.
The gateway exposes a fixed public API:
| Function | Purpose |
|---|---|
init_telemetry(domain) |
One-time provider setup |
shutdown_telemetry(domain) |
Flush and tear down |
get_tracer(domain) |
Domain-scoped tracer (or no-op) |
get_meter(domain) |
Domain-scoped meter (or no-op) |
get_tracer_provider(domain) |
Raw provider for advanced use |
get_meter_provider(domain) |
Raw provider for advanced use |
get_domain_metrics(domain) |
Cached DomainMetrics instance |
get_prometheus_text(domain) |
Prometheus exposition format |
set_span_error(span, exc) |
Record exception + set ERROR status |
extract_context_from_traceparent(tp) |
TraceParent VO -> OTEL Context |
inject_traceparent_from_context() |
Current OTEL span -> TraceParent VO |
create_observation(gauge, value, attrs) |
Observable gauge callback helper |
instrument_fastapi_app(app, domain) |
FastAPI auto-instrumentation |
When OTEL packages are not installed or telemetry is disabled, every
function returns a lightweight no-op: _NoOpTracer, _NoOpSpan,
_NoOpMeter, _NoOpCounter, _NoOpHistogram, _NoOpObservableGauge.
These implement the minimal interface that instrumentation code depends
on (e.g., start_as_current_span() returns a context manager yielding
_NoOpSpan). Instrumentation code calls tracer.start_as_current_span()
unconditionally and gets either a real span or a no-op -- no conditional
guards anywhere.
Part 2: Span conventions
All spans follow the naming pattern protean.<subsystem>.<operation>:
| Span Name | Subsystem | Set By |
|---|---|---|
protean.command.process |
Command processing | CommandProcessor |
protean.command.enrich |
Command enrichment | CommandProcessor |
protean.handler.execute |
Handler dispatch | HandlerMixin |
protean.query.dispatch |
Query dispatch | QueryProcessor |
protean.uow.commit |
Transaction commit | UnitOfWork |
protean.repository.add |
Aggregate persistence | BaseRepository |
protean.repository.get |
Aggregate retrieval | BaseRepository |
protean.event_store.append |
Event store write | BaseEventStore |
protean.engine.handle_message |
Server message handling | Engine |
protean.outbox.process |
Outbox batch | OutboxProcessor |
protean.outbox.publish |
Outbox single publish | OutboxProcessor |
All spans are created with record_exception=False,
set_status_on_exception=False. Errors are recorded explicitly via
set_span_error(span, exc) in exception handlers. This gives precise
control over what gets recorded and keeps the StatusCode import
centralized in the gateway module.
Part 3: Bidirectional TraceParent bridging
Distributed traces flow across Protean's message boundaries through three cooperative injection/extraction points:
Injection (outgoing messages):
-
Command enrichment (
CommandProcessor.enrich()) -- callsinject_traceparent_from_context()to capture the current OTEL span context as aTraceParentvalue object in the command'sMessageHeaders.traceparent. -
Event raising (
BaseAggregate.raise_()) -- preserves an existingtraceparentfrom the event's metadata if present (e.g., when the event was constructed with explicit headers). Falls back toinject_traceparent_from_context()to capture the current span context. This ensures events raised during synchronous command handler execution inherit the handler's span context.
Extraction (incoming messages):
- Command processing (
CommandProcessor.process()) and engine message handling (Engine.handle_message()) -- callextract_context_from_traceparent()to convert the incomingTraceParentvalue object back to an OTELContext, passed as thecontext=parameter tostart_as_current_span(). This makes the new span a child of the incoming trace.
The bridge uses OTEL's standard propagate.inject() and
propagate.extract() with W3C traceparent format, matching the format
that Protean's TraceParent value object already uses. When OTEL is not
installed, both helpers return None and the traceparent fields are
simply not populated -- no trace, no overhead, no errors.
A supporting fix was required: BaseCommand._build_metadata() previously
discarded headers that contained only a traceparent (it required type
or id to be present). This was corrected to preserve headers with any
meaningful content.
Part 4: Metrics
A fixed set of metric instruments is defined in the DomainMetrics class,
created lazily per domain via get_domain_metrics(domain):
Counters: protean.command.processed, protean.handler.invocations,
protean.uow.commits, protean.outbox.published, protean.outbox.failed
Histograms: protean.command.duration, protean.handler.duration,
protean.uow.events_per_commit, protean.outbox.latency
Dimensional attributes (e.g., command_type, handler_name, status)
are passed at recording time, not at instrument creation time.
The Observatory's /metrics endpoint converges with OTEL: when OTEL is
installed, infrastructure metrics (outbox depth, broker health,
subscription lag) are registered as ObservableGauge callbacks on the
OTEL MeterProvider, and get_prometheus_text() serves the combined
output. When OTEL is not installed, the original hand-rolled Prometheus
exposition is preserved unchanged.
Consequences
Positive:
- Instrumentation code is clean and unconditional. A typical callsite is
four lines: get tracer, open span, set attributes, handle error. No
imports from
opentelemetry, no availability checks. - The no-op pattern establishes a reusable approach for future optional integrations in Protean. Any subsystem that depends on an optional package can follow the same gateway + no-op model.
- Full distributed traces flow end-to-end: HTTP request -> command -> handler -> event -> outbox -> downstream handler, all linked by W3C traceparent propagation.
- Zero overhead when telemetry is disabled: no-ops are trivial objects with empty methods; OTEL packages are never imported.
- The
/metricsendpoint works identically whether OTEL is installed or not -- no behavioral change for existing Observatory users.
Negative:
- The gateway module is a single point of evolution. Every new OTEL
feature (e.g., span links, baggage) requires adding a helper to
telemetry.pyrather than using OTEL directly. This is by design but adds friction. - The "never import opentelemetry elsewhere" rule is enforced by code review, not by tooling. A lint rule (e.g., in 3.5 Architecture Fitness Functions) could automate this in the future.
- Trace continuity depends on all three injection/extraction points working correctly. If a new message pathway is added without traceparent bridging, the trace will break at that boundary.
- The async command path records
status="ok"and duration at enqueue time, even though execution happens later and may fail. This is a known trade-off -- the metric reflects "accepted for processing", not "completed successfully".
Alternatives Considered
Conditional guards at each callsite. The straightforward approach:
wrap every OTEL call in if otel_available:. Rejected because it
scatters the optional dependency boundary across dozens of files, makes
instrumentation code harder to read, and is error-prone (easy to forget
the guard in a new callsite).
Decorator-based automatic instrumentation. Automatically wrap
decorated domain elements (e.g., @domain.command_handler) with OTEL
spans. Rejected because it provides no control over span attributes,
error recording, or context propagation. The explicit approach gives
each subsystem control over what gets traced and how.
Single trace system (replace Observatory with OTEL). Replace the Redis-backed Observatory entirely with OTEL + an in-process backend. Rejected because Observatory provides zero-config, real-time SSE-based tracing for local development. OTEL requires infrastructure (a collector or backend). The two serve different audiences and coexist at low cost.
Jaeger/Zipkin-specific integration (instead of OTEL). Rejected because OTEL is the industry convergence point. Jaeger and Zipkin both accept OTLP. Choosing OTEL avoids vendor lock-in and gives users freedom to pick their backend.
Update — Log↔Trace Correlation (Epic 6.6, sub-issue #917)
Epic 6.6 (Logging Overhaul) extended this story with automatic log↔trace
correlation. When telemetry.enabled=True, Domain.configure_logging()
installs a stdlib OTelTraceContextFilter on the root logger and appends a
protean_otel_processor to the structlog pipeline. Both read the active
span via opentelemetry.trace.get_current_span() and inject trace_id
(32-char hex), span_id (16-char hex), and trace_flags (int) into
every log record. Outside a valid span, or when the opentelemetry extra
is not installed, the fields default to empty strings / 0 — no lazy
import fires when telemetry is disabled.
Epic 6.6 also closed a gap in the Epic 6.5 correlation extraction: the
existing ProteanCorrelationFilter and protean_correlation_processor
now fall back to g.correlation_id and g.causation_id when no
g.message_in_context is in scope. These two attributes are the
documented extension point for HTTP middleware, CLI commands, and
background jobs that need log records tagged before a domain message
exists. Message-based extraction (from g.message_in_context.metadata.domain)
still takes precedence when available.
Together, the two additions complete the end-to-end OOTB promise Epic 6.5 made for logs: every log record emitted under an active domain context now carries both the correlation identifiers and the trace identifiers needed to jump between log aggregators and APM tools without code changes.
References
- OpenTelemetry Python SDK
- W3C Trace Context specification
- Epic 6.1: OpenTelemetry Integration (#742)
- Epic 6.6: Logging Overhaul (#912) — log↔trace correlation (#917)
- ADR-0007: Domain-Scoped OpenTelemetry Providers (companion decision)