Harden the server for production
This guide walks through the operational steps of taking a Protean domain to production: raising connection pools, exposing Kubernetes health probes, enabling DLQ maintenance, picking subscription profiles, emitting OpenTelemetry metrics, and shutting down gracefully. For the full catalogue of options, defaults, metric names, and profile values, see the Server Hardening reference.
Raise connection pool limits
Out-of-the-box SQLAlchemy defaults (pool_size = 5, max_overflow =
10) are sized for a single worker against a small database. Bump them
before you go live:
[databases.default]
provider = "postgresql"
database_uri = "${DATABASE_URL}"
pool_size = 10
max_overflow = 20
pool_recycle = 1800
Do the same for the Redis broker and cache so their ConnectionPool
scales with request volume:
[brokers.default]
provider = "redis"
URI = "${REDIS_URL}"
max_connections = 50
[caches.default]
provider = "redis"
URI = "${REDIS_URL}"
max_connections = 50
Run protean check before deploying. It surfaces a LOW_POOL_SIZE
warning for SQLAlchemy databases configured below the production
default, and catches other misconfigurations at the same time.
Size the pool against your database's ceiling
Each worker owns its own pool, so the peak connection count is
workers × (pool_size + max_overflow). With four engine workers and
the configuration above (pool_size = 10, max_overflow = 20):
4 × (10 + 20) = 120 connections
If the same database also serves an API tier with its own pool, add
those connections too. Compare the total against your database's
max_connections setting and leave headroom for admin sessions,
migrations, and read replicas. On PostgreSQL the default
max_connections is 100 — under-sizing the database is the more
common failure than under-sizing Protean's pool.
Expose health probes to Kubernetes
Async engine (protean server)
The engine embeds a lightweight HTTP server on port 8080 by default.
No configuration is needed unless you want to move the port, bind to a
different interface, or turn the server off; to override, add a
[server.health] section to domain.toml:
[server.health]
host = "0.0.0.0"
port = 8080
Wire the probes into your Deployment:
containers:
- name: server
image: my-app:latest
command: ["protean", "server", "--domain=my_domain"]
ports:
- containerPort: 8080
name: health
livenessProbe:
httpGet: { path: /livez, port: health }
periodSeconds: 10
readinessProbe:
httpGet: { path: /readyz, port: health }
periodSeconds: 5
/livez proves the event loop is responsive; /readyz inspects every
provider, broker, cache, and the event store, and returns 503 when
any component is unhealthy or the engine is shutting down.
FastAPI apps
Mount the equivalent router on your API process:
from fastapi import FastAPI
from protean.integrations.fastapi.health import create_health_router
app = FastAPI()
app.include_router(create_health_router(domain))
The router exposes the same /healthz, /livez, and /readyz paths
with the same readiness semantics as the engine server.
Keep the DLQ under control
By default, messages that exhaust their retries pile up in
{stream}:dlq forever. Enable the maintenance task to trim old entries
and alert when a queue grows:
[server.dlq]
enabled = true
retention_hours = 168 # Keep 7 days of history
alert_threshold = 100 # Warn when depth hits 100
alert_callback = "myapp.alerts.on_dlq_alert"
The alert callback runs inside the engine; keep it cheap and non-blocking (page the on-call rotation, post to Slack, open a ticket). Time-based trimming requires a Redis Streams broker.
A minimal Slack webhook alert:
# myapp/alerts.py
import logging
import os
import httpx
logger = logging.getLogger(__name__)
_SLACK_WEBHOOK = os.environ.get("SLACK_DLQ_WEBHOOK")
def on_dlq_alert(dlq_stream: str, depth: int, threshold: int) -> None:
"""Post a Slack message when a DLQ crosses its depth threshold."""
if not _SLACK_WEBHOOK:
logger.warning(
"DLQ alert: %s depth=%d threshold=%d", dlq_stream, depth, threshold
)
return
try:
httpx.post(
_SLACK_WEBHOOK,
json={
"text": (
f":warning: DLQ `{dlq_stream}` has {depth} messages "
f"(threshold {threshold}). Investigate before replaying."
)
},
timeout=2.0,
)
except httpx.HTTPError:
logger.exception("Failed to post DLQ alert to Slack")
The callback fires once per maintenance cycle while the threshold is breached. Any exception it raises is caught and logged — a broken alert handler will not stall the engine.
Override retention or alerting per handler when a subscription needs different SLAs — for example, an auditing handler that must keep 30 days of failures:
@domain.event_handler(
part_of=Order,
subscription_config={
"dlq_retention_hours": 720,
"dlq_alert_threshold": 10,
},
)
class AuditHandler(BaseEventHandler):
...
For discovery, inspection, and replay of individual DLQ messages, see Dead Letter Queues.
Pick a subscription profile
Every handler resolves to a SubscriptionConfig at startup. Pick a
profile that matches its workload instead of tuning fields one at a
time:
from protean.server.subscription.profiles import SubscriptionProfile
@domain.event_handler(
part_of=Order,
subscription_profile=SubscriptionProfile.PRODUCTION,
)
class OrderEventHandler(BaseEventHandler):
...
Override individual fields without abandoning the profile:
@domain.event_handler(
part_of=Order,
subscription_profile=SubscriptionProfile.PRODUCTION,
subscription_config={"messages_per_tick": 50},
)
class BulkOrderHandler(BaseEventHandler):
...
Emit OpenTelemetry metrics
Install the telemetry extra and enable it in domain.toml:
pip install "protean[telemetry]"
[telemetry]
enabled = true
exporter = "otlp"
endpoint = "http://otel-collector:4317"
service_name = "my-service"
The engine emits per-subscription counters and histograms
(protean.subscription.messages_processed,
protean.subscription.processing_duration, …), DLQ maintenance
counters (protean.dlq.trimmed, protean.dlq.alerts), and engine
gauges (protean.engine.up, protean.engine.uptime_seconds, …)
directly through the OTLP exporter.
Connection-pool and backpressure gauges
(protean.db.pool_*, protean.broker.pool_active_connections,
protean.subscription.consumer_lag) are lazily registered on the
Observatory's /metrics endpoint. Scrape it with Prometheus
alongside your OTLP exporter — see Monitoring.
Every metric is a no-op when opentelemetry-api is not installed.
Shut down gracefully
protean server handles SIGINT, SIGTERM, and SIGHUP by stopping
the health server, signalling every subscription and outbox processor
to stop, waiting up to 10 seconds for in-flight handlers to finish,
then closing the event store, brokers, caches, and providers in
reverse initialisation order. Send SIGTERM and give the pod a
terminationGracePeriodSeconds of at least 15 — long enough for
the drain and close steps to complete:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: server
lifecycle:
preStop:
exec:
command: ["sleep", "5"] # Let the load balancer drain
When you create and tear down domains from test or tooling code, call
domain.close() yourself:
from my_domain import domain
try:
with domain.domain_context():
# ... do work ...
finally:
domain.close()
Custom adapters inherit a no-op close(); override it when your
adapter holds sockets, file handles, or background threads.
See also
- Server Hardening reference — every option, default, and metric catalogued.
- Production Deployment — process management, Docker, and Kubernetes manifests.
- Error Handling — retry flow per subscription type and version-conflict auto-retry.
- Dead Letter Queues — inspect, replay, and purge DLQ entries.
- OpenTelemetry Integration — exporter setup, TraceParent propagation, and the full span catalogue.