Server Hardening Reference
Every option, default, and metric shipped by the Server Hardening epic
— pool tuning, health probes, DLQ policy, subscription profiles,
OpenTelemetry metrics, shutdown, and optimistic locking. For the
operational walkthrough that ties these together, see the
Server Hardening guide. For the
--reload development flag, see Run the Server.
Connection pools
SQLAlchemy providers
Pool defaults for postgresql and mssql providers (SQLite uses
SingletonThreadPool and ignores these keys).
| Key | Default | Purpose |
|---|---|---|
pool_size |
5 |
Base connections kept open per worker |
max_overflow |
10 |
Additional temporary connections beyond pool_size |
pool_recycle |
unset | Recycle connections older than N seconds |
[databases.default]
provider = "postgresql"
database_uri = "${DATABASE_URL}"
pool_size = 10
max_overflow = 20
pool_recycle = 1800
Redis broker and cache
Both adapters forward pool parameters to redis.ConnectionPool.
| Key | Purpose |
|---|---|
max_connections |
Cap on connections in the pool |
socket_timeout |
Read/write timeout, seconds |
socket_connect_timeout |
Connection timeout, seconds |
retry_on_timeout |
Retry reads that time out |
MessageDB event store
Forward max_connections directly through conn_info.
[event_store]
provider = "message_db"
database_uri = "${MESSAGE_DB_URL}"
max_connections = 20
LOW_POOL_SIZE warning
Domain.check() emits a LOW_POOL_SIZE warning for any SQLAlchemy
database with pool_size < 5 unless PROTEAN_ENV is development or
testing. Memory providers are skipped. The warning is advisory — it
does not block startup.
Sample output from protean check when pool_size = 2 on a
PostgreSQL provider:
$ protean check --domain=my_domain
Domain: my_domain WARN
1 warning(s)
Warnings (1):
! LOW_POOL_SIZE: Database 'default' has pool_size=2 (production
default is 5). Consider raising it for production workloads.
protean check exits with code 2 on warnings, so CI pipelines that
enforce --strict will fail. Raise pool_size or set PROTEAN_ENV to
development/testing to silence the warning.
Health checks
[server.health]
| Key | Default | Purpose |
|---|---|---|
enabled |
true |
Start the health HTTP server |
host |
"0.0.0.0" |
Bind address |
port |
8080 |
Listen port |
Engine endpoints
| Path | Probe | Response |
|---|---|---|
GET /healthz |
Liveness | 200 with {"status": "ok", "checks": {"event_loop": "responsive"}} |
GET /livez |
Liveness (alias for /healthz) |
Same as /healthz |
GET /readyz |
Readiness | 200 when all checks pass, 503 otherwise |
Sample responses — liveness while the engine is running:
$ curl -i http://localhost:8080/livez
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 55
{"status": "ok", "checks": {"event_loop": "responsive"}}
Readiness when every dependency is reachable:
$ curl -i http://localhost:8080/readyz
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "ok",
"checks": {
"shutting_down": false,
"providers": {"default": "ok"},
"brokers": {"default": "ok"},
"event_store": "ok",
"caches": {"default": "ok"},
"subscriptions": 12
}
}
Readiness when one component is unreachable — the status flips to
"degraded" and the HTTP code to 503, which K8s treats as
"not ready":
$ curl -i http://localhost:8080/readyz
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
{
"status": "degraded",
"checks": {
"shutting_down": false,
"providers": {"default": "ok"},
"brokers": {"default": "unavailable"},
"event_store": "ok",
"caches": {"default": "ok"},
"subscriptions": 12
}
}
Readiness after SIGTERM arrives — the engine reports
"unavailable" immediately so the load balancer drains traffic before
in-flight handlers are affected:
$ curl -i http://localhost:8080/readyz
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
{"status": "unavailable", "checks": {"shutting_down": true}}
/livez keeps returning 200 during the drain window; it only fails
when the event loop itself is blocked. This asymmetry is deliberate:
liveness triggers a restart, readiness pulls the pod out of rotation.
FastAPI router factory
from protean.integrations.fastapi.health import create_health_router
create_health_router(
domain, # Domain instance
*,
prefix: str = "", # URL prefix for all health routes
tags: list[str] | None = None, # OpenAPI tags
)
Mounts GET /healthz, GET /livez, and GET /readyz. The /readyz
check runs the same provider, broker, event-store, and cache inspection
as the engine server. The /healthz and /livez bodies differ — the
FastAPI router returns {"status": "ok", "checks": {"application":
"running"}}, since there is no event-loop task inside the request
cycle to probe.
Dead-letter queue policy
[server.dlq]
| Key | Default | Purpose |
|---|---|---|
enabled |
false |
Start the DLQ maintenance task |
retention_hours |
168 (7 days) |
Trim DLQ entries older than this |
alert_threshold |
100 |
Log a warning when DLQ depth ≥ this |
alert_callback |
unset | Dotted path to a callable, invoked on alert |
check_interval_seconds |
60 |
Seconds between maintenance cycles |
The alert callback is invoked with keyword arguments:
def on_dlq_alert(dlq_stream: str, depth: int, threshold: int) -> None:
...
Per-subscription overrides
Fields on SubscriptionConfig that override the global defaults for
a single subscription:
| Field | Type | Default | Purpose |
|---|---|---|---|
dlq_retention_hours |
int \| None |
inherit global | Per-handler retention window |
dlq_alert_threshold |
int \| None |
inherit global | Per-handler alert threshold |
The maintenance task only runs when a broker that advertises the
DEAD_LETTER_QUEUE capability is configured. Redis Streams implements
time-based trimming via XTRIM MINID; other brokers fall back to a
no-op dlq_trim().
Subscription profiles
Five profiles — PRODUCTION, FAST, BATCH, DEBUG, PROJECTION —
resolve at engine startup to concrete SubscriptionConfig values. For
the full per-profile value dictionaries (messages_per_tick,
blocking_timeout_ms, max_retries, enable_dlq, etc.), see
Subscription Configuration → Profile Defaults.
SubscriptionConfig fields resolvable at every precedence level:
| Field | Type | Default | Applies to |
|---|---|---|---|
subscription_type |
SubscriptionType |
STREAM |
— |
messages_per_tick |
int |
10 |
Both |
tick_interval |
int |
0 |
Both |
blocking_timeout_ms |
int |
5000 |
STREAM |
max_retries |
int |
3 |
STREAM |
retry_delay_seconds |
float |
1.0 |
STREAM |
enable_dlq |
bool |
true |
STREAM |
position_update_interval |
int |
10 |
EVENT_STORE |
origin_stream |
str \| None |
None |
Both |
dlq_retention_hours |
int \| None |
None |
STREAM |
dlq_alert_threshold |
int \| None |
None |
STREAM |
See Subscription Configuration for the full precedence hierarchy.
OpenTelemetry metrics
Every metric below is registered on DomainMetrics and emitted as a
no-op when opentelemetry-api is not installed. For the exporter and
propagation setup, see
OpenTelemetry Integration.
Per-subscription counters and histograms
Emitted directly by the engine.
| Metric | Type | Unit | Attributes |
|---|---|---|---|
protean.subscription.messages_processed |
Counter | {message} |
subscription, handler, stream, status (ok/error) |
protean.subscription.retries |
Counter | {retry} |
subscription, handler, stream |
protean.subscription.dlq_routed |
Counter | {message} |
subscription, handler, stream |
protean.subscription.processing_duration |
Histogram | s |
subscription, handler, stream |
Engine gauges
Emitted directly by the engine.
| Metric | Type | Unit | Meaning |
|---|---|---|---|
protean.engine.up |
Observable gauge | 1 |
1 while running, 0 during shutdown |
protean.engine.uptime_seconds |
Observable gauge | s |
Seconds since the engine started |
protean.engine.active_subscriptions |
Observable gauge | {subscription} |
Current count of live subscriptions |
DLQ maintenance counters
Emitted by DLQMaintenanceTask.
| Metric | Type | Unit | Attributes |
|---|---|---|---|
protean.dlq.trimmed |
Counter | {message} |
dlq_stream |
protean.dlq.alerts |
Counter | {alert} |
dlq_stream |
Infrastructure gauges (Observatory /metrics)
Lazily registered on the first scrape of the Observatory's Prometheus endpoint. See Observability.
| Metric | Type | Attributes |
|---|---|---|
protean.db.pool_size |
Observable gauge | provider_name, database_type |
protean.db.pool_checked_out |
Observable gauge | provider_name, database_type |
protean.db.pool_overflow |
Observable gauge | provider_name, database_type |
protean.db.pool_checked_in |
Observable gauge | provider_name, database_type |
protean.broker.pool_active_connections |
Observable gauge | broker_name |
protean.subscription.consumer_lag |
Observable gauge | domain, handler, stream, type |
protean.subscription.pending_messages |
Observable gauge | domain, handler, stream, type |
protean.outbox.pending_count |
Observable gauge | domain |
BaseProvider.pool_stats() returns {size, checked_out, overflow,
checked_in}. SQLAlchemy providers return live counts; memory and
Elasticsearch providers return an empty dict.
Shutdown sequence
Engine.shutdown() runs these steps on SIGINT, SIGTERM, or
SIGHUP:
- Stop the health HTTP server (probes start failing immediately).
- Signal every subscription, broker subscription, outbox processor, and DLQ maintenance task to stop.
- Wait up to 10 seconds for in-flight handler tasks to complete; cancel any that remain.
- Call
Domain.close()— closes event store, brokers, caches, and providers in reverse initialisation order. - Remove signal handlers and stop the event loop.
Domain.close() is callable from application code for tests and
tooling that create and tear down domains on demand.
Optimistic locking
ExpectedVersionError is raised when two writers race for the same
aggregate version. Atomicity guarantees per adapter:
| Adapter | Mechanism |
|---|---|
| SQLAlchemy repository | Version compared inside the same transaction as the update |
| Elasticsearch repository | Native if_seq_no + if_primary_term on index operations |
| Memory repository | threading.Lock serialises writes |
| Memory event store | threading.Lock guards write() |
| MessageDB event store | Stored-procedure API enforces expected version inside PostgreSQL |
Command handlers auto-retry on ExpectedVersionError; see
Error Handling.