Server Hardening Reference

Every option, default, and metric shipped by the Server Hardening epic — pool tuning, health probes, DLQ policy, subscription profiles, OpenTelemetry metrics, shutdown, and optimistic locking. For the operational walkthrough that ties these together, see the Server Hardening guide. For the --reload development flag, see Run the Server.

Connection pools

SQLAlchemy providers

Pool defaults for postgresql and mssql providers (SQLite uses SingletonThreadPool and ignores these keys).

Key	Default	Purpose
`pool_size`	`5`	Base connections kept open per worker
`max_overflow`	`10`	Additional temporary connections beyond `pool_size`
`pool_recycle`	unset	Recycle connections older than N seconds

[databases.default]
provider = "postgresql"
database_uri = "${DATABASE_URL}"
pool_size = 10
max_overflow = 20
pool_recycle = 1800

Redis broker and cache

Both adapters forward pool parameters to redis.ConnectionPool.

Key	Purpose
`max_connections`	Cap on connections in the pool
`socket_timeout`	Read/write timeout, seconds
`socket_connect_timeout`	Connection timeout, seconds
`retry_on_timeout`	Retry reads that time out

MessageDB event store

Forward max_connections directly through conn_info.

[event_store]
provider = "message_db"
database_uri = "${MESSAGE_DB_URL}"
max_connections = 20

LOW_POOL_SIZE warning

Domain.check() emits a LOW_POOL_SIZE warning for any SQLAlchemy database with pool_size < 5 unless PROTEAN_ENV is development or testing. Memory providers are skipped. The warning is advisory — it does not block startup.

Sample output from protean check when pool_size = 2 on a PostgreSQL provider:

$ protean check --domain=my_domain

  Domain: my_domain  WARN
  1 warning(s)

  Warnings (1):
    ! LOW_POOL_SIZE: Database 'default' has pool_size=2 (production
      default is 5). Consider raising it for production workloads.

protean check exits with code 2 on warnings, so CI pipelines that enforce --strict will fail. Raise pool_size or set PROTEAN_ENV to development/testing to silence the warning.

Health checks

`[server.health]`

Key	Default	Purpose
`enabled`	`true`	Start the health HTTP server
`host`	`"0.0.0.0"`	Bind address
`port`	`8080`	Listen port

Engine endpoints

Path	Probe	Response
`GET /healthz`	Liveness	`200` with `{"status": "ok", "checks": {"event_loop": "responsive"}}`
`GET /livez`	Liveness (alias for `/healthz`)	Same as `/healthz`
`GET /readyz`	Readiness	`200` when all checks pass, `503` otherwise

Sample responses — liveness while the engine is running:

$ curl -i http://localhost:8080/livez
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 55

{"status": "ok", "checks": {"event_loop": "responsive"}}

Readiness when every dependency is reachable:

$ curl -i http://localhost:8080/readyz
HTTP/1.1 200 OK
Content-Type: application/json

{
  "status": "ok",
  "checks": {
    "shutting_down": false,
    "providers": {"default": "ok"},
    "brokers": {"default": "ok"},
    "event_store": "ok",
    "caches": {"default": "ok"},
    "subscriptions": 12
  }
}

Readiness when one component is unreachable — the status flips to "degraded" and the HTTP code to 503, which K8s treats as "not ready":

$ curl -i http://localhost:8080/readyz
HTTP/1.1 503 Service Unavailable
Content-Type: application/json

{
  "status": "degraded",
  "checks": {
    "shutting_down": false,
    "providers": {"default": "ok"},
    "brokers": {"default": "unavailable"},
    "event_store": "ok",
    "caches": {"default": "ok"},
    "subscriptions": 12
  }
}

Readiness after SIGTERM arrives — the engine reports "unavailable" immediately so the load balancer drains traffic before in-flight handlers are affected:

$ curl -i http://localhost:8080/readyz
HTTP/1.1 503 Service Unavailable
Content-Type: application/json

{"status": "unavailable", "checks": {"shutting_down": true}}

/livez keeps returning 200 during the drain window; it only fails when the event loop itself is blocked. This asymmetry is deliberate: liveness triggers a restart, readiness pulls the pod out of rotation.

FastAPI router factory

from protean.integrations.fastapi.health import create_health_router

create_health_router(
    domain,                # Domain instance
    *,
    prefix: str = "",      # URL prefix for all health routes
    tags: list[str] | None = None,  # OpenAPI tags
)

Mounts GET /healthz, GET /livez, and GET /readyz. The /readyz check runs the same provider, broker, event-store, and cache inspection as the engine server. The /healthz and /livez bodies differ — the FastAPI router returns {"status": "ok", "checks": {"application": "running"}}, since there is no event-loop task inside the request cycle to probe.

Dead-letter queue policy

`[server.dlq]`

Key	Default	Purpose
`enabled`	`false`	Start the DLQ maintenance task
`retention_hours`	`168` (7 days)	Trim DLQ entries older than this
`alert_threshold`	`100`	Log a warning when DLQ depth ≥ this
`alert_callback`	unset	Dotted path to a callable, invoked on alert
`check_interval_seconds`	`60`	Seconds between maintenance cycles

The alert callback is invoked with keyword arguments:

def on_dlq_alert(dlq_stream: str, depth: int, threshold: int) -> None:
    ...

Per-subscription overrides

Fields on SubscriptionConfig that override the global defaults for a single subscription:

Field	Type	Default	Purpose
`dlq_retention_hours`	`int \\| None`	inherit global	Per-handler retention window
`dlq_alert_threshold`	`int \\| None`	inherit global	Per-handler alert threshold

The maintenance task only runs when a broker that advertises the DEAD_LETTER_QUEUE capability is configured. Redis Streams implements time-based trimming via XTRIM MINID; other brokers fall back to a no-op dlq_trim().

Subscription profiles

Five profiles — PRODUCTION, FAST, BATCH, DEBUG, PROJECTION — resolve at engine startup to concrete SubscriptionConfig values. For the full per-profile value dictionaries (messages_per_tick, blocking_timeout_ms, max_retries, enable_dlq, etc.), see Subscription Configuration → Profile Defaults.

SubscriptionConfig fields resolvable at every precedence level:

Field	Type	Default	Applies to
`subscription_type`	`SubscriptionType`	`STREAM`	—
`messages_per_tick`	`int`	`10`	Both
`tick_interval`	`int`	`0`	Both
`blocking_timeout_ms`	`int`	`5000`	`STREAM`
`max_retries`	`int`	`3`	`STREAM`
`retry_delay_seconds`	`float`	`1.0`	`STREAM`
`enable_dlq`	`bool`	`true`	`STREAM`
`position_update_interval`	`int`	`10`	`EVENT_STORE`
`origin_stream`	`str \\| None`	`None`	Both
`dlq_retention_hours`	`int \\| None`	`None`	`STREAM`
`dlq_alert_threshold`	`int \\| None`	`None`	`STREAM`

See Subscription Configuration for the full precedence hierarchy.

OpenTelemetry metrics

Every metric below is registered on DomainMetrics and emitted as a no-op when opentelemetry-api is not installed. For the exporter and propagation setup, see OpenTelemetry Integration.

Per-subscription counters and histograms

Emitted directly by the engine.

Metric	Type	Unit	Attributes
`protean.subscription.messages_processed`	Counter	`{message}`	`subscription`, `handler`, `stream`, `status` (`ok`/`error`)
`protean.subscription.retries`	Counter	`{retry}`	`subscription`, `handler`, `stream`
`protean.subscription.dlq_routed`	Counter	`{message}`	`subscription`, `handler`, `stream`
`protean.subscription.processing_duration`	Histogram	`s`	`subscription`, `handler`, `stream`

Engine gauges

Emitted directly by the engine.

Metric	Type	Unit	Meaning
`protean.engine.up`	Observable gauge	`1`	`1` while running, `0` during shutdown
`protean.engine.uptime_seconds`	Observable gauge	`s`	Seconds since the engine started
`protean.engine.active_subscriptions`	Observable gauge	`{subscription}`	Current count of live subscriptions

DLQ maintenance counters

Emitted by DLQMaintenanceTask.

Metric	Type	Unit	Attributes
`protean.dlq.trimmed`	Counter	`{message}`	`dlq_stream`
`protean.dlq.alerts`	Counter	`{alert}`	`dlq_stream`

Infrastructure gauges (Observatory `/metrics`)

Lazily registered on the first scrape of the Observatory's Prometheus endpoint. See Observability.

Metric	Type	Attributes
`protean.db.pool_size`	Observable gauge	`provider_name`, `database_type`
`protean.db.pool_checked_out`	Observable gauge	`provider_name`, `database_type`
`protean.db.pool_overflow`	Observable gauge	`provider_name`, `database_type`
`protean.db.pool_checked_in`	Observable gauge	`provider_name`, `database_type`
`protean.broker.pool_active_connections`	Observable gauge	`broker_name`
`protean.subscription.consumer_lag`	Observable gauge	`domain`, `handler`, `stream`, `type`
`protean.subscription.pending_messages`	Observable gauge	`domain`, `handler`, `stream`, `type`
`protean.outbox.pending_count`	Observable gauge	`domain`

BaseProvider.pool_stats() returns {size, checked_out, overflow, checked_in}. SQLAlchemy providers return live counts; memory and Elasticsearch providers return an empty dict.

Shutdown sequence

Engine.shutdown() runs these steps on SIGINT, SIGTERM, or SIGHUP:

Stop the health HTTP server (probes start failing immediately).
Signal every subscription, broker subscription, outbox processor, and DLQ maintenance task to stop.
Wait up to 10 seconds for in-flight handler tasks to complete; cancel any that remain.
Call Domain.close() — closes event store, brokers, caches, and providers in reverse initialisation order.
Remove signal handlers and stop the event loop.

Domain.close() is callable from application code for tests and tooling that create and tear down domains on demand.

Optimistic locking

ExpectedVersionError is raised when two writers race for the same aggregate version. Atomicity guarantees per adapter:

Adapter	Mechanism
SQLAlchemy repository	Version compared inside the same transaction as the update
Elasticsearch repository	Native `if_seq_no` + `if_primary_term` on index operations
Memory repository	`threading.Lock` serialises writes
Memory event store	`threading.Lock` guards `write()`
MessageDB event store	Stored-procedure API enforces expected version inside PostgreSQL

Command handlers auto-retry on ExpectedVersionError; see Error Handling.