Prevent and repair df/_duroxide inconsistency#258
Conversation
Proposes making the duroxide start-enqueue part of the caller's backend transaction (Option 3) so df.start() is atomic: rollback leaves neither df.* rows nor a _duroxide orchestrator-queue orphan, and the worker only picks up an instance after its df.* rows commit. Grounded in the current start path (single orchestrator_queue INSERT via _duroxide.enqueue_orchestrator_work), the SECURITY DEFINER privilege model needed for the enqueue, the resulting removal of the load-function-graph 5s race, alternatives considered, and an Upgrade & Migration section.
Add a Dual-write inventory section that classifies every df.* writer vs duroxide-client call into two seams: backend session-side dual-writes (df.start, df.cancel) that Option 3 fixes, and worker-side best-effort mirror activities (update-instance-status, update-node-status) that it does not. Make explicit that df.cancel is a genuine dual-write (Phase 2) while df.signal writes only _duroxide and is not a cross-store divergence.
…on 4 decision Add a 'Design options considered' section near the top enumerating the four originally-explored options (single source of truth, our-SQL-in-their-tx, their-enqueue-in-our-tx, and tolerate-and-reconcile) with a Decision subsection, so the Option N references throughout resolve. Update the Overview to state we implement Option 3 (primary) + Option 4 (backstop), remove the now -redundant mid-document summary, and reframe the reconciler as in-scope Phase 3.
Add the trade-off matrix (atomic with df.*, atomic with caller tx, kills the 5s load race, fixes crash-time drift, effort) to the Design options considered section, with footnotes explaining Option 1's convergence with Option 3 and why Option 4 is kept as a crash-drift backstop.
Enqueue the StartOrchestration work item inside the caller's backend transaction instead of out-of-band on the duroxide client pool, so the df.nodes/df.instances INSERTs and the orchestrator-queue INSERT commit or roll back together. - src/lib.rs: add df._enqueue_orchestrator_start(text,text), a SECURITY DEFINER wrapper (SET search_path=pg_catalog) that performs the privileged INSERT via _duroxide.enqueue_orchestrator_work(); EXECUTE revoked from PUBLIC and granted through df.grant_usage(). - src/dsl.rs: df.start() builds the duroxide WorkItem::StartOrchestration, gates on worker readiness, and enqueues via SPI through the wrapper, raising (aborting the tx) on failure. - src/client.rs: drop the now-unused out-of-band start_durable_function; expose is_worker_ready() to the start path. Verified: committed start runs to completion; rolled-back start leaves no df.* rows AND no _duroxide orphan (the leak this fixes); E2E subset incl. join/race/signals/cancel/loops passes; non-superuser df_e2e_user path works. POC scope: fresh-install only (no upgrade script); _duroxide schema name hard-coded; wrapper hardening deferred per docs/spec-atomic-start.md.
Add a 'Proof of concept (validated)' section documenting what was built (the SECURITY DEFINER df._enqueue_orchestrator_start wrapper + in-transaction SPI enqueue), the confirmed findings (privilege model is the only wrinkle; visible_at=now() suffices; reusing the duroxide WorkItem type), and the validation matrix (atomicity, 8/8 E2E subset, 5/5 JOIN, non-superuser path). Update Status, Phase 1, and the visible_at open question to reflect the POC; note the DROP EXTENSION CASCADE / _duroxide recovery gotcha.
…nciler Address the abstraction-breaking direct _duroxide call and add the GC backstop. - df._enqueue_orchestrator_start now resolves the duroxide schema dynamically via df.duroxide_schema() and EXECUTE format(%I.enqueue_orchestrator_work), instead of hard-coding _duroxide. - df.start() probes for the duroxide-pg SQL surface (enqueue_orchestrator_work in the resolved schema): present -> atomic in-transaction SPI enqueue; absent (non-pg provider / pre-wrapper schema) -> fall back to the out-of-band client path. start_durable_function restored for the fallback. - Add df.reconcile(grace) (Option 4): SECURITY DEFINER, admin-only. GCs orphaned ROOT duroxide instances (parent_instance_id IS NULL, no df row) via delete_instances_atomic and fails stuck df.instances; excludes sub-orchestrations. Schedulable as a durable cron loop. Validated: atomicity holds via dynamic schema; reconcile deletes a planted root orphan and a stuck instance while leaving sub-orchestrations + live instances intact; E2E subset 8/8; fmt clean, no new clippy warnings. Spec updated.
The background worker now keeps one reconciler instance running per cluster,
dogfooding pg_durable as a durable cron loop:
df.start(df.loop(df.seq('SELECT * FROM df.reconcile()',
df.wait_for_schedule(<cron>))), 'df_reconciler')
- worker::ensure_reconciler runs each epoch: idempotent (skips if a reconciler
is pending/running; self-heals if it died), submits as a dedicated
NON-superuser role df_reconciler (worker-created; granted df.grant_usage() +
EXECUTE on df.reconcile()). Non-superuser identity avoids the
enable_superuser_instances guard and bounds blast radius to 'trigger GC'.
- New GUC pg_durable.reconciler_cron (default '*/5 * * * *'; empty disables).
Validated: auto-starts as df_reconciler; exactly one instance (idempotent);
scheduled loop GCs a planted orphan on its next tick without self-GC'ing; E2E
subset 6/6 unaffected with it running. Two gotchas handled: pg_-prefixed role
names are reserved (-> df_reconciler), and a set-returning reconciler must be
called as SELECT * FROM df.reconcile() (bare call yields an unsupported RECORD
column in execute-sql). Spec updated.
Fixes from an SoT review of the atomic-start + reconciler work: - HIGH (security): df._enqueue_orchestrator_start no longer accepts an opaque work item. It builds the StartOrchestration server-side and authorizes only a brand-new, not-yet-started instance (pending df.instances row, no queue entry, no duroxide instance), so a df user can no longer forge Cancel/Signal/Activity work items against another user's instance. Wrapper is now (text,text,text). - HIGH (correctness): df.reconcile() gathers the FULL orphan subtree (recursive on parent_instance_id) before delete_instances_atomic, which refuses to delete a parent without its children; each pass is wrapped in EXCEPTION so a failure never aborts reconcile or kills the built-in loop. Count now reflects rows actually deleted. - HIGH (security): reconciler singleton liveness keys on the unforgeable submitted_by = df_reconciler role, not the user-writable label, so a user cannot suppress GC with a look-alike instance. - MED/HIGH: WAIT_SCHEDULE recomputes its delay from the deterministic clock (ctx.utc_now) each generation instead of replaying a fixed build-time offset, fixing a df.loop(df.wait_for_schedule) busy-loop (calculate_cron_wait_from). - MED: ensure_reconciler is re-asserted from the steady-state poll loop, so a reconciler that dies mid-epoch restarts within the poll interval. - MED: df.start() warns when it takes the non-atomic fallback path. - Docs: keep df_reconciler LOGIN (connect_as_user needs it); correct stale role name in comments; qualify the 5s-race claim to the atomic path; spec Review hardening section. Verified: forged enqueues denied; root+child orphan GC'd with no error; label spoof does not suppress GC; reconciler fires on cron (no spin) and self-heals in ~4s after cancel; E2E subset 6/6; fmt clean, no new clippy warnings.
Route df.signal()/df.cancel() through SECURITY DEFINER wrappers (df._enqueue_orchestrator_signal, df._enqueue_orchestrator_cancel) called via SPI, so their _duroxide enqueue commits or rolls back with the caller's transaction, matching the atomic df.start() path. - Work items (ExternalRaised / CancelInstance) are built server-side, so a caller cannot choose the variant or target a foreign instance. - Ownership is authorized via pg_has_role(session_user, submitted_by::oid, 'MEMBER'): session_user is unforgeable under SECURITY DEFINER and membership honors SET ROLE. - Signal preserves the root + running-descendant fan-out via a recursive CTE over parent_instance_id. - Both retain the out-of-band fallback (with a WARNING) when the duroxide-pg SQL surface is absent. Update docs/spec-atomic-start.md to reflect signal/cancel as migrated.
The spec had drifted from the shipped code (Rust-side work-item construction, a duroxide-pg wrapper placement that never landed, stale df.__enqueue_start naming, POC/phasing framing) and triplicated the reconciler and signal/cancel descriptions. Rewrite end-to-end against reality: server-side work-item build in the df-schema SECURITY DEFINER wrappers, start authorized by instance state and cancel/signal by pg_has_role(session_user, ...), the actual worker reconciler DSL, and a single prevent/repair structure. Drop POC language and dead jargon. 676 -> 368 lines; no code changes.
24_atomic_rollback: a rolled-back df.start/df.cancel/df.signal leaves no _duroxide effect (no orphan; the instance is not cancelled; the signal is not delivered). Fails against the pre-change out-of-band enqueue. 25_enqueue_wrapper_authz: the SECURITY DEFINER enqueue wrappers deny a non-owner forging a cancel/signal against a foreign instance, while the owner is allowed. Fails if the pg_has_role authorization check is removed. 26_reconcile_orphan_gc: df.reconcile() collects a planted root orphan that has running child sub-orchestrations (parallel join), gathering the full subtree, and leaves healthy instances untouched. Fails if reconcile collects only orphan roots (delete_instances_atomic refuses a parent without its children). Each test fails against a behavioral defect, verified by mutation testing, not merely the absence of the new functions.
…ement The in-transaction enqueue and df.reconcile() are the first code to call duroxide-pg's SQL surface directly (enqueue_orchestrator_work, delete_instances_atomic, reads of orchestrator_queue/instances/executions), unlike the rest of pg_durable which uses duroxide's Rust provider/client API. Document why (only direct SQL can share the caller's transaction), the existing precedent (the _worker_ready table co-located in _duroxide; duroxide-pg migrations), and that non-PG providers fall back to the out-of-band path.
Resolve the df.grant_usage() conflict: main refactored grant_usage to gate ordinary df.* functions via schema USAGE (PUBLIC EXECUTE) and grant only sensitive functions explicitly, dropping the per-function func_sigs[] array. The three in-transaction enqueue wrappers (df._enqueue_orchestrator_start/ _cancel/_signal) are revoked from PUBLIC, so they are now granted unconditionally in grant_usage() and revoked in revoke_usage(), preserving the prior behavior (every df user could call them) under the new model. This also drops the stale df.debug_connection() reference (removed from the SQL surface by #250).
…pper hardening Fixes from the PR self-review: - Complete the 0.2.3->0.2.4 upgrade script with the atomic enqueue wrappers, df.reconcile(), and matching grant_usage()/revoke_usage() grants/revokes. - Make the in-transaction enqueue probe require both surfaces: the duroxide-pg enqueue SQL function and the df._enqueue_orchestrator_* wrappers. This keeps the new .so B1-compatible with older df schemas that have not run ALTER EXTENSION UPDATE, falling back to the out-of-band client path instead of calling missing wrappers. - Log (server-side) rather than emit a client-visible WARNING for the fallback, so scripts that capture SELECT df.start(...) output are not contaminated. - Harden df._enqueue_orchestrator_start so it is not a generic privileged start-any-orchestration primitive: it only accepts the root graph executor, validates input.instance_id, and enqueues the hard-coded root orchestration. - Extend the authz E2E test to cover the start-wrapper hardening and make its cleanup robust across interrupted prior runs. Verified with scripts/test-upgrade.sh --verbose and targeted E2E tests 24/25/26.
Duroxide does not buffer external events until an orchestration has a pending subscription. A signal sent before the root runtime row exists would be accepted but skipped before the workflow can observe it, so df.signal now rejects that case with object_not_in_prerequisite_state instead of returning OK. - Add the root-materialization check to df._enqueue_orchestrator_signal in both fresh-install SQL and the 0.2.3->0.2.4 upgrade script. - Extend 24_atomic_rollback to assert same-transaction signal attempts are rejected before runtime materialization. - Align the spec/upgrade docs: wait_for_schedule in-flight instances may need restart after the history-shape change; early signals are rejected, not buffered.
The wait_for_schedule loop fix records an utc_now event before the timer, which changes the durable replay event sequence for in-flight scheduled-loop instances. Document the drain/restart requirement explicitly in the spec and upgrade notes.
…ule caveat - During 0.2.3->0.2.4 upgrade, backfill EXECUTE on the new private atomic enqueue wrappers to roles that already had explicit USAGE on schema df, preserving grant option. Without this, existing df users could lose df.start()/df.cancel()/df.signal() after ALTER EXTENSION UPDATE because the new binary would choose the atomic path and then fail on wrapper EXECUTE. - Extend scripts/test-upgrade.sh with a B2 assertion that a pre-upgrade df user retains wrapper privileges after upgrade. - Broaden the wait_for_schedule upgrade caveat from looped schedules to any in-flight WAIT_SCHEDULE node, because the recorded replay event sequence now includes utc_now before the timer.
Use SET search_path = pg_catalog, pg_temp for the new SECURITY DEFINER wrappers and df.reconcile() in both fresh-install SQL and the 0.2.3->0.2.4 upgrade script. This satisfies the pgspot PS004 gate while preserving the fully qualified SQL references inside the functions.
Trim implementation details that are visible in the diff and keep the spec focused on the problem, chosen prevent/repair design, provider coupling, behavior changes, upgrade caveats, validation, and remaining risks.
| -- the same transaction, so this state is reachable only for the instance the | ||
| -- caller (df.start) just inserted in the current transaction — never another | ||
| -- user's committed instance. This is what stops a caller from forging work | ||
| -- against a foreign instance. |
There was a problem hiding this comment.
Comment nit, and to check my understanding...
Given the permissions, a user can bypass df.start and populate df.instances and df.nodes tables, then call this function. The checks you are doing therefore limit the user to enqueueing any pending instance that they own, not necessarily one that's in the same transaction.
There was a problem hiding this comment.
Yes, that understanding is correct. The old comment overstated the guarantee: the wrapper permits a same-owner pending/not-yet-started instance, not necessarily one inserted by df.start in the same transaction. I updated the comment in both fresh-install SQL and the upgrade script to say that explicitly, and to point at the actual safety properties: fixed root orchestration, matching input instance id, and no foreign already-started target.
The SECURITY DEFINER start wrapper allows same-owner pending instances that have not yet been enqueued; that state can be created directly by a df user, not only by df.start in the same transaction. Clarify the comment and point to the actual safety properties: fixed root orchestration, matching input instance id, and no foreign already-started target.
Problem
df.start(),df.cancel(), anddf.signal()write the control plane (df.*) on the caller's transaction but enqueue the data-plane work in_duroxideout-of-band, on a separate connection that commits independently. The two are not atomic, so the stores can diverge: a rolled-backdf.start()leaves an orphaned orchestration in_duroxidewith nodf.*rows, and the worker retries it until a 5s timeout.Addressed in two parts: prevent the inconsistency where possible, repair what still slips through.
Part 1 — Prevent: one transaction for both stores
df.start(),df.cancel(), anddf.signal()now enqueue their runtime work over SPI inside the caller's transaction, so the control-plane writes and the runtime enqueue commit or roll back together.SECURITY DEFINERwrapper (df._enqueue_orchestrator_start/_cancel/_signal) in thedfschema. Each wrapper builds the work item server-side (the caller cannot choose the type or a foreign target) and runs no caller-supplied SQL. The start wrapper is not a generic privileged entrypoint: it only accepts the root graph-executor orchestration and validatesinput.instance_id.df.startpermits the enqueue only for a brand-new, not-yet-started instance (reachable only for the row the caller just inserted).df.cancel/df.signaltarget an already-committed instance, so they checkpg_has_role(session_user, <owner>, 'MEMBER')—session_useris unforgeable underSECURITY DEFINER, and membership honorsSET ROLE.df.signalpreserves the fan-out to the root instance and every running sub-orchestration via a recursive CTE. Signals before the root runtime row exists are rejected instead of returningOKand being skipped before the workflow can observe them.df._enqueue_orchestrator_*wrappers. If either is absent it uses the previous out-of-band path (non-atomic) and logs server-side. The runtime schema is resolved at runtime viadf.duroxide_schema().Part 2 — Repair: periodic reconciliation
For divergence prevention cannot cover (crashes mid-execution, the non-atomic fallback):
df.reconcile()(admin-only,SECURITY DEFINER) deletes orphaned runtime instance subtrees that have nodf.instancesrow, and marks stuckdf.instancesrows failed.pg_durable.reconciler_cron(default*/5 * * * *; empty disables), submitted by a dedicated non-superuser roledf_reconciler, and re-asserts it from the poll loop so a failed one restarts.df.wait_for_schedulerecomputes its delay from the orchestration's recorded clock each iteration instead of replaying a build-time offset (fixes a busy-loop indf.loop(df.wait_for_schedule(...))).Design notes:
docs/spec-atomic-start.md.Tests
New E2E regression tests. Each was confirmed to fail on the specific defect via mutation testing (the new functions present but the guarded behavior broken), not merely because a function is absent:
24_atomic_rollback— a rolled-backdf.start/df.cancel/df.signalleaves no_duroxideeffect (no orphan; the instance is not cancelled; the signal is not delivered), and signaling before runtime materialization is rejected. Fails against the out-of-band/early-drop behavior.25_enqueue_wrapper_authz— the enqueue wrappers deny a non-owner forging a cancel/signal against a foreign instance, reject attempts to use the start wrapper for an internal orchestration, and allow the owner. Fails if thepg_has_rolecheck or start-wrapper validation is removed.26_reconcile_orphan_gc—df.reconcile()collects a planted root orphan that has running child sub-orchestrations (full subtree) and leaves healthy instances untouched. Fails if reconcile collects only roots.Existing E2E also pass (
07_signals,23_signal_in_racefan-out,22_cancel_status_consistency, plus core/loops/join/race/heartbeat); reconciler auto-start, single-instance liveness, and cron scheduling verified manually.scripts/test-upgrade.sh --verbosepasses all 37 checks (including existing df-user wrapper-grant backfill);cargo fmtclean, no new clippy warnings.Notes
duroxide/duroxide-pgorCargo.*changes.df.start/df.cancel/df.signalnow take part in the caller's transaction, so e.g.BEGIN; df.start(...); ROLLBACK;no longer starts the workflow on the atomic path.df.wait_for_schedulefix changes the recorded replay event sequence (utc_nowis recorded before the timer). Drain or restart any in-flight instance already waiting in aWAIT_SCHEDULEnode during upgrade; otherwise it may fail replay as nondeterministic.df.reconcile()call duroxide-pg's SQL surface directly (enqueue_orchestrator_work,delete_instances_atomic, and reads of_duroxide.orchestrator_queue/instances/executions), so they depend on the duroxide-pg (PostgreSQL) provider. Everywhere else pg_durable talks to the runtime through duroxide's Rust provider/client API; this is the first code to reach into duroxide-pg's own tables and functions. It is the only way to share the caller's transaction (the Rust client uses a separate pool), and pg_durable already assumes a PG provider in a known schema (it co-locates its_worker_readytable in_duroxideand relies on duroxide-pg's migrations). A non-PG provider — which doesn't really apply to a PostgreSQL extension storing its state in the same database — falls back to the out-of-band path; this fallback is logged server-side, not emitted as a client-visible SQL warning.Remaining work