Prevent and repair df/_duroxide inconsistency by tjgreen42 · Pull Request #258 · microsoft/pg_durable

tjgreen42 · 2026-06-19T00:22:35Z

Problem

df.start(), df.cancel(), and df.signal() write the control plane (df.*) on the caller's transaction but enqueue the data-plane work in _duroxide out-of-band, on a separate connection that commits independently. The two are not atomic, so the stores can diverge: a rolled-back df.start() leaves an orphaned orchestration in _duroxide with no df.* rows, and the worker retries it until a 5s timeout.

Addressed in two parts: prevent the inconsistency where possible, repair what still slips through.

Part 1 — Prevent: one transaction for both stores

df.start(), df.cancel(), and df.signal() now enqueue their runtime work over SPI inside the caller's transaction, so the control-plane writes and the runtime enqueue commit or roll back together.

The runtime queue is writable by its owner only, so each enqueue goes through a SECURITY DEFINER wrapper (df._enqueue_orchestrator_start / _cancel / _signal) in the df schema. Each wrapper builds the work item server-side (the caller cannot choose the type or a foreign target) and runs no caller-supplied SQL. The start wrapper is not a generic privileged entrypoint: it only accepts the root graph-executor orchestration and validates input.instance_id.
Authorization: df.start permits the enqueue only for a brand-new, not-yet-started instance (reachable only for the row the caller just inserted). df.cancel / df.signal target an already-committed instance, so they check pg_has_role(session_user, <owner>, 'MEMBER') — session_user is unforgeable under SECURITY DEFINER, and membership honors SET ROLE.
df.signal preserves the fan-out to the root instance and every running sub-orchestration via a recursive CTE. Signals before the root runtime row exists are rejected instead of returning OK and being skipped before the workflow can observe them.
Each path probes for both required surfaces — the duroxide-pg SQL enqueue function and the df._enqueue_orchestrator_* wrappers. If either is absent it uses the previous out-of-band path (non-atomic) and logs server-side. The runtime schema is resolved at runtime via df.duroxide_schema().

Part 2 — Repair: periodic reconciliation

For divergence prevention cannot cover (crashes mid-execution, the non-atomic fallback):

df.reconcile() (admin-only, SECURITY DEFINER) deletes orphaned runtime instance subtrees that have no df.instances row, and marks stuck df.instances rows failed.
The background worker keeps one reconciler instance per cluster on pg_durable.reconciler_cron (default */5 * * * *; empty disables), submitted by a dedicated non-superuser role df_reconciler, and re-asserts it from the poll loop so a failed one restarts.
Supporting fix: df.wait_for_schedule recomputes its delay from the orchestration's recorded clock each iteration instead of replaying a build-time offset (fixes a busy-loop in df.loop(df.wait_for_schedule(...))).

Design notes: docs/spec-atomic-start.md.

Tests

New E2E regression tests. Each was confirmed to fail on the specific defect via mutation testing (the new functions present but the guarded behavior broken), not merely because a function is absent:

24_atomic_rollback — a rolled-back df.start / df.cancel / df.signal leaves no _duroxide effect (no orphan; the instance is not cancelled; the signal is not delivered), and signaling before runtime materialization is rejected. Fails against the out-of-band/early-drop behavior.
25_enqueue_wrapper_authz — the enqueue wrappers deny a non-owner forging a cancel/signal against a foreign instance, reject attempts to use the start wrapper for an internal orchestration, and allow the owner. Fails if the pg_has_role check or start-wrapper validation is removed.
26_reconcile_orphan_gc — df.reconcile() collects a planted root orphan that has running child sub-orchestrations (full subtree) and leaves healthy instances untouched. Fails if reconcile collects only roots.

Existing E2E also pass (07_signals, 23_signal_in_race fan-out, 22_cancel_status_consistency, plus core/loops/join/race/heartbeat); reconciler auto-start, single-instance liveness, and cron scheduling verified manually. scripts/test-upgrade.sh --verbose passes all 37 checks (including existing df-user wrapper-grant backfill); cargo fmt clean, no new clippy warnings.

Notes

pg_durable-only; no duroxide / duroxide-pg or Cargo.* changes.
Behavior change: df.start / df.cancel / df.signal now take part in the caller's transaction, so e.g. BEGIN; df.start(...); ROLLBACK; no longer starts the workflow on the atomic path.
Upgrade caveat: the df.wait_for_schedule fix changes the recorded replay event sequence (utc_now is recorded before the timer). Drain or restart any in-flight instance already waiting in a WAIT_SCHEDULE node during upgrade; otherwise it may fail replay as nondeterministic.
Provider coupling: the in-transaction enqueue and df.reconcile() call duroxide-pg's SQL surface directly (enqueue_orchestrator_work, delete_instances_atomic, and reads of _duroxide.orchestrator_queue / instances / executions), so they depend on the duroxide-pg (PostgreSQL) provider. Everywhere else pg_durable talks to the runtime through duroxide's Rust provider/client API; this is the first code to reach into duroxide-pg's own tables and functions. It is the only way to share the caller's transaction (the Rust client uses a separate pool), and pg_durable already assumes a PG provider in a known schema (it co-locates its _worker_ready table in _duroxide and relies on duroxide-pg's migrations). A non-PG provider — which doesn't really apply to a PostgreSQL extension storing its state in the same database — falls back to the out-of-band path; this fallback is logged server-side, not emitted as a client-visible SQL warning.

Remaining work

Optionally move the enqueue entrypoint into duroxide-pg, owned by the schema owner.
Tune reconciler grace/cadence; reconsider role provisioning.

Proposes making the duroxide start-enqueue part of the caller's backend transaction (Option 3) so df.start() is atomic: rollback leaves neither df.* rows nor a _duroxide orchestrator-queue orphan, and the worker only picks up an instance after its df.* rows commit. Grounded in the current start path (single orchestrator_queue INSERT via _duroxide.enqueue_orchestrator_work), the SECURITY DEFINER privilege model needed for the enqueue, the resulting removal of the load-function-graph 5s race, alternatives considered, and an Upgrade & Migration section.

Add a Dual-write inventory section that classifies every df.* writer vs duroxide-client call into two seams: backend session-side dual-writes (df.start, df.cancel) that Option 3 fixes, and worker-side best-effort mirror activities (update-instance-status, update-node-status) that it does not. Make explicit that df.cancel is a genuine dual-write (Phase 2) while df.signal writes only _duroxide and is not a cross-store divergence.

…on 4 decision Add a 'Design options considered' section near the top enumerating the four originally-explored options (single source of truth, our-SQL-in-their-tx, their-enqueue-in-our-tx, and tolerate-and-reconcile) with a Decision subsection, so the Option N references throughout resolve. Update the Overview to state we implement Option 3 (primary) + Option 4 (backstop), remove the now -redundant mid-document summary, and reframe the reconciler as in-scope Phase 3.

Add the trade-off matrix (atomic with df.*, atomic with caller tx, kills the 5s load race, fixes crash-time drift, effort) to the Design options considered section, with footnotes explaining Option 1's convergence with Option 3 and why Option 4 is kept as a crash-drift backstop.

Enqueue the StartOrchestration work item inside the caller's backend transaction instead of out-of-band on the duroxide client pool, so the df.nodes/df.instances INSERTs and the orchestrator-queue INSERT commit or roll back together. - src/lib.rs: add df._enqueue_orchestrator_start(text,text), a SECURITY DEFINER wrapper (SET search_path=pg_catalog) that performs the privileged INSERT via _duroxide.enqueue_orchestrator_work(); EXECUTE revoked from PUBLIC and granted through df.grant_usage(). - src/dsl.rs: df.start() builds the duroxide WorkItem::StartOrchestration, gates on worker readiness, and enqueues via SPI through the wrapper, raising (aborting the tx) on failure. - src/client.rs: drop the now-unused out-of-band start_durable_function; expose is_worker_ready() to the start path. Verified: committed start runs to completion; rolled-back start leaves no df.* rows AND no _duroxide orphan (the leak this fixes); E2E subset incl. join/race/signals/cancel/loops passes; non-superuser df_e2e_user path works. POC scope: fresh-install only (no upgrade script); _duroxide schema name hard-coded; wrapper hardening deferred per docs/spec-atomic-start.md.

Add a 'Proof of concept (validated)' section documenting what was built (the SECURITY DEFINER df._enqueue_orchestrator_start wrapper + in-transaction SPI enqueue), the confirmed findings (privilege model is the only wrinkle; visible_at=now() suffices; reusing the duroxide WorkItem type), and the validation matrix (atomicity, 8/8 E2E subset, 5/5 JOIN, non-superuser path). Update Status, Phase 1, and the visible_at open question to reflect the POC; note the DROP EXTENSION CASCADE / _duroxide recovery gotcha.

…nciler Address the abstraction-breaking direct _duroxide call and add the GC backstop. - df._enqueue_orchestrator_start now resolves the duroxide schema dynamically via df.duroxide_schema() and EXECUTE format(%I.enqueue_orchestrator_work), instead of hard-coding _duroxide. - df.start() probes for the duroxide-pg SQL surface (enqueue_orchestrator_work in the resolved schema): present -> atomic in-transaction SPI enqueue; absent (non-pg provider / pre-wrapper schema) -> fall back to the out-of-band client path. start_durable_function restored for the fallback. - Add df.reconcile(grace) (Option 4): SECURITY DEFINER, admin-only. GCs orphaned ROOT duroxide instances (parent_instance_id IS NULL, no df row) via delete_instances_atomic and fails stuck df.instances; excludes sub-orchestrations. Schedulable as a durable cron loop. Validated: atomicity holds via dynamic schema; reconcile deletes a planted root orphan and a stuck instance while leaving sub-orchestrations + live instances intact; E2E subset 8/8; fmt clean, no new clippy warnings. Spec updated.

The background worker now keeps one reconciler instance running per cluster, dogfooding pg_durable as a durable cron loop: df.start(df.loop(df.seq('SELECT * FROM df.reconcile()', df.wait_for_schedule(<cron>))), 'df_reconciler') - worker::ensure_reconciler runs each epoch: idempotent (skips if a reconciler is pending/running; self-heals if it died), submits as a dedicated NON-superuser role df_reconciler (worker-created; granted df.grant_usage() + EXECUTE on df.reconcile()). Non-superuser identity avoids the enable_superuser_instances guard and bounds blast radius to 'trigger GC'. - New GUC pg_durable.reconciler_cron (default '*/5 * * * *'; empty disables). Validated: auto-starts as df_reconciler; exactly one instance (idempotent); scheduled loop GCs a planted orphan on its next tick without self-GC'ing; E2E subset 6/6 unaffected with it running. Two gotchas handled: pg_-prefixed role names are reserved (-> df_reconciler), and a set-returning reconciler must be called as SELECT * FROM df.reconcile() (bare call yields an unsupported RECORD column in execute-sql). Spec updated.

Fixes from an SoT review of the atomic-start + reconciler work: - HIGH (security): df._enqueue_orchestrator_start no longer accepts an opaque work item. It builds the StartOrchestration server-side and authorizes only a brand-new, not-yet-started instance (pending df.instances row, no queue entry, no duroxide instance), so a df user can no longer forge Cancel/Signal/Activity work items against another user's instance. Wrapper is now (text,text,text). - HIGH (correctness): df.reconcile() gathers the FULL orphan subtree (recursive on parent_instance_id) before delete_instances_atomic, which refuses to delete a parent without its children; each pass is wrapped in EXCEPTION so a failure never aborts reconcile or kills the built-in loop. Count now reflects rows actually deleted. - HIGH (security): reconciler singleton liveness keys on the unforgeable submitted_by = df_reconciler role, not the user-writable label, so a user cannot suppress GC with a look-alike instance. - MED/HIGH: WAIT_SCHEDULE recomputes its delay from the deterministic clock (ctx.utc_now) each generation instead of replaying a fixed build-time offset, fixing a df.loop(df.wait_for_schedule) busy-loop (calculate_cron_wait_from). - MED: ensure_reconciler is re-asserted from the steady-state poll loop, so a reconciler that dies mid-epoch restarts within the poll interval. - MED: df.start() warns when it takes the non-atomic fallback path. - Docs: keep df_reconciler LOGIN (connect_as_user needs it); correct stale role name in comments; qualify the 5s-race claim to the atomic path; spec Review hardening section. Verified: forged enqueues denied; root+child orphan GC'd with no error; label spoof does not suppress GC; reconciler fires on cron (no spin) and self-heals in ~4s after cancel; E2E subset 6/6; fmt clean, no new clippy warnings.

Route df.signal()/df.cancel() through SECURITY DEFINER wrappers (df._enqueue_orchestrator_signal, df._enqueue_orchestrator_cancel) called via SPI, so their _duroxide enqueue commits or rolls back with the caller's transaction, matching the atomic df.start() path. - Work items (ExternalRaised / CancelInstance) are built server-side, so a caller cannot choose the variant or target a foreign instance. - Ownership is authorized via pg_has_role(session_user, submitted_by::oid, 'MEMBER'): session_user is unforgeable under SECURITY DEFINER and membership honors SET ROLE. - Signal preserves the root + running-descendant fan-out via a recursive CTE over parent_instance_id. - Both retain the out-of-band fallback (with a WARNING) when the duroxide-pg SQL surface is absent. Update docs/spec-atomic-start.md to reflect signal/cancel as migrated.

The spec had drifted from the shipped code (Rust-side work-item construction, a duroxide-pg wrapper placement that never landed, stale df.__enqueue_start naming, POC/phasing framing) and triplicated the reconciler and signal/cancel descriptions. Rewrite end-to-end against reality: server-side work-item build in the df-schema SECURITY DEFINER wrappers, start authorized by instance state and cancel/signal by pg_has_role(session_user, ...), the actual worker reconciler DSL, and a single prevent/repair structure. Drop POC language and dead jargon. 676 -> 368 lines; no code changes.

24_atomic_rollback: a rolled-back df.start/df.cancel/df.signal leaves no _duroxide effect (no orphan; the instance is not cancelled; the signal is not delivered). Fails against the pre-change out-of-band enqueue. 25_enqueue_wrapper_authz: the SECURITY DEFINER enqueue wrappers deny a non-owner forging a cancel/signal against a foreign instance, while the owner is allowed. Fails if the pg_has_role authorization check is removed. 26_reconcile_orphan_gc: df.reconcile() collects a planted root orphan that has running child sub-orchestrations (parallel join), gathering the full subtree, and leaves healthy instances untouched. Fails if reconcile collects only orphan roots (delete_instances_atomic refuses a parent without its children). Each test fails against a behavioral defect, verified by mutation testing, not merely the absence of the new functions.

…ement The in-transaction enqueue and df.reconcile() are the first code to call duroxide-pg's SQL surface directly (enqueue_orchestrator_work, delete_instances_atomic, reads of orchestrator_queue/instances/executions), unlike the rest of pg_durable which uses duroxide's Rust provider/client API. Document why (only direct SQL can share the caller's transaction), the existing precedent (the _worker_ready table co-located in _duroxide; duroxide-pg migrations), and that non-PG providers fall back to the out-of-band path.

Resolve the df.grant_usage() conflict: main refactored grant_usage to gate ordinary df.* functions via schema USAGE (PUBLIC EXECUTE) and grant only sensitive functions explicitly, dropping the per-function func_sigs[] array. The three in-transaction enqueue wrappers (df._enqueue_orchestrator_start/ _cancel/_signal) are revoked from PUBLIC, so they are now granted unconditionally in grant_usage() and revoked in revoke_usage(), preserving the prior behavior (every df user could call them) under the new model. This also drops the stale df.debug_connection() reference (removed from the SQL surface by #250).

…pper hardening Fixes from the PR self-review: - Complete the 0.2.3->0.2.4 upgrade script with the atomic enqueue wrappers, df.reconcile(), and matching grant_usage()/revoke_usage() grants/revokes. - Make the in-transaction enqueue probe require both surfaces: the duroxide-pg enqueue SQL function and the df._enqueue_orchestrator_* wrappers. This keeps the new .so B1-compatible with older df schemas that have not run ALTER EXTENSION UPDATE, falling back to the out-of-band client path instead of calling missing wrappers. - Log (server-side) rather than emit a client-visible WARNING for the fallback, so scripts that capture SELECT df.start(...) output are not contaminated. - Harden df._enqueue_orchestrator_start so it is not a generic privileged start-any-orchestration primitive: it only accepts the root graph executor, validates input.instance_id, and enqueues the hard-coded root orchestration. - Extend the authz E2E test to cover the start-wrapper hardening and make its cleanup robust across interrupted prior runs. Verified with scripts/test-upgrade.sh --verbose and targeted E2E tests 24/25/26.

Duroxide does not buffer external events until an orchestration has a pending subscription. A signal sent before the root runtime row exists would be accepted but skipped before the workflow can observe it, so df.signal now rejects that case with object_not_in_prerequisite_state instead of returning OK. - Add the root-materialization check to df._enqueue_orchestrator_signal in both fresh-install SQL and the 0.2.3->0.2.4 upgrade script. - Extend 24_atomic_rollback to assert same-transaction signal attempts are rejected before runtime materialization. - Align the spec/upgrade docs: wait_for_schedule in-flight instances may need restart after the history-shape change; early signals are rejected, not buffered.

The wait_for_schedule loop fix records an utc_now event before the timer, which changes the durable replay event sequence for in-flight scheduled-loop instances. Document the drain/restart requirement explicitly in the spec and upgrade notes.

…ule caveat - During 0.2.3->0.2.4 upgrade, backfill EXECUTE on the new private atomic enqueue wrappers to roles that already had explicit USAGE on schema df, preserving grant option. Without this, existing df users could lose df.start()/df.cancel()/df.signal() after ALTER EXTENSION UPDATE because the new binary would choose the atomic path and then fail on wrapper EXECUTE. - Extend scripts/test-upgrade.sh with a B2 assertion that a pre-upgrade df user retains wrapper privileges after upgrade. - Broaden the wait_for_schedule upgrade caveat from looped schedules to any in-flight WAIT_SCHEDULE node, because the recorded replay event sequence now includes utc_now before the timer.

Use SET search_path = pg_catalog, pg_temp for the new SECURITY DEFINER wrappers and df.reconcile() in both fresh-install SQL and the 0.2.3->0.2.4 upgrade script. This satisfies the pgspot PS004 gate while preserving the fully qualified SQL references inside the functions.

Trim implementation details that are visible in the diff and keep the spec focused on the problem, chosen prevent/repair design, provider coupling, behavior changes, upgrade caveats, validation, and remaining risks.

pinodeca · 2026-06-19T22:37:12Z

+    -- the same transaction, so this state is reachable only for the instance the
+    -- caller (df.start) just inserted in the current transaction — never another
+    -- user's committed instance. This is what stops a caller from forging work
+    -- against a foreign instance.


Comment nit, and to check my understanding...

Given the permissions, a user can bypass df.start and populate df.instances and df.nodes tables, then call this function. The checks you are doing therefore limit the user to enqueueing any pending instance that they own, not necessarily one that's in the same transaction.

Yes, that understanding is correct. The old comment overstated the guarantee: the wrapper permits a same-owner pending/not-yet-started instance, not necessarily one inserted by df.start in the same transaction. I updated the comment in both fresh-install SQL and the upgrade script to say that explicitly, and to point at the actual safety properties: fixed root orchestration, matching input instance id, and no foreign already-started target.

The SECURITY DEFINER start wrapper allows same-owner pending instances that have not yet been enqueued; that state can be created directly by a df user, not only by df.start in the same transaction. Clarify the comment and point to the actual safety properties: fixed root orchestration, matching input instance id, and no foreign already-started target.

tjgreen42 added 8 commits June 18, 2026 21:29

tjgreen42 changed the title ~~Atomic df.start(): spec + validated POC (Option 3)~~ Make df.start() atomic; add reconciler GC and built-in cron loop Jun 19, 2026

tjgreen42 changed the title ~~Make df.start() atomic; add reconciler GC and built-in cron loop~~ Atomic df.start(); reconciler for df/_duroxide divergence Jun 19, 2026

tjgreen42 changed the title ~~Atomic df.start(); reconciler for df/_duroxide divergence~~ Keep df and _duroxide consistent: atomic df.start() (prevent) + reconciler (repair) Jun 19, 2026

tjgreen42 changed the title ~~Keep df and _duroxide consistent: atomic df.start() (prevent) + reconciler (repair)~~ Prevent and repair df/_duroxide inconsistency Jun 19, 2026

tjgreen42 added 4 commits June 19, 2026 04:36

tjgreen42 marked this pull request as ready for review June 19, 2026 17:02

tjgreen42 marked this pull request as draft June 19, 2026 17:07

tjgreen42 marked this pull request as ready for review June 19, 2026 17:19

tjgreen42 marked this pull request as draft June 19, 2026 17:24

tjgreen42 added 6 commits June 19, 2026 17:56

docs: shorten atomic consistency spec to high-level design

a38eb9f

Trim implementation details that are visible in the diff and keep the spec focused on the problem, chosen prevent/repair design, provider coupling, behavior changes, upgrade caveats, validation, and remaining risks.

pinodeca reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent and repair df/_duroxide inconsistency#258

Prevent and repair df/_duroxide inconsistency#258
tjgreen42 wants to merge 21 commits into
mainfrom
tjgreen42/atomic-start

tjgreen42 commented Jun 19, 2026 •

edited

Loading

Uh oh!

pinodeca Jun 19, 2026

Uh oh!

tjgreen42 Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tjgreen42 commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Part 1 — Prevent: one transaction for both stores

Part 2 — Repair: periodic reconciliation

Tests

Notes

Remaining work

Uh oh!

pinodeca Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

tjgreen42 Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tjgreen42 commented Jun 19, 2026 •

edited

Loading