Skip to content

Prevent and repair df/_duroxide inconsistency#258

Draft
tjgreen42 wants to merge 21 commits into
mainfrom
tjgreen42/atomic-start
Draft

Prevent and repair df/_duroxide inconsistency#258
tjgreen42 wants to merge 21 commits into
mainfrom
tjgreen42/atomic-start

Conversation

@tjgreen42

@tjgreen42 tjgreen42 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Problem

df.start(), df.cancel(), and df.signal() write the control plane (df.*) on the caller's transaction but enqueue the data-plane work in _duroxide out-of-band, on a separate connection that commits independently. The two are not atomic, so the stores can diverge: a rolled-back df.start() leaves an orphaned orchestration in _duroxide with no df.* rows, and the worker retries it until a 5s timeout.

Addressed in two parts: prevent the inconsistency where possible, repair what still slips through.

Part 1 — Prevent: one transaction for both stores

df.start(), df.cancel(), and df.signal() now enqueue their runtime work over SPI inside the caller's transaction, so the control-plane writes and the runtime enqueue commit or roll back together.

  • The runtime queue is writable by its owner only, so each enqueue goes through a SECURITY DEFINER wrapper (df._enqueue_orchestrator_start / _cancel / _signal) in the df schema. Each wrapper builds the work item server-side (the caller cannot choose the type or a foreign target) and runs no caller-supplied SQL. The start wrapper is not a generic privileged entrypoint: it only accepts the root graph-executor orchestration and validates input.instance_id.
  • Authorization: df.start permits the enqueue only for a brand-new, not-yet-started instance (reachable only for the row the caller just inserted). df.cancel / df.signal target an already-committed instance, so they check pg_has_role(session_user, <owner>, 'MEMBER')session_user is unforgeable under SECURITY DEFINER, and membership honors SET ROLE.
  • df.signal preserves the fan-out to the root instance and every running sub-orchestration via a recursive CTE. Signals before the root runtime row exists are rejected instead of returning OK and being skipped before the workflow can observe them.
  • Each path probes for both required surfaces — the duroxide-pg SQL enqueue function and the df._enqueue_orchestrator_* wrappers. If either is absent it uses the previous out-of-band path (non-atomic) and logs server-side. The runtime schema is resolved at runtime via df.duroxide_schema().

Part 2 — Repair: periodic reconciliation

For divergence prevention cannot cover (crashes mid-execution, the non-atomic fallback):

  • df.reconcile() (admin-only, SECURITY DEFINER) deletes orphaned runtime instance subtrees that have no df.instances row, and marks stuck df.instances rows failed.
  • The background worker keeps one reconciler instance per cluster on pg_durable.reconciler_cron (default */5 * * * *; empty disables), submitted by a dedicated non-superuser role df_reconciler, and re-asserts it from the poll loop so a failed one restarts.
  • Supporting fix: df.wait_for_schedule recomputes its delay from the orchestration's recorded clock each iteration instead of replaying a build-time offset (fixes a busy-loop in df.loop(df.wait_for_schedule(...))).

Design notes: docs/spec-atomic-start.md.

Tests

New E2E regression tests. Each was confirmed to fail on the specific defect via mutation testing (the new functions present but the guarded behavior broken), not merely because a function is absent:

  • 24_atomic_rollback — a rolled-back df.start / df.cancel / df.signal leaves no _duroxide effect (no orphan; the instance is not cancelled; the signal is not delivered), and signaling before runtime materialization is rejected. Fails against the out-of-band/early-drop behavior.
  • 25_enqueue_wrapper_authz — the enqueue wrappers deny a non-owner forging a cancel/signal against a foreign instance, reject attempts to use the start wrapper for an internal orchestration, and allow the owner. Fails if the pg_has_role check or start-wrapper validation is removed.
  • 26_reconcile_orphan_gcdf.reconcile() collects a planted root orphan that has running child sub-orchestrations (full subtree) and leaves healthy instances untouched. Fails if reconcile collects only roots.

Existing E2E also pass (07_signals, 23_signal_in_race fan-out, 22_cancel_status_consistency, plus core/loops/join/race/heartbeat); reconciler auto-start, single-instance liveness, and cron scheduling verified manually. scripts/test-upgrade.sh --verbose passes all 37 checks (including existing df-user wrapper-grant backfill); cargo fmt clean, no new clippy warnings.

Notes

  • pg_durable-only; no duroxide / duroxide-pg or Cargo.* changes.
  • Behavior change: df.start / df.cancel / df.signal now take part in the caller's transaction, so e.g. BEGIN; df.start(...); ROLLBACK; no longer starts the workflow on the atomic path.
  • Upgrade caveat: the df.wait_for_schedule fix changes the recorded replay event sequence (utc_now is recorded before the timer). Drain or restart any in-flight instance already waiting in a WAIT_SCHEDULE node during upgrade; otherwise it may fail replay as nondeterministic.
  • Provider coupling: the in-transaction enqueue and df.reconcile() call duroxide-pg's SQL surface directly (enqueue_orchestrator_work, delete_instances_atomic, and reads of _duroxide.orchestrator_queue / instances / executions), so they depend on the duroxide-pg (PostgreSQL) provider. Everywhere else pg_durable talks to the runtime through duroxide's Rust provider/client API; this is the first code to reach into duroxide-pg's own tables and functions. It is the only way to share the caller's transaction (the Rust client uses a separate pool), and pg_durable already assumes a PG provider in a known schema (it co-locates its _worker_ready table in _duroxide and relies on duroxide-pg's migrations). A non-PG provider — which doesn't really apply to a PostgreSQL extension storing its state in the same database — falls back to the out-of-band path; this fallback is logged server-side, not emitted as a client-visible SQL warning.

Remaining work

  • Optionally move the enqueue entrypoint into duroxide-pg, owned by the schema owner.
  • Tune reconciler grace/cadence; reconsider role provisioning.

Proposes making the duroxide start-enqueue part of the caller's backend
transaction (Option 3) so df.start() is atomic: rollback leaves neither
df.* rows nor a _duroxide orchestrator-queue orphan, and the worker only
picks up an instance after its df.* rows commit.

Grounded in the current start path (single orchestrator_queue INSERT via
_duroxide.enqueue_orchestrator_work), the SECURITY DEFINER privilege model
needed for the enqueue, the resulting removal of the load-function-graph
5s race, alternatives considered, and an Upgrade & Migration section.
Add a Dual-write inventory section that classifies every df.* writer vs
duroxide-client call into two seams: backend session-side dual-writes
(df.start, df.cancel) that Option 3 fixes, and worker-side best-effort
mirror activities (update-instance-status, update-node-status) that it does
not. Make explicit that df.cancel is a genuine dual-write (Phase 2) while
df.signal writes only _duroxide and is not a cross-store divergence.
…on 4 decision

Add a 'Design options considered' section near the top enumerating the four
originally-explored options (single source of truth, our-SQL-in-their-tx,
their-enqueue-in-our-tx, and tolerate-and-reconcile) with a Decision
subsection, so the Option N references throughout resolve. Update the Overview
to state we implement Option 3 (primary) + Option 4 (backstop), remove the now
-redundant mid-document summary, and reframe the reconciler as in-scope Phase 3.
Add the trade-off matrix (atomic with df.*, atomic with caller tx, kills the
5s load race, fixes crash-time drift, effort) to the Design options considered
section, with footnotes explaining Option 1's convergence with Option 3 and why
Option 4 is kept as a crash-drift backstop.
Enqueue the StartOrchestration work item inside the caller's backend
transaction instead of out-of-band on the duroxide client pool, so the
df.nodes/df.instances INSERTs and the orchestrator-queue INSERT commit or
roll back together.

- src/lib.rs: add df._enqueue_orchestrator_start(text,text), a SECURITY
  DEFINER wrapper (SET search_path=pg_catalog) that performs the privileged
  INSERT via _duroxide.enqueue_orchestrator_work(); EXECUTE revoked from
  PUBLIC and granted through df.grant_usage().
- src/dsl.rs: df.start() builds the duroxide WorkItem::StartOrchestration,
  gates on worker readiness, and enqueues via SPI through the wrapper,
  raising (aborting the tx) on failure.
- src/client.rs: drop the now-unused out-of-band start_durable_function;
  expose is_worker_ready() to the start path.

Verified: committed start runs to completion; rolled-back start leaves no
df.* rows AND no _duroxide orphan (the leak this fixes); E2E subset incl.
join/race/signals/cancel/loops passes; non-superuser df_e2e_user path works.

POC scope: fresh-install only (no upgrade script); _duroxide schema name
hard-coded; wrapper hardening deferred per docs/spec-atomic-start.md.
Add a 'Proof of concept (validated)' section documenting what was built
(the SECURITY DEFINER df._enqueue_orchestrator_start wrapper + in-transaction
SPI enqueue), the confirmed findings (privilege model is the only wrinkle;
visible_at=now() suffices; reusing the duroxide WorkItem type), and the
validation matrix (atomicity, 8/8 E2E subset, 5/5 JOIN, non-superuser path).
Update Status, Phase 1, and the visible_at open question to reflect the POC;
note the DROP EXTENSION CASCADE / _duroxide recovery gotcha.
…nciler

Address the abstraction-breaking direct _duroxide call and add the GC backstop.

- df._enqueue_orchestrator_start now resolves the duroxide schema dynamically
  via df.duroxide_schema() and EXECUTE format(%I.enqueue_orchestrator_work),
  instead of hard-coding _duroxide.
- df.start() probes for the duroxide-pg SQL surface (enqueue_orchestrator_work
  in the resolved schema): present -> atomic in-transaction SPI enqueue; absent
  (non-pg provider / pre-wrapper schema) -> fall back to the out-of-band client
  path. start_durable_function restored for the fallback.
- Add df.reconcile(grace) (Option 4): SECURITY DEFINER, admin-only. GCs orphaned
  ROOT duroxide instances (parent_instance_id IS NULL, no df row) via
  delete_instances_atomic and fails stuck df.instances; excludes
  sub-orchestrations. Schedulable as a durable cron loop.

Validated: atomicity holds via dynamic schema; reconcile deletes a planted root
orphan and a stuck instance while leaving sub-orchestrations + live instances
intact; E2E subset 8/8; fmt clean, no new clippy warnings. Spec updated.
The background worker now keeps one reconciler instance running per cluster,
dogfooding pg_durable as a durable cron loop:
  df.start(df.loop(df.seq('SELECT * FROM df.reconcile()',
                          df.wait_for_schedule(<cron>))), 'df_reconciler')

- worker::ensure_reconciler runs each epoch: idempotent (skips if a reconciler
  is pending/running; self-heals if it died), submits as a dedicated
  NON-superuser role df_reconciler (worker-created; granted df.grant_usage() +
  EXECUTE on df.reconcile()). Non-superuser identity avoids the
  enable_superuser_instances guard and bounds blast radius to 'trigger GC'.
- New GUC pg_durable.reconciler_cron (default '*/5 * * * *'; empty disables).

Validated: auto-starts as df_reconciler; exactly one instance (idempotent);
scheduled loop GCs a planted orphan on its next tick without self-GC'ing; E2E
subset 6/6 unaffected with it running. Two gotchas handled: pg_-prefixed role
names are reserved (-> df_reconciler), and a set-returning reconciler must be
called as SELECT * FROM df.reconcile() (bare call yields an unsupported RECORD
column in execute-sql). Spec updated.
@tjgreen42 tjgreen42 changed the title Atomic df.start(): spec + validated POC (Option 3) Make df.start() atomic; add reconciler GC and built-in cron loop Jun 19, 2026
Fixes from an SoT review of the atomic-start + reconciler work:

- HIGH (security): df._enqueue_orchestrator_start no longer accepts an opaque
  work item. It builds the StartOrchestration server-side and authorizes only a
  brand-new, not-yet-started instance (pending df.instances row, no queue entry,
  no duroxide instance), so a df user can no longer forge Cancel/Signal/Activity
  work items against another user's instance. Wrapper is now (text,text,text).
- HIGH (correctness): df.reconcile() gathers the FULL orphan subtree (recursive
  on parent_instance_id) before delete_instances_atomic, which refuses to delete
  a parent without its children; each pass is wrapped in EXCEPTION so a failure
  never aborts reconcile or kills the built-in loop. Count now reflects rows
  actually deleted.
- HIGH (security): reconciler singleton liveness keys on the unforgeable
  submitted_by = df_reconciler role, not the user-writable label, so a user
  cannot suppress GC with a look-alike instance.
- MED/HIGH: WAIT_SCHEDULE recomputes its delay from the deterministic clock
  (ctx.utc_now) each generation instead of replaying a fixed build-time offset,
  fixing a df.loop(df.wait_for_schedule) busy-loop (calculate_cron_wait_from).
- MED: ensure_reconciler is re-asserted from the steady-state poll loop, so a
  reconciler that dies mid-epoch restarts within the poll interval.
- MED: df.start() warns when it takes the non-atomic fallback path.
- Docs: keep df_reconciler LOGIN (connect_as_user needs it); correct stale role
  name in comments; qualify the 5s-race claim to the atomic path; spec Review
  hardening section.

Verified: forged enqueues denied; root+child orphan GC'd with no error; label
spoof does not suppress GC; reconciler fires on cron (no spin) and self-heals in
~4s after cancel; E2E subset 6/6; fmt clean, no new clippy warnings.
@tjgreen42 tjgreen42 changed the title Make df.start() atomic; add reconciler GC and built-in cron loop Atomic df.start(); reconciler for df/_duroxide divergence Jun 19, 2026
@tjgreen42 tjgreen42 changed the title Atomic df.start(); reconciler for df/_duroxide divergence Keep df and _duroxide consistent: atomic df.start() (prevent) + reconciler (repair) Jun 19, 2026
@tjgreen42 tjgreen42 changed the title Keep df and _duroxide consistent: atomic df.start() (prevent) + reconciler (repair) Prevent and repair df/_duroxide inconsistency Jun 19, 2026
Route df.signal()/df.cancel() through SECURITY DEFINER wrappers
(df._enqueue_orchestrator_signal, df._enqueue_orchestrator_cancel)
called via SPI, so their _duroxide enqueue commits or rolls back with
the caller's transaction, matching the atomic df.start() path.

- Work items (ExternalRaised / CancelInstance) are built server-side,
  so a caller cannot choose the variant or target a foreign instance.
- Ownership is authorized via pg_has_role(session_user,
  submitted_by::oid, 'MEMBER'): session_user is unforgeable under
  SECURITY DEFINER and membership honors SET ROLE.
- Signal preserves the root + running-descendant fan-out via a
  recursive CTE over parent_instance_id.
- Both retain the out-of-band fallback (with a WARNING) when the
  duroxide-pg SQL surface is absent.

Update docs/spec-atomic-start.md to reflect signal/cancel as migrated.
The spec had drifted from the shipped code (Rust-side work-item
construction, a duroxide-pg wrapper placement that never landed, stale
df.__enqueue_start naming, POC/phasing framing) and triplicated the
reconciler and signal/cancel descriptions.

Rewrite end-to-end against reality: server-side work-item build in the
df-schema SECURITY DEFINER wrappers, start authorized by instance state
and cancel/signal by pg_has_role(session_user, ...), the actual
worker reconciler DSL, and a single prevent/repair structure. Drop POC
language and dead jargon. 676 -> 368 lines; no code changes.
24_atomic_rollback: a rolled-back df.start/df.cancel/df.signal leaves no
  _duroxide effect (no orphan; the instance is not cancelled; the signal is
  not delivered). Fails against the pre-change out-of-band enqueue.
25_enqueue_wrapper_authz: the SECURITY DEFINER enqueue wrappers deny a
  non-owner forging a cancel/signal against a foreign instance, while the
  owner is allowed. Fails if the pg_has_role authorization check is removed.
26_reconcile_orphan_gc: df.reconcile() collects a planted root orphan that
  has running child sub-orchestrations (parallel join), gathering the full
  subtree, and leaves healthy instances untouched. Fails if reconcile
  collects only orphan roots (delete_instances_atomic refuses a parent
  without its children).

Each test fails against a behavioral defect, verified by mutation testing,
not merely the absence of the new functions.
…ement

The in-transaction enqueue and df.reconcile() are the first code to call
duroxide-pg's SQL surface directly (enqueue_orchestrator_work,
delete_instances_atomic, reads of orchestrator_queue/instances/executions),
unlike the rest of pg_durable which uses duroxide's Rust provider/client API.
Document why (only direct SQL can share the caller's transaction), the existing
precedent (the _worker_ready table co-located in _duroxide; duroxide-pg
migrations), and that non-PG providers fall back to the out-of-band path.
@tjgreen42 tjgreen42 marked this pull request as ready for review June 19, 2026 17:02
@tjgreen42 tjgreen42 marked this pull request as draft June 19, 2026 17:07
Resolve the df.grant_usage() conflict: main refactored grant_usage to gate
ordinary df.* functions via schema USAGE (PUBLIC EXECUTE) and grant only
sensitive functions explicitly, dropping the per-function func_sigs[] array.
The three in-transaction enqueue wrappers (df._enqueue_orchestrator_start/
_cancel/_signal) are revoked from PUBLIC, so they are now granted
unconditionally in grant_usage() and revoked in revoke_usage(), preserving the
prior behavior (every df user could call them) under the new model.

This also drops the stale df.debug_connection() reference (removed from the SQL
surface by #250).
@tjgreen42 tjgreen42 marked this pull request as ready for review June 19, 2026 17:19
@tjgreen42 tjgreen42 marked this pull request as draft June 19, 2026 17:24
…pper hardening

Fixes from the PR self-review:

- Complete the 0.2.3->0.2.4 upgrade script with the atomic enqueue wrappers,
  df.reconcile(), and matching grant_usage()/revoke_usage() grants/revokes.
- Make the in-transaction enqueue probe require both surfaces: the duroxide-pg
  enqueue SQL function and the df._enqueue_orchestrator_* wrappers. This keeps
  the new .so B1-compatible with older df schemas that have not run ALTER
  EXTENSION UPDATE, falling back to the out-of-band client path instead of
  calling missing wrappers.
- Log (server-side) rather than emit a client-visible WARNING for the fallback,
  so scripts that capture SELECT df.start(...) output are not contaminated.
- Harden df._enqueue_orchestrator_start so it is not a generic privileged
  start-any-orchestration primitive: it only accepts the root graph executor,
  validates input.instance_id, and enqueues the hard-coded root orchestration.
- Extend the authz E2E test to cover the start-wrapper hardening and make its
  cleanup robust across interrupted prior runs.

Verified with scripts/test-upgrade.sh --verbose and targeted E2E tests 24/25/26.
Duroxide does not buffer external events until an orchestration has a pending
subscription. A signal sent before the root runtime row exists would be accepted
but skipped before the workflow can observe it, so df.signal now rejects that
case with object_not_in_prerequisite_state instead of returning OK.

- Add the root-materialization check to df._enqueue_orchestrator_signal in both
  fresh-install SQL and the 0.2.3->0.2.4 upgrade script.
- Extend 24_atomic_rollback to assert same-transaction signal attempts are
  rejected before runtime materialization.
- Align the spec/upgrade docs: wait_for_schedule in-flight instances may need
  restart after the history-shape change; early signals are rejected, not
  buffered.
The wait_for_schedule loop fix records an utc_now event before the timer, which
changes the durable replay event sequence for in-flight scheduled-loop instances.
Document the drain/restart requirement explicitly in the spec and upgrade notes.
…ule caveat

- During 0.2.3->0.2.4 upgrade, backfill EXECUTE on the new private atomic
  enqueue wrappers to roles that already had explicit USAGE on schema df,
  preserving grant option. Without this, existing df users could lose
  df.start()/df.cancel()/df.signal() after ALTER EXTENSION UPDATE because the
  new binary would choose the atomic path and then fail on wrapper EXECUTE.
- Extend scripts/test-upgrade.sh with a B2 assertion that a pre-upgrade df user
  retains wrapper privileges after upgrade.
- Broaden the wait_for_schedule upgrade caveat from looped schedules to any
  in-flight WAIT_SCHEDULE node, because the recorded replay event sequence now
  includes utc_now before the timer.
Use SET search_path = pg_catalog, pg_temp for the new SECURITY DEFINER wrappers
and df.reconcile() in both fresh-install SQL and the 0.2.3->0.2.4 upgrade
script. This satisfies the pgspot PS004 gate while preserving the fully qualified
SQL references inside the functions.
Trim implementation details that are visible in the diff and keep the spec focused
on the problem, chosen prevent/repair design, provider coupling, behavior changes,
upgrade caveats, validation, and remaining risks.
Comment thread sql/pg_durable--0.2.3--0.2.4.sql Outdated
-- the same transaction, so this state is reachable only for the instance the
-- caller (df.start) just inserted in the current transaction — never another
-- user's committed instance. This is what stops a caller from forging work
-- against a foreign instance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment nit, and to check my understanding...

Given the permissions, a user can bypass df.start and populate df.instances and df.nodes tables, then call this function. The checks you are doing therefore limit the user to enqueueing any pending instance that they own, not necessarily one that's in the same transaction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that understanding is correct. The old comment overstated the guarantee: the wrapper permits a same-owner pending/not-yet-started instance, not necessarily one inserted by df.start in the same transaction. I updated the comment in both fresh-install SQL and the upgrade script to say that explicitly, and to point at the actual safety properties: fixed root orchestration, matching input instance id, and no foreign already-started target.

The SECURITY DEFINER start wrapper allows same-owner pending instances that have
not yet been enqueued; that state can be created directly by a df user, not only
by df.start in the same transaction. Clarify the comment and point to the actual
safety properties: fixed root orchestration, matching input instance id, and no
foreign already-started target.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants