Skip to content

feat(api): make orphan-task recovery configurable and drop the Jira idempotency table#11472

Open
AdriiiPRodri wants to merge 6 commits into
masterfrom
idem-feature-flags
Open

feat(api): make orphan-task recovery configurable and drop the Jira idempotency table#11472
AdriiiPRodri wants to merge 6 commits into
masterfrom
idem-feature-flags

Conversation

@AdriiiPRodri
Copy link
Copy Markdown
Contributor

@AdriiiPRodri AdriiiPRodri commented Jun 5, 2026

Context

This builds on the orphan-task recovery added in #11416, which re-enqueues background tasks whose worker died mid-run (deploy, OOM, eviction). That work shipped without an operational kill switch or any per-area control, relied on a dedicated jira_issue_dispatches table for Jira de-duplication, and made scan re-runs idempotent. This PR makes recovery toggleable per task group (opt-in), drops the Jira idempotency table, and removes the scan re-run idempotency so scans are no longer auto-recovered. The whole #11416 feature is still unreleased, so removing the migrations does not affect any released schema.

Description

Feature flags for orphan-task recovery. Recovery is gated by Django settings (environment variables). The master switch is OFF by default, so recovery is opt-in; the per-group flags default to enabled, so once the master is on every group recovers unless explicitly turned off.

Env var Default Scope
DJANGO_TASK_RECOVERY_ENABLED false Master switch for the whole recovery sweep (opt-in).
DJANGO_TASK_RECOVERY_SUMMARIES_ENABLED true Group: scan-summary, scan-compliance-overviews, scan-provider-compliance-scores, scan-daily-severity, scan-finding-group-summaries, scan-reset-ephemeral-resources.
DJANGO_TASK_RECOVERY_DELETIONS_ENABLED true Group: provider-deletion, tenant-deletion.

The flat reenqueueable allowlist is replaced by RECOVERY_TASK_GROUPS (summaries, deletions) plus a reenqueueable_tasks() helper that unions only the enabled groups. A task in a disabled group is still detected and marked terminal (clearing the stuck "in progress" state), but it is not re-enqueued. With the master flag off, the task-recovery sweep is skipped entirely; the attack-paths stale cleanup, a separate concern, keeps running.

Scans excluded from recovery. The scan re-run idempotency added in #11416 is removed (the pre-run _clear_scan_rerun_state delete and the compliance summary/requirement re-deletes), Scan.recovery_count and its migration are dropped, and scan-perform/scan-perform-scheduled are moved into the watchdog's skip set. An orphaned scan is now left untouched (not detected, marked, or re-enqueued), reverting scans to their pre-#11416 behavior, because re-running a scan is not safe to do automatically.

Remove the Jira idempotency dispatch table. The JiraIssueDispatch model and its 0096_jiraissuedispatch migration are removed, send_findings_to_jira is reverted to its pre-#11416 form, and integration-jira is dropped from the reenqueueable allowlist because, without the dedup table, re-running it would create duplicate Jira issues. The dispatch cleanup step is also removed from provider deletion.

Tasks never re-enqueued. Only the two groups above are ever re-enqueued. Every other task is detected and marked terminal (so it stops showing as "in progress"), but never re-run, for one of three reasons:

  • External side effects, where re-running duplicates or loses work: integration-jira (duplicate Jira issues), integration-s3 (upload rebuilt from worker-local files that do not survive the crash), integration-security-hub (pushes findings to AWS), scan-report, scan-compliance-reports (generate/compress/upload output files from worker-local tmp).
  • Ephemeral checks, nothing to recover since they run again on demand or schedule: integration-check, integration-connection-check, provider-connection-check, lighthouse-connection-check, lighthouse-provider-connection-check, lighthouse-provider-models-refresh.
  • Not audited for idempotency, kept terminal as a conservative default: backfill-compliance-summaries, backfill-daily-severity-summaries, backfill-finding-group-summaries, backfill-provider-compliance-scores, backfill-scan-resource-summaries, scan-attack-surface-overviews, scan-category-summaries, scan-resource-group-summaries, reaggregate-all-finding-group-summaries, findings-mute-historical.

Some tasks are skipped entirely (not even detected): scan-perform and scan-perform-scheduled (not auto-recovered), attack-paths-scan-perform (handled by its own stale cleanup, which drops the temporary Neo4j database), and attack-paths-cleanup-stale-scans and reconcile-orphan-tasks (they re-run on their own schedule).

Migration cleanup

This PR deletes two unreleased migrations: 0094_scan_recovery_count (added Scan.recovery_count) and 0096_jiraissuedispatch (created jira_issue_dispatches). Neither shipped in a release, so no released schema is affected. If an environment already applied either one while tracking master, the column/table and their migration records are left behind on pull; drop them manually:

ALTER TABLE scans DROP COLUMN IF EXISTS recovery_count;
DROP TABLE IF EXISTS jira_issue_dispatches;
DELETE FROM django_migrations WHERE app = 'api' AND name IN ('0094_scan_recovery_count', '0096_jiraissuedispatch');

Steps to review

  1. Recovery is OFF by default: the master flag DJANGO_TASK_RECOVERY_ENABLED defaults to false, so the sweep does nothing until you set it to true.
  2. With the master on, set a group flag to false (for example DJANGO_TASK_RECOVERY_SUMMARIES_ENABLED=false) to exclude that group; its orphaned tasks are then marked terminal instead of re-enqueued.
  3. Scans are no longer recovered: an orphaned scan-perform/scan-perform-scheduled is ignored (not detected, marked, or re-enqueued).
  4. Settings-only change plus migration deletions, so no new migration is required; python manage.py makemigrations --check reports no changes.
  5. Tests: pytest tasks/tests/test_orphan_recovery.py (master/per-group flags and the scan skip) and pytest tasks/tests/test_integrations.py -k Jira (Jira send reverted).

Checklist

API

  • All issue/task requirements work as expected on the API
  • Endpoint response output (if applicable) - N/A, no endpoint changes
  • EXPLAIN ANALYZE output for new/modified queries or indexes (if applicable) - N/A
  • Performance test results (if applicable) - N/A
  • Verify if API specs need to be regenerated - not needed, no endpoint or schema changes
  • Check if version updates are required - covered by the existing unreleased section
  • Ensure new entries are added to api/CHANGELOG.md

License

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@AdriiiPRodri AdriiiPRodri requested a review from a team as a code owner June 5, 2026 10:03
@github-actions github-actions Bot added component/api review-django-migrations This PR contains changes in Django migrations labels Jun 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Conflict Markers Resolved

All conflict markers have been successfully resolved in this pull request.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

✅ All necessary CHANGELOG.md files have been updated.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🔒 osv-scanner: 2 finding(s) in api/uv.lock

Severity gate: HIGH,CRITICAL,UNKNOWN

Severity ID Package Version Summary
🟠 HIGH (8.8) GHSA-897w-fcg9-f6xj PyPI/dulwich 0.23.0 Dulwich has an arbitrary file write via NTFS-hostile tree entries on Windows
🟠 HIGH (7.4) PYSEC-2026-179 PyPI/pyjwt 2.12.1 (no summary)

To accept a finding, add an [[IgnoredVulns]] entry to osv-scanner.toml at the repo root with a reason and ignoreUntil.

View run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🔒 Container Security Scan

Image: prowler-api:aae8a7e
Last scan: 2026-06-05 13:15:41 UTC

📊 Vulnerability Summary

Severity Count
🔴 Critical 21
Total 21

15 package(s) affected

⚠️ Action Required

Critical severity vulnerabilities detected. These should be addressed before merging:

  • Review the detailed scan results
  • Update affected packages to patched versions
  • Consider using a different base image if updates are unavailable

📋 Resources:

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.97%. Comparing base (f7f8747) to head (97cbb12).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #11472    +/-   ##
========================================
  Coverage   93.96%   93.97%            
========================================
  Files         242      240     -2     
  Lines       35619    35407   -212     
========================================
- Hits        33471    33273   -198     
+ Misses       2148     2134    -14     
Flag Coverage Δ
api 93.97% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
prowler ∅ <ø> (∅)
api 93.97% <100.00%> (+<0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes orphaned-task recovery operationally controllable via Django settings (master opt-in plus per-task-group toggles) and removes the unreleased Jira idempotency mechanism (JiraIssueDispatch) along with its migration and related cleanup logic.

Changes:

  • Add master + per-group feature flags for orphan-task recovery and refactor the allowlist into grouped RECOVERY_TASK_GROUPS.
  • Remove the Jira idempotency dispatch table/model/migration and revert Jira send behavior accordingly (and ensure Jira tasks are no longer auto re-enqueued).
  • Update tests, docs, and changelog entries to match the new recovery semantics and the Jira de-dup removal.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
api/src/backend/tasks/tests/test_orphan_recovery.py Adds coverage for master/per-group recovery flags and updates Jira recovery expectations.
api/src/backend/tasks/tests/test_integrations.py Removes Jira dispatch-based idempotency assertions and updates expected return shape.
api/src/backend/tasks/tests/test_deletion.py Removes provider-deletion cleanup test for the deleted Jira dispatch table.
api/src/backend/tasks/jobs/orphan_recovery.py Introduces grouped re-enqueue allowlist + master flag gate in reconcile_orphans().
api/src/backend/tasks/jobs/integrations.py Removes Jira dispatch reservation logic and simplifies Jira send loop/return payload.
api/src/backend/tasks/jobs/deletion.py Drops Jira dispatch cleanup step from provider deletion workflow.
api/src/backend/config/django/base.py Adds TASK_RECOVERY_* settings sourced from DJANGO_TASK_RECOVERY_* env vars.
api/src/backend/api/models.py Deletes the JiraIssueDispatch model.
api/src/backend/api/migrations/0096_jiraissuedispatch.py Removes the migration that created the Jira dispatch table.
api/docs/orphan-task-recovery.md Updates operational docs for recovery flags and removes Jira idempotency claims.
api/CHANGELOG.md Removes the Jira idempotency entry tied to the now-removed dispatch table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/src/backend/tasks/jobs/orphan_recovery.py
Comment thread api/src/backend/tasks/jobs/integrations.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Comment thread api/src/backend/tasks/jobs/scan.py
Comment thread api/src/backend/tasks/jobs/scan.py Outdated
Comment thread api/src/backend/tasks/jobs/scan.py
Comment thread api/src/backend/tasks/jobs/integrations.py
…livering scan-perform/integration-jira on crash
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Comment thread api/src/backend/tasks/tasks.py
Comment thread api/src/backend/tasks/jobs/orphan_recovery.py Outdated
Comment thread api/CHANGELOG.md Outdated
Comment thread api/docs/orphan-task-recovery.md Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/api review-django-migrations This PR contains changes in Django migrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants