Skip to content

eth/protocols/wit, consensus/bor: WIT2 — BP-signed witness announcements with transitive relay and pre-import serving#2208

Open
lucca30 wants to merge 14 commits into
developfrom
lmartins/wit2-signed-announce
Open

eth/protocols/wit, consensus/bor: WIT2 — BP-signed witness announcements with transitive relay and pre-import serving#2208
lucca30 wants to merge 14 commits into
developfrom
lmartins/wit2-signed-announce

Conversation

@lucca30

@lucca30 lucca30 commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds WIT2 (witness protocol version 3): block producers sign a commitment over each witness, peers verify the signature and relay the announce at network-RTT speed without executing the block, and any peer that has fetched the body can serve it pre-import from an in-memory cache. The slow part of witness propagation — re-execution before relay — is removed from the critical path. Mixed mesh with WIT1 nodes is tolerated; no flag-day rollout required.

Devnet result (4 scenarios, post-fork-only window, hop-chain topology with +300 ms per-hop import knob):

  • Stateless validator at hop 2 — milestone-vote lag p95 425 ms → 1.0 ms (−99.8%).
  • Stateless validator at hop 3 — milestone-vote lag p95 719 ms → 260 ms (−64%).
  • Mixed-version meshes (only stateless WIT2; only one BP WIT2): zero errors, no peer drops, full sample counts. Backward compatible.

What we're solving

Today on Polygon mainnet, witness propagation through a stateless validator that is multiple hops away from a block producer accumulates a per-hop ~500 ms execution gate: each intermediate node must finish executing the block before it will relay the witness downstream. This serialises along the path and shows up at the receiver as milestone-voting latency — slow milestone votes on a fraction of blocks at multi-hop stateless validators. Adding more peers does not help; the chain of dependencies is fan-in × execution time.

The deliverable is to detach announce from execute so witness availability propagates at gossip speed, while keeping the same byte-correctness guarantee (hash check at the requester, with on-chain blame) and the same content-correctness guarantee (state-root, with BP blame).

How the code achieves it

1. BP-signed witness commitment

The producer needs to commit to which witness bytes are correct without paying ~88 ms of single-thread keccak on the announce path (otherwise we re-introduce the same gate we're trying to remove, just on a different node). See Signing-scheme evaluation below — short version: chunked-parallel keccak at 1 MiB chunks beats the next-best viable candidate by a clear margin and keeps the WIT1 wire format intact.

  • core/stateless/witness_commit.go::WitnessCommitHash(bytes) = keccak256(concat(per-1MiB-chunk-keccak)). Each 1 MiB chunk is hashed in parallel; final aggregate is one extra keccak over <1 KiB of chunk hashes. ~13.5 ms wall-clock for 50 MiB witnesses on 8 cores vs ~88 ms single-shot keccak — 6.5× speedup, no wire-format change. Producer and verifier agree on the chunk size as a protocol constant.
  • Producer signs the commitment via consensus/bor.SignBytes reusing the engine's SignerFn, with a dedicated mimetype application/x-bor-wit2-announce and a domain-separated digest tag — replay-resistant at both the digest and signer-call levels.
  • Operator note: validators running Clef must whitelist the new mimetype, otherwise the producer falls back to unsigned WIT1 announces.

2. Verify-and-relay without execution

  • New protocol version WIT2 = 3 (eth/protocols/wit/protocol.go), new message SignedNewWitnessHashesMsg = 0x06 carrying up to 64 announcements per packet.
  • eth/handler_wit2.go::handleSignedWitnessAnnouncements does ecrecover against the scheduled producer for the announced block; on success the announce is cached and immediately relayed to peers that have not seen this hash. No state execution is touched.
  • Header-not-yet-local case is handled as a deferral (no strike, retry on the next packet for the same hash) so the block-cosend race does not punish honest relayers. Strikes (rate-limited disconnect at 5/min) are reserved for confirmed misbehaviour: bad signature, signer ≠ scheduled producer with a known header.

3. Pre-import serving cache

  • pendingWitnessBodies (capacity 10) in the WIT2 handler is fed from the paged-fetch path the moment byte-correctness verification against the BP-signed WitnessHash passes — i.e. before chain write. handleGetWitness consults this cache before chain storage, so a peer that just received the body can serve it to a downstream stateless node before it has finished executing.
  • Entries are gated on the BP-signed WitnessHash being on file — relayers never cache unverified bytes, and WIT1 fallback paths skip the cache entirely (no path to mix unsigned and signed bytes).

4. Blame model preserved

  • Byte-correctness: requester verifies the body against the BP's signed WitnessHash; failure attaches to the server that returned the bytes.
  • Content-correctness (state-root): same as today — failure attaches to the BP that signed.
  • Conflicting WitnessHash for the same BlockHash is rejected via signedWitnessCache.putIfNewer, so a peer cannot equivocate witnesses across announcements.

5. Rate-limits & DoS shape

  • Per-(blockHash, peer) relay rate-limit: 200 ms.
  • Announcement TTL: 30 s.
  • Per-peer token bucket: burst 256, refill 64/s.
  • Strike disconnect at 5 invalid signed announces / minute.

6. Compatibility

  • WIT1 peers continue using NewWitnessHashes. Mixed WIT1/WIT2 mesh is tolerated: WIT2 nodes downgrade to WIT1 wire when peering with WIT1 peers (relay handler skips peers with Version() < wit.WIT2).
  • New WitnessHash field on WitnessMetadataResponse is set by WIT2 servers and ignored by WIT1 readers — wire forward-compatible.

Signing-scheme evaluation

Picking the right commitment function for the announce signature is load-bearing for the whole PR: too slow on the producer and we just move the per-hop gate from "execute the block" to "hash the witness"; too weak and we lose the byte-blame property that lets a downstream node disconnect a peer that returned tampered bytes. Four candidates were evaluated end-to-end on synthetic 1–50 MiB witnesses (Apple M4 Pro, Go 1.26.2, go test -benchtime=3s -count=3, median of three).

Candidates

Candidate Mechanism Wire change
A — current baseline keccak256(canonical_RLP(witness)) single-thread none
B — chunked-parallel keccak keccak(chunk0_hash ‖ … ‖ chunkN_hash), chunks hashed concurrently none
C — per-node Merkle hash every state node, sort, build Merkle tree, sign root none
D — intrinsic (drop separate hash) sign nothing extra; rely on header.StateRoot to detect bad bytes shrinks announce, bumps to WIT3

Result at 50 MiB — verifier wall-clock (best parallel config)

Candidate Best wall-clock Speedup vs A Producer cost Allocs/op Notes
A 88.4 ms (1 thread) 1.0× 88 ms 1 scales 1× even on 12-core hardware
B (1 MiB chunks, 8 cores) 13.5 ms 6.5× 13.5 ms 16 wire identical, only producer's signing input changes
B (15 MiB pages, 4 cores) 26.7 ms 3.3× 27 ms 16 one chunk per wire page; less internal parallelism
C (per-node Merkle, 4 cores) 122 ms 0.7× (worse) ~50 ms (precomputed hashes) 614 k Merkle build over 204k node hashes overwhelms the parallel keccak win
D (intrinsic, 4 cores) 44 ms 2.0× 0 ms 410 k rejected — see below

Why D was rejected post-bench

D had the most attractive numbers (zero producer cost, 2× verifier speedup, no signature on the announce path) — but a peer can serve a truncated witness whose included nodes all hash consistently up to the BP-signed header.StateRoot. Branch nodes embed child references as 32-byte hashes inside their own bytes, so dropping a subtree leaves the parent branch nodes' hashes unchanged. The intrinsic walker has no way to distinguish "this hash-reference belongs to a path that was never touched and is intentionally absent" from "this hash-reference belongs to a path that was touched and was adversarially omitted" — only attempting execution would. That destroys pre-execute byte-blame, which is the whole reason WIT2 introduced a content commitment in the first place. A/B/C all preserve byte-blame because they sign over content; truncation changes the commitment, signature mismatch, peer dropped pre-execute.

Why B at 1 MiB chunks won

A chunk-size sweep at 50 MiB / 8 cores:

chunk size wall-clock aggregate throughput speedup vs A
512 KiB 13.5 ms 3.9 GB/s 6.5×
1 MiB 13.8 ms 3.85 GB/s 6.4×
2 MiB 15.9 ms 3.3 GB/s 5.6×
4 MiB 17.1 ms 3.1 GB/s 5.2×
15 MiB (= one wire page) 27.7 ms 1.9 GB/s 3.2×

512 KiB shaves a tenth of a ms over 1 MiB at the cost of doubling the chunk count and the per-chunk overhead — 1 MiB is the knee of the curve. Below 512 KiB, per-chunk setup starts dominating. The 4 GB/s ceiling is the M4 Pro's aggregate keccak throughput across 8 P-cores; further parallelism doesn't help with the current keccak primitive.

Verifier-side scaling — B beats A non-trivially only ≥ 30 MiB

witness size A (single-thread) B (1 MiB chunks, 8 cores)
1 MiB 1.8 ms 1.8 ms (one chunk, no parallelism)
5 MiB 8.9 ms 4.0 ms (~2.2×)
15 MiB 26.6 ms 6.2 ms (~4.3×)
30 MiB 53.2 ms 9.1 ms (~5.8×)
50 MiB 88.4 ms 13.5 ms (~6.5×)

For the small witnesses Polygon emits today (typically 1–10 MiB) B is comparable to A; for the large witnesses we already see at the upper tail (30–50 MiB) B is the difference between the producer/verifier paying a ~90 ms gate vs ~14 ms. The fix is most impactful exactly where the problem is worst.

Why not C

C is dominated by every other viable candidate on these numbers: slower verifier than A (122 ms vs 88 ms), 91 MiB / 614 k allocations per verify at 50 MiB, no wire saving. C only becomes interesting if a future design needs sub-witness proofs (proving a specific node belongs to the committed set without sending the full body) — that's not on the roadmap, so C is a no-vote here.

Sensitivity caveats

  • Synthetic 256-byte avg node size; real mainnet shape may shift C/D's relative position but not B vs A — B operates on contiguous bytes regardless of node distribution.
  • Producer cost for B reuses the same parallel-keccak; not amortized into anything else, so it's a real wall-clock cost on the announce path. 14 ms is well under the per-hop savings (~500 ms) so the change is a net win even at 50 MiB.
  • A faster hash primitive (BLAKE3 / KangarooTwelve) would push verifier toward ~5 ms but adds a non-Ethereum dependency; out of scope, can be re-evaluated separately if 14 ms is still too slow.

Full bench artifact (raw numbers, reproduction commands, allocation breakdown): agent-zero/investigations/witness-propagation/witness-commit-bench.md.

Local devnet validation

A 9-node hop-chain devnet on kurtosis-pos: 4 BPs full-mesh, two relay full-nodes (F1/F2) carrying a +300 ms per-hop import-delay knob to amplify the gate without heavy tx loads, and three stateless validators at hop distances 1 / 2 / 3 from the closest BP (S1 ↔ BP1, S2 ↔ F1, S3 ↔ F2). Topology was enforced post-launch via admin_removePeer after every node imported past Giugliano (block 128 + 72-block settle), so the measurement window is post-fork and post-prune only — pre-fork blocks (different code path) are excluded.

Four scenarios, ~30 measured blocks each:

# Image map Headline result
1 All 9 = bor:develop (control) Reproduces the bug. S2 milestone-lag p95 425 ms / max 482 ms. S3 p95 719 ms / max 898 ms.
2 All 9 = bor:wit2 S2 p95 1.0 ms (−99.8%). S3 p95 260 ms (−64%). 700–1000 RX_SIGNED_ANNOUNCE per node confirms protocol active end-to-end.
3 S1/S2/S3 = bor:wit2, rest = bor:develop Same lag as develop (no relay path for signed announces because intermediate hops are WIT1), but zero errors, no peer drops, full sample counts; WIT2 stateless nodes correctly downgrade to WIT1 wire and serve develop peers via the chain path.
4 BP1 = bor:wit2, rest = bor:develop Same as scenario 3 from the other direction. BP1 served 195 witnesses on the WIT1 chain path; develop validators kept whitelisting milestones throughout.

F2 import-lag (the relay just before S3) shows the mechanism: median drops 805 → 305 ms in scenario 2 — one full per-hop inject overlapped with WIT2 announcement-driven pre-fetch, exactly what the design predicts.

S3's residual p95 of 260 ms in scenario 2 is the single +300 ms inject on F2 still in the critical path: WIT2 lets the F1 hop overlap, but F2 still has to receive and execute the block before serving S3. Without the artificial knob (i.e., on mainnet), the natural per-hop gate is ~50–100 ms and this residual shrinks proportionally.

Full report (per-scenario logs, lag tables, errors/warnings, peer-count snapshots, prune timestamps, image map): agent-zero/investigations/witness-propagation/devnet-validation-2026-04-30b.md.

Re-validation at the current head (2026-06-09, commit 8c16b39): the s2 all-WIT2 scenario was re-run at the convergence-review head with per-node Prometheus capture, which surfaced a defect all code review rounds missed — any node holding a signer key signed announces for foreign blocks, causing honest validators to strike each other. Fixed by binding the sign path to the scheduled producer (maySignAnnouncementForBlock, regression-tested) with a truthful WIT1 hash-announce fallback for non-producers. Post-fix: strikes 0 across all validators, RX_SIGNED +20%, latency unchanged. A Phase-2 large-witness scenario (~18.9 MB witnesses, above the 16 MiB push/gossip cap) verified the multi-page pull path end-to-end: all stateless nodes imported every padded block in lockstep.

Backward compatibility — explicit checks

Property Check Result
WIT2 binary in develop mesh scenario 4 (BP1 only) and scenario 3 (S* only) full sample counts, zero ERRORs, no peer isolation
WIT1 fallback path pendingWitnessBodies skipped when no signed WitnessHash on file gated check in eth/handler_wit2.go::resolveWitnessBytes
Mixed-version metadata new WitnessHash field on WitnessMetadataResponse ignored by WIT1 readers served-chain path serves both equally
Replay across forks/networks mimetype + domain tag in consensus/bor.SignBytes covered by consensus/bor/signbytes_test.go

Deployment & rollout safety — why a mixed mesh has no downside

Rollout is fully incremental: no flag day, no coordinated upgrade, no config change. The wit protocol negotiates per-connection at handshake (ProtocolVersions = [WIT2, WIT1, WIT0], highest mutual version wins), so one node simultaneously speaks WIT2 to upgraded peers and WIT1 to old peers. The new message (SignedNewWitnessHashesMsg = 0x06) is outside WIT1's message range and every send site is gated on peer.Version() >= WIT2 — an old node can never receive a frame it doesn't understand. The witness body transfer (paged GetWitness pull) is byte-identical in both versions.

Per-peer announce split (producer side). When a producer (or any node holding the witness) announces, each recipient gets the variant its negotiated version supports: WIT2 peers the signed announcement, WIT1/WIT0 peers the legacy unsigned NewWitnessHashes. Both are truthful (every announce path is gated on HasWitness) and both lead to the same paged pull. Old peers observe exactly the protocol they ran before this PR. The signature is computed once per produced block (cached; ~14 ms keccak + ~100 µs ECDSA) and the same announcement struct is reused across all peers, both broadcast phases, and every relay hop — relayers forward the producer's signature verbatim and cannot re-sign.

WIT2 → WIT1 boundary. A node that learns of a witness via a signed announce does not translate it for WIT1 peers at receive time (the transitive relay deliberately skips peers below WIT2 — at that point the node only knows a hash, and the WIT1 announce contract is "I have this witness"). The WIT1 peer is served on the node's post-import announce instead, and can even pull the body from the WIT2 node's in-flight cache pre-import. Net effect: WIT1 peers keep today's latency profile exactly; WIT2's pre-import fast path simply doesn't extend through them.

No cross-version discipline risk. Strike-disconnect only triggers on signed announcements (bad signature / signer ≠ scheduled producer). WIT1 peers physically cannot send those, so an old peer can never be struck and never strikes anyone. On the WIT2 side, only the block's sealer signs announcements (producer binding on the sign path, maySignAnnouncementForBlock), an announce racing ahead of its header is deferred rather than punished, and rate limiting drops packets without striking — devnet-verified at zero strikes across all honest topologies.

Benefits track validator adoption; the upgrade is inert until then. No signed announces exist until block producers run the new version, so a WIT2 full/RPC node in an old mesh behaves byte-for-byte like a WIT1 node. Deploying sentries/RPC/archive first is safe and changes nothing; each validator upgrade then switches on the latency win for its blocks. The worst case anywhere in a mixed mesh is today's propagation behavior, never worse — a WIT2 island reachable only through WIT1 hops degrades to the status quo (measured: scenarios 3/4 above, zero errors, no peer drops).

Suggested rollout order. Upgrade a validator together with its sentries as a unit — the signed announce dies at the first WIT1 hop, so a validator behind old sentries gets no benefit past hop 1. On canaries watch the eth_wit2_* meters: strike_disconnect, invalid_sig, not_validator should be ~0 in an honest mesh (non-zero = a buggy/malicious peer being handled); header_unknown and deferred counts are non-zero by design (announce/block races absorbed by the deferral queue).

No hardfork gate. WIT2 negotiates immediately at handshake; what gates real traffic is witness production/consumption, which is already live. The PR changes announcement transport and serving, not witness semantics, block validity, or consensus.

Test plan

  • Unit tests: core/stateless/witness_commit_test.go, witness_commit_bench_test.go, consensus/bor/signbytes_test.go, eth/handler_wit2_test.go, eth/handler_wit_test.go, eth/peerset_test.go, eth/protocols/wit/protocol_wit2_test.go, eth/fetcher/witness_manager_wit2_test.go.
  • Local devnet validation, 4 scenarios, post-fork-only measurement window — see report above.
  • Reviewers: confirm the chunk-size choice (1 MiB) for the parallel keccak — small enough to be useful on smaller witnesses, large enough to amortise goroutine overhead.
  • Reviewers: confirm the cache capacity (10) for pendingWitnessBodies. We don't expect more than a few in-flight unique witnesses at a time, but worth a second opinion under burst conditions.
  • Operator readiness: comms to validators running Clef about the new mimetype application/x-bor-wit2-announce.

Diffguard / quality-gate notes

  • Dead code: clean (earlier leftovers removed).
  • eth/handler_wit2.go split into three files (handler_wit2.go ~410 lines, handler_wit2_bodies.go, handler_wit2_announces.go) — all under the 500-line threshold.
  • handleWitnessBroadcast refactored into three accept-path helpers (was 96 lines / complexity 17, now within thresholds); deferredAnnounceCache.put eviction extracted (complexity 13 → under 10).
  • WitnessCommitHash complexity (was 18) resolved by extracting the worker-pool fan-out into hashWitnessChunks / witnessCommitWorkerCount; commitment recipe unchanged (pinned by the shape tests).
  • witness_manager.go growth (+161 vs base, over the oversized-file delta tolerance) resolved by moving the PR-added WIT2 code (byte-verification, pre-import serving handoff, empty-response backoff) into eth/fetcher/witness_manager_wit2.go.
  • Remaining flags are pre-existing legacy hot spots this PR touches but does not own (newHandler, BroadcastBlock, handleGetWitness, file sizes of handler.go/bor.go/block_fetcher.go/peer.go) plus one accepted item: (Peer).broadcastWitness complexity 12 — upstream-shaped select loop; this PR adds one case.
  • govulncheck failures are unrelated to this PR: two Go 1.26.3 standard-library vulns (GO-2026-5039 net/textproto, GO-2026-5037 crypto/x509, both fixed in go1.26.4) reached through pre-existing code (node/endpoints.go, ethstats, internal/cli); develop's latest Govuln run fails identically. Needs a repo-wide toolchain bump, separate from this PR.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@claude

claude Bot commented Apr 30, 2026

Copy link
Copy Markdown

test

@claude

claude Bot commented Apr 30, 2026

Copy link
Copy Markdown

Code Review

3 issues found. Checked for bugs and CLAUDE.md compliance.


1. Performance: redundant witness encoding

File: eth/fetcher/witness_manager.go L670-L680

verifyAgainstSignedHash (line 671) encodes the witness via encodedWitnessHash and computes WitnessCommitHash, then cacheVerifiedWitnessForServing (line 679) repeats the exact same EncodeRLP + WitnessCommitHash. Neither call reuses the other's result.

On a 50 MiB witness this adds ~100–300 ms of redundant CPU work per verified fetch — meaningful given WIT2's goal of eliminating per-hop latency.

Suggested fix: Have verifyAgainstSignedHash return the encoded bytes and hash on success, then pass them directly to parentCacheWitnessForServing instead of re-encoding.


2. Performance: unconditional encode+hash before signed-announcement check

File: eth/handler_wit.go L95-L119

EncodeRLP (line 96–97) and WitnessCommitHash (line 101) run before checking whether a signed announcement exists (line 102). On the WIT1 fallback path (default case, lines 112–118), the encoded bytes and hash are discarded.

Every witness broadcast — including from WIT1 peers — pays the full encode+hash cost (~150–450 ms on 50 MiB witnesses) even when the result is never used.

Suggested fix: Check signedWitnesses.get(hash) first. Only encode+hash when hasSigned is true. This mirrors the pattern already used in cacheVerifiedWitnessForServing in witness_manager.go (lines 697–701), which correctly checks before encoding.


3. Bug: peer dropped on local EncodeRLP failure

File: eth/fetcher/witness_manager.go L725-L730

When encodedWitnessHash fails (line 725–726), the non-empty peer string is passed to handleWitnessFetchFailureExt (line 728), which calls parentDropPeer(peer). An EncodeRLP failure on a successfully-decoded Witness is a local error (the peer delivered valid RLP that decoded fine), not evidence of peer misbehavior.

This is inconsistent with the pattern in cacheVerifiedWitnessForServing just below (line 704–705), which correctly logs the failure without dropping anyone.

Suggested fix: Change peer to "" on line 728:

m.handleWitnessFetchFailureExt(hash, "", fmt.Errorf("witness encode failed: %w", err), false)

@codecov

codecov Bot commented Apr 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.42718% with 78 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.66%. Comparing base (706b800) to head (c9d4756).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
eth/handler_wit2.go 87.85% 26 Missing ⚠️
eth/handler_wit.go 92.57% 10 Missing and 5 partials ⚠️
eth/fetcher/witness_manager_wit2.go 81.96% 8 Missing and 3 partials ⚠️
core/stateless/witness_commit.go 84.74% 6 Missing and 3 partials ⚠️
eth/handler.go 86.20% 6 Missing and 2 partials ⚠️
eth/peerset.go 73.91% 3 Missing and 3 partials ⚠️
eth/handler_eth.go 66.66% 2 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2208      +/-   ##
===========================================
+ Coverage    53.41%   53.66%   +0.24%     
===========================================
  Files          896      900       +4     
  Lines       159713   160550     +837     
===========================================
+ Hits         85314    86157     +843     
+ Misses       69071    69055      -16     
- Partials      5328     5338      +10     
Files with missing lines Coverage Δ
accounts/accounts.go 100.00% <ø> (ø)
consensus/bor/bor.go 85.73% <100.00%> (+0.20%) ⬆️
core/stateless/encoding.go 65.15% <100.00%> (+1.65%) ⬆️
eth/fetcher/block_fetcher.go 73.80% <100.00%> (+0.06%) ⬆️
eth/fetcher/witness_manager.go 90.77% <100.00%> (+0.17%) ⬆️
eth/handler_wit2_announces.go 100.00% <100.00%> (ø)
eth/handler_wit2_bodies.go 100.00% <100.00%> (ø)
eth/peer.go 95.80% <ø> (ø)
eth/protocols/wit/broadcast.go 54.16% <100.00%> (+15.27%) ⬆️
eth/protocols/wit/handler.go 29.33% <100.00%> (+1.93%) ⬆️
... and 10 more

... and 18 files with indirect coverage changes

Files with missing lines Coverage Δ
accounts/accounts.go 100.00% <ø> (ø)
consensus/bor/bor.go 85.73% <100.00%> (+0.20%) ⬆️
core/stateless/encoding.go 65.15% <100.00%> (+1.65%) ⬆️
eth/fetcher/block_fetcher.go 73.80% <100.00%> (+0.06%) ⬆️
eth/fetcher/witness_manager.go 90.77% <100.00%> (+0.17%) ⬆️
eth/handler_wit2_announces.go 100.00% <100.00%> (ø)
eth/handler_wit2_bodies.go 100.00% <100.00%> (ø)
eth/peer.go 95.80% <ø> (ø)
eth/protocols/wit/broadcast.go 54.16% <100.00%> (+15.27%) ⬆️
eth/protocols/wit/handler.go 29.33% <100.00%> (+1.93%) ⬆️
... and 10 more

... and 18 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lucca30

lucca30 commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Review responses

All 3 review issues addressed in 12368a3:

  1. Redundant witness encodingverifyAgainstSignedHash now returns the canonically-encoded body and signed hash on success; cacheVerifiedWitnessForServing reuses them directly instead of re-encoding (~14 ms saved per verified fetch on 50 MiB witnesses on the bench machine).
  2. Unconditional encode+hash on WIT1 pathsignedWitnesses.get(hash) is now hoisted above EncodeRLP/WitnessCommitHash in handleBroadcastWitness. WIT1 broadcasts (no signed announcement) short-circuit before paying the encode/hash cost.
  3. Peer drop on local EncodeRLP failure — re-encoding RLP bytes the peer already delivered as valid is a local invariant violation, not peer misbehavior. Now logs without dropping, mirroring the cache path.

The wit2_test.go no-signed-hash regression test was retargeted onto verifyAgainstSignedHash where the invariant now lives.

CI status

  • ✅ All review issues addressed.
  • ⚠️ diffguard (Quality metrics) failed at install, not on this PR's diff: upstream tool break — *Language does not implement lang.Language (missing method DeadCodeDetector) in github.com/0xPolygon/diffguard@latest. Fix belongs in the diffguard repo.
  • ⚠️ Kurtosis Stateless E2E Tests Test 4 (extreme network latency recovery) failed because validator node-2 stayed at block 397 while all other 8 nodes recovered fully (768+, milestone lag 0). This same test has flaked on develop recently (Apr 27, Apr 16). I'll retrigger.
  • ⚠️ codecov patch coverage 52%: WIT2 protocol paths exercised by integration/devnet tests rather than unit tests; matches coverage profile of the existing WIT1 handler code.
  • ⚠️ SonarCloud duplication 5.3%: concentrated in _test.go files (witness_manager_test.go, witness_manager_wit2_test.go) — table-driven test setup, not production-code duplication.

lucca30 added a commit that referenced this pull request May 5, 2026
Three high-severity issues from the Codex adversarial review on PR #2208,
each with a TDD regression test added first then fixed:

1. Header-race: signed announce arriving before header was silently
   downgraded. handleSignedWitnessAnnouncements called peer.AddKnownAnnounce
   unconditionally before the verification gate, leaving a peer marked
   announce-known even on bad-signature / header-unknown rejection paths.
   That suppressed our own re-relay back to that peer if a valid version
   of the same hash arrived from someone else, killing the natural
   recovery path. Fix: gate AddKnownAnnounce on acceptSignedAnnouncement
   success so the announce-known bit only reflects verified delivery.

   Test: TestHandleSignedWitnessAnnouncementsBadSigDoesNotMarkAnnounceKnown.

2. pendingWitnessBodies TTL didn't actually evict. get() observed expiry
   and returned false but left the entry in the map; gcLocked only ran
   from put(), so a node that stopped receiving witnesses retained up to
   capacity (10) ~50MB blobs indefinitely. Fix: when get() observes an
   expired entry, upgrade to write lock and delete it (re-checking under
   the write lock to avoid clobbering a concurrent put).

   Test: TestPendingWitnessBodyCacheGetEvictsExpired.

3. Honest body-server dropped on bad producer commitment.
   verifyAgainstSignedHash dropped the byte-server on every signed-hash
   mismatch, but the announcement only proves *some* BP signed *some*
   hash — not that the hash matches the canonical witness. A faulty or
   malicious scheduled producer that signed a bogus hash would weaponise
   this to disconnect every honest peer serving the real witness. Fix:
   reject the bytes (don't cache for serving) and back off the request
   without dropping the byte-server. TODO comment left for follow-up
   signer-quarantine work, which needs (signer, relayer) provenance the
   manager doesn't currently have.

   Test: TestProcessWitnessResponseDoesNotDropOnByteMismatch (replaces
   the previous TestProcessWitnessResponseDropsOnHashMismatch, whose
   policy this commit reverses).
@lucca30

lucca30 commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Adversarial-review findings — addressed via TDD

Codex flagged 3 high-severity issues in an adversarial pass. For each I wrote a failing regression test first, then implemented the smallest correct fix, then confirmed all 919 tests across eth, eth/fetcher, eth/protocols/wit, consensus/bor, and core/stateless still pass. Pushed as 9485452.

# Codex finding Fix Regression test
1 Header-race: signed announce arriving before header silently degrades to unsigned-fallback because AddKnownAnnounce was called before the verification gate, suppressing recovery via re-relay Move AddKnownAnnounce into the acceptSignedAnnouncement success branch so the announce-known bit only reflects verified delivery. Re-receipt now drives recovery. TestHandleSignedWitnessAnnouncementsBadSigDoesNotMarkAnnounceKnown
2 pendingWitnessBodies TTL never actually evicted: get() returned false on expiry but left entries in the map; gcLocked only ran from put(). Up to 10 × ~50 MiB blobs could persist indefinitely past the 30s TTL. get() now upgrades to a write lock and deletes the expired entry (re-checking the entry pointer under the write lock to avoid clobbering a concurrent put). TestPendingWitnessBodyCacheGetEvictsExpired
3 verifyAgainstSignedHash dropped the byte-server on every signed-hash mismatch, but the announcement only proves some BP signed some hash. A faulty/malicious scheduled producer signing bogus could weaponise this to disconnect honest peers serving canonical witnesses. On mismatch, reject the bytes (no serving cache) and back off the request without dropping the byte-server. Existing TestProcessWitnessResponseDropsOnHashMismatch was replaced — its policy reversed — with TestProcessWitnessResponseDoesNotDropOnByteMismatch. TODO comment for the proper long-term fix (signer quarantine), which needs (signer, relayer) provenance the manager doesn't currently track. TestProcessWitnessResponseDoesNotDropOnByteMismatch

What's still out of scope (filed as TODOs at the call site, not in this PR)

  • Signer quarantine ban-list (real long-term fix for finding 3) — requires plumbing announcement signer + relayer pair through to the witness manager.
  • Deferred-announce queue with chain-head drain (more complete fix for finding 1) — re-gossip already drives recovery in practice; the queue is overkill for this PR.
  • Byte-budgeted witness body cache cap (finding 2 follow-up) — current capacity-10 cap with working TTL eviction is enough to close the leak.

@lucca30

lucca30 commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Adversarial review — all 4 items now closed

Pushed b3cb00e: deferred-announce queue closes the cosend race.

Codex finding Status Test
Header-race: announce arriving before header silently downgrades to WIT1 fallback done TestDeferredSignedAnnounceDrainedAfterHeaderArrives
pendingWitnessBodies TTL never evicts done TestPendingWitnessBodyCacheGetEvictsExpired
AddKnownAnnounce set on rejected announces done TestHandleSignedWitnessAnnouncementsBadSigDoesNotMarkAnnounceKnown
Honest byte-server dropped on bad commitment done TestProcessWitnessResponseDoesNotDropOnByteMismatch

A small in-memory cache (capacity 256, TTL 30s) holds signed announcements whose producer-binding could not be checked at receive time because the matching block header wasn't local yet. handler.Start subscribes to ChainHeadEvent; on each new block, drainDeferredAnnouncesFor(blockHash) re-runs verification, caches the announcement on success, credits the original sender as announce-known, and relays. Mirrors the existing pendingWitnessBodyCache lifecycle.

isScheduledProducer was also tightened: header presence is now checked first regardless of consensus engine. The previous non-bor early-return skipped the header check entirely, which was incorrect on its own — an announce we can't tie to a local block is unverifiable here.

920/920 tests pass across eth, eth/fetcher, eth/protocols/wit, consensus/bor, core/stateless. go build ./... clean.

Still deferred (not adversarial-review-flagged): signer-quarantine ban-list (needs signer/relayer provenance plumbed to the witness manager — TODO at the call site); byte-budgeted cap on pendingWitnessBodies (count cap of 10 with working TTL is bounded; future-proofing item).

@sonarqubecloud

sonarqubecloud Bot commented May 5, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
4.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@claude

claude Bot commented May 5, 2026

Copy link
Copy Markdown

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@lucca30 lucca30 marked this pull request as draft May 25, 2026 18:00
lucca30 added 7 commits June 8, 2026 13:54
…ncements with transitive relay and pre-import serving

Adds WIT2 (protocol version 3): block producers sign a chunked-parallel
commitment over each witness, peers verify the signature and relay the
announcement at network-RTT speed without execution, and any peer holding
the body can serve it pre-import from an in-memory cache. Byte-correctness
is verified by requesters against the BP-signed WitnessHash, attaching
tampering blame to the server; content-correctness (state-root) failures
attach to the BP. Removes the per-hop ~500 ms execution gate that today
serialises witness propagation through stateless validators.

Witness commitment uses 1 MiB chunked-parallel keccak (keccak256 of the
concatenation of per-chunk hashes), measured at ~13.5 ms wall-clock for
50 MiB witnesses on 8 cores vs ~88 ms single-shot. Wire format and
signature shape are unchanged from a single-keccak commitment; only the
function mapping bytes to the 32-byte commitment changes.

Producer-side signing reuses the engine SignerFn via consensus/bor.SignBytes
with a dedicated mimetype (application/x-bor-wit2-announce) and a
domain-separated digest tag, replay-resistant at both the digest and
signer-call levels. Receivers verify ecrecover against the scheduled
producer for the announced block; announces for blocks whose header is
not yet locally available are deferred (no strike) so the block-cosend
race does not punish honest relayers.

Pre-import serving cache (capacity 10) is fed from the paged-fetch path
the moment byte-correctness check passes, before chain write. Cache
entries are gated on a BP-signed WitnessHash being on file — relayers
never cache unverified bytes, and WIT1 fallback paths skip the cache
entirely. handleGetWitness consults the cache before chain storage.

Wire: new protocol version WIT2 = 3, new message
SignedNewWitnessHashesMsg = 0x06 with up to 64 announcements per packet.
WitnessMetadataResponse extended with WitnessHash. WIT1 peers continue
using NewWitnessHashes; mixed mesh tolerated.

Rate-limits: 200 ms per-(blockHash, peer) relay rate-limit, 30 s announce
TTL, per-peer token bucket (burst 256, refill 64/s), strike disconnect
at 5 invalid signed announces per minute. Conflicting WitnessHash for
the same BlockHash is rejected via signedWitnessCache.putIfNewer.

Operator note: validators running Clef as their signer must whitelist
the mimetype application/x-bor-wit2-announce; without it the producer
falls back to unsigned WIT1 announces.
- eth/handler_wit2.go: remove unused errInvalidSigner, contextBackground,
  wit2SpanLookupMissMeter, and now-unused context import
- core/stateless/witness_commit_bench_test.go: drop redundant c := c
  loop-var copies (Go 1.22+ copyloopvar)
- goimports formatting on accounts/accounts.go, witness_commit_bench_test.go,
  witness_commit_helpers_test.go, eth/fetcher/witness_manager.go,
  eth/fetcher/witness_manager_wit2_test.go, eth/handler_wit2.go,
  eth/protocols/wit/protocol.go
… drop

- eth/fetcher/witness_manager.go: verifyAgainstSignedHash now returns the
  canonically-encoded body and signed hash on success, so the pre-import
  serving cache no longer re-encodes the same witness (~14 ms saved per
  verified fetch on 50 MiB witnesses). cacheVerifiedWitnessForServing
  takes the precomputed body directly.
- eth/fetcher/witness_manager.go: local EncodeRLP failure inside
  verifyAgainstSignedHash no longer drops the peer — re-encoding bytes
  the peer already delivered as valid RLP is a local invariant violation,
  not peer misbehavior. Mirrors the pattern already used by the cache
  path.
- eth/handler_wit.go: hoist signedWitnesses.get(hash) above the EncodeRLP
  + WitnessCommitHash work in handleBroadcastWitness. WIT1 broadcasts
  (no signed announcement on file) used to pay the full encode+hash cost
  only to discard the result; now they short-circuit.
- eth/fetcher/witness_manager_wit2_test.go: rename + retarget the
  no-signed-hash regression test onto verifyAgainstSignedHash, where the
  invariant now lives.
Three high-severity issues from the Codex adversarial review on PR #2208,
each with a TDD regression test added first then fixed:

1. Header-race: signed announce arriving before header was silently
   downgraded. handleSignedWitnessAnnouncements called peer.AddKnownAnnounce
   unconditionally before the verification gate, leaving a peer marked
   announce-known even on bad-signature / header-unknown rejection paths.
   That suppressed our own re-relay back to that peer if a valid version
   of the same hash arrived from someone else, killing the natural
   recovery path. Fix: gate AddKnownAnnounce on acceptSignedAnnouncement
   success so the announce-known bit only reflects verified delivery.

   Test: TestHandleSignedWitnessAnnouncementsBadSigDoesNotMarkAnnounceKnown.

2. pendingWitnessBodies TTL didn't actually evict. get() observed expiry
   and returned false but left the entry in the map; gcLocked only ran
   from put(), so a node that stopped receiving witnesses retained up to
   capacity (10) ~50MB blobs indefinitely. Fix: when get() observes an
   expired entry, upgrade to write lock and delete it (re-checking under
   the write lock to avoid clobbering a concurrent put).

   Test: TestPendingWitnessBodyCacheGetEvictsExpired.

3. Honest body-server dropped on bad producer commitment.
   verifyAgainstSignedHash dropped the byte-server on every signed-hash
   mismatch, but the announcement only proves *some* BP signed *some*
   hash — not that the hash matches the canonical witness. A faulty or
   malicious scheduled producer that signed a bogus hash would weaponise
   this to disconnect every honest peer serving the real witness. Fix:
   reject the bytes (don't cache for serving) and back off the request
   without dropping the byte-server. TODO comment left for follow-up
   signer-quarantine work, which needs (signer, relayer) provenance the
   manager doesn't currently have.

   Test: TestProcessWitnessResponseDoesNotDropOnByteMismatch (replaces
   the previous TestProcessWitnessResponseDropsOnHashMismatch, whose
   policy this commit reverses).
Fourth and final adversarial-review item. Block + signed-announce gossip
streams travel independently and can reach a node in either order. When
the announce arrives first, isScheduledProducer returns
(ok=false, headerAvailable=false) and the previous code dropped the
announcement on the floor — relying on mesh re-gossip to reconstruct the
signed-hash for that block. In sparse meshes (single-cosend window, small
fanout) re-gossip never fires and subsequent witness fetches silently
fall back to the unsigned WIT1 path, leaking the WIT2 byte-verification
guarantee for that block.

This commit holds the announcement instead of dropping it:

- New deferredAnnounceCache mirrors pendingWitnessBodyCache: capacity 256,
  TTL = wit2AnnounceTTL (30s), oldest-evict, in-place expiry on take().
- acceptSignedAnnouncement's deferral branch now puts the announcement
  into deferredAnnounces.
- New drainDeferredAnnouncesFor(blockHash) re-runs verification for the
  matching announcement, caches it on success, credits the original
  sender as announce-known, and relays. On still-header-unknown (rare:
  the chain-head fired but the indexed header isn't reachable yet by
  hash) the entry is re-stashed to ride the next chain-head event.
- handler.Start subscribes to ChainHeadEvent and runs
  deferredAnnouncesLoop, which calls drainDeferredAnnouncesFor on each
  imported block. handler.Stop unsubscribes via quitSync.

isScheduledProducer was reordered to check header presence first
regardless of consensus engine. The previous early-return for non-bor
test chains skipped the header check entirely, which was incorrect on
its own (an announce we can't tie to a local block is unverifiable
here) and prevented unit tests from exercising the deferral path. Bor
producer recovery still runs only when a bor engine is present.

Test: TestDeferredSignedAnnounceDrainedAfterHeaderArrives covers the
full lifecycle — announce arrives header-unknown (deferred, not cached,
sender not credited), header lands, drain runs, announcement is now
cached and the deferred entry is consumed.
…ce entries (W-1)

handleGetWitness: bound len(WitnessPages) to MaxWitnessPagesServe. A request
packed with unknown hashes or out-of-range pages accumulates zero data bytes
and trips neither byte guard, while still forcing one DB size lookup per
distinct hash and one response entry per page — a CPU/IO/alloc amplification
vector. Legitimate requests carry a single page, so the bound is never
approached. Also fix an error-string typo.

deferredAnnounceCache: add a per-peer live-entry cap (capacity/8) so a single
peer cannot saturate the deferred-announce queue and evict honest
header-racing announces. The cache is keyed by blockHash, so bounding the
claimed BlockNumber is no defence (an attacker reuses a near-tip number with
distinct fake hashes); the per-peer cap is the bound that holds. Per-peer
accounting is maintained across put/take/gc/evict; add
wit2DeferredPerPeerDropMeter.

Tests: TestHandleGetWitness_PageCountBound, TestDeferredAnnounceCachePerPeerCap.
Fixes the stateless-consumer sync regression where, in an all-WIT2 fleet,
a node always fetches the witness body from an announce-only relayer (no
peer is ever marked as a body-holder) and re-polls it with empty
GetWitness until that relayer obtains the body. WIT1 stays in lockstep
because its hash-announce both implies the sender holds the body and marks
it as a holder, so the first pull lands; WIT2 relays the signed announce
ahead of the body, leaving the consumer to poll.

- handler: record peers that ask for a body we don't yet hold but have a
  BP-signed announcement for (witnessWaiterRegistry, bounded + 30s TTL),
  and push the full witness to them the moment we obtain it. Three triggers
  cover how a node comes to hold a body: our own verified fetch
  (cacheVerifiedWitnessForServing), a gossip broadcast
  (handleWitnessBroadcast), and — the dominant case for full/producing
  nodes — generating it during native block import, flushed from the
  chain-head loop (flushWitnessWaitersForImported). Restores the WIT1-style
  hand-off without flooding: at most one body per peer that actually asked.
- fetcher: route empty ("body not ready yet") responses to a dedicated
  backoff (first retries immediate, then exponential to a 1s cap) instead
  of a tight ~gatherSlack re-poll; never drop the request (the witness
  provably exists) or penalise the responder.

Covered by TestEmptyGetWitnessForSignedHashPushesBodyOnArrival,
TestFlushWitnessWaitersForImportedPushesFromChainStorage, and
TestEmptyResponseBacksOffToAvoidHammering (all fail before the fix).
@lucca30 lucca30 force-pushed the lmartins/wit2-signed-announce branch from b3cb00e to fe20334 Compare June 9, 2026 19:15
@lucca30

lucca30 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up: stateless-sync fix + devnet validation (fe20334b7)

While validating this PR on a Kurtosis devnet I found and fixed a stateless-consumer sync regression, and rebased the branch onto current develop (it had been sitting on an older base, so the diff against develop is now smaller/cleaner).

Problem

In an all-WIT2 fleet the witness body is never pushed — both WIT1 and WIT2 are announce-then-pull, and the full-body broadcast path is unused. WIT2 relays the signed announce transitively, ahead of the body, and a relayer is (correctly) only marked announce-known, not body-known. So a stateless consumer always pulls the body from a relayer that doesn't hold it yet → empty response → re-poll until that relayer catches up. WIT1 stays in lockstep because it only announces a witness it already holds (and that announce marks the sender a body-holder), so the first pull lands.

Fix

  • Push the body to waiters. Record a peer that asks for a body we don't yet hold but have a BP-signed announcement for (bounded registry + 30 s TTL), and push the full witness to it the moment we obtain the body. Three triggers cover how a node comes to hold a body: our own verified fetch, a gossip broadcast, and — the dominant case for producing/full nodes — generating it during native block import (flushed from the chain-head loop). Restores the WIT1-style hand-off without flooding: ≤ 1 body per peer that actually asked, exactly what the pull would have cost.
  • Back off empty fetches. Route empty ("body not ready yet") responses to a dedicated backoff (first retries immediate, then exponential to a 1 s cap) instead of a tight re-poll loop; never drop the request (the witness provably exists) or penalise the responder.

Observed gains (Kurtosis devnet, 9-node hop chain, all-WIT2, instrumented build)

Stateless node S1's only peer is the producing validator BP1 (the case the native-import trigger targets):

metric before fix after fix
S1 milestone-lag p50 646 ms 0.9 ms
stateless height vs BP1 (at teardown) 4–6 blocks behind 0–1
S1 blocks served in ≤ 2 fetches 35 / 38 (92 %)

The median-lag collapse (646 ms → 0.9 ms) is only explicable if the body is pushed the instant BP1 imports the block, rather than discovered on a later poll. All stateless nodes end height-locked within 0–1 of BP1, versus 4–6 behind before.

It took two iterations: a first version with only the fetch + broadcast triggers showed no improvement, because S1's peer (a producing node) obtains witnesses by generating them on import — which fired neither hook. The chain-head trigger closed that gap.

Honest residual

A p95 tail of ~1.3–2 s remains, from a few straggler blocks per node (3 of 38 on S1) where the consumer's sole relayer is itself slow to obtain the block — no push can deliver a body the relayer doesn't yet hold. This is partly inherent to single-source stateless sync and is present in pre-regression (develop-era) runs too; it is not the witness-fetch regression. Numbers are from a single short run — the p50 signal is robust, the p95 noisier.

Tests

TestEmptyGetWitnessForSignedHashPushesBodyOnArrival, TestFlushWitnessWaitersForImportedPushesFromChainStorage, and TestEmptyResponseBacksOffToAvoidHammering — each fails before the fix and passes after.

lucca30 added 6 commits June 9, 2026 17:41
…g test

TestReconstructWitness/DuplicatePageMisreconstructs documents that the
multi-page path has no (hash,page) dedup: a duplicated page index satisfies
the TotalPages count with a real page missing and reassembly fires over the
wrong byte stream. Pinned for a future dedup fix; not fixed here (out of
scope). See PR discussion.
Two hardening fixes on the witness broadcast/push path (convergence round 1):

- pushWitnessToWaiters now refuses to full-push a witness whose canonical
  encoding exceeds witnessPushMaxSize (wit message cap minus envelope
  margin). The receiver enforces a 16MB cap on inbound wit messages, so an
  oversized NewWitness push would get this node dropped as a protocol
  violator by the very peers it is trying to help. Oversized witnesses stay
  on the paged pull path; the bytes are servable by the time a push could
  fire, so the waiter's backed-off poll succeeds. New meter:
  eth/wit2/serve/waiter_push_oversize.

- handleWitnessBroadcast is now verify-or-drop. Bytes contradicting a
  BP-signed witnessHash on file are fully rejected (previously they skipped
  the serving cache but were still injected into the block fetcher and the
  sender was marked as a body-holder - a bypass of the byte verification
  the paged-fetch path enforces). With no signed announcement on file the
  broadcast is only accepted for a locally known block header, restoring
  the known-block binding the inline comment promised. AddKnownWitness
  moved to accept paths only. New meter:
  eth/wit2/serve/broadcast_unknown_header_drop.

Tests: TestHandleWitnessBroadcastByteMismatchNotInjected,
TestHandleWitnessBroadcastDropsUnknownHeader,
TestWaiterPushSkipsOversizedWitness.
The verify-or-drop hardening of handleWitnessBroadcast dropped the one
NewWitness push that matters: a stateless consumer at the tip has, by
definition, not imported the block it needs the witness for - so its header
is unknown and the signed announce sits in deferredAnnounces (producer
binding needs the header). Both acceptance paths missed and the pushed body
was discarded, re-opening the stateless lag the push was added to cure
(convergence round 2 catch).

Add an import-only acceptance tier between the two: a broadcast whose bytes
match a fresh deferred commitment is injected into the block fetcher so the
pending block can import - import re-verifies everything via stateless
execution and the state-root check - and the sender is marked a
body-holder. It is NOT cached for serving, NOT promoted into
signedWitnesses, and NOT relayed: those carry the verified-announce trust
property, and a deferred entry's producer is unverified until the
post-import drain checks it against the chain-validated header. Verifying
against the header embedded in the pushed witness instead would let a peer
self-seal a fabricated header and pass its own announce as the producer's.
The deferred entry is read with a non-consuming peek so the drain still
runs. Bytes contradicting the deferred commitment still drop. New meter:
eth/wit2/serve/broadcast_deferred_import_only.

Test: TestHandleWitnessBroadcastAcceptedWhileAnnounceDeferred.
At the stateless tip the deferred state is structural, not transient: a
signed announce cannot be producer-verified before its block imports, the
block cannot import without the witness, and a deferred (unverified)
announce deliberately marks no peer announce-known. getOnePeerWithWitness
therefore has no candidate, and for witnesses above the full-push size cap
there is no NewWitness push either - leaving the consumer with no body
source at all (convergence round 3 catch).

resolveWitnessFetchPeer now falls back to the peer recorded on the fresh
deferred entry (peek also returns the relayer ID): the relayer that
announced the witness is on the propagation path and is exactly who a pull
should target. Its bytes are NOT byte-checked against the deferred
commitment - an unverified announcement must not be able to veto or bless
data, else a Byzantine announce could reject the real witness pages.
Import (stateless execution + state-root check) remains the verifier,
exactly as on every WIT1 fetch.

Test: TestResolveWitnessFetchPeerFallsBackToDeferredAnnouncer.
Every node with an authorized signer (i.e. every validator) signed WIT2
announcements for every block it announced or cosent — including blocks
other validators produced. Receivers bind announce-signer to the header
sealer and strike on mismatch, so on a devnet with all-validator WIT2
peers, honest validators strike-disconnected each other roughly every
20s (observed: 21-32 strike_disconnects per validator per 8min run).
The self-signed foreign announce was also cached in signedWitnesses,
where it can shadow the producer's real announcement (putIfNewer dedups
by block hash) and suppress its transitive relay.

Gate the sign path on the same producer binding the receive side
enforces: signLocalWitnessAnnouncement refuses any block the local
signer did not seal (maySignAnnouncementForBlock, a thin wrapper over
verifyScheduledProducer). For foreign blocks the announce and cosend
paths fall back to the unsigned WIT1 hash announce, which is truthful —
both paths are gated on HasWitness.
Patch coverage 64.75% -> ~91% (codecov target 90%):
- new unit tests for the wit2 caches/registries (rate-limit tracker,
  waiter registry caps/expiry, deferred-announce cache lifecycle,
  signed-witness cache), the announce receive path (accept/dedup/relay,
  rate-limit drop, number-mismatch strike), the cosend and BroadcastBlock
  per-version announce split, the chain-head deferred drain, and the wit
  wire path (signed-announce round-trip, malformed-packet rejection,
  queue-full drop)
- codecov: ignore eth/peer_mock.go (generated gomock, like the other
  ignored mocks)

Sonar new-code duplication 4.9% -> ~2% (threshold 3%):
- shared fixtures in witness_manager_wit2_test.go (primePendingWitness,
  witnessResponse, encodedCommitHash), handler_wit2_test.go
  (persistedSignedWitness, requestFirstWitnessPage), and
  witness_bench_test.go (benchWitnessSizes)

diffguard:
- split handler_wit2.go (988 lines) into handler_wit2.go +
  handler_wit2_bodies.go + handler_wit2_announces.go, all under the
  500-line threshold
- handleWitnessBroadcast (96 lines, complexity 17) extracted into three
  accept-path helpers; deferredAnnounceCache.put eviction scan extracted
  (complexity 13 -> under 10)

No behavioral changes outside test code: the broadcast/eviction
refactors are mechanical extractions covered by the existing and new
tests.
@lucca30 lucca30 marked this pull request as ready for review June 10, 2026 14:16

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

…and file size

- Extract hashWitnessChunks/witnessCommitWorkerCount from WitnessCommitHash
  (complexity 18 -> under threshold); commitment recipe unchanged, pinned by
  the existing shape tests.
- Move the PR-added WIT2 fetcher code (verifyAgainstSignedHash,
  cacheVerifiedWitnessForServing, handleWitnessBodyNotReady, empty-response
  backoff and its constants) into eth/fetcher/witness_manager_wit2.go so
  witness_manager.go stays within the oversized-file growth tolerance.
- Add coverage for the patch lines CI flagged: SignBytes error paths and
  CurrentSigner, WitnessCommitHashFromWitness, the TTL gc sweep of all four
  wit2 caches, drainDeferredAnnouncesFor guards, AddKnownAnnounce, and the
  announce-packet decode failure.
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant