# Pipeline vs Zellic V12 — Autonolas Registries cross-check **Pre-commit timestamp.** Pipeline outputs frozen 2026-04-21 before this comparison was written. IPFS directory CID `bafybeiduaa37fuzqimqd3473pqkzfgtcvnnzzdhkctkazvygzzuibimihi` (27 files, 165 KB). Nostr kind-1 commit event `ec1e0ad24ed85893e9a435d047bfbd9d4b0882ae3ceeb44eee6013fcedceb69a` (propagated to damus.io / nos.lol / primal.net). Those artifacts predate this writeup; any per-finding claim here is verifiable against them. **What this is.** A per-finding comparison between a dual-LLM + Slither audit pipeline's blind output on the Code4rena `2026-01-olas` autonolas-registries subset (8 files, 2831 nLOC) and the V12 findings file that ships in the contest repo. V12 is Zellic's in-house AI auditor. The contest README marks V12 findings as out-of-scope for C4 rewards. This cross-check is AI-vs-AI, not AI-vs-human-wardens. **What this is not.** The wardens' final report for 2026-01-olas has not been published. Everything below is preliminary. When the wardens' report lands, the same pipeline outputs will be scored against it in a separate writeup. Any claim here could be revised up or down once wardens weigh in. V12's findings aren't ground truth — they're one AI auditor's read, and both systems share an AI-LLM failure mode footprint. ## Scope and scoping decisions Our pipeline ran against the `autonolas-registries` submodule at commit `be1057a5e37f17f26b13c41311fe0e8e40259484` on 2026-04-21. 8 files, picked for coherence within the pipeline's 1000–5000 LOC envelope: - `contracts/ServiceManager.sol` (472 nLOC) - `contracts/ServiceManagerProxy.sol` (79) - `contracts/multisigs/RecoveryModule.sol` (416) - `contracts/multisigs/SafeMultisigWithRecoveryModule.sol` (106) - `contracts/multisigs/PolySafeCreatorWithRecoveryModule.sol` (240) - `contracts/staking/StakingBase.sol` (1221) - `contracts/utils/HashCheckpoint.sol` (125) - `contracts/utils/ComplementaryServiceMetadata.sol` (172) V12's registries findings file covers the same codebase. Of V12's 11 findings, 10 target a file we scanned; 1 has a split target where part-A (`OperatorSignedHashes._verifySignedHash`) is in `contracts/utils/OperatorSignedHashes.sol` — a file we did NOT include in our 8-file cut. Part-B of that same finding (`getEnableModuleTransactionHash` in `PolySafeCreatorWithRecoveryModule`) IS in scope. We handle this with a split score below: part-A is out-of-scope for the comparison; part-B is scored normally. ## V12 findings inventory Severity column uses V12's labels verbatim (Medium = M; Qa = Q). | # | Title (abbreviated) | Sev | Primary target | |---|---|---|---| | F1 | Signature/domain binding + validator discovery errors | M | `OperatorSignedHashes` (OOS) + `PolySafeCreatorWithRecoveryModule` | | F2 | Unprotected initializer allows first caller to seize ownership (+ downstream abuse of `changeImplementation`) | Q | `ServiceManager.initialize` | | F3 | Inconsistent/missing event emissions after failed external calls | Q | `RecoveryModule`, `ServiceManager.unbondWithSignature`, `StakingBase._claim`, `HashCheckpoint.changeOwner` | | F4 | Inconsistent hash URI accessors, zero-hash sentinel | Q | `HashCheckpoint.latestHash` + `latestHashURI` | | F5 | Eviction logic treats `serviceId == 0` as a sentinel | Q | `StakingBase.checkpoint` / `_evict` | | F6 | ETH path omits bond validation, allows zero-bond agents | Q | `ServiceManager.create` | | F7 | Invalid `rewardDistributionType` enum -> DoS via `address(0)` call | Q | `StakingBase._stake` + `_getRewardReceiversAndAmounts` | | F8 | Zero-value `bytes32` encoded as valid CID-like URI | Q | `ComplementaryServiceMetadata.tokenURI`, `HashCheckpoint.latestHash`/`latestHashURI` | | F9 | Packed `(operator, serviceId)` key truncates serviceId high bits | Q | `ServiceManager.registerAgentsWithSignature` + `unbondWithSignature` | | F10 | Rounding mismatch: `calculateStakingReward` view under-reports for `eligibleServiceIds[0]` | Q | `StakingBase.calculateStakingReward` | | F11 | Ignored `execTransaction` return value can silently skip module enablement | Q | `PolySafeCreatorWithRecoveryModule.create` | 1 Medium, 10 Qa. Even for a QA-heavy contest that's a lean finding set — consistent with the codebase already having a Zellic pre-audit file committed before the contest opened, i.e. a well-audited starting point. ## Per-finding comparison Rating scheme: - **CATCH** — a pipeline finding points at the same function + the same root cause V12 identified. Severity need not match exactly; if it's in the same family (Low ~ Qa, Medium matches Medium), that's a catch. - **PARTIAL** — pipeline observed the symptom or affected code path but either (a) dismissed severity / rated it materially lower than V12, (b) covered only some of V12's multi-target sub-items, or (c) got the function right but the root cause wrong. - **MISS** — pipeline either said nothing about the affected code, or explicitly reasoned that the code is safe in exactly the way V12 says it isn't. ### F1 — Signature/domain binding + validator discovery (M) Part-A (`OperatorSignedHashes._verifySignedHash`): the file is **out-of-scope** for our run. No assessment. Part-B (`PolySafeCreatorWithRecoveryModule.getEnableModuleTransactionHash`): V12 argues the contract trusts `computeProxyAddress(owner)` as the canonical Safe address without verifying that the factory uses identical inputs/initializer/salt as the actual deployment. Both our reviews looked at this code and concluded it was fine. Claude's high-confidence observations: "EIP-712 domain separator recomputed with `block.chainid` and the multisig address — consistent with Safe's on-chain domain separator" and "`multisig.code.length > 0` pre-check plus `multisig.codehash == polySafeProxyBytecodeHash` post-check bind the resulting contract to the expected Poly Safe proxy bytecode; substituting a malicious proxy at a counterfactual address is infeasible." Gemini called the EIP-712 chain-id + domain evaluation "directly lines up with the Gnosis Safe `v1.3.0` specification." Both stopped at the assumption that the codehash check is sufficient. V12's concern is about the *computed* address diverging from the *deployed* address upstream of the codehash check — if the factory's deployment path differs (different salt/initializer) from the path implicit in `computeProxyAddress(owner)`, the signed digest references an address that may never host the intended Safe. Neither reviewer surfaced this failure mode. **MISS** on part-B. ### F2 — Unprotected initializer (Q) V12: `ServiceManager.initialize()` has no initializer modifier; the first caller after a proxy is deployed (if the initializer isn't run atomically) becomes the permanent owner. V12 lists `changeImplementation` as a second target — not as a second bug, but as the privileged function the attacker-owner would next use for a total takeover. Claude, finding #2: "Unprotected `initialize()` — front-run ownership if the proxy is deployed without atomic initialization. Severity: Low." Same function. Same root cause. Same "assume atomic deploy, otherwise an MEV actor takes ownership" risk articulation. Claude also correctly notes the mitigator inside `ServiceManagerProxy` (the proxy's constructor invokes `initialize()` as part of its deployment delegatecall), which is why it rates Low rather than High. V12 rates this Qa while describing impact as "High / Critical — an attacker who calls initialize() first gains full administrative control of ServiceManager," which suggests V12 downrates to Qa on deployment-pattern-avoidance grounds similar to Claude's reasoning, though V12 doesn't make that chain explicit. **CATCH** with aligned severity. ### F3 — Inconsistent event emissions (Q) 4 sub-targets. Pipeline coverage is mixed: - `recoverAccess` (RecoveryModule): Gemini's RecoveryModule finding #2 flags the unchecked return value of `execTransactionFromModule`, and explicitly says: "`recoverAccess`, the protocol will bizarrely act as though access was perfectly recovered, emitting an untruthful `AccessRecovered` log, but leaving the user stranded." Same failure pattern V12 describes. - `unbondWithSignature` (ServiceManager): neither reviewer flags anything about event consistency after the external call here. Claude's ServiceManager high-confidence observations even state that reentrancy-events is a false positive because "Nonces are updated before the external calls in the signature variants." That's addressing reentrancy, not the event-truthfulness concern V12 raises. - `_claim` (StakingBase): Claude observes "`_claim` zeros `sInfo.reward` before `_withdraw`" as a correctness note but doesn't flag the emitted-event semantics after the external `transferFrom`. - `changeOwner` (HashCheckpoint): no review flags an event issue here; Claude's high-confidence list says "changeOwner / setBaseURI are properly owner-gated with zero-address / zero-length checks" without addressing event ordering. 1 of 4 sub-targets covered by pipeline. V12 treats F3 as a single finding-unit so any sub-target hit justifies PARTIAL, but 25% coverage is weak. **PARTIAL** (weak). ### F4 — Inconsistent hash URI accessors (Q) V12: `latestHash` and `latestHashURI` expose the same stored value in two different representations; callers of `latestHash` won't observe `baseURI` updates, so two consumers get different strings for the same address. Claude: "latestHash for an address that never called checkpoint returns `CID_PREFIX + 64 zero chars` — not a vulnerability, just expected default." That observation notices the zero-default symptom (which is F8) but does not address the baseURI-divergence-between-accessors concern that F4 articulates. Gemini found nothing here. **MISS.** ### F5 — Eviction logic treats `serviceId == 0` as sentinel (Q) V12: `checkpoint` records candidate eviction entries into a sparse array indexed by slot, but uses 0 as a "real" service ID at the same time. `_evict` then treats 0 as an empty-sentinel and skips it during compaction, while still consuming the caller-supplied eviction count — the mismatch can lead to wrong entries being removed from `setServiceIds` and an ID-0 service escaping eviction. Claude on StakingBase high-confidence obs: "`setServiceIds` swap-and-pop in `_evict` iterates `numEvictServices..1` using pre-captured indexes, which is safe because indexes are processed in descending order." That's reasoning about index correctness in descending order, not about the sparse-array consumer treating 0 as empty while the producer treats 0 as valid. Gemini: "`_evict` uses a backward-iterating swap-and-pop to remove multi-target inactive services. Because `serviceIndexes` is strictly monotonically increasing and processed backwards, a swapped element from `totalNumServices` is mathematically guaranteed not to be an index slated for eviction later in the loop." Similar analysis — correct about indexing order, misses the producer/consumer 0-vs-0 disagreement. Both reviewers explicitly concluded the eviction logic is safe, while V12 argues it isn't. **MISS.** ### F6 — ETH bond-validation omission (Q) V12: `ServiceManager.create` enforces nonzero agent bonds only when `token != ETH_TOKEN_ADDRESS`; the ETH branch omits that check. Downstream `IService.create` may or may not enforce it; the defensive local check is missing. Claude's ServiceManager review does not flag this. Its finding #1 is a related but distinct msg.value mismatch in `registerAgentsWithSignature`, not a missing-bond-check in `create`. Gemini's ServiceManager review doesn't flag it either. **MISS.** ### F7 — Invalid `rewardDistributionType` enum (Q) V12: `_stake` validates the distributor address only when the low 8 bits encode the Custom variant; any uint8 value that isn't one of the 4 enum members is accepted and later dispatched via the "else" branch of `_getRewardReceiversAndAmounts` as Custom, resulting in a call to `address(0)` — DoS on claim. Claude's StakingBase L-3 is closely adjacent: "Upper bits of `rewardDistributionInfo` are not validated. Shifting right by 8 and casting to `uint160` validates only bits [8..167]. Bits [168..255] are never checked, nor stored in a canonicalized form. Today this is harmless because only the low 168 bits are read back. Any future upgrade that extends the encoding could silently collide with user-supplied garbage." Different slice of the same pattern — Claude focused on the unused upper 88 bits, V12 on the low 8-bit enum cast. The shared observation is "validation of the packed word is incomplete." Claude didn't connect that to the `address(0)` DoS chain. The two slices (bits 168-255 vs low 8 bits) are disjoint portions of the same packed word; one could reasonably argue Claude's L-3 and V12's F7 are adjacent rather than overlapping. Calling this PARTIAL on the grounds that the pipeline identified a missing-validation pattern in the same word V12 flags, while acknowledging the specific exploit V12 describes (dispatch into the Custom branch → `address(0)` call) is absent from pipeline output. **PARTIAL** (generous read). ### F8 — Zero-value `bytes32` CID-like URI (Q) 2 sub-targets. - `ComplementaryServiceMetadata.tokenURI`: Claude's high-confidence observations include "Hash validation: Any `bytes32` is accepted (including zero). Since CID prefix encodes fixed multihash length/algo, a zero hash produces a syntactically valid but meaningless CID. Not a protocol-funds issue." The symptom is identified; the severity assessment ("not a protocol-funds issue") is the disagreement point with V12, which rates it Q and articulates metadata-collision / indexer-confusion consequences. - `HashCheckpoint.latestHash`/`latestHashURI`: Claude's high- confidence observations also include "`latestHash` for an address that never called `checkpoint` returns `CID_PREFIX + 64 zero chars` — not a vulnerability, just expected default." Same pattern: symptom noted, severity dismissed. Gemini found nothing in either file. Both sub-targets observed by Claude, severity materially under-rated vs V12 ("not a vulnerability" vs Q). **PARTIAL** — symptom-caught, impact-missed. ### F9 — Packed `(operator, serviceId)` key truncation (Q) V12: `operatorService = uint256(uint160(operator)) | (serviceId << 160)` preserves only 96 bits of `serviceId`. Two serviceIds that differ only in their high bits collide, so their nonces share storage; advancing the shared nonce via a signed op on one service invalidates previously-signed ops on the other. Claude's ServiceManager high-confidence observations: "operator in lower 160 bits, serviceId shifted by 160 — collision-free as long as serviceId fits in 96 bits (consistent with the uint32 service-id invariant in `IService`)." Claude noticed the shift margin but argued the external invariant makes it safe. Gemini's ServiceManager high-confidence observations: "Shift logic combining `operator | (serviceId << 160)` uses cleanly separated boolean segments (address is exactly 160 bits long); `serviceId` boundaries safely avoid intra-uint collisions." Gemini asserted safety directly. V12's argument is that the uint32 invariant Claude cites isn't actually enforced by the packing code itself; the invariant lives outside this file, which is exactly the class of defensive validation V12 says should be local. Both reviewers identified the exact shift margin V12 is concerned about — and both explicitly reasoned it was safe on grounds that the external uint32 `serviceId` convention would hold. V12 argues the packing code itself should enforce the invariant; Claude and Gemini accepted the invariant as given. No symptom / severity disagreement — a clean opposite conclusion. **MISS.** ### F10 — Rounding mismatch in `calculateStakingReward` (Q) V12: `calculateStakingReward` (the view) applies floor division uniformly, whereas `checkpoint` (the state writer) collects the integer-division remainder and adds it to `eligibleServiceIds[0]`. The view thus under-reports the pending reward for the first eligible service by up to a few wei. Claude on StakingBase high-confidence observations: "Pro-rata leftover loop in `checkpoint()` correctly delivers the rounding dust to service index 0; sum of distributed rewards equals `lastAvailableRewards` exactly." Claude observed the checkpoint-side rounding dust behavior, but didn't cross-check it against the view function. Gemini similarly noted: "Reward Math Truncation: In `checkpoint()`, when recalculating `updatedReward` due to protocol reward shortage (`totalRewards > lastAvailableRewards`), truncation losses (up to bounded wei equivalent to `numServices-1`) are collected across execution and specifically appended back locally to the initial subset at index `0`. This preserves `availableRewards` balance integrity to the exact wei." Also checkpoint-only; also didn't cross-check against the view. The pipeline had the information needed to notice the mismatch (it saw the checkpoint-side behavior and reasoned about it correctly) but didn't compare against the adjacent view. Neither model flagged it. **MISS.** ### F11 — Ignored `execTransaction` return value (Q) V12: `PolySafeCreatorWithRecoveryModule.create` calls `ISafe.execTransaction(...)` to enable the recovery module on the freshly created Safe and does not check the boolean return value. Safe returns `false` (doesn't revert) on internal failure. Gemini: "Unchecked Return Value on Safe Execution Allows Silent Evasion of Recovery Module Enablement. Severity: Medium." Same function, same root cause, with the specific attack-vector story (gas-limit grief to starve the nested `enableModule` call) and the concrete "pre-signed `enableModuleSig` is permanently burned" consequence. Gemini rated Medium; V12 rated Q. Both agree it's real. Gemini also opportunistically caught the sibling issue in `RecoveryModule.recoverAccess` and `RecoveryModule.create` (the `execTransactionFromModule` return-value), which is adjacent to F3 above. Claude's PolySafeCreator review explicitly reasoned through the execution path and concluded safe — partly on grounds that a freshly-deployed Safe passes the codehash check, which rules out a prior module being installed. That argument doesn't address the out-of-gas silent-failure path Gemini / V12 describe. **CATCH** (Gemini), with a stricter severity than V12. Claude MISS. ## Summary table | V12 finding | Pipeline rating | Notes | |---|---|---| | F1 part-A | OUT-OF-SCOPE | `OperatorSignedHashes` not in our 8-file cut | | F1 part-B | MISS | Both models reasoned about codehash check; neither surfaced computed-vs-deployed address divergence | | F2 | CATCH | Claude, Low ~ V12's Q | | F3 | PARTIAL | 1 of 4 sub-targets covered (Gemini on `recoverAccess`) | | F4 | MISS | Neither model noticed the `baseURI`-update divergence between accessors | | F5 | MISS | Both models concluded eviction safe | | F6 | MISS | `create` ETH branch not flagged | | F7 | PARTIAL | Same "incomplete packed-word validation" observation, different attack path | | F8 | PARTIAL | Symptom observed in both sub-targets, severity dismissed ("not a vulnerability") | | F9 | MISS | Both models explicitly concluded safe; V12 says otherwise | | F10 | MISS | Pipeline saw checkpoint-side behavior but didn't compare vs view | | F11 | CATCH | Gemini, Medium > V12's Q | Scoreline on V12's 10 in-scope finding-units (F1-B + F2 through F11): **2 catches, 3 partials, 5 misses.** Gemini's value-add concentrated on ServiceManager, RecoveryModule, PolySafeCreatorWithRecoveryModule, and the high-confidence-obs layer of the ServiceManager + StakingBase reviews. For 4 of the 8 files (ServiceManagerProxy, SafeMultisigWithRecoveryModule, HashCheckpoint, ComplementaryServiceMetadata) Gemini produced no findings. On StakingBase Gemini also produced no findings while Claude produced 3. The catches above came from Gemini (F11) and Claude (F2, F7 partial, F8 partial); the two models covered different files non-redundantly, which is the design intent of running both, though a natural follow-up question is whether one model's coverage fully subsumes the other at any given file. Comparison to Phase-5 scoreline on Intuition (same pipeline, 6 V12 findings, no partials reported): 2/6 catches + 4 misses. Strict catch rate went from 33% (Intuition) to 20% (Olas). Lenient rate (catches + partials / total) went from 33% to 50% because this run produced 3 items where the pipeline observed the symptom or adjacent pattern but got the severity / exploit chain wrong. Either way the absolute hit rate is modest and both datasets show the severity-under-rating pattern the Phase-5 writeup flagged. ## Pipeline findings not in V12's list The pipeline produced 6 findings V12 didn't, spread across 3 files. Each is unverified — could be true-positives V12 missed, or false-positives. Listed for transparency: 1. **Claude SM-1**: `registerAgentsWithSignature` doesn't validate `msg.value` in the `isTokenSecured` branch, allowing stranded ETH or locked-wei re-spend. Low. Plausible true-positive; value at risk is tiny (1 wei per agent). 2. **Gemini SM-1**: `registerAgentsWithSignature` bypasses the global `operatorWhitelist` check that `registerAgents` enforces. Medium. Plausible true-positive if the contest's threat model treats the whitelist as protocol-enforced; the code does look asymmetric. 3. **Gemini SM-2**: EIP-712 signatures lack a `deadline` / cancellation mechanism; a service owner holding a stale operator signature can use it later. Medium. Plausible true-positive as a UX / access-control issue; whether it's counted in wardens' corpus may depend on how judges see signature-replay vs signature-reuse. 4. **Gemini RM-1**: `recoverAccess` permanently locks recovery if `msg.sender` is the last processed multisig owner in the linked-list traversal. High. Plausible true-positive; this is a self-inflicted-DoS scenario that depends on specific owner- ordering. The warden report will be the arbiter. 5. **Claude SB L-1**: Inactivity accounting stalls while `availableRewards == 0`, then spikes on re-funding — an eviction footgun at the next checkpoint after a pool top-up. Low / UX footgun. Plausible true-positive. 6. **Claude SB L-2**: `_unstake(enforced=true)` returns `reward` although funds were pushed back to `availableRewards`. Integrator-confusion; no fund loss. Plausible true-positive as a return-value-semantics bug; wardens may not have counted it if it was considered too low-impact or documented behavior. If wardens report similar findings, some of these move into the "catches against wardens" column for the eventual wardens writeup. If not, they land in the false-positive column. ## Caveats 1. **V12 is not ground truth.** V12 is an AI auditor with its own failure modes. Both V12 and our pipeline are LLM-backed and may share systematic blind spots. The AI-vs-AI comparison cannot establish absolute detection rates. The wardens' final report is the closer-to-ground-truth reference and is still pending. 2. **Scope subset, not full contest.** Our 8-file cut is ~20% of the contest's 40 in-scope files; V12's findings for `autonolas-governance` and `autonolas-tokenomics` are separately published and not covered here. 3. **Pre-commit only helps on the wardens side.** V12's findings were public in the contest repo before our pipeline ran. The README of the pinned artifact states explicitly that V12 output was NOT fed to the pipeline; any overlap is independent discovery at inference time. However, V12's findings may have entered the models' training data indirectly via public crawls of the contest repo, and we have no way to prove they didn't. The pre-commit timestamp proves pipeline outputs existed before the wardens' report publishes; it doesn't prove the pipeline was insulated from V12's influence during training. 4. **Severity calibration disagreement.** V12's labels on this finding set are 1 Medium + 10 Qa; our pipeline uses High / Medium / Low. When scoring "severity aligned," Low ~ Qa was treated as a match; it is not a precise match in either direction. 5. **Sample of one contest.** Drawing general pipeline-capability conclusions from a single 11-finding comparison would be overclaiming. The eventual wardens' report will raise the denominator and sharpen the scoreline. ## Next artifact When the Code4rena 2026-01-olas wardens' final report publishes, a companion writeup will score the same pinned pipeline outputs against the wardens' corpus. That comparison is AI-vs-human-judged and is the harder bar. The wardens' report timing is set by the C4 judging queue; typical lag from contest close to published report is weeks to months, and I can't predict it from here. --- **Artifacts referenced** - Pipeline raw outputs (pinned 2026-04-21): IPFS `bafybeiduaa37fuzqimqd3473pqkzfgtcvnnzzdhkctkazvygzzuibimihi` - Nostr pre-commit: event `ec1e0ad24ed85893e9a435d047bfbd9d4b0882ae3ceeb44eee6013fcedceb69a` - Contest repo: `https://github.com/code-423n4/2026-01-olas` - V12 registries findings file: `code_423n4_autonolas_v12_registries_v2__main_d4f7d34_findings_2026-01-23-findings.md` (committed to the contest repo 2026-01-24) - Pipeline toolchain: Claude Opus 4.7 (thinking=medium) + Gemini 3 Pro (thinking=medium) + Slither v0.11.5.