# Pipeline vs Zellic V12 — Autonolas Registries cross-check

**Pre-commit timestamp.** Pipeline outputs frozen 2026-04-21 before
this comparison was written. IPFS directory CID
`bafybeiduaa37fuzqimqd3473pqkzfgtcvnnzzdhkctkazvygzzuibimihi`
(27 files, 165 KB). Nostr kind-1 commit event
`ec1e0ad24ed85893e9a435d047bfbd9d4b0882ae3ceeb44eee6013fcedceb69a`
(propagated to damus.io / nos.lol / primal.net). Those artifacts
predate this writeup; any per-finding claim here is verifiable
against them.

**What this is.** A per-finding comparison between a
dual-LLM + Slither audit pipeline's blind output on the Code4rena
`2026-01-olas` autonolas-registries subset (8 files, 2831 nLOC) and
the V12 findings file that ships in the contest repo. V12 is
Zellic's in-house AI auditor. The contest README marks V12 findings
as out-of-scope for C4 rewards. This cross-check is AI-vs-AI,
not AI-vs-human-wardens.

**What this is not.** The wardens' final report for 2026-01-olas
has not been published. Everything below is preliminary. When the
wardens' report lands, the same pipeline outputs will be scored
against it in a separate writeup. Any claim here could be revised
up or down once wardens weigh in. V12's findings aren't ground
truth — they're one AI auditor's read, and both systems share an
AI-LLM failure mode footprint.

## Scope and scoping decisions

Our pipeline ran against the `autonolas-registries` submodule at
commit `be1057a5e37f17f26b13c41311fe0e8e40259484` on 2026-04-21. 8
files, picked for coherence within the pipeline's 1000–5000 LOC
envelope:

- `contracts/ServiceManager.sol` (472 nLOC)
- `contracts/ServiceManagerProxy.sol` (79)
- `contracts/multisigs/RecoveryModule.sol` (416)
- `contracts/multisigs/SafeMultisigWithRecoveryModule.sol` (106)
- `contracts/multisigs/PolySafeCreatorWithRecoveryModule.sol` (240)
- `contracts/staking/StakingBase.sol` (1221)
- `contracts/utils/HashCheckpoint.sol` (125)
- `contracts/utils/ComplementaryServiceMetadata.sol` (172)

V12's registries findings file covers the same codebase. Of V12's
11 findings, 10 target a file we scanned; 1 has a split target
where part-A (`OperatorSignedHashes._verifySignedHash`) is in
`contracts/utils/OperatorSignedHashes.sol` — a file we did NOT
include in our 8-file cut. Part-B of that same finding
(`getEnableModuleTransactionHash` in
`PolySafeCreatorWithRecoveryModule`) IS in scope. We handle this
with a split score below: part-A is out-of-scope for the
comparison; part-B is scored normally.

## V12 findings inventory

Severity column uses V12's labels verbatim (Medium = M; Qa = Q).

| # | Title (abbreviated) | Sev | Primary target |
|---|---|---|---|
| F1 | Signature/domain binding + validator discovery errors | M | `OperatorSignedHashes` (OOS) + `PolySafeCreatorWithRecoveryModule` |
| F2 | Unprotected initializer allows first caller to seize ownership (+ downstream abuse of `changeImplementation`) | Q | `ServiceManager.initialize` |
| F3 | Inconsistent/missing event emissions after failed external calls | Q | `RecoveryModule`, `ServiceManager.unbondWithSignature`, `StakingBase._claim`, `HashCheckpoint.changeOwner` |
| F4 | Inconsistent hash URI accessors, zero-hash sentinel | Q | `HashCheckpoint.latestHash` + `latestHashURI` |
| F5 | Eviction logic treats `serviceId == 0` as a sentinel | Q | `StakingBase.checkpoint` / `_evict` |
| F6 | ETH path omits bond validation, allows zero-bond agents | Q | `ServiceManager.create` |
| F7 | Invalid `rewardDistributionType` enum -> DoS via `address(0)` call | Q | `StakingBase._stake` + `_getRewardReceiversAndAmounts` |
| F8 | Zero-value `bytes32` encoded as valid CID-like URI | Q | `ComplementaryServiceMetadata.tokenURI`, `HashCheckpoint.latestHash`/`latestHashURI` |
| F9 | Packed `(operator, serviceId)` key truncates serviceId high bits | Q | `ServiceManager.registerAgentsWithSignature` + `unbondWithSignature` |
| F10 | Rounding mismatch: `calculateStakingReward` view under-reports for `eligibleServiceIds[0]` | Q | `StakingBase.calculateStakingReward` |
| F11 | Ignored `execTransaction` return value can silently skip module enablement | Q | `PolySafeCreatorWithRecoveryModule.create` |

1 Medium, 10 Qa. Even for a QA-heavy contest that's a lean
finding set — consistent with the codebase already having a Zellic
pre-audit file committed before the contest opened, i.e. a
well-audited starting point.

## Per-finding comparison

Rating scheme:

- **CATCH** — a pipeline finding points at the same function + the
  same root cause V12 identified. Severity need not match exactly;
  if it's in the same family (Low ~ Qa, Medium matches Medium),
  that's a catch.
- **PARTIAL** — pipeline observed the symptom or affected code path
  but either (a) dismissed severity / rated it materially lower
  than V12, (b) covered only some of V12's multi-target sub-items,
  or (c) got the function right but the root cause wrong.
- **MISS** — pipeline either said nothing about the affected code,
  or explicitly reasoned that the code is safe in exactly the way
  V12 says it isn't.

### F1 — Signature/domain binding + validator discovery (M)

Part-A (`OperatorSignedHashes._verifySignedHash`): the file is
**out-of-scope** for our run. No assessment.

Part-B (`PolySafeCreatorWithRecoveryModule.getEnableModuleTransactionHash`):
V12 argues the contract trusts
`computeProxyAddress(owner)` as the canonical Safe address without
verifying that the factory uses identical inputs/initializer/salt
as the actual deployment. Both our reviews looked at this code and
concluded it was fine. Claude's high-confidence observations: "EIP-712
domain separator recomputed with `block.chainid` and the multisig
address — consistent with Safe's on-chain domain separator" and
"`multisig.code.length > 0` pre-check plus
`multisig.codehash == polySafeProxyBytecodeHash` post-check bind the
resulting contract to the expected Poly Safe proxy bytecode;
substituting a malicious proxy at a counterfactual address is
infeasible." Gemini called the EIP-712 chain-id + domain
evaluation "directly lines up with the Gnosis Safe `v1.3.0`
specification." Both stopped at the assumption that the codehash
check is sufficient. V12's concern is about the *computed* address
diverging from the *deployed* address upstream of the codehash
check — if the factory's deployment path differs (different
salt/initializer) from the path implicit in
`computeProxyAddress(owner)`, the signed digest references an
address that may never host the intended Safe. Neither reviewer
surfaced this failure mode.

**MISS** on part-B.

### F2 — Unprotected initializer (Q)

V12: `ServiceManager.initialize()` has no initializer modifier; the
first caller after a proxy is deployed (if the initializer isn't
run atomically) becomes the permanent owner. V12 lists
`changeImplementation` as a second target — not as a second bug,
but as the privileged function the attacker-owner would next use
for a total takeover.

Claude, finding #2: "Unprotected `initialize()` — front-run
ownership if the proxy is deployed without atomic initialization.
Severity: Low." Same function. Same root cause. Same "assume
atomic deploy, otherwise an MEV actor takes ownership" risk
articulation. Claude also correctly notes the mitigator inside
`ServiceManagerProxy` (the proxy's constructor invokes
`initialize()` as part of its deployment delegatecall), which is
why it rates Low rather than High. V12 rates this Qa while
describing impact as "High / Critical — an attacker who calls
initialize() first gains full administrative control of
ServiceManager," which suggests V12 downrates to Qa on
deployment-pattern-avoidance grounds similar to Claude's
reasoning, though V12 doesn't make that chain explicit.

**CATCH** with aligned severity.

### F3 — Inconsistent event emissions (Q)

4 sub-targets. Pipeline coverage is mixed:

- `recoverAccess` (RecoveryModule): Gemini's RecoveryModule
  finding #2 flags the unchecked return value of
  `execTransactionFromModule`, and explicitly says: "`recoverAccess`,
  the protocol will bizarrely act as though access was perfectly
  recovered, emitting an untruthful `AccessRecovered` log, but
  leaving the user stranded." Same failure pattern V12 describes.
- `unbondWithSignature` (ServiceManager): neither reviewer flags
  anything about event consistency after the external call here.
  Claude's ServiceManager high-confidence observations even state
  that reentrancy-events is a false positive because "Nonces are
  updated before the external calls in the signature variants."
  That's addressing reentrancy, not the event-truthfulness concern
  V12 raises.
- `_claim` (StakingBase): Claude observes "`_claim` zeros
  `sInfo.reward` before `_withdraw`" as a correctness note but
  doesn't flag the emitted-event semantics after the external
  `transferFrom`.
- `changeOwner` (HashCheckpoint): no review flags an event issue
  here; Claude's high-confidence list says "changeOwner /
  setBaseURI are properly owner-gated with zero-address /
  zero-length checks" without addressing event ordering.

1 of 4 sub-targets covered by pipeline. V12 treats F3 as a single
finding-unit so any sub-target hit justifies PARTIAL, but 25%
coverage is weak.

**PARTIAL** (weak).

### F4 — Inconsistent hash URI accessors (Q)

V12: `latestHash` and `latestHashURI` expose the same stored value
in two different representations; callers of `latestHash` won't
observe `baseURI` updates, so two consumers get different strings
for the same address.

Claude: "latestHash for an address that never called checkpoint
returns `CID_PREFIX + 64 zero chars` — not a vulnerability, just
expected default." That observation notices the zero-default
symptom (which is F8) but does not address the
baseURI-divergence-between-accessors concern that F4 articulates.
Gemini found nothing here.

**MISS.**

### F5 — Eviction logic treats `serviceId == 0` as sentinel (Q)

V12: `checkpoint` records candidate eviction entries into a sparse
array indexed by slot, but uses 0 as a "real" service ID at the
same time. `_evict` then treats 0 as an empty-sentinel and skips
it during compaction, while still consuming the caller-supplied
eviction count — the mismatch can lead to wrong entries being
removed from `setServiceIds` and an ID-0 service escaping
eviction.

Claude on StakingBase high-confidence obs: "`setServiceIds`
swap-and-pop in `_evict` iterates `numEvictServices..1` using
pre-captured indexes, which is safe because indexes are processed
in descending order." That's reasoning about index correctness in
descending order, not about the sparse-array consumer treating 0
as empty while the producer treats 0 as valid. Gemini: "`_evict`
uses a backward-iterating swap-and-pop to remove multi-target
inactive services. Because `serviceIndexes` is strictly
monotonically increasing and processed backwards, a swapped element
from `totalNumServices` is mathematically guaranteed not to be an
index slated for eviction later in the loop." Similar analysis —
correct about indexing order, misses the producer/consumer 0-vs-0
disagreement.

Both reviewers explicitly concluded the eviction logic is safe,
while V12 argues it isn't.

**MISS.**

### F6 — ETH bond-validation omission (Q)

V12: `ServiceManager.create` enforces nonzero agent bonds only
when `token != ETH_TOKEN_ADDRESS`; the ETH branch omits that
check. Downstream `IService.create` may or may not enforce it; the
defensive local check is missing.

Claude's ServiceManager review does not flag this. Its finding #1
is a related but distinct msg.value mismatch in
`registerAgentsWithSignature`, not a missing-bond-check in
`create`. Gemini's ServiceManager review doesn't flag it either.

**MISS.**

### F7 — Invalid `rewardDistributionType` enum (Q)

V12: `_stake` validates the distributor address only when the low
8 bits encode the Custom variant; any uint8 value that isn't one
of the 4 enum members is accepted and later dispatched via the
"else" branch of `_getRewardReceiversAndAmounts` as Custom,
resulting in a call to `address(0)` — DoS on claim.

Claude's StakingBase L-3 is closely adjacent: "Upper bits of
`rewardDistributionInfo` are not validated. Shifting right by 8
and casting to `uint160` validates only bits [8..167]. Bits
[168..255] are never checked, nor stored in a canonicalized form.
Today this is harmless because only the low 168 bits are read
back. Any future upgrade that extends the encoding could silently
collide with user-supplied garbage." Different slice of the same
pattern — Claude focused on the unused upper 88 bits, V12 on the
low 8-bit enum cast. The shared observation is "validation of the
packed word is incomplete." Claude didn't connect that to the
`address(0)` DoS chain.

The two slices (bits 168-255 vs low 8 bits) are disjoint portions
of the same packed word; one could reasonably argue Claude's L-3
and V12's F7 are adjacent rather than overlapping. Calling this
PARTIAL on the grounds that the pipeline identified a
missing-validation pattern in the same word V12 flags, while
acknowledging the specific exploit V12 describes (dispatch into
the Custom branch → `address(0)` call) is absent from pipeline
output.

**PARTIAL** (generous read).

### F8 — Zero-value `bytes32` CID-like URI (Q)

2 sub-targets.

- `ComplementaryServiceMetadata.tokenURI`: Claude's high-confidence
  observations include "Hash validation: Any `bytes32` is accepted
  (including zero). Since CID prefix encodes fixed multihash
  length/algo, a zero hash produces a syntactically valid but
  meaningless CID. Not a protocol-funds issue." The symptom is
  identified; the severity assessment ("not a protocol-funds
  issue") is the disagreement point with V12, which rates it Q and
  articulates metadata-collision / indexer-confusion consequences.
- `HashCheckpoint.latestHash`/`latestHashURI`: Claude's high-
  confidence observations also include "`latestHash` for an
  address that never called `checkpoint` returns `CID_PREFIX + 64
  zero chars` — not a vulnerability, just expected default." Same
  pattern: symptom noted, severity dismissed.

Gemini found nothing in either file.

Both sub-targets observed by Claude, severity materially
under-rated vs V12 ("not a vulnerability" vs Q).

**PARTIAL** — symptom-caught, impact-missed.

### F9 — Packed `(operator, serviceId)` key truncation (Q)

V12: `operatorService = uint256(uint160(operator)) | (serviceId << 160)`
preserves only 96 bits of `serviceId`. Two serviceIds that differ
only in their high bits collide, so their nonces share storage;
advancing the shared nonce via a signed op on one service
invalidates previously-signed ops on the other.

Claude's ServiceManager high-confidence observations: "operator in
lower 160 bits, serviceId shifted by 160 — collision-free as long
as serviceId fits in 96 bits (consistent with the uint32
service-id invariant in `IService`)." Claude noticed the shift
margin but argued the external invariant makes it safe.
Gemini's ServiceManager high-confidence observations: "Shift logic
combining `operator | (serviceId << 160)` uses cleanly separated
boolean segments (address is exactly 160 bits long); `serviceId`
boundaries safely avoid intra-uint collisions." Gemini asserted
safety directly.

V12's argument is that the uint32 invariant Claude cites isn't
actually enforced by the packing code itself; the invariant lives
outside this file, which is exactly the class of defensive
validation V12 says should be local.

Both reviewers identified the exact shift margin V12 is
concerned about — and both explicitly reasoned it was safe on
grounds that the external uint32 `serviceId` convention would
hold. V12 argues the packing code itself should enforce the
invariant; Claude and Gemini accepted the invariant as given. No
symptom / severity disagreement — a clean opposite conclusion.

**MISS.**

### F10 — Rounding mismatch in `calculateStakingReward` (Q)

V12: `calculateStakingReward` (the view) applies floor division
uniformly, whereas `checkpoint` (the state writer) collects the
integer-division remainder and adds it to `eligibleServiceIds[0]`.
The view thus under-reports the pending reward for the first
eligible service by up to a few wei.

Claude on StakingBase high-confidence observations: "Pro-rata
leftover loop in `checkpoint()` correctly delivers the rounding
dust to service index 0; sum of distributed rewards equals
`lastAvailableRewards` exactly." Claude observed the
checkpoint-side rounding dust behavior, but didn't cross-check it
against the view function. Gemini similarly noted: "Reward Math
Truncation: In `checkpoint()`, when recalculating `updatedReward`
due to protocol reward shortage (`totalRewards > lastAvailableRewards`),
truncation losses (up to bounded wei equivalent to `numServices-1`)
are collected across execution and specifically appended back
locally to the initial subset at index `0`. This preserves
`availableRewards` balance integrity to the exact wei." Also
checkpoint-only; also didn't cross-check against the view.

The pipeline had the information needed to notice the mismatch
(it saw the checkpoint-side behavior and reasoned about it
correctly) but didn't compare against the adjacent view. Neither
model flagged it.

**MISS.**

### F11 — Ignored `execTransaction` return value (Q)

V12: `PolySafeCreatorWithRecoveryModule.create` calls
`ISafe.execTransaction(...)` to enable the recovery module on the
freshly created Safe and does not check the boolean return value.
Safe returns `false` (doesn't revert) on internal failure.

Gemini: "Unchecked Return Value on Safe Execution Allows Silent
Evasion of Recovery Module Enablement. Severity: Medium." Same
function, same root cause, with the specific attack-vector story
(gas-limit grief to starve the nested `enableModule` call) and
the concrete "pre-signed `enableModuleSig` is permanently burned"
consequence. Gemini rated Medium; V12 rated Q. Both agree it's
real. Gemini also opportunistically caught the sibling
issue in `RecoveryModule.recoverAccess` and `RecoveryModule.create`
(the `execTransactionFromModule` return-value), which is adjacent
to F3 above.

Claude's PolySafeCreator review explicitly reasoned through the
execution path and concluded safe — partly on grounds that a
freshly-deployed Safe passes the codehash check, which rules out a
prior module being installed. That argument doesn't address the
out-of-gas silent-failure path Gemini / V12 describe.

**CATCH** (Gemini), with a stricter severity than V12. Claude MISS.

## Summary table

| V12 finding | Pipeline rating | Notes |
|---|---|---|
| F1 part-A | OUT-OF-SCOPE | `OperatorSignedHashes` not in our 8-file cut |
| F1 part-B | MISS | Both models reasoned about codehash check; neither surfaced computed-vs-deployed address divergence |
| F2 | CATCH | Claude, Low ~ V12's Q |
| F3 | PARTIAL | 1 of 4 sub-targets covered (Gemini on `recoverAccess`) |
| F4 | MISS | Neither model noticed the `baseURI`-update divergence between accessors |
| F5 | MISS | Both models concluded eviction safe |
| F6 | MISS | `create` ETH branch not flagged |
| F7 | PARTIAL | Same "incomplete packed-word validation" observation, different attack path |
| F8 | PARTIAL | Symptom observed in both sub-targets, severity dismissed ("not a vulnerability") |
| F9 | MISS | Both models explicitly concluded safe; V12 says otherwise |
| F10 | MISS | Pipeline saw checkpoint-side behavior but didn't compare vs view |
| F11 | CATCH | Gemini, Medium > V12's Q |

Scoreline on V12's 10 in-scope finding-units (F1-B + F2 through F11):
**2 catches, 3 partials, 5 misses.**

Gemini's value-add concentrated on ServiceManager, RecoveryModule,
PolySafeCreatorWithRecoveryModule, and the high-confidence-obs
layer of the ServiceManager + StakingBase reviews. For 4 of the 8
files (ServiceManagerProxy, SafeMultisigWithRecoveryModule,
HashCheckpoint, ComplementaryServiceMetadata) Gemini produced no
findings. On StakingBase Gemini also produced no findings while
Claude produced 3. The catches above came from Gemini (F11) and
Claude (F2, F7 partial, F8 partial); the two models covered
different files non-redundantly, which is the design intent of
running both, though a natural follow-up question is whether one
model's coverage fully subsumes the other at any given file.

Comparison to Phase-5 scoreline on Intuition (same pipeline, 6
V12 findings, no partials reported): 2/6 catches + 4 misses.
Strict catch rate went from 33% (Intuition) to 20% (Olas).
Lenient rate (catches + partials / total) went from 33% to 50%
because this run produced 3 items where the pipeline observed the
symptom or adjacent pattern but got the severity / exploit chain
wrong. Either way the absolute hit rate is modest and both
datasets show the severity-under-rating pattern the Phase-5
writeup flagged.

## Pipeline findings not in V12's list

The pipeline produced 6 findings V12 didn't, spread across 3
files. Each is unverified — could be true-positives V12 missed,
or false-positives. Listed for transparency:

1. **Claude SM-1**: `registerAgentsWithSignature` doesn't validate
   `msg.value` in the `isTokenSecured` branch, allowing stranded
   ETH or locked-wei re-spend. Low. Plausible true-positive;
   value at risk is tiny (1 wei per agent).
2. **Gemini SM-1**: `registerAgentsWithSignature` bypasses the
   global `operatorWhitelist` check that `registerAgents`
   enforces. Medium. Plausible true-positive if the contest's
   threat model treats the whitelist as protocol-enforced; the
   code does look asymmetric.
3. **Gemini SM-2**: EIP-712 signatures lack a `deadline` /
   cancellation mechanism; a service owner holding a stale
   operator signature can use it later. Medium. Plausible
   true-positive as a UX / access-control issue; whether it's
   counted in wardens' corpus may depend on how judges see
   signature-replay vs signature-reuse.
4. **Gemini RM-1**: `recoverAccess` permanently locks recovery if
   `msg.sender` is the last processed multisig owner in the
   linked-list traversal. High. Plausible true-positive; this is
   a self-inflicted-DoS scenario that depends on specific owner-
   ordering. The warden report will be the arbiter.
5. **Claude SB L-1**: Inactivity accounting stalls while
   `availableRewards == 0`, then spikes on re-funding — an
   eviction footgun at the next checkpoint after a pool top-up.
   Low / UX footgun. Plausible true-positive.
6. **Claude SB L-2**: `_unstake(enforced=true)` returns `reward`
   although funds were pushed back to `availableRewards`.
   Integrator-confusion; no fund loss. Plausible true-positive
   as a return-value-semantics bug; wardens may not have counted
   it if it was considered too low-impact or documented behavior.

If wardens report similar findings, some of these move into the
"catches against wardens" column for the eventual wardens writeup.
If not, they land in the false-positive column.

## Caveats

1. **V12 is not ground truth.** V12 is an AI auditor with its own
   failure modes. Both V12 and our pipeline are LLM-backed and
   may share systematic blind spots. The AI-vs-AI comparison
   cannot establish absolute detection rates. The wardens' final
   report is the closer-to-ground-truth reference and is still
   pending.

2. **Scope subset, not full contest.** Our 8-file cut is
   ~20% of the contest's 40 in-scope files; V12's findings for
   `autonolas-governance` and `autonolas-tokenomics` are
   separately published and not covered here.

3. **Pre-commit only helps on the wardens side.** V12's findings
   were public in the contest repo before our pipeline ran. The
   README of the pinned artifact states explicitly that V12 output
   was NOT fed to the pipeline; any overlap is independent
   discovery at inference time. However, V12's findings may have
   entered the models' training data indirectly via public crawls
   of the contest repo, and we have no way to prove they didn't.
   The pre-commit timestamp proves pipeline outputs existed before
   the wardens' report publishes; it doesn't prove the pipeline
   was insulated from V12's influence during training.

4. **Severity calibration disagreement.** V12's labels on this
   finding set are 1 Medium + 10 Qa; our pipeline uses High /
   Medium / Low. When scoring "severity aligned," Low ~ Qa was
   treated as a match; it is not a precise match in either
   direction.

5. **Sample of one contest.** Drawing general pipeline-capability
   conclusions from a single 11-finding comparison would be
   overclaiming. The eventual wardens' report will raise the
   denominator and sharpen the scoreline.

## Next artifact

When the Code4rena 2026-01-olas wardens' final report publishes, a
companion writeup will score the same pinned pipeline outputs
against the wardens' corpus. That comparison is AI-vs-human-judged
and is the harder bar. The wardens' report timing is set by the
C4 judging queue; typical lag from contest close to published
report is weeks to months, and I can't predict it from here.

---

**Artifacts referenced**

- Pipeline raw outputs (pinned 2026-04-21): IPFS
  `bafybeiduaa37fuzqimqd3473pqkzfgtcvnnzzdhkctkazvygzzuibimihi`
- Nostr pre-commit: event
  `ec1e0ad24ed85893e9a435d047bfbd9d4b0882ae3ceeb44eee6013fcedceb69a`
- Contest repo: `https://github.com/code-423n4/2026-01-olas`
- V12 registries findings file:
  `code_423n4_autonolas_v12_registries_v2__main_d4f7d34_findings_2026-01-23-findings.md`
  (committed to the contest repo 2026-01-24)
- Pipeline toolchain: Claude Opus 4.7 (thinking=medium) + Gemini
  3 Pro (thinking=medium) + Slither v0.11.5.