# What the dual-LLM + Slither pipeline caught vs. missed against V12's full finding list — Intuition, read per-file

**Author:** `merovan` · **Contact:** `merovan@envs.net` · **Published:** 2026-04-21

This is a companion to the [Intuition benchmark](./audit_pipeline_intuition_benchmark.md)
and the [pipeline how-to](./audit_pipeline_howto.md). The benchmark
reported what the pipeline surfaced. This writeup is the harder question:
given V12's complete finding list across the Intuition scope — six
findings total, spread across the main repo and the periphery repo —
which ones did the pipeline rediscover, which did it miss, and what
does the pattern of hits and misses actually tell us?

Of V12's six findings, the pipeline rediscovered **two** (both at a
lower severity than V12 rated them — one by one tier, the other by
two). It missed four, including V12's second Critical on AtomWallet
and all three V12 findings on TrustBonding. It also produced four
additional findings not in V12, which are plausibly novel but
currently unverified against V13 or any other independent auditor.

### V12's full finding list (public in the contest repos since 2026-03-03 and 2026-03-04)

| # | File | V12 severity | V12 title | Pipeline? |
|---|---|---|---|---|
| 1 | AtomWallet | Critical | Unsigned validity window metadata | **Caught** (Claude High) |
| 2 | AtomWallet | Critical | Ownership slot mismatch bricks wallet | Missed |
| 3 | TrustBonding | Medium | Zero-amount claims bypass claimed check | Missed |
| 4 | TrustBonding | Low | No epoch snapshot for reward parameters | Missed |
| 5 | TrustBonding | Low | Division by zero in utilization interpolation | Missed |
| 6 | TrustSwapAndBridgeRouter (periphery) | High | Bridge fee quoted from slippage minimum | **Caught** (Claude Low) |

Two caught, four missed, and the two caught were both rated by the
pipeline below V12's call (High vs Critical on AtomWallet's unsigned
window; Low vs High on the bridge fee). The rest of this writeup is a
per-file breakdown of why each file broke the way it did, on this
corpus.

### Up-front caveats

Repeated because they matter:

1. **The V12 findings lists were public in the Code4rena repos from
   each contest's open date** (main repo opened 2026-03-04 with the
   V12 list committed; periphery sub-repo has V12 findings dated
   2026-03-03). My run was 2026-04-20, ~six weeks later. The models
   almost certainly have or could retrieve these lists; a rediscovery
   on this corpus is not evidence of novel-bug-finding capability.
2. **n=1 contest.** Everything below generalizes from a single public
   benchmark with known ground truth. A blind run on a fresh contest
   is the only way to separate "rediscovery" from "discovery," and
   hasn't happened yet.
3. **The pipeline was tuned on this benchmark.** I ran it once, looked
   at outputs, tweaked `max_tokens` and cache version, ran again. The
   tweaks are small but the benchmark is no longer blind.

With those in place, here's what each file shows.

## 1. `AtomWallet.sol` — one Critical caught, one Critical missed

### What was caught

Both of Claude Opus 4.7's runs independently produced the same finding:

> `validUntil` / `validAfter` are not covered by the ECDSA signature.
> Attacker strips the 12-byte validity-window suffix from a user's
> signed UserOp, re-encodes as a 65-byte signature, and replays the
> UserOp well outside the user's intended window.

Same root cause, same location (`_validateSignature` /
`_extractValidUntilAndValidAfterFromSignature`), same PoC shape as
V12's Critical #1 "Unsigned validity window metadata." Claude rated it
High; V12 rated it Critical. Under C4's rubric (Critical = loss of
user funds under realistic conditions with a permissionless attack), a
smart-contract wallet where any mempool observer can replay your
signed operations well after your intended expiry is plainly
Critical. Claude over-weighted the "preconditions" column as if
mempool observability were a gating factor; on reflection V12's call
is defensible and I'd align with it.

### What was missed

V12's **second** Critical on AtomWallet — "Ownership slot mismatch
bricks wallet" — was not surfaced by either model in any run. The bug
is that AtomWallet splits ownership state across a custom storage
slot and the inherited OZ `Ownable2Step` slots, with `owner()`
conditionally switching its read source based on an `isClaimed` flag.
Different parts of the lifecycle (`initialize`, `transferOwnership`,
`acceptOwnership`, `owner`) read and write *different* slots, so the
post-claim state can end up with `owner()` returning `address(0)` and
every `onlyOwner` path breaking. This bricks the wallet.

It's instructive to ask why the pipeline missed this one while
catching the other. Three features of the missed bug push it away
from what a per-file LLM prompt can see:

- **It requires cross-function state-transition reasoning.** The bug
  only manifests when moving from "pre-claim" to "claimed" mode. You
  have to simulate a lifecycle across four functions plus an
  `isClaimed` flip, not just read one function's logic.
- **It looks like standard OZ inheritance.** Each individual function,
  read in isolation, looks reasonable — it's only the mismatch across
  them that's buggy. An LLM asked "does this function work?" gets a
  locally-correct "yes" on each.
- **The Slither prior didn't flag it.** Slither's detectors don't have
  a "cross-function storage-slot mismatch" rule; its flags are mostly
  single-function issues. Without Slither's wedge, Claude had no
  orienting signal to start the cross-function trace.

V12's writeup on this finding explicitly notes: "Because the bug only
manifests when moving from 'pre-claim' to 'claimed' mode, it is easy
to miss in isolated function review." That's the structural failure
mode. Any per-file prompt-driven pipeline will share this weakness
unless the prompt scaffolding adds explicit cross-function
state-transition analysis.

### Gemini on AtomWallet

Three runs, three different outputs: a truncated mid-sentence
`SIG_VALIDATION_FAILED` claim (factually wrong because the relevant
revert branch is unreachable after the length-77 early-return); a
different AA-UX finding about `onlyOwner + nonReentrant` interacting
badly with ERC-4337 UserOps on `claimAtomWalletDepositFees` and
`transferOwnership`; and finally a clean "no findings" output with
`thinking="medium"`. None reached either V12 Critical. On the file
where V12 had the highest-signal ground truth, Gemini 3 Pro's
contribution was near-zero on two runs and actively incorrect on the
third.

### Takeaway

On single-function, single-file bugs — "is this input covered by the
digest?" "is this modifier checking what you think?" — a single
Claude run at `thinking="medium"` carried this finding on its own.
Gemini's contribution on this class of file was net-negative in at
least one of three runs. I would not weight Gemini 3 Pro equally on
AA / signature-verification files.

But the half of AtomWallet the pipeline missed — an ownership bug
spanning four functions and a lifecycle flag — is exactly the class
of bug that a per-file LLM prompt is structurally bad at. That's a
pipeline limitation, not a one-off miss. Closing it needs a second
pass whose prompt scaffolds "list every function that reads or writes
`owner()` / `pendingOwner()`, then argue whether the reads match the
writes across the lifecycle." That's a concrete next iteration.

## 2. `ProgressiveCurve.sol` — clean "no findings" with substantive justification

V12 reported nothing on this file. Claude reported nothing on this
file, with ~15 lines of "what I checked" covering rounding direction
on all four ERC-4626 conversion paths, MAX_SHARES / MAX_ASSETS bounds,
the even-slope check, initializer modifier placement, view-only
purity, and storage-slot isolation.

Gemini produced one speculative Low on a `previewWithdraw` edge-case
revert that didn't survive manual review. The reverting path requires
a `(shares, HALF_SLOPE)` combination that doesn't occur under the
production `minDeposit = 1e16` / `minShare = 1e6` floors, and the
revert is arguably the correct UX vs. silent underflow.

### Why the clean output is evidence, not just silence

The high-confidence observations section is the difference. A
"no findings" output with no justification is hard to trust; one with
an explicit checklist of invariants examined lets a reviewer
spot-check. On curve math the right places to look are rounding
direction and bound behavior, and the output names both.

Neither model reasoned about `PCMath.square` / `PCMath.squareUp`
semantics in the math library one level down. The pipeline's per-file
scoping is the reason — the LLM sees the file's public surface and
the Slither output, but not the library's implementation, so it flags
the library as a residual verification item rather than reasoning
about it. This is a structural limitation of the current prompt, not
a one-off miss.

### Takeaway

A clean no-findings output with substantive "what was checked"
content is a useful artifact. It lets a reviewer move on from the
file without re-doing the routine invariant checks. It is not
positive evidence that the file is bug-free — V12 also agreed this
file was clean, so the corpus only shows we *agreed* with V12 on a
file where V12 was clean, not that we'd have caught anything V12
missed.

## 3. `TrustBonding.sol` — 0 of 3 V12 findings caught; 3 different unverified findings produced

This is the file where the pipeline's rediscovery rate against V12
is weakest:
V12 has three findings here and the pipeline rediscovered none of
them. Instead the pipeline surfaced three different findings in the
same file, which may be genuine additions, may be false positives,
and are currently unverified against V13 or any other auditor.

### What V12 had and the pipeline missed

- **Medium — "Zero-amount claims bypass claimed check."** The helper
  `_hasClaimedRewardsForEpoch` infers claimed status from
  `userClaimedRewardsForEpoch[account][epoch] > 0`. If a user's
  computed reward is zero for an epoch (rounding or zero utilization),
  the mapping stays at 0, the helper returns "unclaimed," and a later
  recomputation under different state can yield a second non-zero
  claim for the same epoch. This is a state-invariant bug: the
  protocol's invariant "one claim per epoch per user" is encoded via
  an amount-based proxy that fails on the zero-amount edge.
- **Low — "No epoch snapshot for reward parameters."** Mutable
  governance parameters (`multiVault`, `satelliteEmissionsController`,
  `personalUtilizationLowerBound`) are read at claim time against the
  *current* value, not snapshotted per epoch. Admins can retroactively
  change historical epoch entitlements. This is a governance /
  parameter-upgrade pattern bug.
- **Low — "Division by zero in utilization interpolation."**
  `_getNormalizedUtilizationRatio` divides by a `target` that isn't
  validated as non-zero before use. Upstream state can push `target`
  to zero, DoSing reward-claim paths. This is a single-line
  input-validation bug.

What the pipeline surfaced instead:

- **Claude — Critical:** `_balanceOf(user, past_ts)` in `VotingEscrow`
  extrapolates from the user's latest checkpoint even when the
  queried timestamp is before that checkpoint, chaining through
  `userBondedBalanceAtEpochEnd` → `_userEligibleRewardsForEpoch` into
  an emission-share miscount. Unverified against V13.
- **Gemini — High:** "Zero-balance reward eclipse" — a lock whose
  expiration lands exactly on `_epochTimestampEnd(epoch)` evaluates
  to balance=0 at that timestamp. Unverified against V13.
- **Gemini — High:** "JIT reward extraction" —
  `_userEligibleRewardsForEpoch` takes an end-of-epoch snapshot
  rather than time-integrating, letting a last-block locker claim the
  full epoch. Unverified against V13.

### Why V12's findings were missed

The three missed V12 findings span three different bug patterns, and
each is the kind of bug a per-file LLM prompt has trouble with for a
different reason:

- **The zero-amount bypass is an invariant-violation bug.** The bug
  only shows up when you ask "what invariant is this proxy supposed
  to enforce?" and notice that "claimed ↔ amount > 0" fails on the
  zero-amount edge. The pipeline's prompt asks for "exploitable
  vulnerabilities," which biases the LLM toward surface-level code
  flaws rather than invariant reasoning. A purpose-built
  "list every invariant this contract assumes, then probe the edges"
  pass would likely catch this.
- **The epoch-snapshot bug is a parameter-upgrade pattern.** It
  requires reasoning about *admin actions taken in the future* and
  their retroactive effect on past-epoch state. The LLM sees the
  contract in a static snapshot; reasoning about the operational
  lifecycle of governance actions is outside what per-file prompts
  train on.
- **The div-by-zero is a single-line oversight.** Slither should
  arguably have flagged this, but its div-by-zero detector keys on
  `divide-by-zero` literal patterns and didn't fire on
  `(delta * ratioRange) / target`. Either an extra Slither pass with
  looser division-safety heuristics, or an explicit "for every division
  in this file, prove the divisor is non-zero" prompt step, would
  close this gap.

All three are plausibly *closable* with targeted prompt additions —
an invariant probe, a division-safety probe, a parameter-upgrade
second pass. Whether those would actually close the misses is an
untested hypothesis; it would take a fresh contest run to check.
None look like "LLMs fundamentally can't do this" failures.

### The unverified additions

Claude's `_balanceOf` Critical and Gemini's two Highs are distinct
findings with concrete PoC sketches. Whether they represent genuine
novel additions, or false positives the pipeline confabulates on
hard-to-reason files, is **not resolvable from this writeup** — it
needs a cross-check against V13 (the post-mitigation report) or a
project-team response. I didn't have V13 open at run time. For an
honest reader, three unverified findings on a file where the pipeline
rediscovered nothing V12 caught is ambiguous at best.

### Takeaway

TrustBonding is the file where the pipeline's limitations are most
visible, both in absolute terms (0/3 V12 rediscovery) and in the
uncertain status of its additions (3 unverified). The underlying
diagnosis is that V12's three findings each require a different
*kind* of reasoning than the prompt's "find exploitable vulnerabilities"
frame induces — invariants, parameter-upgrade lifecycle,
division-safety sweeps. Adding purpose-built sub-prompts for each of
those would likely close most of the gap. Without those, the pipeline
on TrustBonding-class files should be treated as exploratory (worth
reading its outputs) rather than definitive (don't treat no
V12-matching findings as "nothing there").

## 4. `OffsetProgressiveCurve.sol` — clean, and it mirrors #2

Both Claude and Gemini returned zero findings with substantive
"what was checked" sections. V12 also reported nothing. The file is a
coordinate-transform wrapper around `ProgressiveCurve` (a 2D offset
applied in the curve's share-asset plane), so repeat-clean across the
two files is consistent with treating the offset as a sound
transform. This is mild evidence the pipeline is *consistent* across
structurally similar files, not evidence that either file is actually
clean — V12 agreed on both, so we can't distinguish "we caught what
was there" from "we agreed with V12 about there being nothing there."

## 5. `TrustSwapAndBridgeRouter.sol` — V12's High caught by Claude at Low; Gemini added an unverified Medium

### What was caught

Claude's finding on this file matches V12's **High** severity
"Bridge fee quoted from slippage minimum" almost exactly in root
cause and PoC shape:

> `bridgeFee` is quoted via
> `MetaERC20Hub.quoteTransferRemote(recipient, minTrustOut)`, but the
> later `_bridgeTrust` call forwards the *actual* `amountOut` (≥
> `minTrustOut`). If Metalayer's fee scales with amount, the router
> reverts with insufficient `msg.value` whenever
> `amountOut > minTrustOut` — or if fee sufficiency is not strictly
> enforced, subsidy / DoS. Remediation: re-quote `bridgeFee` using
> `amountOut` post-swap.

This is the same bug V12 reports. Claude rated it Low (possibly
Medium, flagged in the finding). V12 rated it High. Under C4's
rubric, a user-controlled lower bound letting a caller underpay
bridge fees while bridging a larger amount — with plausible DoS or
subsidy outcomes — is defensibly High. **Same under-rating pattern
as AtomWallet**: on this corpus the pipeline's two caught findings
are both below V12's severity — AtomWallet by one tier (High vs
Critical), bridge fee by two (Low vs High). Two data points is not a
pattern claim, but the direction is consistent.

### The unverified Gemini addition

Gemini's Medium on `deadline = block.timestamp` in the Slipstream
swap structs — "provides no MEV protection, sandwich/delay attacks
stay live under mempool congestion" — is not in V12. On Base today
the MEV surface is small and this would likely be judged a Low or
QA; if the router gets used on higher-MEV L2s it becomes more
defensible. This is an unverified addition to V12, not a
rediscovery.

### Takeaway

Two things sit together on this file. First, the pipeline
**rediscovered V12's one finding on the periphery repo** — that's a
hit on this corpus, and the one where the dual-LLM + Slither prompt
produced content that lines up with V12's own PoC. Second, Claude
rated it Low where V12 rated it High — two severity tiers below.
This is the second data point consistent with the under-rating
direction in observation (A). Two data points is not a pattern
claim, but the direction holds on both findings the pipeline caught.

## Reading across six V12 findings and the pipeline's output

Five observations, stated with the bounds the corpus actually
supports.

### (A) Severity under-rating on both caught findings

Both V12 findings the pipeline rediscovered came back at a lower
severity than V12's call. The AtomWallet unsigned-window bug was
Critical (V12) / High (Claude) — one tier down. The bridge-fee bug
was High (V12) / Low-possibly-Medium (Claude) — two tiers. The
common thread: Claude over-weighted a "preconditions" column as if
observing a signed UserOp in the mempool or picking a low
`minTrustOut` were gating factors. Both are trivial in practice.
With n=2 this is a direction, not a pattern claim; a prompt revision
that says "if the attack is permissionless and the exploit
observable on public state, do not demote on preconditions alone"
seems worth testing, and would need more contest runs to validate.

### (B) Cross-function state-transition bugs were a recurring miss

Two of the four missed findings on this corpus (AtomWallet ownership
slot mismatch; TrustBonding epoch-snapshot for reward parameters)
share a common shape: the bug only manifests when you simulate a
*lifecycle* across multiple functions. Per-file prompts ask "does
this function work?" and get locally-correct answers. Nothing in the
prompt scaffolds the reasoning required to notice a read-write
mismatch across four functions and an `isClaimed` flip. A
second-pass prompt that lists every function reading or writing a
given piece of state and asks whether the reads match the writes
across the lifecycle is a plausible next thing to try — whether it
would actually close this class of miss is an open question until
it's run.

### (C) Invariant-violation bugs aren't framed by the current prompt

The TrustBonding zero-amount claim bypass is the clearest example.
The bug only surfaces when you ask "what invariant is this code
trying to encode, and where does the encoding fail?" The current
prompt asks for "exploitable vulnerabilities," which biases toward
surface-level flaws. A dedicated invariant-probe sub-prompt ("list
every invariant the contract depends on; for each, describe the edge
case that would break it") would help.

### (D) Slither's value is real but uneven

On TrustBonding, Slither's asymmetry flag on `_balanceOf` was what
seeded Claude's `_balanceOf` extrapolation finding — real leverage,
even though that finding is currently unverified. On AtomWallet,
Slither's output didn't contain the wedge for the ownership slot
mismatch, because Slither doesn't have a "cross-function storage
slot mismatch" detector. On TrustBonding, Slither's div-by-zero
detector didn't fire on `(delta * ratioRange) / target`, so V12's
div-by-zero Low wasn't wedge-assisted either. On this corpus,
Slither is good leverage for per-file-visible asymmetries and
library-FP filtering; it doesn't cover the bug classes that most
need cross-function reasoning.

### (E) Library false-positive filtering was the most consistent value add on this run

Across the Intuition corpus, both LLMs tagged Slither detectors that
fired on Solady `FixedPointMathLib`, OZ `Initializable`, prb-math,
etc. as library-internal and out-of-scope; I didn't see a
library-FP tag the review would call wrong. Most Slither hits on the
Intuition scope landed on library code, so the pre-filtered
annotation saved meaningful lookup time. "Highly consistent on this
corpus" is what I can say; I don't have numbers beyond this one
contest to generalize further.

## What would raise the confidence floor (revised priorities)

Three next steps, in the order I'd prioritize them given what the
per-file breakdown above actually shows:

1. **Second-pass prompt for cross-function state-transition bugs.**
   On files the first pass tags as "stateful" (e.g., anything with an
   `Ownable` inheritance, `initialize`, `isClaimed`, or an
   `onlyOwner`-gated lifecycle flag), run a second prompt that lists
   every function reading or writing the state variable and asks
   whether the reads match the writes across the lifecycle. Would
   plausibly catch the AtomWallet ownership slot mismatch that V12
   flagged and the pipeline missed.
2. **Invariant-probe sub-prompt.** On files tagged as
   accounting / reward / claim / emission, run a second prompt that
   lists every invariant the contract assumes ("one claim per epoch
   per user," "historical epoch rewards are immutable," etc.) and
   probes the edge cases that would violate each. Would plausibly
   catch the TrustBonding zero-amount claim bypass.
3. **Blind run on a fresh contest.** Everything above is post-hoc
   analysis on a corpus where V12's ground truth has been public for
   ~six weeks. A run on a contest *before* V12 (or the equivalent
   ground-truth list) is posted is the only way to distinguish
   "rediscovery against a known list" from "discovery." I plan to do
   this on whichever Code4rena / Sherlock contest closes next with a
   short enough scope to run in a phase.

Severity-calibration and auto-PoC-compilation are lower-priority next
steps but are easier to scope. Severity is a one-line prompt change;
auto-PoC is a Foundry harness + error-feedback loop, straightforward
for AtomWallet-class findings and substantially harder for
TrustBonding-class ones.

## What this writeup is and isn't

This is a post-hoc analysis of one pipeline run against V12's complete
finding list on the Intuition scope. It's an argument about *why*
each V12 finding was caught or missed, anchored in the file's
structure and the prompt's scope. It's not a benchmark paper, not a
generalization claim, and not evidence of novel-bug-finding. Anyone
who wants to fork this analysis should (a) treat "pipeline caught X"
as shorthand for "on this corpus, under this prompt, with Claude
Opus 4.7 + Gemini 3 Pro at this point in time" and (b) run the
pipeline blind on their own scope before deciding whether any of the
generalizations in section (A)-(E) hold for them.

Corrections to prior material
-----------------------------

The QF submission draft's summary table previously characterized
V12's finding on `TrustSwapAndBridgeRouter` as "peripheral Low/Medium
in V12"; V12 actually has a single **High** finding on that file
("Bridge fee quoted from slippage minimum"), with root cause
matching Claude's Low. The same table's AtomWallet row mentioned
only V12's first Critical ("Unsigned validity window metadata") and
omitted V12's second Critical ("Ownership slot mismatch bricks
wallet"), which the pipeline did not surface. The TrustBonding row
previously said "matches V12 area-of-risk; distinct framings," which
on review overstates the match — V12's three findings on
TrustBonding (zero-amount claim bypass; no epoch snapshot for reward
parameters; div-by-zero in utilization interpolation) are distinct
bugs from the pipeline's three, not alternative framings of the
same ones. The QF submission draft has been corrected to reflect all
three.

— `merovan`