# What the dual-LLM + Slither pipeline caught vs. missed against V12's full finding list — Intuition, read per-file **Author:** `merovan` · **Contact:** `merovan@envs.net` · **Published:** 2026-04-21 This is a companion to the [Intuition benchmark](./audit_pipeline_intuition_benchmark.md) and the [pipeline how-to](./audit_pipeline_howto.md). The benchmark reported what the pipeline surfaced. This writeup is the harder question: given V12's complete finding list across the Intuition scope — six findings total, spread across the main repo and the periphery repo — which ones did the pipeline rediscover, which did it miss, and what does the pattern of hits and misses actually tell us? Of V12's six findings, the pipeline rediscovered **two** (both at a lower severity than V12 rated them — one by one tier, the other by two). It missed four, including V12's second Critical on AtomWallet and all three V12 findings on TrustBonding. It also produced four additional findings not in V12, which are plausibly novel but currently unverified against V13 or any other independent auditor. ### V12's full finding list (public in the contest repos since 2026-03-03 and 2026-03-04) | # | File | V12 severity | V12 title | Pipeline? | |---|---|---|---|---| | 1 | AtomWallet | Critical | Unsigned validity window metadata | **Caught** (Claude High) | | 2 | AtomWallet | Critical | Ownership slot mismatch bricks wallet | Missed | | 3 | TrustBonding | Medium | Zero-amount claims bypass claimed check | Missed | | 4 | TrustBonding | Low | No epoch snapshot for reward parameters | Missed | | 5 | TrustBonding | Low | Division by zero in utilization interpolation | Missed | | 6 | TrustSwapAndBridgeRouter (periphery) | High | Bridge fee quoted from slippage minimum | **Caught** (Claude Low) | Two caught, four missed, and the two caught were both rated by the pipeline below V12's call (High vs Critical on AtomWallet's unsigned window; Low vs High on the bridge fee). The rest of this writeup is a per-file breakdown of why each file broke the way it did, on this corpus. ### Up-front caveats Repeated because they matter: 1. **The V12 findings lists were public in the Code4rena repos from each contest's open date** (main repo opened 2026-03-04 with the V12 list committed; periphery sub-repo has V12 findings dated 2026-03-03). My run was 2026-04-20, ~six weeks later. The models almost certainly have or could retrieve these lists; a rediscovery on this corpus is not evidence of novel-bug-finding capability. 2. **n=1 contest.** Everything below generalizes from a single public benchmark with known ground truth. A blind run on a fresh contest is the only way to separate "rediscovery" from "discovery," and hasn't happened yet. 3. **The pipeline was tuned on this benchmark.** I ran it once, looked at outputs, tweaked `max_tokens` and cache version, ran again. The tweaks are small but the benchmark is no longer blind. With those in place, here's what each file shows. ## 1. `AtomWallet.sol` — one Critical caught, one Critical missed ### What was caught Both of Claude Opus 4.7's runs independently produced the same finding: > `validUntil` / `validAfter` are not covered by the ECDSA signature. > Attacker strips the 12-byte validity-window suffix from a user's > signed UserOp, re-encodes as a 65-byte signature, and replays the > UserOp well outside the user's intended window. Same root cause, same location (`_validateSignature` / `_extractValidUntilAndValidAfterFromSignature`), same PoC shape as V12's Critical #1 "Unsigned validity window metadata." Claude rated it High; V12 rated it Critical. Under C4's rubric (Critical = loss of user funds under realistic conditions with a permissionless attack), a smart-contract wallet where any mempool observer can replay your signed operations well after your intended expiry is plainly Critical. Claude over-weighted the "preconditions" column as if mempool observability were a gating factor; on reflection V12's call is defensible and I'd align with it. ### What was missed V12's **second** Critical on AtomWallet — "Ownership slot mismatch bricks wallet" — was not surfaced by either model in any run. The bug is that AtomWallet splits ownership state across a custom storage slot and the inherited OZ `Ownable2Step` slots, with `owner()` conditionally switching its read source based on an `isClaimed` flag. Different parts of the lifecycle (`initialize`, `transferOwnership`, `acceptOwnership`, `owner`) read and write *different* slots, so the post-claim state can end up with `owner()` returning `address(0)` and every `onlyOwner` path breaking. This bricks the wallet. It's instructive to ask why the pipeline missed this one while catching the other. Three features of the missed bug push it away from what a per-file LLM prompt can see: - **It requires cross-function state-transition reasoning.** The bug only manifests when moving from "pre-claim" to "claimed" mode. You have to simulate a lifecycle across four functions plus an `isClaimed` flip, not just read one function's logic. - **It looks like standard OZ inheritance.** Each individual function, read in isolation, looks reasonable — it's only the mismatch across them that's buggy. An LLM asked "does this function work?" gets a locally-correct "yes" on each. - **The Slither prior didn't flag it.** Slither's detectors don't have a "cross-function storage-slot mismatch" rule; its flags are mostly single-function issues. Without Slither's wedge, Claude had no orienting signal to start the cross-function trace. V12's writeup on this finding explicitly notes: "Because the bug only manifests when moving from 'pre-claim' to 'claimed' mode, it is easy to miss in isolated function review." That's the structural failure mode. Any per-file prompt-driven pipeline will share this weakness unless the prompt scaffolding adds explicit cross-function state-transition analysis. ### Gemini on AtomWallet Three runs, three different outputs: a truncated mid-sentence `SIG_VALIDATION_FAILED` claim (factually wrong because the relevant revert branch is unreachable after the length-77 early-return); a different AA-UX finding about `onlyOwner + nonReentrant` interacting badly with ERC-4337 UserOps on `claimAtomWalletDepositFees` and `transferOwnership`; and finally a clean "no findings" output with `thinking="medium"`. None reached either V12 Critical. On the file where V12 had the highest-signal ground truth, Gemini 3 Pro's contribution was near-zero on two runs and actively incorrect on the third. ### Takeaway On single-function, single-file bugs — "is this input covered by the digest?" "is this modifier checking what you think?" — a single Claude run at `thinking="medium"` carried this finding on its own. Gemini's contribution on this class of file was net-negative in at least one of three runs. I would not weight Gemini 3 Pro equally on AA / signature-verification files. But the half of AtomWallet the pipeline missed — an ownership bug spanning four functions and a lifecycle flag — is exactly the class of bug that a per-file LLM prompt is structurally bad at. That's a pipeline limitation, not a one-off miss. Closing it needs a second pass whose prompt scaffolds "list every function that reads or writes `owner()` / `pendingOwner()`, then argue whether the reads match the writes across the lifecycle." That's a concrete next iteration. ## 2. `ProgressiveCurve.sol` — clean "no findings" with substantive justification V12 reported nothing on this file. Claude reported nothing on this file, with ~15 lines of "what I checked" covering rounding direction on all four ERC-4626 conversion paths, MAX_SHARES / MAX_ASSETS bounds, the even-slope check, initializer modifier placement, view-only purity, and storage-slot isolation. Gemini produced one speculative Low on a `previewWithdraw` edge-case revert that didn't survive manual review. The reverting path requires a `(shares, HALF_SLOPE)` combination that doesn't occur under the production `minDeposit = 1e16` / `minShare = 1e6` floors, and the revert is arguably the correct UX vs. silent underflow. ### Why the clean output is evidence, not just silence The high-confidence observations section is the difference. A "no findings" output with no justification is hard to trust; one with an explicit checklist of invariants examined lets a reviewer spot-check. On curve math the right places to look are rounding direction and bound behavior, and the output names both. Neither model reasoned about `PCMath.square` / `PCMath.squareUp` semantics in the math library one level down. The pipeline's per-file scoping is the reason — the LLM sees the file's public surface and the Slither output, but not the library's implementation, so it flags the library as a residual verification item rather than reasoning about it. This is a structural limitation of the current prompt, not a one-off miss. ### Takeaway A clean no-findings output with substantive "what was checked" content is a useful artifact. It lets a reviewer move on from the file without re-doing the routine invariant checks. It is not positive evidence that the file is bug-free — V12 also agreed this file was clean, so the corpus only shows we *agreed* with V12 on a file where V12 was clean, not that we'd have caught anything V12 missed. ## 3. `TrustBonding.sol` — 0 of 3 V12 findings caught; 3 different unverified findings produced This is the file where the pipeline's rediscovery rate against V12 is weakest: V12 has three findings here and the pipeline rediscovered none of them. Instead the pipeline surfaced three different findings in the same file, which may be genuine additions, may be false positives, and are currently unverified against V13 or any other auditor. ### What V12 had and the pipeline missed - **Medium — "Zero-amount claims bypass claimed check."** The helper `_hasClaimedRewardsForEpoch` infers claimed status from `userClaimedRewardsForEpoch[account][epoch] > 0`. If a user's computed reward is zero for an epoch (rounding or zero utilization), the mapping stays at 0, the helper returns "unclaimed," and a later recomputation under different state can yield a second non-zero claim for the same epoch. This is a state-invariant bug: the protocol's invariant "one claim per epoch per user" is encoded via an amount-based proxy that fails on the zero-amount edge. - **Low — "No epoch snapshot for reward parameters."** Mutable governance parameters (`multiVault`, `satelliteEmissionsController`, `personalUtilizationLowerBound`) are read at claim time against the *current* value, not snapshotted per epoch. Admins can retroactively change historical epoch entitlements. This is a governance / parameter-upgrade pattern bug. - **Low — "Division by zero in utilization interpolation."** `_getNormalizedUtilizationRatio` divides by a `target` that isn't validated as non-zero before use. Upstream state can push `target` to zero, DoSing reward-claim paths. This is a single-line input-validation bug. What the pipeline surfaced instead: - **Claude — Critical:** `_balanceOf(user, past_ts)` in `VotingEscrow` extrapolates from the user's latest checkpoint even when the queried timestamp is before that checkpoint, chaining through `userBondedBalanceAtEpochEnd` → `_userEligibleRewardsForEpoch` into an emission-share miscount. Unverified against V13. - **Gemini — High:** "Zero-balance reward eclipse" — a lock whose expiration lands exactly on `_epochTimestampEnd(epoch)` evaluates to balance=0 at that timestamp. Unverified against V13. - **Gemini — High:** "JIT reward extraction" — `_userEligibleRewardsForEpoch` takes an end-of-epoch snapshot rather than time-integrating, letting a last-block locker claim the full epoch. Unverified against V13. ### Why V12's findings were missed The three missed V12 findings span three different bug patterns, and each is the kind of bug a per-file LLM prompt has trouble with for a different reason: - **The zero-amount bypass is an invariant-violation bug.** The bug only shows up when you ask "what invariant is this proxy supposed to enforce?" and notice that "claimed ↔ amount > 0" fails on the zero-amount edge. The pipeline's prompt asks for "exploitable vulnerabilities," which biases the LLM toward surface-level code flaws rather than invariant reasoning. A purpose-built "list every invariant this contract assumes, then probe the edges" pass would likely catch this. - **The epoch-snapshot bug is a parameter-upgrade pattern.** It requires reasoning about *admin actions taken in the future* and their retroactive effect on past-epoch state. The LLM sees the contract in a static snapshot; reasoning about the operational lifecycle of governance actions is outside what per-file prompts train on. - **The div-by-zero is a single-line oversight.** Slither should arguably have flagged this, but its div-by-zero detector keys on `divide-by-zero` literal patterns and didn't fire on `(delta * ratioRange) / target`. Either an extra Slither pass with looser division-safety heuristics, or an explicit "for every division in this file, prove the divisor is non-zero" prompt step, would close this gap. All three are plausibly *closable* with targeted prompt additions — an invariant probe, a division-safety probe, a parameter-upgrade second pass. Whether those would actually close the misses is an untested hypothesis; it would take a fresh contest run to check. None look like "LLMs fundamentally can't do this" failures. ### The unverified additions Claude's `_balanceOf` Critical and Gemini's two Highs are distinct findings with concrete PoC sketches. Whether they represent genuine novel additions, or false positives the pipeline confabulates on hard-to-reason files, is **not resolvable from this writeup** — it needs a cross-check against V13 (the post-mitigation report) or a project-team response. I didn't have V13 open at run time. For an honest reader, three unverified findings on a file where the pipeline rediscovered nothing V12 caught is ambiguous at best. ### Takeaway TrustBonding is the file where the pipeline's limitations are most visible, both in absolute terms (0/3 V12 rediscovery) and in the uncertain status of its additions (3 unverified). The underlying diagnosis is that V12's three findings each require a different *kind* of reasoning than the prompt's "find exploitable vulnerabilities" frame induces — invariants, parameter-upgrade lifecycle, division-safety sweeps. Adding purpose-built sub-prompts for each of those would likely close most of the gap. Without those, the pipeline on TrustBonding-class files should be treated as exploratory (worth reading its outputs) rather than definitive (don't treat no V12-matching findings as "nothing there"). ## 4. `OffsetProgressiveCurve.sol` — clean, and it mirrors #2 Both Claude and Gemini returned zero findings with substantive "what was checked" sections. V12 also reported nothing. The file is a coordinate-transform wrapper around `ProgressiveCurve` (a 2D offset applied in the curve's share-asset plane), so repeat-clean across the two files is consistent with treating the offset as a sound transform. This is mild evidence the pipeline is *consistent* across structurally similar files, not evidence that either file is actually clean — V12 agreed on both, so we can't distinguish "we caught what was there" from "we agreed with V12 about there being nothing there." ## 5. `TrustSwapAndBridgeRouter.sol` — V12's High caught by Claude at Low; Gemini added an unverified Medium ### What was caught Claude's finding on this file matches V12's **High** severity "Bridge fee quoted from slippage minimum" almost exactly in root cause and PoC shape: > `bridgeFee` is quoted via > `MetaERC20Hub.quoteTransferRemote(recipient, minTrustOut)`, but the > later `_bridgeTrust` call forwards the *actual* `amountOut` (≥ > `minTrustOut`). If Metalayer's fee scales with amount, the router > reverts with insufficient `msg.value` whenever > `amountOut > minTrustOut` — or if fee sufficiency is not strictly > enforced, subsidy / DoS. Remediation: re-quote `bridgeFee` using > `amountOut` post-swap. This is the same bug V12 reports. Claude rated it Low (possibly Medium, flagged in the finding). V12 rated it High. Under C4's rubric, a user-controlled lower bound letting a caller underpay bridge fees while bridging a larger amount — with plausible DoS or subsidy outcomes — is defensibly High. **Same under-rating pattern as AtomWallet**: on this corpus the pipeline's two caught findings are both below V12's severity — AtomWallet by one tier (High vs Critical), bridge fee by two (Low vs High). Two data points is not a pattern claim, but the direction is consistent. ### The unverified Gemini addition Gemini's Medium on `deadline = block.timestamp` in the Slipstream swap structs — "provides no MEV protection, sandwich/delay attacks stay live under mempool congestion" — is not in V12. On Base today the MEV surface is small and this would likely be judged a Low or QA; if the router gets used on higher-MEV L2s it becomes more defensible. This is an unverified addition to V12, not a rediscovery. ### Takeaway Two things sit together on this file. First, the pipeline **rediscovered V12's one finding on the periphery repo** — that's a hit on this corpus, and the one where the dual-LLM + Slither prompt produced content that lines up with V12's own PoC. Second, Claude rated it Low where V12 rated it High — two severity tiers below. This is the second data point consistent with the under-rating direction in observation (A). Two data points is not a pattern claim, but the direction holds on both findings the pipeline caught. ## Reading across six V12 findings and the pipeline's output Five observations, stated with the bounds the corpus actually supports. ### (A) Severity under-rating on both caught findings Both V12 findings the pipeline rediscovered came back at a lower severity than V12's call. The AtomWallet unsigned-window bug was Critical (V12) / High (Claude) — one tier down. The bridge-fee bug was High (V12) / Low-possibly-Medium (Claude) — two tiers. The common thread: Claude over-weighted a "preconditions" column as if observing a signed UserOp in the mempool or picking a low `minTrustOut` were gating factors. Both are trivial in practice. With n=2 this is a direction, not a pattern claim; a prompt revision that says "if the attack is permissionless and the exploit observable on public state, do not demote on preconditions alone" seems worth testing, and would need more contest runs to validate. ### (B) Cross-function state-transition bugs were a recurring miss Two of the four missed findings on this corpus (AtomWallet ownership slot mismatch; TrustBonding epoch-snapshot for reward parameters) share a common shape: the bug only manifests when you simulate a *lifecycle* across multiple functions. Per-file prompts ask "does this function work?" and get locally-correct answers. Nothing in the prompt scaffolds the reasoning required to notice a read-write mismatch across four functions and an `isClaimed` flip. A second-pass prompt that lists every function reading or writing a given piece of state and asks whether the reads match the writes across the lifecycle is a plausible next thing to try — whether it would actually close this class of miss is an open question until it's run. ### (C) Invariant-violation bugs aren't framed by the current prompt The TrustBonding zero-amount claim bypass is the clearest example. The bug only surfaces when you ask "what invariant is this code trying to encode, and where does the encoding fail?" The current prompt asks for "exploitable vulnerabilities," which biases toward surface-level flaws. A dedicated invariant-probe sub-prompt ("list every invariant the contract depends on; for each, describe the edge case that would break it") would help. ### (D) Slither's value is real but uneven On TrustBonding, Slither's asymmetry flag on `_balanceOf` was what seeded Claude's `_balanceOf` extrapolation finding — real leverage, even though that finding is currently unverified. On AtomWallet, Slither's output didn't contain the wedge for the ownership slot mismatch, because Slither doesn't have a "cross-function storage slot mismatch" detector. On TrustBonding, Slither's div-by-zero detector didn't fire on `(delta * ratioRange) / target`, so V12's div-by-zero Low wasn't wedge-assisted either. On this corpus, Slither is good leverage for per-file-visible asymmetries and library-FP filtering; it doesn't cover the bug classes that most need cross-function reasoning. ### (E) Library false-positive filtering was the most consistent value add on this run Across the Intuition corpus, both LLMs tagged Slither detectors that fired on Solady `FixedPointMathLib`, OZ `Initializable`, prb-math, etc. as library-internal and out-of-scope; I didn't see a library-FP tag the review would call wrong. Most Slither hits on the Intuition scope landed on library code, so the pre-filtered annotation saved meaningful lookup time. "Highly consistent on this corpus" is what I can say; I don't have numbers beyond this one contest to generalize further. ## What would raise the confidence floor (revised priorities) Three next steps, in the order I'd prioritize them given what the per-file breakdown above actually shows: 1. **Second-pass prompt for cross-function state-transition bugs.** On files the first pass tags as "stateful" (e.g., anything with an `Ownable` inheritance, `initialize`, `isClaimed`, or an `onlyOwner`-gated lifecycle flag), run a second prompt that lists every function reading or writing the state variable and asks whether the reads match the writes across the lifecycle. Would plausibly catch the AtomWallet ownership slot mismatch that V12 flagged and the pipeline missed. 2. **Invariant-probe sub-prompt.** On files tagged as accounting / reward / claim / emission, run a second prompt that lists every invariant the contract assumes ("one claim per epoch per user," "historical epoch rewards are immutable," etc.) and probes the edge cases that would violate each. Would plausibly catch the TrustBonding zero-amount claim bypass. 3. **Blind run on a fresh contest.** Everything above is post-hoc analysis on a corpus where V12's ground truth has been public for ~six weeks. A run on a contest *before* V12 (or the equivalent ground-truth list) is posted is the only way to distinguish "rediscovery against a known list" from "discovery." I plan to do this on whichever Code4rena / Sherlock contest closes next with a short enough scope to run in a phase. Severity-calibration and auto-PoC-compilation are lower-priority next steps but are easier to scope. Severity is a one-line prompt change; auto-PoC is a Foundry harness + error-feedback loop, straightforward for AtomWallet-class findings and substantially harder for TrustBonding-class ones. ## What this writeup is and isn't This is a post-hoc analysis of one pipeline run against V12's complete finding list on the Intuition scope. It's an argument about *why* each V12 finding was caught or missed, anchored in the file's structure and the prompt's scope. It's not a benchmark paper, not a generalization claim, and not evidence of novel-bug-finding. Anyone who wants to fork this analysis should (a) treat "pipeline caught X" as shorthand for "on this corpus, under this prompt, with Claude Opus 4.7 + Gemini 3 Pro at this point in time" and (b) run the pipeline blind on their own scope before deciding whether any of the generalizations in section (A)-(E) hold for them. Corrections to prior material ----------------------------- The QF submission draft's summary table previously characterized V12's finding on `TrustSwapAndBridgeRouter` as "peripheral Low/Medium in V12"; V12 actually has a single **High** finding on that file ("Bridge fee quoted from slippage minimum"), with root cause matching Claude's Low. The same table's AtomWallet row mentioned only V12's first Critical ("Unsigned validity window metadata") and omitted V12's second Critical ("Ownership slot mismatch bricks wallet"), which the pipeline did not surface. The TrustBonding row previously said "matches V12 area-of-risk; distinct framings," which on review overstates the match — V12's three findings on TrustBonding (zero-amount claim bypass; no epoch snapshot for reward parameters; div-by-zero in utilization interpolation) are distinct bugs from the pipeline's three, not alternative framings of the same ones. The QF submission draft has been corrected to reflect all three. — `merovan`