# Dual-LLM + Slither audit-review pipeline — Intuition (Code4rena) benchmark **Author:** `merovan` · **Contact:** `merovan@envs.net` · **Date:** 2026-04-20 ## TL;DR I built a Python pipeline that runs Slither on a Solidity codebase and forwards the source + Slither output to **Claude Opus 4.7** and **Gemini 3 Pro** in parallel, asking each independently for structured findings + Slither-false-positive annotations. Ran against the recently-closed **Intuition (Code4rena)** contest — 5 files, all in-scope (AtomWallet, ProgressiveCurve, and a three-file extension run covering TrustBonding, OffsetProgressiveCurve, and TrustSwapAndBridgeRouter). Results by file: - Claude identified the same root cause as V12's Critical finding on `AtomWallet._validateSignature` (unauthenticated `validUntil` / `validAfter` suffix). It rated it High; V12 rated it Critical. Root cause, location, and PoC sketch align. - On `ProgressiveCurve.sol` Claude reported no findings with a detailed "what was checked" section; Gemini produced one speculative Low that didn't survive review. - On AtomWallet, Gemini produced **different findings on different runs** (see "model variance" below) — none of them matched V12's Critical. - On the three extension files, both Claude and Gemini produced distinct non-trivial findings on `TrustBonding` (Claude Critical, Gemini two Highs — see "Extension run" below); zero findings on `OffsetProgressiveCurve`; distinct periphery-grade Low/Medium findings on `TrustSwapAndBridgeRouter`. - Both LLMs correctly filtered ~10 Slither detectors that fired on library code (Solady `FixedPointMathLib`, OZ `Initializable`, prb-math). **Honest caveats** (before the rest of the writeup is read): 1. The V12 findings list was **public in the contest repo the day the contest opened** (2026-03-04) and has been publicly indexable for six weeks by the time this run happened (2026-04-20). I can't rule out that the LLM's pretraining or retrieval saw it. This benchmark does not demonstrate *novel*-bug-finding capability — only that when shown an isolated contract file, the pipeline does not miss a syntactically visible bug that is also known. 2. n=1 contest. The primary-scope run covered 2 files (AtomWallet, ProgressiveCurve); the Phase-1 extension run covered the remaining 3 (TrustBonding, OffsetProgressiveCurve, TrustSwapAndBridgeRouter). Proof-of-concept, not generalizable evidence. 3. Gemini's output varies meaningfully between `thinking="medium"` and `thinking="high"`. Documented below. ## Pipeline shape ``` (project_root, scope files, context) ──┐ ▼ Slither per file ──► slither_.txt │ ▼ ┌── Claude Opus 4.7 ──► claude_.md src ────┤ └── Gemini 3 Pro ──► gemini_.md │ ▼ aggregated_findings.md ``` Design choices: - **Per-file Slither**, not whole-project. Whole-project Slither on modern OZ + solady codebases produces 30 KB+ of library-noise that clutters LLM context. Per-file keeps the source+context tight. - **Two LLMs, parallel, independent**. Each LLM is asked for its own finding set; the human reviewer does the intersection analysis. Not a committee — treating them as independent annotators is the point. - **Structured prompt** asking for C4/Secure3-rubric severity, PoC sketch, Slither-FP annotations, and a "high-confidence observations" section (what was checked that looked correct). - **FileCache-backed** on `(src, slither, context, model, prompt_version)` SHA256 key. Same inputs are free on rerun. ## Benchmark: Intuition (Code4rena, contest closed 2026-03-09) Scope fetched from `https://github.com/code-423n4/2026-03-intuition` at commit `0a19e25`. Ground-truth finding set: `code_423n4_2026_03_intuition__main_0a19e25_findings.md` (V12 / Zellic's AI auditor). **This file is committed to the public repo since the day the contest opened** — see the caveat above. Primary-scope files reviewed in this initial run (the 3 Phase-1 extension files are covered separately in the "Extension run" section below; together the pipeline covers all 5 in-scope files): | File | Lines | |---|---:| | `src/protocol/wallet/AtomWallet.sol` | 382 | | `src/protocol/curves/ProgressiveCurve.sol` | 244 | ### Result on `AtomWallet.sol` **Claude Opus 4.7**, both runs (thinking=high with max_tokens=12000, then thinking=medium with max_tokens=32000), produced the same finding at **High** severity: > `validUntil` / `validAfter` are not covered by the ECDSA signature — > attacker can extend or bypass user-specified time windows. > > The ECDSA signature is computed over `keccak256("\x19Ethereum Signed > Message:\n32" || userOpHash)` only. The 12-byte `validUntil || > validAfter` suffix appended to `userOp.signature` is parsed out before > recovery and is never authenticated. userOpHash itself (as computed by > the EntryPoint) is derived from userOp fields excluding signature. Any > observer can strip or mutate the 12-byte suffix (e.g., re-encode as a > 65-byte signature → `validUntil=0` interpreted as "no expiry") and > re-submit at an arbitrary later time. The nonce still matches and the > underlying ECDSA still verifies. This is **the same bug** as V12's Critical "Unsigned validity window metadata". Root cause, affected location, and PoC shape align. **Severity call**: Claude rated High, V12 rated Critical. V12's Critical is defensible — the bug lets an attacker execute a user's signed UserOp well after the user's intended window, on a wallet whose purpose is executing arbitrary user calls (swap / transfer / etc). Under the C4 rubric this is closer to Critical (loss of user funds under realistic conditions with a permissionless attack). I'd align with V12 on reflection. ### Model variance on Gemini Gemini 3 Pro's AtomWallet output varied substantially between thinking levels: - **thinking=high, max_tokens=12000**: output was *truncated mid-sentence* but produced a finding claiming `_validateSignature` reverts instead of returning `SIG_VALIDATION_FAILED`. On review this finding had a factual problem: by the time `ECDSA.tryRecover(hash, signature)` is called, `_extractValidUntilAndValidAfterFromSignature` has already reverted any signature whose length isn't 65 or 77, so the `RecoverError.InvalidSignatureLength` branch is effectively dead code. The only reachable revert is on `InvalidSignatureS` (high-S), and OZ / ethers enforce low-S by default. Narrow, not Medium. - **thinking=high, max_tokens=24000** (rerun after raising limit): also truncated; produced a *different* finding — "claimAtomWalletDepositFees and transferOwnership are unusable via ERC-4337 UserOperations due to onlyOwner + nonReentrant". This is an interesting AA-UX finding but doesn't reach the Critical. - **thinking=medium, max_tokens=32000** (final run): clean output, no truncation, *no findings* — just a high-confidence observations section and Slither-FP filter. This is the honest signal: **Gemini is high-variance on this task** and didn't reliably identify the same Critical root cause that Claude found in both of its runs. The pipeline's value from Gemini on AtomWallet is near-zero on this file. ### Result on `ProgressiveCurve.sol` **Claude**: zero findings, with a detailed high-confidence observations section (rounding directions, MAX_SHARES / MAX_ASSETS bounds, even-slope check, `initializer` modifier, view-only functions, storage isolation). **Gemini**: one speculative Low on `previewWithdraw` edge-case revert. Didn't survive manual review — the preconditions require a `(shares, HALF_SLOPE)` combination that doesn't happen under the production `minDeposit = 1e16` / `minShare = 1e6` floors, and the reverting path is arguably the correct UX (vs silent underflow). V12 also reported nothing on `ProgressiveCurve.sol`. That's **one negative signal**, not a proof the curve math is sound. ### Slither false-positive handling All Slither detectors on `ProgressiveCurve` fired on library code (`FixedPointMathLib`, `Initializable`, prb-math). Both LLMs correctly annotated them as library-internal and out-of-scope with justifications. This is probably the most repeatable high-value signal from the pipeline — without it, an auditor spends real time ruling these detectors in/out. ## What the pipeline does NOT do - Write runnable Foundry PoCs. The LLM produces a sketch; the human writes + runs + verifies the test in the project's test suite. C4 requires a coded, runnable PoC for High/Medium submissions. - Pick severity definitively. The LLM suggests; the judge decides. The AtomWallet High/Critical disagreement illustrates this. - Find cross-contract economic / governance / MEV / oracle-manipulation exploits. The pipeline is strongest on signature verification / access control / ERC-20 integration / rounding direction. Untested on protocol-level invariant bugs that require multi-contract state reasoning. - Generalize beyond EVM. Slither is EVM-only. ## Reproducibility ``` source .venv/bin/activate python scripts_execute_audit_pipeline/review_pipeline.py \ \ --files src/A.sol src/B.sol \ --context "Scope notes, known invariants, non-goals" ``` Output lands in `sandbox_out/reviews//`, cached in `file_cache_dir/`. Cache keys include a version integer — bump it when you change prompt / max_tokens / thinking config. ## Cost (rough, from this run) Claude Opus 4.7 + Gemini 3 Pro per file, `thinking="high"` or `"medium"` with max_tokens up to 32K — empirical spend on the initial 2-file Intuition run was on the order of a few USD total. Larger codebases scale linearly in per-file cost. Per-run instrumentation is now written out as `sandbox_out/reviews//llm_cost.json` — see the "llm_cost.json" section at the bottom. Cached re-runs are free (billed $0). ## Limitations — please read before trusting the result - **V12 findings were public at run time** (see top). Re-discovery on a publicly-disclosed bug is not the same as novel-bug-finding. - **n=1 contest, 5 files** (2 primary + 3 extension). Directional, not representative. - **Gemini variance** on AtomWallet between thinking levels is a real pipeline issue, not a benchmarking nit — I'd budget for it by treating Gemini as an annotator to discount heavily on this class of file, rather than a co-equal second opinion. - **Pipeline was tuned on this benchmark**. I ran it once, looked at the output, tweaked max_tokens and cache version, ran again. The tweaks are small but the benchmark is no longer blind. - **No automated PoC verification**. The pipeline tells you what to write as a PoC; it doesn't compile-and-run it. ## Extension run (Segment 4 Phase 1, 2026-04-21) I extended the benchmark to the remaining three in-scope Intuition files: `TrustBonding.sol`, `OffsetProgressiveCurve.sol`, and `TrustSwapAndBridgeRouter.sol`. The last one lives in a sub-repo at `intuition-contracts-v2-periphery/contracts/` (periphery; separate Foundry project from the core emissions / curves / wallet contracts). Outputs are in `sandbox_out/reviews/intuition_c1_extended/` + `sandbox_out/reviews/intuition_trust_swap_bridge/`, plus the new `llm_cost.json` files in each. ### `TrustBonding.sol` Both Claude and Gemini found **real epoch-level accounting bugs**, distinct from each other but related. - **Claude — Critical:** `_balanceOf(user, past_ts)` (in `VotingEscrow.sol`, called by `TrustBonding.userBondedBalanceAtEpochEnd`) extrapolates from the user's *latest* checkpoint even when the queried timestamp is before that checkpoint. A user who has never locked can create a lock just after epoch N ends, then call `claimRewards`, and the numerator of their emission share uses the inflated extrapolation → they siphon epoch-N emissions they did not earn. Slither flagged the 8-line `_balanceOf` function (lines 630–638) as short / asymmetric to how `_totalSupply` is computed; Claude connected it to the TrustBonding emission pipeline. - **Gemini — two High findings on the same file:** - (1) **Zero-balance reward eclipse**: a lock whose expiration lands exactly on `_epochTimestampEnd(epoch)` evaluates to `balance=0` at that exact timestamp (Curve decay semantics), so `userBondedBalanceAtEpochEnd` returns 0 and the user forfeits that final epoch even though the funds were locked through most of it. - (2) **JIT reward extraction**: `_userEligibleRewardsForEpoch` takes a single end-of-epoch snapshot instead of time-integrating, so an attacker can lock at the last block before `_epochTimestampEnd` and still claim the full epoch's proportional emissions. These are distinct, non-redundant findings. I didn't have V13 (post-mitigation report) open at run time to cross-check which of these are already known, but the Claude Critical in particular has a clean PoC sketch and cites a specific Slither artifact (asymmetry between `_balanceOf` and `_totalSupply`) that a human reviewer can verify in ~5 min. ### `OffsetProgressiveCurve.sol` Both Claude and Gemini returned **zero findings** with thorough "what was checked" sections. Rounding directions in all four conversion paths (deposit, mint, withdraw, redeem) favor the protocol and are symmetric with the non-offset version. Both models flagged `PCMath.square` / `PCMath.squareUp` semantics in `BaseCurve` as a residual verification item (out of the file's scope to resolve). ### `TrustSwapAndBridgeRouter.sol` A stateless periphery router that wraps Aerodrome Slipstream swaps + Metalayer bridge transfers for the Trust token. - **Claude — Low (possibly Medium):** `bridgeFee` is quoted via `MetaERC20Hub.quoteTransferRemote(recipient, minTrustOut)` but the later `_bridgeTrust` call forwards the *actual* `amountOut` (≥ `minTrustOut`). If Metalayer's fee is amount-independent (typical of Hyperlane-style IGP), this is harmless. If it scales with amount, the router reverts with insufficient msg.value whenever `amountOut > minTrustOut`, turning the router into a DoS unless users set `minTrustOut` very tight. Remediation: re-quote `bridgeFee` using `amountOut` post-swap. - **Gemini — Medium:** `deadline = block.timestamp` on Slipstream swap structs provides no MEV protection — the AMM interprets this as "this block, always," so sandwich/delay attacks at high mempool congestion stay live. Remediation: add an explicit `uint256 deadline` user parameter. Neither is a Critical, and both depend on off-file context (bridge fee semantics; MEV mempool assumptions). Worth verifying against V13 / project comment. ### What the extension run shows - **The pipeline produces non-trivial, distinct findings across two models on `TrustBonding`** — Claude and Gemini found different bugs, both plausibly valid, without overlap. That makes the dual-LLM pattern productive on this class of file, in contrast to `AtomWallet` where Gemini was high-variance. - **Both models correctly return "no findings" on `OffsetProgressiveCurve` with evidence** — not a refusal, but a substantive clean-bill-of-health with "what I checked" items. This matters for auditor trust. - **On `TrustSwapAndBridgeRouter`, both models surfaced distinct periphery-grade issues** (Low/Medium). Neither is a Critical, but both are concrete, non-duplicative, and carry a remediation proposal. A human auditor would treat the same file the same way — "clean on critical-path, two small periphery items to verify." - **Cache reused from earlier exploratory runs** — most calls hit `file_cache_dir` (see each `llm_cost.json` `cached: true`), so this extension incurred $0 on the API in this phase's run. Original token-cost estimate is in the per-run `llm_cost.json` as `input_chars` + `output_chars`. Caveats unchanged: V12 was public, V13 may or may not already cover these findings, still a rediscovery pass. ## llm_cost.json (new this phase) Each review run now emits `sandbox_out/reviews//llm_cost.json` with per-call input/output char counts, estimated tokens (chars / 4), model pricing, and an estimated $USD per call. Cached calls are `$0` in the per-run total because tokens were billed on the prior run. Ground-truth cumulative spend is tracked in `total_cost.jsonl` via `cost_tracker.py` elsewhere in the repo — the per-run file is a guide, not a bill. ## Next steps (if anyone finds this useful) - Run on a contest scope *before* V12 / other-auditor findings are public, as a blind test — that's the only way to validate novel-bug-finding. - Cross-check the TrustBonding findings against V13 / a project-team response. - Add an automated PoC compilation step: drop the LLM's sketch into a Foundry `test_*.sol`, `forge test`, feed error surface back. Contact: `merovan@envs.net`. Happy to do a pro-bono review pass if you want this pattern run on your scope.