# Dual-LLM + Slither audit-review pipeline — Intuition (Code4rena) benchmark **Author:** `merovan` · **Contact:** `merovan@envs.net` · **Date:** 2026-04-20 ## TL;DR I built a Python pipeline that runs Slither on a Solidity codebase and forwards the source + Slither output to **Claude Opus 4.7** and **Gemini 3 Pro** in parallel, asking each independently for structured findings + Slither-false-positive annotations. On a 2-file benchmark against the recently-closed **Intuition (Code4rena)** contest: - Claude identified the same root cause as V12's Critical finding on `AtomWallet._validateSignature` (unauthenticated `validUntil` / `validAfter` suffix). It rated it High; V12 rated it Critical. Root cause, location, and PoC sketch align. - On `ProgressiveCurve.sol` Claude reported no findings with a detailed "what was checked" section; Gemini produced one speculative Low that didn't survive review. - On AtomWallet, Gemini produced **different findings on different runs** (see "model variance" below) — none of them matched V12's Critical. - Both LLMs correctly filtered ~10 Slither detectors that fired on library code (Solady `FixedPointMathLib`, OZ `Initializable`, prb-math). **Honest caveats** (before the rest of the writeup is read): 1. The V12 findings list was **public in the contest repo the day the contest opened** (2026-03-04) and has been publicly indexable for six weeks by the time this run happened (2026-04-20). I can't rule out that the LLM's pretraining or retrieval saw it. This benchmark does not demonstrate *novel*-bug-finding capability — only that when shown an isolated contract file, the pipeline does not miss a syntactically visible bug that is also known. 2. n=1 contest, 2 files reviewed of 5 in-scope. This is proof-of-concept, not generalizable evidence. 3. Gemini's output varies meaningfully between `thinking="medium"` and `thinking="high"`. Documented below. ## Pipeline shape ``` (project_root, scope files, context) ──┐ ▼ Slither per file ──► slither_.txt │ ▼ ┌── Claude Opus 4.7 ──► claude_.md src ────┤ └── Gemini 3 Pro ──► gemini_.md │ ▼ aggregated_findings.md ``` Design choices: - **Per-file Slither**, not whole-project. Whole-project Slither on modern OZ + solady codebases produces 30 KB+ of library-noise that clutters LLM context. Per-file keeps the source+context tight. - **Two LLMs, parallel, independent**. Each LLM is asked for its own finding set; the human reviewer does the intersection analysis. Not a committee — treating them as independent annotators is the point. - **Structured prompt** asking for C4/Secure3-rubric severity, PoC sketch, Slither-FP annotations, and a "high-confidence observations" section (what was checked that looked correct). - **FileCache-backed** on `(src, slither, context, model, prompt_version)` SHA256 key. Same inputs are free on rerun. ## Benchmark: Intuition (Code4rena, contest closed 2026-03-09) Scope fetched from `https://github.com/code-423n4/2026-03-intuition` at commit `0a19e25`. Ground-truth finding set: `code_423n4_2026_03_intuition__main_0a19e25_findings.md` (V12 / Zellic's AI auditor). **This file is committed to the public repo since the day the contest opened** — see the caveat above. Files reviewed (of 5 in-scope): | File | Lines | |---|---:| | `src/protocol/wallet/AtomWallet.sol` | 382 | | `src/protocol/curves/ProgressiveCurve.sol` | 244 | ### Result on `AtomWallet.sol` **Claude Opus 4.7**, both runs (thinking=high with max_tokens=12000, then thinking=medium with max_tokens=32000), produced the same finding at **High** severity: > `validUntil` / `validAfter` are not covered by the ECDSA signature — > attacker can extend or bypass user-specified time windows. > > The ECDSA signature is computed over `keccak256("\x19Ethereum Signed > Message:\n32" || userOpHash)` only. The 12-byte `validUntil || > validAfter` suffix appended to `userOp.signature` is parsed out before > recovery and is never authenticated. userOpHash itself (as computed by > the EntryPoint) is derived from userOp fields excluding signature. Any > observer can strip or mutate the 12-byte suffix (e.g., re-encode as a > 65-byte signature → `validUntil=0` interpreted as "no expiry") and > re-submit at an arbitrary later time. The nonce still matches and the > underlying ECDSA still verifies. This is **the same bug** as V12's Critical "Unsigned validity window metadata". Root cause, affected location, and PoC shape align. **Severity call**: Claude rated High, V12 rated Critical. V12's Critical is defensible — the bug lets an attacker execute a user's signed UserOp well after the user's intended window, on a wallet whose purpose is executing arbitrary user calls (swap / transfer / etc). Under the C4 rubric this is closer to Critical (loss of user funds under realistic conditions with a permissionless attack). I'd align with V12 on reflection. ### Model variance on Gemini Gemini 3 Pro's AtomWallet output varied substantially between thinking levels: - **thinking=high, max_tokens=12000**: output was *truncated mid-sentence* but produced a finding claiming `_validateSignature` reverts instead of returning `SIG_VALIDATION_FAILED`. On review this finding had a factual problem: by the time `ECDSA.tryRecover(hash, signature)` is called, `_extractValidUntilAndValidAfterFromSignature` has already reverted any signature whose length isn't 65 or 77, so the `RecoverError.InvalidSignatureLength` branch is effectively dead code. The only reachable revert is on `InvalidSignatureS` (high-S), and OZ / ethers enforce low-S by default. Narrow, not Medium. - **thinking=high, max_tokens=24000** (rerun after raising limit): also truncated; produced a *different* finding — "claimAtomWalletDepositFees and transferOwnership are unusable via ERC-4337 UserOperations due to onlyOwner + nonReentrant". This is an interesting AA-UX finding but doesn't reach the Critical. - **thinking=medium, max_tokens=32000** (final run): clean output, no truncation, *no findings* — just a high-confidence observations section and Slither-FP filter. This is the honest signal: **Gemini is high-variance on this task** and didn't reliably identify the same Critical root cause that Claude found in both of its runs. The pipeline's value from Gemini on AtomWallet is near-zero on this file. ### Result on `ProgressiveCurve.sol` **Claude**: zero findings, with a detailed high-confidence observations section (rounding directions, MAX_SHARES / MAX_ASSETS bounds, even-slope check, `initializer` modifier, view-only functions, storage isolation). **Gemini**: one speculative Low on `previewWithdraw` edge-case revert. Didn't survive manual review — the preconditions require a `(shares, HALF_SLOPE)` combination that doesn't happen under the production `minDeposit = 1e16` / `minShare = 1e6` floors, and the reverting path is arguably the correct UX (vs silent underflow). V12 also reported nothing on `ProgressiveCurve.sol`. That's **one negative signal**, not a proof the curve math is sound. ### Slither false-positive handling All Slither detectors on `ProgressiveCurve` fired on library code (`FixedPointMathLib`, `Initializable`, prb-math). Both LLMs correctly annotated them as library-internal and out-of-scope with justifications. This is probably the most repeatable high-value signal from the pipeline — without it, an auditor spends real time ruling these detectors in/out. ## What the pipeline does NOT do - Write runnable Foundry PoCs. The LLM produces a sketch; the human writes + runs + verifies the test in the project's test suite. C4 requires a coded, runnable PoC for High/Medium submissions. - Pick severity definitively. The LLM suggests; the judge decides. The AtomWallet High/Critical disagreement illustrates this. - Find cross-contract economic / governance / MEV / oracle-manipulation exploits. The pipeline is strongest on signature verification / access control / ERC-20 integration / rounding direction. Untested on protocol-level invariant bugs that require multi-contract state reasoning. - Generalize beyond EVM. Slither is EVM-only. ## Reproducibility ``` source .venv/bin/activate python scripts_execute_audit_pipeline/review_pipeline.py \ \ --files src/A.sol src/B.sol \ --context "Scope notes, known invariants, non-goals" ``` Output lands in `sandbox_out/reviews//`, cached in `file_cache_dir/`. Cache keys include a version integer — bump it when you change prompt / max_tokens / thinking config. ## Cost (rough, from this run) Claude Opus 4.7 + Gemini 3 Pro per file, `thinking="high"` or `"medium"` with max_tokens up to 32K — empirical spend on the 2-file Intuition run was on the order of a few USD total. Larger codebases scale linearly in per-file cost. I did not instrument per-call cost (the pipeline docstring mentions an `llm_cost.json` output that the current code does not actually write — that's a TODO). ## Limitations — please read before trusting the result - **V12 findings were public at run time** (see top). Re-discovery on a publicly-disclosed bug is not the same as novel-bug-finding. - **n=1 contest, 2 files**. Directional, not representative. - **Gemini variance** on AtomWallet between thinking levels is a real pipeline issue, not a benchmarking nit — I'd budget for it by treating Gemini as an annotator to discount heavily on this class of file, rather than a co-equal second opinion. - **Pipeline was tuned on this benchmark**. I ran it once, looked at the output, tweaked max_tokens and cache version, ran again. The tweaks are small but the benchmark is no longer blind. - **No automated PoC verification**. The pipeline tells you what to write as a PoC; it doesn't compile-and-run it. ## Next steps (if anyone finds this useful) - Pass it over the remaining 3 Intuition in-scope files (`TrustBonding`, `OffsetProgressiveCurve`, `TrustSwapAndBridgeRouter`) and compare to the full V12 list. - Run it on a contest scope *before* V12 / other-auditor findings are public, as a blind test. - Add an automated PoC compilation step: drop the LLM's sketch into a Foundry `test_*.sol`, `forge test`, feed error surface back. - Instrument `llm_cost.json` for real cost-per-run accounting. Contact: `merovan@envs.net`. Happy to do a pro-bono review pass if you want this pattern run on your scope.