# Dual-LLM + Slither audit-review pipeline — Intuition (Code4rena) benchmark

**Author:** `merovan` · **Contact:** `merovan@envs.net` · **Date:** 2026-04-20

## TL;DR

I built a Python pipeline that runs Slither on a Solidity codebase and
forwards the source + Slither output to **Claude Opus 4.7** and **Gemini 3
Pro** in parallel, asking each independently for structured findings +
Slither-false-positive annotations. On a 2-file benchmark against the
recently-closed **Intuition (Code4rena)** contest:

- Claude identified the same root cause as V12's Critical finding on
  `AtomWallet._validateSignature` (unauthenticated `validUntil` /
  `validAfter` suffix). It rated it High; V12 rated it Critical. Root
  cause, location, and PoC sketch align.
- On `ProgressiveCurve.sol` Claude reported no findings with a detailed
  "what was checked" section; Gemini produced one speculative Low that
  didn't survive review.
- On AtomWallet, Gemini produced **different findings on different runs**
  (see "model variance" below) — none of them matched V12's Critical.
- Both LLMs correctly filtered ~10 Slither detectors that fired on
  library code (Solady `FixedPointMathLib`, OZ `Initializable`, prb-math).

**Honest caveats** (before the rest of the writeup is read):

1. The V12 findings list was **public in the contest repo the day the
   contest opened** (2026-03-04) and has been publicly indexable for six
   weeks by the time this run happened (2026-04-20). I can't rule out that
   the LLM's pretraining or retrieval saw it. This benchmark does not
   demonstrate *novel*-bug-finding capability — only that when shown an
   isolated contract file, the pipeline does not miss a syntactically
   visible bug that is also known.
2. n=1 contest, 2 files reviewed of 5 in-scope. This is proof-of-concept,
   not generalizable evidence.
3. Gemini's output varies meaningfully between `thinking="medium"` and
   `thinking="high"`. Documented below.

## Pipeline shape

```
(project_root, scope files, context) ──┐
                                       ▼
                                    Slither per file ──► slither_<file>.txt
                                                │
                                                ▼
        ┌── Claude Opus 4.7 ──► claude_<file>.md
src ────┤
        └── Gemini 3 Pro    ──► gemini_<file>.md
                                                │
                                                ▼
                                         aggregated_findings.md
```

Design choices:

- **Per-file Slither**, not whole-project. Whole-project Slither on modern
  OZ + solady codebases produces 30 KB+ of library-noise that clutters LLM
  context. Per-file keeps the source+context tight.
- **Two LLMs, parallel, independent**. Each LLM is asked for its own
  finding set; the human reviewer does the intersection analysis. Not a
  committee — treating them as independent annotators is the point.
- **Structured prompt** asking for C4/Secure3-rubric severity, PoC sketch,
  Slither-FP annotations, and a "high-confidence observations" section
  (what was checked that looked correct).
- **FileCache-backed** on `(src, slither, context, model, prompt_version)`
  SHA256 key. Same inputs are free on rerun.

## Benchmark: Intuition (Code4rena, contest closed 2026-03-09)

Scope fetched from `https://github.com/code-423n4/2026-03-intuition` at
commit `0a19e25`. Ground-truth finding set:
`code_423n4_2026_03_intuition__main_0a19e25_findings.md` (V12 / Zellic's
AI auditor). **This file is committed to the public repo since the day
the contest opened** — see the caveat above.

Files reviewed (of 5 in-scope):

| File | Lines |
|---|---:|
| `src/protocol/wallet/AtomWallet.sol` | 382 |
| `src/protocol/curves/ProgressiveCurve.sol` | 244 |

### Result on `AtomWallet.sol`

**Claude Opus 4.7**, both runs (thinking=high with max_tokens=12000, then
thinking=medium with max_tokens=32000), produced the same finding at
**High** severity:

> `validUntil` / `validAfter` are not covered by the ECDSA signature —
> attacker can extend or bypass user-specified time windows.
>
> The ECDSA signature is computed over `keccak256("\x19Ethereum Signed
> Message:\n32" || userOpHash)` only. The 12-byte `validUntil ||
> validAfter` suffix appended to `userOp.signature` is parsed out before
> recovery and is never authenticated. userOpHash itself (as computed by
> the EntryPoint) is derived from userOp fields excluding signature. Any
> observer can strip or mutate the 12-byte suffix (e.g., re-encode as a
> 65-byte signature → `validUntil=0` interpreted as "no expiry") and
> re-submit at an arbitrary later time. The nonce still matches and the
> underlying ECDSA still verifies.

This is **the same bug** as V12's Critical "Unsigned validity window
metadata". Root cause, affected location, and PoC shape align.

**Severity call**: Claude rated High, V12 rated Critical. V12's Critical
is defensible — the bug lets an attacker execute a user's signed UserOp
well after the user's intended window, on a wallet whose purpose is
executing arbitrary user calls (swap / transfer / etc). Under the C4
rubric this is closer to Critical (loss of user funds under realistic
conditions with a permissionless attack). I'd align with V12 on reflection.

### Model variance on Gemini

Gemini 3 Pro's AtomWallet output varied substantially between thinking
levels:

- **thinking=high, max_tokens=12000**: output was *truncated mid-sentence*
  but produced a finding claiming `_validateSignature` reverts instead of
  returning `SIG_VALIDATION_FAILED`. On review this finding had a factual
  problem: by the time `ECDSA.tryRecover(hash, signature)` is called,
  `_extractValidUntilAndValidAfterFromSignature` has already reverted
  any signature whose length isn't 65 or 77, so the
  `RecoverError.InvalidSignatureLength` branch is effectively dead code.
  The only reachable revert is on `InvalidSignatureS` (high-S), and OZ /
  ethers enforce low-S by default. Narrow, not Medium.
- **thinking=high, max_tokens=24000** (rerun after raising limit): also
  truncated; produced a *different* finding — "claimAtomWalletDepositFees
  and transferOwnership are unusable via ERC-4337 UserOperations due to
  onlyOwner + nonReentrant". This is an interesting AA-UX finding but
  doesn't reach the Critical.
- **thinking=medium, max_tokens=32000** (final run): clean output, no
  truncation, *no findings* — just a high-confidence observations section
  and Slither-FP filter.

This is the honest signal: **Gemini is high-variance on this task** and
didn't reliably identify the same Critical root cause that Claude found
in both of its runs. The pipeline's value from Gemini on AtomWallet is
near-zero on this file.

### Result on `ProgressiveCurve.sol`

**Claude**: zero findings, with a detailed high-confidence observations
section (rounding directions, MAX_SHARES / MAX_ASSETS bounds, even-slope
check, `initializer` modifier, view-only functions, storage isolation).

**Gemini**: one speculative Low on `previewWithdraw` edge-case revert.
Didn't survive manual review — the preconditions require a
`(shares, HALF_SLOPE)` combination that doesn't happen under the
production `minDeposit = 1e16` / `minShare = 1e6` floors, and the
reverting path is arguably the correct UX (vs silent underflow).

V12 also reported nothing on `ProgressiveCurve.sol`. That's **one negative
signal**, not a proof the curve math is sound.

### Slither false-positive handling

All Slither detectors on `ProgressiveCurve` fired on library code
(`FixedPointMathLib`, `Initializable`, prb-math). Both LLMs correctly
annotated them as library-internal and out-of-scope with justifications.
This is probably the most repeatable high-value signal from the pipeline
— without it, an auditor spends real time ruling these detectors in/out.

## What the pipeline does NOT do

- Write runnable Foundry PoCs. The LLM produces a sketch; the human writes
  + runs + verifies the test in the project's test suite. C4 requires a
  coded, runnable PoC for High/Medium submissions.
- Pick severity definitively. The LLM suggests; the judge decides. The
  AtomWallet High/Critical disagreement illustrates this.
- Find cross-contract economic / governance / MEV / oracle-manipulation
  exploits. The pipeline is strongest on signature verification / access
  control / ERC-20 integration / rounding direction. Untested on
  protocol-level invariant bugs that require multi-contract state reasoning.
- Generalize beyond EVM. Slither is EVM-only.

## Reproducibility

```
source .venv/bin/activate
python scripts_execute_audit_pipeline/review_pipeline.py \
  <project_root> <project_name> \
  --files src/A.sol src/B.sol \
  --context "Scope notes, known invariants, non-goals"
```

Output lands in `sandbox_out/reviews/<project_name>/`, cached in
`file_cache_dir/`. Cache keys include a version integer — bump it when
you change prompt / max_tokens / thinking config.

## Cost (rough, from this run)

Claude Opus 4.7 + Gemini 3 Pro per file, `thinking="high"` or
`"medium"` with max_tokens up to 32K — empirical spend on the 2-file
Intuition run was on the order of a few USD total. Larger codebases
scale linearly in per-file cost.

I did not instrument per-call cost (the pipeline docstring mentions an
`llm_cost.json` output that the current code does not actually write —
that's a TODO).

## Limitations — please read before trusting the result

- **V12 findings were public at run time** (see top). Re-discovery on a
  publicly-disclosed bug is not the same as novel-bug-finding.
- **n=1 contest, 2 files**. Directional, not representative.
- **Gemini variance** on AtomWallet between thinking levels is a real
  pipeline issue, not a benchmarking nit — I'd budget for it by treating
  Gemini as an annotator to discount heavily on this class of file,
  rather than a co-equal second opinion.
- **Pipeline was tuned on this benchmark**. I ran it once, looked at the
  output, tweaked max_tokens and cache version, ran again. The tweaks are
  small but the benchmark is no longer blind.
- **No automated PoC verification**. The pipeline tells you what to write
  as a PoC; it doesn't compile-and-run it.

## Next steps (if anyone finds this useful)

- Pass it over the remaining 3 Intuition in-scope files (`TrustBonding`,
  `OffsetProgressiveCurve`, `TrustSwapAndBridgeRouter`) and compare to the
  full V12 list.
- Run it on a contest scope *before* V12 / other-auditor findings are
  public, as a blind test.
- Add an automated PoC compilation step: drop the LLM's sketch into a
  Foundry `test_*.sol`, `forge test`, feed error surface back.
- Instrument `llm_cost.json` for real cost-per-run accounting.

Contact: `merovan@envs.net`. Happy to do a pro-bono review pass if you
want this pattern run on your scope.