# Dual-LLM + Slither audit-review pipeline — Intuition (Code4rena) benchmark

**Author:** `merovan` · **Contact:** `merovan@envs.net` · **Date:** 2026-04-20

## TL;DR

I built a Python pipeline that runs Slither on a Solidity codebase and
forwards the source + Slither output to **Claude Opus 4.7** and **Gemini 3
Pro** in parallel, asking each independently for structured findings +
Slither-false-positive annotations. Ran against the recently-closed
**Intuition (Code4rena)** contest — 5 files, all in-scope (AtomWallet,
ProgressiveCurve, and a three-file extension run covering TrustBonding,
OffsetProgressiveCurve, and TrustSwapAndBridgeRouter). Results by file:

- Claude identified the same root cause as V12's Critical finding on
  `AtomWallet._validateSignature` (unauthenticated `validUntil` /
  `validAfter` suffix). It rated it High; V12 rated it Critical. Root
  cause, location, and PoC sketch align.
- On `ProgressiveCurve.sol` Claude reported no findings with a detailed
  "what was checked" section; Gemini produced one speculative Low that
  didn't survive review.
- On AtomWallet, Gemini produced **different findings on different runs**
  (see "model variance" below) — none of them matched V12's Critical.
- On the three extension files, both Claude and Gemini produced
  distinct non-trivial findings on `TrustBonding` (Claude Critical,
  Gemini two Highs — see "Extension run" below); zero findings on
  `OffsetProgressiveCurve`; distinct periphery-grade Low/Medium findings
  on `TrustSwapAndBridgeRouter`.
- Both LLMs correctly filtered ~10 Slither detectors that fired on
  library code (Solady `FixedPointMathLib`, OZ `Initializable`, prb-math).

**Honest caveats** (before the rest of the writeup is read):

1. The V12 findings list was **public in the contest repo the day the
   contest opened** (2026-03-04) and has been publicly indexable for six
   weeks by the time this run happened (2026-04-20). I can't rule out that
   the LLM's pretraining or retrieval saw it. This benchmark does not
   demonstrate *novel*-bug-finding capability — only that when shown an
   isolated contract file, the pipeline does not miss a syntactically
   visible bug that is also known.
2. n=1 contest. The primary-scope run covered 2 files (AtomWallet,
   ProgressiveCurve); the Phase-1 extension run covered the remaining
   3 (TrustBonding, OffsetProgressiveCurve, TrustSwapAndBridgeRouter).
   Proof-of-concept, not generalizable evidence.
3. Gemini's output varies meaningfully between `thinking="medium"` and
   `thinking="high"`. Documented below.

## Pipeline shape

```
(project_root, scope files, context) ──┐
                                       ▼
                                    Slither per file ──► slither_<file>.txt
                                                │
                                                ▼
        ┌── Claude Opus 4.7 ──► claude_<file>.md
src ────┤
        └── Gemini 3 Pro    ──► gemini_<file>.md
                                                │
                                                ▼
                                         aggregated_findings.md
```

Design choices:

- **Per-file Slither**, not whole-project. Whole-project Slither on modern
  OZ + solady codebases produces 30 KB+ of library-noise that clutters LLM
  context. Per-file keeps the source+context tight.
- **Two LLMs, parallel, independent**. Each LLM is asked for its own
  finding set; the human reviewer does the intersection analysis. Not a
  committee — treating them as independent annotators is the point.
- **Structured prompt** asking for C4/Secure3-rubric severity, PoC sketch,
  Slither-FP annotations, and a "high-confidence observations" section
  (what was checked that looked correct).
- **FileCache-backed** on `(src, slither, context, model, prompt_version)`
  SHA256 key. Same inputs are free on rerun.

## Benchmark: Intuition (Code4rena, contest closed 2026-03-09)

Scope fetched from `https://github.com/code-423n4/2026-03-intuition` at
commit `0a19e25`. Ground-truth finding set:
`code_423n4_2026_03_intuition__main_0a19e25_findings.md` (V12 / Zellic's
AI auditor). **This file is committed to the public repo since the day
the contest opened** — see the caveat above.

Primary-scope files reviewed in this initial run (the 3 Phase-1
extension files are covered separately in the "Extension run" section
below; together the pipeline covers all 5 in-scope files):

| File | Lines |
|---|---:|
| `src/protocol/wallet/AtomWallet.sol` | 382 |
| `src/protocol/curves/ProgressiveCurve.sol` | 244 |

### Result on `AtomWallet.sol`

**Claude Opus 4.7**, both runs (thinking=high with max_tokens=12000, then
thinking=medium with max_tokens=32000), produced the same finding at
**High** severity:

> `validUntil` / `validAfter` are not covered by the ECDSA signature —
> attacker can extend or bypass user-specified time windows.
>
> The ECDSA signature is computed over `keccak256("\x19Ethereum Signed
> Message:\n32" || userOpHash)` only. The 12-byte `validUntil ||
> validAfter` suffix appended to `userOp.signature` is parsed out before
> recovery and is never authenticated. userOpHash itself (as computed by
> the EntryPoint) is derived from userOp fields excluding signature. Any
> observer can strip or mutate the 12-byte suffix (e.g., re-encode as a
> 65-byte signature → `validUntil=0` interpreted as "no expiry") and
> re-submit at an arbitrary later time. The nonce still matches and the
> underlying ECDSA still verifies.

This is **the same bug** as V12's Critical "Unsigned validity window
metadata". Root cause, affected location, and PoC shape align.

**Severity call**: Claude rated High, V12 rated Critical. V12's Critical
is defensible — the bug lets an attacker execute a user's signed UserOp
well after the user's intended window, on a wallet whose purpose is
executing arbitrary user calls (swap / transfer / etc). Under the C4
rubric this is closer to Critical (loss of user funds under realistic
conditions with a permissionless attack). I'd align with V12 on reflection.

### Model variance on Gemini

Gemini 3 Pro's AtomWallet output varied substantially between thinking
levels:

- **thinking=high, max_tokens=12000**: output was *truncated mid-sentence*
  but produced a finding claiming `_validateSignature` reverts instead of
  returning `SIG_VALIDATION_FAILED`. On review this finding had a factual
  problem: by the time `ECDSA.tryRecover(hash, signature)` is called,
  `_extractValidUntilAndValidAfterFromSignature` has already reverted
  any signature whose length isn't 65 or 77, so the
  `RecoverError.InvalidSignatureLength` branch is effectively dead code.
  The only reachable revert is on `InvalidSignatureS` (high-S), and OZ /
  ethers enforce low-S by default. Narrow, not Medium.
- **thinking=high, max_tokens=24000** (rerun after raising limit): also
  truncated; produced a *different* finding — "claimAtomWalletDepositFees
  and transferOwnership are unusable via ERC-4337 UserOperations due to
  onlyOwner + nonReentrant". This is an interesting AA-UX finding but
  doesn't reach the Critical.
- **thinking=medium, max_tokens=32000** (final run): clean output, no
  truncation, *no findings* — just a high-confidence observations section
  and Slither-FP filter.

This is the honest signal: **Gemini is high-variance on this task** and
didn't reliably identify the same Critical root cause that Claude found
in both of its runs. The pipeline's value from Gemini on AtomWallet is
near-zero on this file.

### Result on `ProgressiveCurve.sol`

**Claude**: zero findings, with a detailed high-confidence observations
section (rounding directions, MAX_SHARES / MAX_ASSETS bounds, even-slope
check, `initializer` modifier, view-only functions, storage isolation).

**Gemini**: one speculative Low on `previewWithdraw` edge-case revert.
Didn't survive manual review — the preconditions require a
`(shares, HALF_SLOPE)` combination that doesn't happen under the
production `minDeposit = 1e16` / `minShare = 1e6` floors, and the
reverting path is arguably the correct UX (vs silent underflow).

V12 also reported nothing on `ProgressiveCurve.sol`. That's **one negative
signal**, not a proof the curve math is sound.

### Slither false-positive handling

All Slither detectors on `ProgressiveCurve` fired on library code
(`FixedPointMathLib`, `Initializable`, prb-math). Both LLMs correctly
annotated them as library-internal and out-of-scope with justifications.
This is probably the most repeatable high-value signal from the pipeline
— without it, an auditor spends real time ruling these detectors in/out.

## What the pipeline does NOT do

- Write runnable Foundry PoCs. The LLM produces a sketch; the human writes
  + runs + verifies the test in the project's test suite. C4 requires a
  coded, runnable PoC for High/Medium submissions.
- Pick severity definitively. The LLM suggests; the judge decides. The
  AtomWallet High/Critical disagreement illustrates this.
- Find cross-contract economic / governance / MEV / oracle-manipulation
  exploits. The pipeline is strongest on signature verification / access
  control / ERC-20 integration / rounding direction. Untested on
  protocol-level invariant bugs that require multi-contract state reasoning.
- Generalize beyond EVM. Slither is EVM-only.

## Reproducibility

```
source .venv/bin/activate
python scripts_execute_audit_pipeline/review_pipeline.py \
  <project_root> <project_name> \
  --files src/A.sol src/B.sol \
  --context "Scope notes, known invariants, non-goals"
```

Output lands in `sandbox_out/reviews/<project_name>/`, cached in
`file_cache_dir/`. Cache keys include a version integer — bump it when
you change prompt / max_tokens / thinking config.

## Cost (rough, from this run)

Claude Opus 4.7 + Gemini 3 Pro per file, `thinking="high"` or
`"medium"` with max_tokens up to 32K — empirical spend on the initial
2-file Intuition run was on the order of a few USD total. Larger
codebases scale linearly in per-file cost.

Per-run instrumentation is now written out as
`sandbox_out/reviews/<name>/llm_cost.json` — see the "llm_cost.json"
section at the bottom. Cached re-runs are free (billed $0).

## Limitations — please read before trusting the result

- **V12 findings were public at run time** (see top). Re-discovery on a
  publicly-disclosed bug is not the same as novel-bug-finding.
- **n=1 contest, 5 files** (2 primary + 3 extension). Directional, not
  representative.
- **Gemini variance** on AtomWallet between thinking levels is a real
  pipeline issue, not a benchmarking nit — I'd budget for it by treating
  Gemini as an annotator to discount heavily on this class of file,
  rather than a co-equal second opinion.
- **Pipeline was tuned on this benchmark**. I ran it once, looked at the
  output, tweaked max_tokens and cache version, ran again. The tweaks are
  small but the benchmark is no longer blind.
- **No automated PoC verification**. The pipeline tells you what to write
  as a PoC; it doesn't compile-and-run it.

## Extension run (Segment 4 Phase 1, 2026-04-21)

I extended the benchmark to the remaining three in-scope Intuition files:
`TrustBonding.sol`, `OffsetProgressiveCurve.sol`, and
`TrustSwapAndBridgeRouter.sol`. The last one lives in a sub-repo at
`intuition-contracts-v2-periphery/contracts/` (periphery; separate
Foundry project from the core emissions / curves / wallet contracts).
Outputs are in `sandbox_out/reviews/intuition_c1_extended/` +
`sandbox_out/reviews/intuition_trust_swap_bridge/`, plus the new
`llm_cost.json` files in each.

### `TrustBonding.sol`

Both Claude and Gemini found **real epoch-level accounting bugs**, distinct
from each other but related.

- **Claude — Critical:** `_balanceOf(user, past_ts)` (in `VotingEscrow.sol`,
  called by `TrustBonding.userBondedBalanceAtEpochEnd`) extrapolates from
  the user's *latest* checkpoint even when the queried timestamp is before
  that checkpoint. A user who has never locked can create a lock just after
  epoch N ends, then call `claimRewards`, and the numerator of their
  emission share uses the inflated extrapolation → they siphon epoch-N
  emissions they did not earn. Slither flagged the 8-line `_balanceOf`
  function (lines 630–638) as short / asymmetric to how `_totalSupply` is
  computed; Claude connected it to the TrustBonding emission pipeline.
- **Gemini — two High findings on the same file:**
  - (1) **Zero-balance reward eclipse**: a lock whose expiration lands
    exactly on `_epochTimestampEnd(epoch)` evaluates to `balance=0` at that
    exact timestamp (Curve decay semantics), so `userBondedBalanceAtEpochEnd`
    returns 0 and the user forfeits that final epoch even though the funds
    were locked through most of it.
  - (2) **JIT reward extraction**: `_userEligibleRewardsForEpoch` takes a
    single end-of-epoch snapshot instead of time-integrating, so an
    attacker can lock at the last block before `_epochTimestampEnd` and
    still claim the full epoch's proportional emissions.

These are distinct, non-redundant findings. I didn't have V13 (post-mitigation
report) open at run time to cross-check which of these are already known,
but the Claude Critical in particular has a clean PoC sketch and cites a
specific Slither artifact (asymmetry between `_balanceOf` and `_totalSupply`)
that a human reviewer can verify in ~5 min.

### `OffsetProgressiveCurve.sol`

Both Claude and Gemini returned **zero findings** with thorough "what was
checked" sections. Rounding directions in all four conversion paths
(deposit, mint, withdraw, redeem) favor the protocol and are symmetric
with the non-offset version. Both models flagged `PCMath.square` /
`PCMath.squareUp` semantics in `BaseCurve` as a residual verification
item (out of the file's scope to resolve).

### `TrustSwapAndBridgeRouter.sol`

A stateless periphery router that wraps Aerodrome Slipstream swaps +
Metalayer bridge transfers for the Trust token.

- **Claude — Low (possibly Medium):** `bridgeFee` is quoted via
  `MetaERC20Hub.quoteTransferRemote(recipient, minTrustOut)` but the
  later `_bridgeTrust` call forwards the *actual* `amountOut` (≥
  `minTrustOut`). If Metalayer's fee is amount-independent (typical of
  Hyperlane-style IGP), this is harmless. If it scales with amount, the
  router reverts with insufficient msg.value whenever `amountOut >
  minTrustOut`, turning the router into a DoS unless users set
  `minTrustOut` very tight. Remediation: re-quote `bridgeFee` using
  `amountOut` post-swap.
- **Gemini — Medium:** `deadline = block.timestamp` on Slipstream swap
  structs provides no MEV protection — the AMM interprets this as "this
  block, always," so sandwich/delay attacks at high mempool congestion
  stay live. Remediation: add an explicit `uint256 deadline` user
  parameter.

Neither is a Critical, and both depend on off-file context (bridge fee
semantics; MEV mempool assumptions). Worth verifying against V13 /
project comment.

### What the extension run shows

- **The pipeline produces non-trivial, distinct findings across two models
  on `TrustBonding`** — Claude and Gemini found different bugs, both
  plausibly valid, without overlap. That makes the dual-LLM pattern
  productive on this class of file, in contrast to `AtomWallet` where
  Gemini was high-variance.
- **Both models correctly return "no findings" on `OffsetProgressiveCurve`
  with evidence** — not a refusal, but a substantive clean-bill-of-health
  with "what I checked" items. This matters for auditor trust.
- **On `TrustSwapAndBridgeRouter`, both models surfaced distinct
  periphery-grade issues** (Low/Medium). Neither is a Critical, but both
  are concrete, non-duplicative, and carry a remediation proposal. A
  human auditor would treat the same file the same way — "clean on
  critical-path, two small periphery items to verify."
- **Cache reused from earlier exploratory runs** — most calls hit
  `file_cache_dir` (see each `llm_cost.json` `cached: true`), so this
  extension incurred $0 on the API in this phase's run. Original
  token-cost estimate is in the per-run `llm_cost.json` as `input_chars`
  + `output_chars`.

Caveats unchanged: V12 was public, V13 may or may not already cover these
findings, still a rediscovery pass.

## llm_cost.json (new this phase)

Each review run now emits `sandbox_out/reviews/<name>/llm_cost.json` with
per-call input/output char counts, estimated tokens (chars / 4), model
pricing, and an estimated $USD per call. Cached calls are `$0` in the
per-run total because tokens were billed on the prior run. Ground-truth
cumulative spend is tracked in `total_cost.jsonl` via `cost_tracker.py`
elsewhere in the repo — the per-run file is a guide, not a bill.

## Next steps (if anyone finds this useful)

- Run on a contest scope *before* V12 / other-auditor findings are public,
  as a blind test — that's the only way to validate novel-bug-finding.
- Cross-check the TrustBonding findings against V13 / a project-team
  response.
- Add an automated PoC compilation step: drop the LLM's sketch into a
  Foundry `test_*.sol`, `forge test`, feed error surface back.

Contact: `merovan@envs.net`. Happy to do a pro-bono review pass if you
want this pattern run on your scope.