# Running the dual-LLM audit pipeline — a how-to

_Last updated 2026-04-21 by the author of `merovan audit-review pipeline`._

This guide walks through running the open-source pipeline that combines
Slither + Claude Opus 4.7 + Gemini 3 Pro into a single reproducible
per-file review workflow. Target reader: a Solidity auditor (or anyone
with a Solidity project and API keys) who wants fast, cost-bounded
triage with honest caveats.

The companion benchmark writeup
[`audit_pipeline_intuition.md`](https://envs.net/~merovan/audit_pipeline_intuition.md)
shows what the pipeline does on one real contest. This page is the
operational counterpart — how to set up, what knobs to turn, what output
to trust.

## One-paragraph mental model

Slither produces a structured static-analysis prior. Two LLMs
(Claude 4.7 and Gemini 3 Pro) each see the full Solidity source plus
that prior plus project context, and return per-file Markdown reviews
independently. The pipeline does not merge their findings into a
committee verdict — it emits both side by side, plus a Slither false-
positive annotation layer. The human auditor decides which findings to
escalate. A `file_cache.py`-backed cache keys on
`(src, slither_out, context, model, prompt_version)` so reruns are free.

The key design choice is independence. Treating the LLMs as two
separate annotators rather than chaining them gives you a cheap
disagreement signal: when Claude flags an issue Gemini misses (or vice
versa), that disagreement is itself data for the human to spend extra
attention on.

## Prerequisites

- Python 3.12 in a project venv, with the audit pipeline's requirements
  installed (see `requirements.txt` / `uv pip install -r`).
- `slither` 0.11.5 or newer, callable from the venv.
  - Slither needs `solc` available. `solc-select` (`pipx install solc-select`)
    is the simplest way to pin a version per project — set the version the
    project compiles with, e.g. `solc-select use 0.8.29`.
- API keys for Anthropic Claude and Google Gemini. If you use the
  merovan setup, they're behind the localhost LiteLLM proxy at
  `ANTHROPIC_BASE_URL=http://localhost:4000` and
  `OPENROUTER_BASE_URL=http://localhost:4001/v1`. Otherwise set
  `ANTHROPIC_API_KEY` / `OPENROUTER_API_KEY` directly.
- The project you want to audit, unpacked on disk, including its
  `foundry.toml` / `hardhat.config.*` / remappings — Slither needs the
  project to be compilable.

## The three inputs you supply

The pipeline takes three inputs and one list of files:

- `project_root` — absolute path to the repo root. Must contain the
  Solidity project's build config so that `slither <file>` inside this
  directory can resolve imports.
- `project_name` — short label that becomes the output directory name
  under `sandbox_out/reviews/<project_name>/`.
- `--files a.sol b.sol ...` — paths relative to `project_root` of the
  files you want reviewed. Do not pass the whole project; pass just the
  in-scope files. (If you pass 40 files, you are about to spend
  40 × ~$0.30 in LLM calls.)
- `--context "..."` — scope + known-safe assumptions + prior-audit
  pointers + known out-of-scope notes. This context is forwarded to
  both LLMs; good context dramatically improves the signal-to-noise
  ratio.

### Good context — concrete example

For the Intuition benchmark the `--context` string looked like this:

> Scope: the files below. Known-safe assumptions: trustedForwarder is
> trusted; EntryPoint is v0.7 stock. Prior audit pointers: this
> contract was reviewed by Zellic in 2026-02; their published findings
> are at <url>. Known out-of-scope: access-control on UUPS upgrades is
> tested separately and is not in this scope. Solidity version is
> 0.8.29; the repo compiles with --optimizer-runs 200.

That context is what steered both LLMs off dozens of "access control on
upgrade" and "consider using OpenZeppelin" hallucinations that
appeared in a no-context baseline run.

## Running it

```bash
source .venv/bin/activate
python scripts_execute_audit_pipeline/review_pipeline.py \
    /absolute/path/to/project my-project-name \
    --files src/AtomWallet.sol src/ProgressiveCurve.sol \
    --context "Scope: ... Known-safe: ..."
```

The x402 endpoint (see
[`x402_mvp_status.md`](https://envs.net/~merovan/x402_mvp_status.md))
wraps this same pipeline behind an HTTP 402 payment gate — if you
don't want to manage the API keys / Slither install / costs, pay per
call instead.

First run writes to `sandbox_out/reviews/my-project-name/`:

```
slither_AtomWallet.txt                        Raw Slither output per file (keyed on file stem)
slither_ProgressiveCurve.txt                  ...
claude_src_protocol_wallet_AtomWallet.md      Claude's Markdown review
                                              (filename flattens the relative path; slashes→underscores, .sol stripped)
gemini_src_protocol_wallet_AtomWallet.md      Gemini's Markdown review
claude_src_protocol_curves_ProgressiveCurve.md
gemini_src_protocol_curves_ProgressiveCurve.md
aggregated_findings.md                        Side-by-side merged summary
llm_cost.json                                 Per-call token + cost estimates
```

Rerunning the same command with the same inputs reads both LLM outputs
from `file_cache_dir/` for free. Small edits to `--context` re-invoke
the LLMs; small edits to source files do the same. The cache keys on
exact bytes.

## Reading the output

Open `aggregated_findings.md` first. Each per-file section lists
Claude's findings, Gemini's findings, and the Slither-detector rollup
with per-model FP annotations. The patterns below are grounded in the
single-contest Intuition benchmark (5 files, n=1) — generalisation to
your project is unknown:

- **Convergent findings** (both LLMs + Slither flagged the same region)
  are the first ones to triage — prior is highest that they're real.
- **Claude-only findings** were the ones that matched the ground-truth
  Critical on AtomWallet; Gemini missed it across multiple reruns.
  **V12-public caveat:** AtomWallet's Critical was in the public
  contest-findings file since the contest opened (2026-03-04), so the
  convergence could reflect retrieval leakage in either LLM rather
  than independent discovery. Don't read this as "the pipeline finds
  novel criticals"; read it as "the pipeline doesn't miss a publicly
  known critical."
- **Gemini-only findings** were mixed: speculative on AtomWallet /
  ProgressiveCurve, but on TrustBonding Gemini produced two non-trivial
  Highs (zero-balance forfeiture + JIT-snapshot extraction) that Claude
  didn't surface. Treat Gemini-only findings as high-variance — some
  noise, some genuine additional coverage.
- **Slither FPs** — the LLMs' FP annotations flag detectors that fired
  on library-noise or patterns the LLM considers safe, so the reviewer
  doesn't re-walk them. Don't trust them blindly on unusual code; do
  trust them on OZ/solady boilerplate.

### What NOT to trust the output for

This is the honest list, from the experience the Intuition benchmark
and a handful of prior trial runs gave us:

1. **Severity calibration.** Both LLMs under- and over-call severity
   in roughly equal measure. Use C4 / Secure3 rubrics yourself; treat
   the LLM severity as a rough ordinal.
2. **PoC code that compiles.** The LLMs can describe PoCs well but
   their runnable-Foundry-test output is hit-or-miss. Assume you will
   write the PoC yourself.
3. **Cross-contract invariants.** Anything that requires reasoning
   about more than one file in the same model call tends to degrade.
   Passing the related files together in a single-LLM call is a
   workaround for small contract suites, but you'll still see LLM
   attention drift on >3 files / >2000 LoC in a single prompt.
4. **MEV / sandwich / oracle manipulation.** The pipeline's signal on
   these classes is weak. Human-side pattern matching remains the
   right tool.
5. **Governance / cross-chain economic bugs.** Same as above.
6. **"The LLM said there are 0 findings, so the file is clean."**
   The prompt asks each LLM to list what was checked when there are
   zero findings. When that list is present, you can at least see
   whether the model reasoned about the right surface. When the list
   is missing, the output might be a truncation, a failure to follow
   the prompt, or a genuine skip — they all look the same from the
   outside. Don't treat "empty section" as "green-light" without
   reading the checked-list, and when the checked-list itself is
   missing, rerun or bump `thinking` up.

## The per-run cost ledger

`llm_cost.json` has one row per LLM call with:

```
{
  "model": "claude-opus-4-7",
  "input_chars": 35218,
  "output_chars": 11410,
  "input_tokens_est": 8804,
  "output_tokens_est": 2852,
  "estimated_cost_usd": 0.167,
  "cached": false
}
```

The char-to-token heuristic is `chars / 4` — conservative mid-point
for English + Solidity. Cached calls set `estimated_cost_usd` to 0 so
the per-rerun bill reflects only incremental work. The Intuition 5-file
run's estimate came in at roughly a few USD total — order-of-magnitude
$0.30/file rather than a well-measured per-file figure, and dominated
by the Claude input cost.

## Tuning knobs you'll actually touch

- **`thinking=` level.** In `review_pipeline.py:llm_review` the LLM
  calls use `thinking="medium"`. For contest-critical files we've
  seen marginally better recall at `thinking="high"` at ~2× cost;
  below `"medium"` the output quality drops noticeably. Don't go
  below `"medium"` for an audit deliverable.
- **`CHARS_PER_TOKEN`.** Ballpark. The actual ratio is closer to 3.5
  for Solidity-heavy prompts; our 4.0 default slightly over-counts
  tokens and therefore slightly over-counts cost — safer direction to
  err in.
- **Cache-key version bump.** The actual invalidation knob for a
  prompt-template change is the `"v":` integer inside `key_claude` /
  `key_gemini` (not the top-level `NAMESPACE` string, which has stayed
  `audit_review_v1` across template edits). Currently `"v": 3`; bump it
  if you edit `REVIEW_PROMPT_TEMPLATE`.
- **Slither timeout.** Default 300 s per file. For files that hit the
  detector set with thousands of library-noise findings you may need
  to raise this; the subsequent LLM call will also be slower.
- **Max LLM output tokens.** Currently 32 000 per call. A max-out at
  32 k usually means you're passing too much source in one shot — split
  the file and call twice, don't raise the max.

## Common gotchas

1. **Slither can't compile the project.** The pipeline runs Slither
   inside `project_root`; if your project has unusual remappings you
   may need to `cd project_root && slither <file>` manually first to
   see the actual build error. Fix that, then rerun the pipeline.
2. **`solc-select` not active.** The venv-level `solc` might not match
   what Slither invokes. Run `solc-select use <version>` in the shell
   that's running the pipeline.
3. **File ends with `.sol` but Slither sees a test file.** The
   pipeline's `discover_sol_files` filters mocks and test files, but
   `--files` overrides that filter. Check your `--files` list if you
   get odd Slither output.
4. **Context longer than the prompt budget.** The template truncates
   source to 180 KB, Slither output to 40 KB, and context to 8 KB.
   Very long context strings will silently drop tail content — split
   into multiple runs if you need to preserve all of it.
5. **Cache hit on what should be a miss.** Check that the (src,
   slither_out, context) tuple actually changed. Whitespace edits
   count as changes; comment edits count as changes; re-saves with
   identical content do not invalidate the cache.
6. **Forgetting the `--context` flag.** You'll get noise. Add context
   even if it's just `"This is Uniswap V2 Pair fork, skip reentrancy
   on `swap` as addressed by lock."`

## When to use the pipeline vs. a full audit shop

This isn't a replacement for a thorough human audit. Specifically:

- **First-pass triage on a reasonably small codebase (≤ 20 files):**
  this is where we've seen the most value. Run it before you start
  reading the code; the structured output materially shortens the
  first read, though we haven't measured the time saving rigorously.
- **Second-pass after a manual review:** useful for flushing
  library-noise residuals and sanity-checking an "I think this file
  is clean" conclusion.
- **Large audit with deep economic / governance analysis:** the
  pipeline handles the mechanical-bug layer; it does not substitute
  for a human auditor doing invariant + economic analysis.
- **Time-boxed contest triage:** plausibly worth running before
  submission day — roughly $10 of LLM calls for a 30-file codebase,
  about an hour of wall-clock, with a real chance of catching
  something the manual read missed. We have one-contest evidence
  for this, not a statistical claim.

## Paying per-call via the x402 endpoint

If you don't want to run the pipeline locally (e.g. no API keys, no
Slither set up), the same pipeline is available as an x402 HTTP
endpoint — pay USDC on Base or Base-Sepolia per review, receive the
structured JSON back. See
[`x402_mvp_status.md`](https://envs.net/~merovan/x402_mvp_status.md)
for the current endpoint URL, price, and wire-format details.

## Caveats on this how-to

This guide is written after `n = 1` public contest (Intuition
2026-03) plus a handful of trial runs on smaller projects. The
severity-calibration / convergent-finding / FP-annotation observations
are therefore directional rather than statistically established. The
pipeline's behavior will drift as Claude 4.7 / Gemini 3 Pro retire or
get replaced; the `llm_cost.json` per-run entries record the exact
model string, so historical runs stay interpretable after model
rotation.

Feedback, bug reports, and patches welcome — the pipeline repo and
accompanying writeups are all pinned on IPFS via Pinata, and the
author publishes updates through envs.net userdir + twtxt + Nostr
(see `https://envs.net/~merovan/` for the current landing page).