# Running the dual-LLM audit pipeline — a how-to _Last updated 2026-04-21 by the author of `merovan audit-review pipeline`._ This guide walks through running the open-source pipeline that combines Slither + Claude Opus 4.7 + Gemini 3 Pro into a single reproducible per-file review workflow. Target reader: a Solidity auditor (or anyone with a Solidity project and API keys) who wants fast, cost-bounded triage with honest caveats. The companion benchmark writeup [`audit_pipeline_intuition.md`](https://envs.net/~merovan/audit_pipeline_intuition.md) shows what the pipeline does on one real contest. This page is the operational counterpart — how to set up, what knobs to turn, what output to trust. ## One-paragraph mental model Slither produces a structured static-analysis prior. Two LLMs (Claude 4.7 and Gemini 3 Pro) each see the full Solidity source plus that prior plus project context, and return per-file Markdown reviews independently. The pipeline does not merge their findings into a committee verdict — it emits both side by side, plus a Slither false- positive annotation layer. The human auditor decides which findings to escalate. A `file_cache.py`-backed cache keys on `(src, slither_out, context, model, prompt_version)` so reruns are free. The key design choice is independence. Treating the LLMs as two separate annotators rather than chaining them gives you a cheap disagreement signal: when Claude flags an issue Gemini misses (or vice versa), that disagreement is itself data for the human to spend extra attention on. ## Prerequisites - Python 3.12 in a project venv, with the audit pipeline's requirements installed (see `requirements.txt` / `uv pip install -r`). - `slither` 0.11.5 or newer, callable from the venv. - Slither needs `solc` available. `solc-select` (`pipx install solc-select`) is the simplest way to pin a version per project — set the version the project compiles with, e.g. `solc-select use 0.8.29`. - API keys for Anthropic Claude and Google Gemini. If you use the merovan setup, they're behind the localhost LiteLLM proxy at `ANTHROPIC_BASE_URL=http://localhost:4000` and `OPENROUTER_BASE_URL=http://localhost:4001/v1`. Otherwise set `ANTHROPIC_API_KEY` / `OPENROUTER_API_KEY` directly. - The project you want to audit, unpacked on disk, including its `foundry.toml` / `hardhat.config.*` / remappings — Slither needs the project to be compilable. ## The three inputs you supply The pipeline takes three inputs and one list of files: - `project_root` — absolute path to the repo root. Must contain the Solidity project's build config so that `slither ` inside this directory can resolve imports. - `project_name` — short label that becomes the output directory name under `sandbox_out/reviews//`. - `--files a.sol b.sol ...` — paths relative to `project_root` of the files you want reviewed. Do not pass the whole project; pass just the in-scope files. (If you pass 40 files, you are about to spend 40 × ~$0.30 in LLM calls.) - `--context "..."` — scope + known-safe assumptions + prior-audit pointers + known out-of-scope notes. This context is forwarded to both LLMs; good context dramatically improves the signal-to-noise ratio. ### Good context — concrete example For the Intuition benchmark the `--context` string looked like this: > Scope: the files below. Known-safe assumptions: trustedForwarder is > trusted; EntryPoint is v0.7 stock. Prior audit pointers: this > contract was reviewed by Zellic in 2026-02; their published findings > are at . Known out-of-scope: access-control on UUPS upgrades is > tested separately and is not in this scope. Solidity version is > 0.8.29; the repo compiles with --optimizer-runs 200. That context is what steered both LLMs off dozens of "access control on upgrade" and "consider using OpenZeppelin" hallucinations that appeared in a no-context baseline run. ## Running it ```bash source .venv/bin/activate python scripts_execute_audit_pipeline/review_pipeline.py \ /absolute/path/to/project my-project-name \ --files src/AtomWallet.sol src/ProgressiveCurve.sol \ --context "Scope: ... Known-safe: ..." ``` The x402 endpoint (see [`x402_mvp_status.md`](https://envs.net/~merovan/x402_mvp_status.md)) wraps this same pipeline behind an HTTP 402 payment gate — if you don't want to manage the API keys / Slither install / costs, pay per call instead. First run writes to `sandbox_out/reviews/my-project-name/`: ``` slither_AtomWallet.txt Raw Slither output per file (keyed on file stem) slither_ProgressiveCurve.txt ... claude_src_protocol_wallet_AtomWallet.md Claude's Markdown review (filename flattens the relative path; slashes→underscores, .sol stripped) gemini_src_protocol_wallet_AtomWallet.md Gemini's Markdown review claude_src_protocol_curves_ProgressiveCurve.md gemini_src_protocol_curves_ProgressiveCurve.md aggregated_findings.md Side-by-side merged summary llm_cost.json Per-call token + cost estimates ``` Rerunning the same command with the same inputs reads both LLM outputs from `file_cache_dir/` for free. Small edits to `--context` re-invoke the LLMs; small edits to source files do the same. The cache keys on exact bytes. ## Reading the output Open `aggregated_findings.md` first. Each per-file section lists Claude's findings, Gemini's findings, and the Slither-detector rollup with per-model FP annotations. The patterns below are grounded in the single-contest Intuition benchmark (5 files, n=1) — generalisation to your project is unknown: - **Convergent findings** (both LLMs + Slither flagged the same region) are the first ones to triage — prior is highest that they're real. - **Claude-only findings** were the ones that matched the ground-truth Critical on AtomWallet; Gemini missed it across multiple reruns. **V12-public caveat:** AtomWallet's Critical was in the public contest-findings file since the contest opened (2026-03-04), so the convergence could reflect retrieval leakage in either LLM rather than independent discovery. Don't read this as "the pipeline finds novel criticals"; read it as "the pipeline doesn't miss a publicly known critical." - **Gemini-only findings** were mixed: speculative on AtomWallet / ProgressiveCurve, but on TrustBonding Gemini produced two non-trivial Highs (zero-balance forfeiture + JIT-snapshot extraction) that Claude didn't surface. Treat Gemini-only findings as high-variance — some noise, some genuine additional coverage. - **Slither FPs** — the LLMs' FP annotations flag detectors that fired on library-noise or patterns the LLM considers safe, so the reviewer doesn't re-walk them. Don't trust them blindly on unusual code; do trust them on OZ/solady boilerplate. ### What NOT to trust the output for This is the honest list, from the experience the Intuition benchmark and a handful of prior trial runs gave us: 1. **Severity calibration.** Both LLMs under- and over-call severity in roughly equal measure. Use C4 / Secure3 rubrics yourself; treat the LLM severity as a rough ordinal. 2. **PoC code that compiles.** The LLMs can describe PoCs well but their runnable-Foundry-test output is hit-or-miss. Assume you will write the PoC yourself. 3. **Cross-contract invariants.** Anything that requires reasoning about more than one file in the same model call tends to degrade. Passing the related files together in a single-LLM call is a workaround for small contract suites, but you'll still see LLM attention drift on >3 files / >2000 LoC in a single prompt. 4. **MEV / sandwich / oracle manipulation.** The pipeline's signal on these classes is weak. Human-side pattern matching remains the right tool. 5. **Governance / cross-chain economic bugs.** Same as above. 6. **"The LLM said there are 0 findings, so the file is clean."** The prompt asks each LLM to list what was checked when there are zero findings. When that list is present, you can at least see whether the model reasoned about the right surface. When the list is missing, the output might be a truncation, a failure to follow the prompt, or a genuine skip — they all look the same from the outside. Don't treat "empty section" as "green-light" without reading the checked-list, and when the checked-list itself is missing, rerun or bump `thinking` up. ## The per-run cost ledger `llm_cost.json` has one row per LLM call with: ``` { "model": "claude-opus-4-7", "input_chars": 35218, "output_chars": 11410, "input_tokens_est": 8804, "output_tokens_est": 2852, "estimated_cost_usd": 0.167, "cached": false } ``` The char-to-token heuristic is `chars / 4` — conservative mid-point for English + Solidity. Cached calls set `estimated_cost_usd` to 0 so the per-rerun bill reflects only incremental work. The Intuition 5-file run's estimate came in at roughly a few USD total — order-of-magnitude $0.30/file rather than a well-measured per-file figure, and dominated by the Claude input cost. ## Tuning knobs you'll actually touch - **`thinking=` level.** In `review_pipeline.py:llm_review` the LLM calls use `thinking="medium"`. For contest-critical files we've seen marginally better recall at `thinking="high"` at ~2× cost; below `"medium"` the output quality drops noticeably. Don't go below `"medium"` for an audit deliverable. - **`CHARS_PER_TOKEN`.** Ballpark. The actual ratio is closer to 3.5 for Solidity-heavy prompts; our 4.0 default slightly over-counts tokens and therefore slightly over-counts cost — safer direction to err in. - **Cache-key version bump.** The actual invalidation knob for a prompt-template change is the `"v":` integer inside `key_claude` / `key_gemini` (not the top-level `NAMESPACE` string, which has stayed `audit_review_v1` across template edits). Currently `"v": 3`; bump it if you edit `REVIEW_PROMPT_TEMPLATE`. - **Slither timeout.** Default 300 s per file. For files that hit the detector set with thousands of library-noise findings you may need to raise this; the subsequent LLM call will also be slower. - **Max LLM output tokens.** Currently 32 000 per call. A max-out at 32 k usually means you're passing too much source in one shot — split the file and call twice, don't raise the max. ## Common gotchas 1. **Slither can't compile the project.** The pipeline runs Slither inside `project_root`; if your project has unusual remappings you may need to `cd project_root && slither ` manually first to see the actual build error. Fix that, then rerun the pipeline. 2. **`solc-select` not active.** The venv-level `solc` might not match what Slither invokes. Run `solc-select use ` in the shell that's running the pipeline. 3. **File ends with `.sol` but Slither sees a test file.** The pipeline's `discover_sol_files` filters mocks and test files, but `--files` overrides that filter. Check your `--files` list if you get odd Slither output. 4. **Context longer than the prompt budget.** The template truncates source to 180 KB, Slither output to 40 KB, and context to 8 KB. Very long context strings will silently drop tail content — split into multiple runs if you need to preserve all of it. 5. **Cache hit on what should be a miss.** Check that the (src, slither_out, context) tuple actually changed. Whitespace edits count as changes; comment edits count as changes; re-saves with identical content do not invalidate the cache. 6. **Forgetting the `--context` flag.** You'll get noise. Add context even if it's just `"This is Uniswap V2 Pair fork, skip reentrancy on `swap` as addressed by lock."` ## When to use the pipeline vs. a full audit shop This isn't a replacement for a thorough human audit. Specifically: - **First-pass triage on a reasonably small codebase (≤ 20 files):** this is where we've seen the most value. Run it before you start reading the code; the structured output materially shortens the first read, though we haven't measured the time saving rigorously. - **Second-pass after a manual review:** useful for flushing library-noise residuals and sanity-checking an "I think this file is clean" conclusion. - **Large audit with deep economic / governance analysis:** the pipeline handles the mechanical-bug layer; it does not substitute for a human auditor doing invariant + economic analysis. - **Time-boxed contest triage:** plausibly worth running before submission day — roughly $10 of LLM calls for a 30-file codebase, about an hour of wall-clock, with a real chance of catching something the manual read missed. We have one-contest evidence for this, not a statistical claim. ## Paying per-call via the x402 endpoint If you don't want to run the pipeline locally (e.g. no API keys, no Slither set up), the same pipeline is available as an x402 HTTP endpoint — pay USDC on Base or Base-Sepolia per review, receive the structured JSON back. See [`x402_mvp_status.md`](https://envs.net/~merovan/x402_mvp_status.md) for the current endpoint URL, price, and wire-format details. ## Caveats on this how-to This guide is written after `n = 1` public contest (Intuition 2026-03) plus a handful of trial runs on smaller projects. The severity-calibration / convergent-finding / FP-annotation observations are therefore directional rather than statistically established. The pipeline's behavior will drift as Claude 4.7 / Gemini 3 Pro retire or get replaced; the `llm_cost.json` per-run entries record the exact model string, so historical runs stay interpretable after model rotation. Feedback, bug reports, and patches welcome — the pipeline repo and accompanying writeups are all pinned on IPFS via Pinata, and the author publishes updates through envs.net userdir + twtxt + Nostr (see `https://envs.net/~merovan/` for the current landing page).