Initial editorial archive shipped May 2026. Each run is a canonical Ed25519-signed BenchmarkRun JSON that any third party can independently verify against the maintainer's published public key.
Verifiable benchmarks for local AI inference hardware.
Silicon Logic is an independent publication where every published benchmark is cryptographically signed, every prompt is human-authored, and every methodology decision is public. Readers don't have to trust the chart — they can verify the run. The maintainer's public key is published in the repository, and every benchmark in the editorial archive can be independently verified against it using a single command.
The archive starts with signed evidence.
The first milestone is not a chart. It is a verifiable substrate: canonical run artifacts, human-authored prompts, structural independence, and a public verification path.
Every prompt in the v1 benchmark suite is written by the editor. No prompts curated from MT-Bench, HELM, or other published suites. No AI-generated test content. Every prompt is accountable to a named human author.
No vendor sponsorship dollars. No hardware manufacturer pays for placement. No software vendor influences methodology. Independence is structural, not asserted.
Numbers are everywhere. Defensible numbers are rare.
Local AI inference hardware is increasingly important, increasingly contested, and surprisingly under-served by rigorous editorial coverage. Hardware vendors publish competing benchmark claims for the same silicon. Standard benchmark suites get marketed around. Community projects surface useful but unauditable numbers. YouTubers run impressive-looking tests with undocumented methodology.
The result: readers who care about local LLM performance — researchers, engineers building on consumer GPUs, Apple Silicon developers, anyone choosing hardware for inference workloads — can find plenty of numbers. What's missing is numbers they can defend.
Two tracks, two methodologies, zero composite AI scores.
Silicon Logic publishes in two tracks. The separation is the point: hardware performance and model behavior answer different questions.
Hardware Performance
Hardware Performance covers consumer GPUs, Apple Silicon, and AI accelerators with cryptographically signed performance metrics on a weekly editorial cadence.
Model Quality
Model Quality covers in-depth model reviews at a slower cadence (every 6-8 weeks), evaluating model behavior on tasks with separate methodology from hardware benchmarks.
The two tracks have different methodologies and never blur into composite "AI scores." Hardware performance measures tokens per second, time to first token, latency percentiles, memory pressure, and energy efficiency. Model quality evaluates behavior. Readers get the right measurement for their question instead of a marketing-friendly conflation.
Eight dimensions, held together.
Eight dimensions distinguish the publication. No single dimension is unique — Phoronix is independent, MLPerf has methodology transparency, Procyon ships reproducible benchmarks. The defensible position is the combination:
AI hardware benchmarking focus — local inference is the editorial domain, not a side beat
Cryptographic provenance — every run signed with Ed25519, public key committed to the repository, signatures verify mathematically
Editorial accountability — a named human editor with a verifiable public-record track record, not a brand or institution
Methodology transparency — every methodology decision documented as a versioned, reader-facing artifact
Independence — zero vendor sponsorship, zero placement deals, structural rather than asserted
Reproducible methodology — every signed run includes the harness version, prompt suite version, and execution parameters needed to re-run the benchmark
Local-inference focus — Apple Silicon, consumer GPUs, and the hardware actual readers run, not datacenter benchmarks
Track separation — hardware performance and model quality measured separately, never collapsed into a single ranking
No investigated competitor combines all eight. That's Silicon Logic's editorial position.
Every benchmark execution flows through the same pipeline.
The pipeline is designed to make editorial claims traceable from model server to canonical artifact to cryptographic signature.
Harness
The harness runs the model server — llama-server for GGUF models, mlx_lm.server for Apple Silicon MLX models. It executes the prompt with one warmup trial (discarded) and five counted trials, captures timing via Python's perf_counter for sub-millisecond precision, and emits a BenchmarkTimings record with median aggregation across the counted trials.
Mapper
The mapper converts harness output into a canonical BenchmarkRun — the published form, schema-validated across Python (Pydantic), TypeScript (Zod), and Postgres (Drizzle). Schema synchronization across the three layers is verified by a CI check on every commit.
Signer
The signer takes the canonical JSON, computes its canonical-bytes representation, signs with the maintainer's Ed25519 private key (loaded at runtime from 1Password, never on disk), and emits a signed BenchmarkRun. The signed artifact is then re-read from disk and re-verified against the public key before being accepted — defense in depth against serialization corruption.
Verifying any published run takes one command. The repository contains the public key, the verification function, and the documented procedure. Either the math says the run is valid, or it doesn't. There is no "trust us."
Phase 1.4 is publishable.
Phase 1.4 publishable milestone shipped May 2026. Twenty pull requests merged across two days of focused execution build the editorial substrate end-to-end: schemas, harness lifecycle, multi-trial orchestration, mapper, signing pipeline, CLI, and the first ten signed sample runs.
The first ten runs in the editorial archive benchmark Llama 3.2 1B Instruct on a MacBook Pro M5 Max across five prompts (conversational, code, reasoning, long-context, and a second reasoning task) in two quantizations: Q4_K_M GGUF via llama.cpp, and 4-bit MLX via mlx_lm.
Every one of those ten runs verifies against the maintainer's published public key. The verification procedure works today. Any reader can clone the repository and prove the numbers came from the maintainer's key. That's the substrate.
Built as a publication substrate, not a one-off benchmark script.
The system spans web, backend, schemas, inference runtimes, signing, and distribution so the editorial archive can be verified by humans and programs.
- Languages
- TypeScript (frontend/MCP), Python 3.12 (backend)
- Monorepo
- Turborepo + pnpm + uv
- Database
- Postgres 17 on Neon (us-east-1)
- Schemas
- Drizzle ORM (SQL), Pydantic v2 (Python), Zod (TypeScript), synchronized via CI
- Frontend
- Next.js 15 with Tailwind 4
- Inference runtimes
- llama.cpp (GGUF), mlx_lm (Apple Silicon MLX)
- Signing
- Ed25519 via Python cryptography library
- Distribution
- Open repository on GitHub, planned MCP server for programmatic access
- Reference hardware
- MacBook Pro M5 Max 36GB (Apple Silicon), Windows + RTX 5080 (planned)
The editorial operating surface.
Silicon Logic's public promise is narrow enough to verify: signed benchmark runs, explicit methodology versioning, defined trials, median aggregation, and track-specific editorial cadence.
How to verify any signed run
Every published BenchmarkRun in the Silicon Logic archive includes an Ed25519 signature and the signer's public key. The repository commits the maintainer's public key and its SHA-256 fingerprint, allowing any third party to verify a published benchmark independently. The full verification procedure is documented in the repository at data/runs/README.md.
View the signing module at github.com/Vargix/silicon-logic. For verification questions, methodology disputes, or hardware coverage suggestions, see contact below.
Want to follow Silicon Logic's launch?
Silicon Logic launches publicly in Phase 1.6 with the first editorial Track 1 article. For early access, methodology questions, or technical collaboration on the publication substrate, reach out below.