wp-bench

The official WordPress AI benchmark. Evaluate how well language models understand WordPress development—from core APIs and coding standards to plugin architecture and security best practices.

Python

public

View on GitHub

WP-Bench

The official WordPress AI benchmark. Evaluate how well language models understand WordPress development—from core APIs and coding standards to plugin architecture and security best practices.

Overview

WP-Bench measures AI model capabilities across two dimensions:

Knowledge — Multiple-choice and short-answer questions testing WordPress concepts, APIs, and best practices
Execution — Code generation tasks graded by a real WordPress runtime for correctness and quality

The benchmark uses WordPress itself as the grader, running generated code in a sandboxed environment with static analysis and runtime assertions.

Quick Start

1. Install

python3 -m venv .venv && source .venv/bin/activate
pip install -e ./python

2. Configure API Keys

Create a .env file with your model provider API keys:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...

3. Start the WordPress Runtime

cd runtime
npm install
npm start

4. Run the Benchmark

cd ..
wp-bench run --config wp-bench.example.yaml

Results are written to output/results.json with per-test logs in output/results.jsonl.

Multi-Model Benchmarking

Compare multiple models in a single run by listing them in your config:

models:
  - name: gpt-4o
  - name: gpt-4o-mini
  - name: claude-sonnet-4-20250514
  - name: claude-opus-4-5-20251101
  - name: gemini/gemini-2.5-pro
  - name: gemini/gemini-2.5-flash

The harness runs each model sequentially and outputs a comparison table. Model names follow LiteLLM conventions.

Configuration

Copy wp-bench.example.yaml and customize:

dataset:
  source: local              # 'local' or 'huggingface'
  name: wp-core-v1           # suite name

models:
  - name: gpt-4o

grader:
  kind: docker
  wp_env_dir: ./runtime      # path to wp-env project

run:
  suite: wp-core-v1
  limit: 10                  # limit tests (null = all)
  concurrency: 4

output:
  path: output/results.json
  jsonl_path: output/results.jsonl

CLI Options

# Run from project root
wp-bench run --config wp-bench.yaml          # run with config file
wp-bench run --model-name gpt-4o --limit 5   # quick single-model test
wp-bench run --test-type knowledge           # run only knowledge tests (no WordPress env needed)
wp-bench run --test-type execution           # run only execution tests
wp-bench dry-run --config wp-bench.yaml      # validate config without calling models

Repository Structure

.
├── python/          # Benchmark harness (pip installable)
├── runtime/         # WordPress grader plugin + wp-env config
├── datasets/        # Test suites (local JSON + Hugging Face builder)
├── notebooks/       # Results visualization and reporting
└── output/          # Benchmark results (gitignored)

Test Suites

Test suites live in datasets/suites/<suite-name>/ with two directories per suite:

execution/ — Code generation tasks with assertions (one JSON file per category)
knowledge/ — Multiple-choice and short-answer knowledge questions (one JSON file per category)

The default suite wp-core-v1 covers WordPress core APIs, hooks, database operations, and security patterns.

Loading from Hugging Face

dataset:
  source: huggingface
  name: WordPress/wp-bench-v1

Results & Reporting

After running benchmarks, visualize results with the included Jupyter notebook:

pip install jupyter pandas plotly
jupyter notebook notebooks/results_report.ipynb

The notebook generates:

Overall scores bar chart
Knowledge vs Correctness comparison
Radar chart for top models
Exportable HTML report

How Grading Works

The harness sends a prompt to the model requesting WordPress code
Generated code is sent to the WordPress runtime via WP-CLI
The runtime performs static analysis (syntax, coding standards, security)
Code executes in a sandbox with test assertions
Results return as JSON with scores and detailed feedback

# Manual grading example (run from runtime/ directory)
npm run wp-bench -- verify --payload=$(echo '{"code":"<?php echo 1;"}' | base64)

Development

pip install -e ./python[dev]    # install with dev dependencies
ruff check python/              # lint
mypy python/                    # type check
pytest python/                  # test

License

GPL-2.0-or-later

Find me

v0.3.3[beta]