The official WordPress AI benchmark. Evaluate how well language models understand WordPress development—from core APIs and coding standards to plugin architecture and security best practices.
The official WordPress AI benchmark. Evaluate how well language models understand WordPress development—from core APIs and coding standards to plugin architecture and security best practices.
WP-Bench measures AI model capabilities across two dimensions:
The benchmark uses WordPress itself as the grader, running generated code in a sandboxed environment with static analysis and runtime assertions.
python3 -m venv .venv && source .venv/bin/activate
pip install -e ./python
Create a .env file with your model provider API keys:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
cd runtime
npm install
npm start
cd ..
wp-bench run --config wp-bench.example.yaml
Results are written to output/results.json with per-test logs in output/results.jsonl.
Compare multiple models in a single run by listing them in your config:
models:
- name: gpt-4o
- name: gpt-4o-mini
- name: claude-sonnet-4-20250514
- name: claude-opus-4-5-20251101
- name: gemini/gemini-2.5-pro
- name: gemini/gemini-2.5-flash
The harness runs each model sequentially and outputs a comparison table. Model names follow LiteLLM conventions.
Copy wp-bench.example.yaml and customize:
dataset:
source: local # 'local' or 'huggingface'
name: wp-core-v1 # suite name
models:
- name: gpt-4o
grader:
kind: docker
wp_env_dir: ./runtime # path to wp-env project
run:
suite: wp-core-v1
limit: 10 # limit tests (null = all)
concurrency: 4
output:
path: output/results.json
jsonl_path: output/results.jsonl
# Run from project root
wp-bench run --config wp-bench.yaml # run with config file
wp-bench run --model-name gpt-4o --limit 5 # quick single-model test
wp-bench run --test-type knowledge # run only knowledge tests (no WordPress env needed)
wp-bench run --test-type execution # run only execution tests
wp-bench dry-run --config wp-bench.yaml # validate config without calling models
.
├── python/ # Benchmark harness (pip installable)
├── runtime/ # WordPress grader plugin + wp-env config
├── datasets/ # Test suites (local JSON + Hugging Face builder)
├── notebooks/ # Results visualization and reporting
└── output/ # Benchmark results (gitignored)
Test suites live in datasets/suites/<suite-name>/ with two directories per suite:
execution/ — Code generation tasks with assertions (one JSON file per category)knowledge/ — Multiple-choice and short-answer knowledge questions (one JSON file per category)The default suite wp-core-v1 covers WordPress core APIs, hooks, database operations, and security patterns.
dataset:
source: huggingface
name: WordPress/wp-bench-v1
After running benchmarks, visualize results with the included Jupyter notebook:
pip install jupyter pandas plotly
jupyter notebook notebooks/results_report.ipynb
The notebook generates:
# Manual grading example (run from runtime/ directory)
npm run wp-bench -- verify --payload=$(echo '{"code":"<?php echo 1;"}' | base64)
pip install -e ./python[dev] # install with dev dependencies
ruff check python/ # lint
mypy python/ # type check
pytest python/ # test
We use cookies to analyze traffic and improve your experience. You can accept or reject analytics cookies.