Autonomous GPU Kernel Generation & Optimization via Deep Agents
KernelAgent turns PyTorch programs into verified Triton kernels and optimize its performance. It was designed around KernelBench workloads and combines:
GPU Kernel Synthesis Blog post: PyTorch KernelFalcon
GPU Kernel Optimization Blog post: PyTorch KernelAgent
Every stage writes artifacts to a run directory under .fuse/<run_id>/, including the fused PyTorch code, subgraphs.json, individual KernelAgent sessions, and the final compose_out/composed_kernel.py.
Every stage writes artifacts to a run directory under .optimize/<run_id>/, including the input Triton kernel, artifacts, individual optimization worker sessions, and the final output/best_kernel.py.
pip install triton or nightly from source)pip install -e .
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/xpu
undefinedNote: Intel XPU support requires:
Verify your XPU installation:
import torch
print(torch.xpu.is_available()) # Should print True
print(torch.xpu.device_count()) # Number of Intel GPUs
git clone https://github.com/ScalingIntelligence/KernelBench.git
Note: By default, KernelAgent UI searches for KernelBench at the same level as KernelAgent. (i.e. ../KernelBench)
You can export keys directly or use an .env file that the CLIs load automatically.
OPENAI_MODEL=gpt-5 # default model for extraction
NUM_KERNEL_SEEDS=4 # parallel workers per kernel
MAX_REFINEMENT_ROUNDS=10 # retry budget per worker
LOG_LEVEL=INFO # logging level
KernelAgent currently supports OpenAI and Anthropic out-of-the-box. You can also use a custom OpenAI endpoint.
These can be configured in .env or via environment variables.
# OpenAI (models like `o4-mini`, `gpt-5`)
OPENAI_API_KEY=sk-...
# Anthropic (default; `claude-sonnet-4-20250514` is used when `OPENAI_MODEL` is unset)
ANTHROPIC_API_KEY=sk-ant-...
# Relay configuration for self-hosted gateways
LLM_RELAY_URL=http://127.0.0.1:11434
LLM_RELAY_TIMEOUT_S=120
More knobs live in triton_kernel_agent/agent.py and Fuser/config.py.
undefinedAuto-route a KernelBench problem — static analysis picks between the direct KernelAgent path and the full Fuser pipeline, with automatic fallback if the first attempt fails:
python -m Fuser.auto_agent \
--problem /abs/path/to/KernelBench/level1/19_ReLU.py \
--no-router-cache \ # avoid caching or using cached results
--verify # ensure final composition test runs
--no-router-cache can be enabled to avoid utilizing any cached router results and prevent writing to the cache.
undefinedManually run the pipeline (extract → dispatch → compose) when you want explicit control over models or concurrency:
python -m Fuser.pipeline \
--problem /abs/path/to/problem.py \
--extract-model gpt-5 \
--dispatch-model o4-mini \
--dispatch-jobs auto \
--compose-model o4-mini \
--workers 4 \
--max-iters 5 \
--verify
# For Intel XPU
python -m Fuser.pipeline \
--problem /abs/path/to/problem.py \
--target-platform xpu \
--extract-model gpt-5 \
--dispatch-model o4-mini \
--dispatch-jobs auto \
--compose-model o4-mini \
--workers 4 \
--max-iters 5 \
--verify
dispatch-jobs auto matches the number of discovered subgraphs; artifacts are placed under .fuse/<run_id>/.
undefinedDirect KernelAgent run — bypass Fuser and provide a plain language problem description or a KernelBench snippet:
from triton_kernel_agent import TritonKernelAgent
agent = TritonKernelAgent(num_workers=4, max_rounds=8, model_name="gpt-5")
result = agent.generate_kernel(
problem_description="Implement ReLU over a contiguous 1D tensor of length 1024"
)
if result["success"]:
print("Kernel path:", result["kernel_path"])
print("Session directory:", result["session_dir"])
else:
print("Failure:", result["message"])
undefinedUIs — interactive runs with Gradio frontends:
kernel-agent or python scripts/triton_ui.pyfuser-ui or python scripts/fuser_uipipeline-ui or python scripts/pipeline_uiundefinedAutoRouter (Fuser/auto_agent.py): parses the problem’s AST, looks for attention blocks, transposed convolutions, control flow, and long op chains. It caches decisions under .fuse/router_cache.json and can fall back to the other path if the first attempt fails. Use --no-router-cache to ignore the existing cache and caching new routes. Use --ignore-router-config to ignore router-provided tuning and rely on CLI args.
undefinedFuser Orchestrator (Fuser/orchestrator.py): rewrites the PyTorch module into fusable modules, executes them for validation, and packages a tarball of the fused code. Run IDs and directories are managed via Fuser/paths.py.
undefinedSubgraph Extractor (Fuser/subgraph_extractor.py): prompts the LLM to emit a JSON array describing each unique subgraph, including ops, shapes, dtypes, and parameter tensors. Entries are deduplicated by shape signature so the dispatcher can reuse kernels.
undefinedDispatcher (Fuser/dispatch_kernel_agent.py): converts each JSON item into a precise Triton generation spec, then spins up TritonKernelAgent processes in parallel. Each worker writes its own session directory with the candidate kernel, test harness, and verification logs.
undefinedTritonKernelAgent (triton_kernel_agent/): manages a pool of verification workers (worker.py, manager.py). Each worker iteratively asks an LLM for improvements, executes unit tests under sandboxed subprocesses (Fuser/runner.py), and enforces strict bans on PyTorch fallbacks. A run succeeds only when the test prints PASS (or the sentinel string) and exits with status 0.
undefinedComposer (Fuser/compose_end_to_end.py): stitches the verified kernels back into a single Triton program. The composed file contains one or more @triton.jit kernels plus a kernel_function(...) wrapper and a self-test that replays the original PyTorch problem. With --verify, the test is executed immediately and must succeed.
KernelAgent includes a hardware-guided optimization pipeline that iteratively improves a verified Triton kernel’s performance using GPU profiling feedback.
The loop runs for up to N rounds, with early termination when the kernel reaches roofline (≥95% SOL) or when performance converges.
python scripts/optimization_ui.py --port 8085
undefinedOptimize a kernel using beam search — parallel exploration with top-N kernels and M bottleneck directions:
cd examples && python run_opt_manager.py \
--kernel-dir optimize_01_matvec/ \
--strategy beam_search \
--max-rounds 5
| Component | Location | Role |
|---|---|---|
| undefinedOptimizationOrchestratorundefined | triton_kernel_agent/opt_worker_component/orchestrator/ |
Main optimization loop |
| undefinedKernelProfilerundefined | triton_kernel_agent/opt_worker_component/profiling/ |
NCU hardware profiling |
| undefinedBottleneckAnalyzerundefined | triton_kernel_agent/opt_worker_component/prescribing/ |
LLM-based bottleneck diagnosis |
| undefinedRooflineAnalyzerundefined | kernel_perf_agent/kernel_opt/roofline/ |
SOL classification and early stopping |
| undefinedBenchmarkundefined | triton_kernel_agent/opt_worker_component/benchmarking/ |
CUDA event timing |
.optimize/workers/<worker_id>/<run_id>/artifacts
kernel_round_0.py # baseline kernel
kernel_round_N.py # kernel after round N
round001_opt_prompt.txt # optimization prompt sent to LLM
round001_opt_reply.txt # LLM response
round001_strategy.json # bottleneck analysis result
...
KernelAgent supports multiple GPU platforms for Triton kernel execution:
| Platform | Device String | Flag | Status |
|---|---|---|---|
| NVIDIA CUDA | cuda |
--target-platform cuda (default) |
Fully supported |
| Intel XPU | xpu |
--target-platform xpu |
Supported |
When targeting Intel XPU, KernelAgent automatically:
device='xpu' for all tensor allocations# Check CUDA availability
import torch
print("CUDA available:", torch.cuda.is_available())
# Check XPU availability
print("XPU available:", hasattr(torch, 'xpu') and torch.xpu.is_available())
A successful pipeline run yields a structure similar to:
.fuse/<run_id>/
orchestrator/code.py.tgz # fused PyTorch refactor
subgraphs.json # shape-specialized subgraph descriptions
kernels_out/
<subgraph_id>/* # per-subgraph KernelAgent sessions
summary.json # success/failure per subgraph
compose_out/
composed_kernel.py # final Triton program + self-test
summary.json # composition metadata
These artifacts are designed for reproducibility: you can re-run a single kernel session, inspect prompts/responses, or feed composed_kernel.py directly into downstream tooling.
Looking for ready-to-browse kernel generation outputs? See the curated artifacts repo:
It includes selected L1/L2/L3 problems with:
subgraphs.json) and per‑subgraph Triton kernelsLooking for ready-to-browse kernel optimization outputs? See the curated artifacts repo:
It includes selected L1 problems with:
triton_kernel_agent/ — KernelAgent core (agent, worker manager, provider adapters, prompt templates)triton_kernel_agent/opt_worker_component/ — optimization pipeline (profiler, benchmarker, bottleneck analyzer, orchestrator)kernel_perf_agent/kernel_opt — roofline analysis, hardware specs, and benchmarking utilitiesFuser/ — auto-router, orchestration pipeline, CLIs, Gradio UIstriton_kernel_agent/templates/ — Jinja templates used when prompting TritonKernelAgentexamples/ — sample problems and prompt snippetstests/ — unit tests for agents and utilitiese2e_test.py — example end-to-end kernel generation harnessscripts/ — coverage/benchmark tooling, profiling helpers, CLI entry points (e.g., autoroute coverage runners, Triton UI)pip install -e .[dev]pytest -vCONTRIBUTING.mdKernelAgent is released under the Apache License 2.0; see LICENSE.