A durable AI agent built with Inngest and pi-ai that experiments with its own prompts over time. It runs a normal think/act/observe loop, scores responses after the fact, and uses scheduled evaluation jobs to create, test, and promote better behavioral prompts.
A durable AI agent built with Inngest and pi-ai that experiments with its own prompts over time. It runs a normal think/act/observe loop, scores responses after the fact, and uses scheduled evaluation jobs to create, test, and promote better behavioral prompts over time.
The interesting part is not just that the agent can rewrite prompts. It is that the first version learned to game its own scoring system. When the evaluation pipeline asked an LLM to improve an underperforming prompt, the model started embedding scoring criteria directly into the generated SOUL.md, turning the metric into the target.
This repo explores that self-learning loop and the guardrails needed to keep it useful:
Read the blog post about this project: https://www.inngest.com/blog/build-self-learning-agent
This project is a fork of Inngest’s Utah agent example, extended with response scoring, prompt versioning, and an automated evaluation pipeline.
Simple TypeScript that gives you:
connect(), no server neededChannel (e.g. Telegram) → Inngest Cloud (webhook + transform) → WebSocket → Local Worker → LLM (Anthropic/OpenAI/Google) → Reply Event → Channel API
The worker connects to Inngest Cloud via WebSocket. No public endpoint. No ngrok. No VPS. Messages flow through Inngest as events, and the agent processes them locally with full filesystem access.
git clone https://github.com/mitchellalderson/inngest-self-learning-agent
cd inngest-self-learning-agent
pnpm install
cp .env.example .env
Edit .env with your keys:
ANTHROPIC_API_KEY=sk-ant-...
INNGEST_EVENT_KEY=...
INNGEST_SIGNING_KEY=signkey-prod-...
Then add the environment variables for your channel(s) — see setup guides below.
Start the worker:
# Production mode (connects to Inngest Cloud via WebSocket)
pnpm run start
# Development mode (uses local Inngest dev server)
npx inngest-cli@latest dev &
pnpm run dev
On startup, the worker automatically sets up webhooks and transforms for each configured channel.
The agent supports multiple messaging channels. Each channel has its own setup guide:
src/
├── worker.ts # Entry point — connect() or serve()
├── client.ts # Inngest client
├── config.ts # Configuration from env vars
├── agent-loop.ts # Core think → act → observe cycle
├── setup.ts # Channel setup orchestration
├── lib/
│ ├── llm.ts # pi-ai wrapper (multi-provider: Anthropic, OpenAI, Google)
│ ├── tools.ts # Tool definitions (TypeBox schemas) + execution
│ ├── context.ts # System prompt builder with workspace file injection
│ ├── session.ts # JSONL session persistence
│ ├── memory.ts # File-based memory system (daily logs + distillation)
│ ├── prompt-version.ts # Prompt versioning and A/B testing
│ ├── scoring.ts # Response quality evaluation
│ ├── evaluation.ts # Prompt performance analysis and improvement
│ ├── compaction.ts # LLM-powered conversation summarization
│ └── logger.ts # Structured logging utility
├── functions/
│ ├── message.ts # Main agent function (singleton + cancelOn)
│ ├── send-reply.ts # Channel-agnostic reply dispatch
│ ├── score.ts # Async response quality scoring
│ ├── acknowledge-message.ts # Message acknowledgment (typing indicator, etc.)
│ ├── heartbeat.ts # Cron-based memory maintenance
│ ├── evaluate-prompts.ts # Cron-based prompt improvement
│ ├── sub-agent.ts # Isolated sub-agent loops (sync/async task delegation)
│ └── failure-handler.ts # Global error handler with notifications
└── channels/
├── types.ts # ChannelHandler interface
├── index.ts # Channel registry
├── setup-helpers.ts # Inngest REST API helpers for webhook setup
└── <channel-name>/ # A channel implementation (see README for setup)
├── handler.ts # ChannelHandler implementation
├── api.ts # API client
├── setup.ts # Webhook setup automation
├── transform.ts # Webhook transform
└── format.ts # Formatting for channel messages
workspace/ # Agent workspace (persisted across runs)
├── IDENTITY.md # Agent name, role, emoji
├── SOUL.md # Agent personality and behavioral guidelines (fallback)
├── USER.md # User information
├── MEMORY.md # Long-term memory (agent-writable)
├── memory/ # Daily logs (YYYY-MM-DD.md, auto-managed)
├── prompts/ # Versioned prompts for A/B testing
│ ├── registry.json # Version metadata
│ └── v1/SOUL.md # Versioned behavioral prompts
├── scores/ # Response quality scores (YYYY-MM-DD.jsonl)
└── sessions/ # JSONL conversation files (gitignored)
The core is a while loop where each iteration is an Inngest step:
step.run("think") calls the LLM via pi-ai’s complete()step.run("tool-read")Inngest auto-indexes duplicate step IDs in loops (think:0, think:1, etc.), so you don’t need to track iteration numbers in step names.
One incoming message triggers multiple independent functions:
| Function | Purpose | Config |
|---|---|---|
agent-handle-message |
Run the agent loop | Singleton per chat, cancel on new message |
acknowledge-message |
Show “typing…” immediately | No retries (best effort) |
send-reply |
Format and send the response | 3 retries, channel dispatch |
agent-handle-score |
Evaluate response quality (async) | 1 retry, fires after reply sent |
agent-heartbeat |
Distill daily logs into long-term memory | Cron (every 30 min) |
evaluate-prompts |
Analyze scores, improve prompts | Cron (every 6 hours) |
agent-sub-agent |
Run isolated sub-agent for delegated tasks | 1 retry, sync or async |
global-failure-handler |
Catch errors, notify user | Triggered by inngest/function.failed |
The agent reads markdown files from the workspace directory and injects them into the system prompt:
| File | Purpose |
|---|---|
IDENTITY.md |
Agent name, role, and emoji |
SOUL.md |
Agent personality, behavioral guidelines, tone, boundaries |
USER.md |
Info about the user (name, timezone, preferences) |
MEMORY.md |
Curated long-term memory (agent-writable) |
Edit these files to customize your agent’s personality and knowledge. The agent can also update MEMORY.md using the write tool to remember things across conversations.
Note: SOUL.md supports versioning for A/B testing — see Prompt Versioning below.
The agent has a two-tier memory system:
workspace/memory/YYYY-MM-DD.md) — append-only notes written via the remember tool during conversationsworkspace/MEMORY.md) — curated summary distilled from daily logs by the heartbeat functionThe agent-heartbeat function runs on a cron schedule (default: every 30 minutes). It checks if daily logs have accumulated enough content, then uses the LLM to distill them into MEMORY.md. Old daily logs are pruned after a configurable retention period (default: 30 days).
After each agent response, a lightweight LLM evaluates response quality across four dimensions:
| Dimension | Scale | Description |
|---|---|---|
| Relevance | 0-10 | Did the response address the user’s question? |
| Completeness | 0-10 | Was anything important missing? |
| Tool Efficiency | 0-10 | Were tool calls necessary and well-targeted? |
| Tone Alignment | 0-10 | Did it match SOUL.md guidelines? |
Scores are persisted to workspace/scores/YYYY-MM-DD.jsonl as JSON lines:
{
"timestamp": "2026-03-12T...",
"sessionKey": "main",
"promptVersion": "v1",
"relevance": 8,
"completeness": 7,
"toolEfficiency": 9,
"tone": 8,
"composite": 8.0,
"rationale": "Addressed the core question..."
}
How it works:
Configuration (env vars):
| Variable | Default | Description |
|---|---|---|
SCORING_ENABLED |
true |
Enable/disable scoring |
SCORING_PROVIDER |
anthropic |
Provider for scoring LLM |
SCORING_MODEL |
claude-3-5-haiku-20241022 |
Model for scoring |
The agent supports A/B testing of behavioral prompts through versioned SOUL.md files:
workspace/prompts/
├── registry.json # Version metadata + active assignments
├── v1/
│ └── SOUL.md # Baseline prompt
├── v2/
│ └── SOUL.md # First variation
└── v3/
└── SOUL.md # Second variation
registry.json schema:
{
"versions": [
{
"id": "v1",
"created": "2026-03-15T00:00:00Z",
"source": "baseline",
"active": true,
"weight": 0.5
},
{
"id": "v2",
"created": "2026-03-16T00:00:00Z",
"source": "evaluation-pipeline",
"active": true,
"weight": 0.5,
"parentVersion": "v1"
}
],
"currentDefault": "v2"
}
How it works:
registry.jsonSOUL.md is injected into the system promptAutomatic initialization:
registry.json is missing, the system auto-creates a fresh v1 registryworkspace/SOUL.md to workspace/prompts/v1/SOUL.mdSOUL.md doesn’t exist, creates a default promptConfiguration:
| Variable | Default | Description |
|---|---|---|
PROMPT_VERSIONING_ENABLED |
true |
Enable/disable prompt versioning |
The evaluation pipeline automatically analyzes scored responses and generates improved prompt versions. In production you can run it on whatever cadence gives the agent enough fresh data, such as nightly; by default this repo runs the evaluator every 6 hours.
The main lesson from the first runs was Goodhart’s Law in miniature: once the prompt generator saw enough performance data, it began producing prompts that mirrored the evaluation criteria instead of simply improving the agent’s behavior. The prompt-generation step now includes explicit output rules that prohibit scoring targets, metrics, and evaluation data from appearing in generated SOUL.md files.
For testing the evaluation pipeline, use a smaller model like Haiku instead of Sonnet. Sonnet produces consistently high-quality responses, making it difficult to generate the score variance needed to trigger prompt improvements:
# Switch to Haiku for testing
AGENT_MODEL=claude-3-5-haiku-20241022
Haiku is more likely to produce varied scores across different question types, which helps the evaluation pipeline identify underperforming prompt versions and generate meaningful improvements.
How it works:
A cron function (evaluate-prompts) runs on a configurable schedule (default: every 6 hours):
workspace/scores/workspace/prompts/vN/SOUL.mdWeight redistribution:
Safety rails:
Configuration (env vars):
| Variable | Default | Description |
|---|---|---|
EVALUATION_CRON |
0 */6 * * * |
Cron schedule for evaluation runs |
EVAL_MIN_DATA_POINTS |
10 |
Minimum scores before version can be rewritten |
EVAL_TARGET_COMPOSITE |
7.0 |
Target composite score threshold |
EVAL_MAX_VERSIONS |
5 |
Maximum active versions before retirement |
EVAL_NEW_VERSION_WEIGHT |
0.5 |
Initial weight for new versions |
EVAL_PROMOTION_TRAFFIC |
0.8 |
Traffic share required for promotion |
EVAL_PROMOTION_SCORE_GAP |
1.0 |
Score advantage required for promotion |
EVAL_SIGNIFICANT_GAP |
1.0 |
Points below best to trigger rewrite |
Long conversations get summarized automatically so the agent doesn’t lose context or hit token limits:
Compaction runs as an Inngest step (step.run("compact")), so it’s durable and retryable.
Long tool results bloat the conversation context and cause the LLM to lose focus. The agent uses two-tier pruning:
The agent is channel-agnostic. Each channel implements a ChannelHandler interface (src/channels/types.ts) with methods for sending replies, acknowledging messages, and setup. Each channel directory follows the same structure:
src/channels/<name>/
├── handler.ts # ChannelHandler implementation (sendReply, acknowledge)
├── api.ts # API client for the channel's platform
├── setup.ts # Webhook setup automation
├── transform.ts # Plain JS transform for Inngest webhook
└── format.ts # Markdown → channel-specific format conversion
To add Discord, WhatsApp, or any other channel:
src/channels/ following the structure aboveChannelHandler interface in handler.tsagent.message.receivedsrc/channels/index.tsThe agent loop, reply dispatch, and acknowledgment functions are all channel-agnostic — no changes needed outside src/channels/.
connect() — WebSocket-based workerThis project uses pi-ai (@mariozechner/pi-ai) by Mario Zechner for its unified LLM interface and @mariozechner/pi-coding-agent for it’s. standard tools. pi-ai provides a single complete() function that works across Anthropic, OpenAI, Google, and other providers — making it easy to swap models without changing any agent code. It’s a great library.
Apache-2.0