inngest-self-learning-agent

A durable AI agent built with Inngest and pi-ai that experiments with its own prompts over time. It runs a normal think/act/observe loop, scores responses after the fact, and uses scheduled evaluation jobs to create, test, and promote better behavioral prompts.

TypeScript

public

View on GitHub

Inngest Self-Learning Agent

The interesting part is not just that the agent can rewrite prompts. It is that the first version learned to game its own scoring system. When the evaluation pipeline asked an LLM to improve an underperforming prompt, the model started embedding scoring criteria directly into the generated SOUL.md, turning the metric into the target.

This repo explores that self-learning loop and the guardrails needed to keep it useful:

Score every response across relevance, completeness, tool efficiency, and tone
Attribute scores to prompt versions so improvements can be compared over time
A/B test prompt variants with weighted traffic instead of replacing prompts blindly
Run scheduled evaluation to rewrite underperformers and promote stronger versions
Block score gaming so generated prompts do not copy evaluation criteria or optimize for the test itself

Read the blog post about this project: https://www.inngest.com/blog/build-self-learning-agent

This project is a fork of Inngest’s Utah agent example, extended with response scoring, prompt versioning, and an automated evaluation pipeline.

Simple TypeScript that gives you:

Durable agent loop — every LLM call and tool execution is an Inngest step
Automatic retries — LLM API timeouts are handled by Inngest, not your code
Singleton concurrency — one conversation at a time per chat, no race conditions
Cancel on new message — user sends again? Current run cancels, new one starts
Multi-channel — Slack, Telegram, and more via a simple channel interface
Local development — runs on your machine via connect(), no server needed
Response scoring — async LLM-based quality evaluation after every reply
Prompt versioning — A/B test behavioral prompts with weighted random selection
Evaluation pipeline — scheduled prompt improvement based on score analysis
Guardrails — keep generated prompts from leaking scoring criteria into agent behavior
Sub-agents — delegate tasks to isolated agent loops (sync or async)

Architecture

Channel (e.g. Telegram) → Inngest Cloud (webhook + transform) → WebSocket → Local Worker → LLM (Anthropic/OpenAI/Google) → Reply Event → Channel API

The worker connects to Inngest Cloud via WebSocket. No public endpoint. No ngrok. No VPS. Messages flow through Inngest as events, and the agent processes them locally with full filesystem access.

Prerequisites

Node.js 23+ (uses native TypeScript strip-types)
LLM API key (e.g. Anthropic API key (console.anthropic.com))
Inngest account (app.inngest.com)
At least one channel configured (see Channels below)

Setup

1. Create an Inngest Account

Sign up at app.inngest.com
Go to Settings → Keys and copy your:
- Event Key (for sending events)
- Signing Key (for authenticating your worker)

2. Configure and Run

git clone https://github.com/mitchellalderson/inngest-self-learning-agent
cd inngest-self-learning-agent
pnpm install
cp .env.example .env

Edit .env with your keys:

ANTHROPIC_API_KEY=sk-ant-...
INNGEST_EVENT_KEY=...
INNGEST_SIGNING_KEY=signkey-prod-...

Then add the environment variables for your channel(s) — see setup guides below.

Start the worker:

# Production mode (connects to Inngest Cloud via WebSocket)
pnpm run start

# Development mode (uses local Inngest dev server)
npx inngest-cli@latest dev &
pnpm run dev

On startup, the worker automatically sets up webhooks and transforms for each configured channel.

Channels

The agent supports multiple messaging channels. Each channel has its own setup guide:

Telegram — Fully automated setup. Just add your bot token and run.
Slack — Requires creating a Slack app and configuring Event Subscriptions.

Project Structure

src/
├── worker.ts                  # Entry point — connect() or serve()
├── client.ts                  # Inngest client
├── config.ts                  # Configuration from env vars
├── agent-loop.ts              # Core think → act → observe cycle
├── setup.ts                   # Channel setup orchestration
├── lib/
│   ├── llm.ts                 # pi-ai wrapper (multi-provider: Anthropic, OpenAI, Google)
│   ├── tools.ts               # Tool definitions (TypeBox schemas) + execution
│   ├── context.ts             # System prompt builder with workspace file injection
│   ├── session.ts             # JSONL session persistence
│   ├── memory.ts              # File-based memory system (daily logs + distillation)
│   ├── prompt-version.ts      # Prompt versioning and A/B testing
│   ├── scoring.ts             # Response quality evaluation
│   ├── evaluation.ts          # Prompt performance analysis and improvement
│   ├── compaction.ts          # LLM-powered conversation summarization
│   └── logger.ts              # Structured logging utility
├── functions/
│   ├── message.ts             # Main agent function (singleton + cancelOn)
│   ├── send-reply.ts          # Channel-agnostic reply dispatch
│   ├── score.ts               # Async response quality scoring
│   ├── acknowledge-message.ts # Message acknowledgment (typing indicator, etc.)
│   ├── heartbeat.ts           # Cron-based memory maintenance
│   ├── evaluate-prompts.ts    # Cron-based prompt improvement
│   ├── sub-agent.ts           # Isolated sub-agent loops (sync/async task delegation)
│   └── failure-handler.ts     # Global error handler with notifications
└── channels/
    ├── types.ts               # ChannelHandler interface
    ├── index.ts               # Channel registry
    ├── setup-helpers.ts       # Inngest REST API helpers for webhook setup
    └── <channel-name>/        # A channel implementation (see README for setup)
        ├── handler.ts         # ChannelHandler implementation
        ├── api.ts             # API client
        ├── setup.ts           # Webhook setup automation
        ├── transform.ts       # Webhook transform
        └── format.ts          # Formatting for channel messages
workspace/                       # Agent workspace (persisted across runs)
├── IDENTITY.md                # Agent name, role, emoji
├── SOUL.md                    # Agent personality and behavioral guidelines (fallback)
├── USER.md                    # User information
├── MEMORY.md                  # Long-term memory (agent-writable)
├── memory/                    # Daily logs (YYYY-MM-DD.md, auto-managed)
├── prompts/                   # Versioned prompts for A/B testing
│   ├── registry.json          # Version metadata
│   └── v1/SOUL.md             # Versioned behavioral prompts
├── scores/                    # Response quality scores (YYYY-MM-DD.jsonl)
└── sessions/                  # JSONL conversation files (gitignored)

How It Works

The Agent Loop

The core is a while loop where each iteration is an Inngest step:

Think — step.run("think") calls the LLM via pi-ai’s complete()
Act — if the LLM wants tools, each tool runs as step.run("tool-read")
Observe — tool results are fed back into the conversation
Repeat — until the LLM responds with text (no tools) or max iterations

Inngest auto-indexes duplicate step IDs in loops (think:0, think:1, etc.), so you don’t need to track iteration numbers in step names.

Event-Driven Composition

One incoming message triggers multiple independent functions:

Function	Purpose	Config
`agent-handle-message`	Run the agent loop	Singleton per chat, cancel on new message
`acknowledge-message`	Show “typing…” immediately	No retries (best effort)
`send-reply`	Format and send the response	3 retries, channel dispatch
`agent-handle-score`	Evaluate response quality (async)	1 retry, fires after reply sent
`agent-heartbeat`	Distill daily logs into long-term memory	Cron (every 30 min)
`evaluate-prompts`	Analyze scores, improve prompts	Cron (every 6 hours)
`agent-sub-agent`	Run isolated sub-agent for delegated tasks	1 retry, sync or async
`global-failure-handler`	Catch errors, notify user	Triggered by `inngest/function.failed`

Workspace Context Injection

The agent reads markdown files from the workspace directory and injects them into the system prompt:

File	Purpose
`IDENTITY.md`	Agent name, role, and emoji
`SOUL.md`	Agent personality, behavioral guidelines, tone, boundaries
`USER.md`	Info about the user (name, timezone, preferences)
`MEMORY.md`	Curated long-term memory (agent-writable)

Edit these files to customize your agent’s personality and knowledge. The agent can also update MEMORY.md using the write tool to remember things across conversations.

Note: SOUL.md supports versioning for A/B testing — see Prompt Versioning below.

Memory System

The agent has a two-tier memory system:

Daily logs (workspace/memory/YYYY-MM-DD.md) — append-only notes written via the remember tool during conversations
Long-term memory (workspace/MEMORY.md) — curated summary distilled from daily logs by the heartbeat function

The agent-heartbeat function runs on a cron schedule (default: every 30 minutes). It checks if daily logs have accumulated enough content, then uses the LLM to distill them into MEMORY.md. Old daily logs are pruned after a configurable retention period (default: 30 days).

Response Scoring

After each agent response, a lightweight LLM evaluates response quality across four dimensions:

Dimension	Scale	Description
Relevance	0-10	Did the response address the user’s question?
Completeness	0-10	Was anything important missing?
Tool Efficiency	0-10	Were tool calls necessary and well-targeted?
Tone Alignment	0-10	Did it match SOUL.md guidelines?

Scores are persisted to workspace/scores/YYYY-MM-DD.jsonl as JSON lines:

{
  "timestamp": "2026-03-12T...",
  "sessionKey": "main",
  "promptVersion": "v1",
  "relevance": 8,
  "completeness": 7,
  "toolEfficiency": 9,
  "tone": 8,
  "composite": 8.0,
  "rationale": "Addressed the core question..."
}

How it works:

Runs asynchronously after the reply is sent (doesn’t block delivery)
Composite score = simple average of the four dimensions
Failures don’t affect reply delivery

Configuration (env vars):

Variable	Default	Description
`SCORING_ENABLED`	`true`	Enable/disable scoring
`SCORING_PROVIDER`	`anthropic`	Provider for scoring LLM
`SCORING_MODEL`	`claude-3-5-haiku-20241022`	Model for scoring

Prompt Versioning

The agent supports A/B testing of behavioral prompts through versioned SOUL.md files:

workspace/prompts/
├── registry.json        # Version metadata + active assignments
├── v1/
│   └── SOUL.md          # Baseline prompt
├── v2/
│   └── SOUL.md          # First variation
└── v3/
    └── SOUL.md          # Second variation

registry.json schema:

{
  "versions": [
    {
      "id": "v1",
      "created": "2026-03-15T00:00:00Z",
      "source": "baseline",
      "active": true,
      "weight": 0.5
    },
    {
      "id": "v2",
      "created": "2026-03-16T00:00:00Z",
      "source": "evaluation-pipeline",
      "active": true,
      "weight": 0.5,
      "parentVersion": "v1"
    }
  ],
  "currentDefault": "v2"
}

How it works:

On each agent loop start, the context builder reads registry.json
Selects a prompt version using weighted random selection from active versions
Weights are normalized automatically if they don’t sum to 1.0
The selected version’s SOUL.md is injected into the system prompt
Version ID is stored in session metadata and scoring logs for analysis

Automatic initialization:

If registry.json is missing, the system auto-creates a fresh v1 registry
Copies the current workspace/SOUL.md to workspace/prompts/v1/SOUL.md
If root SOUL.md doesn’t exist, creates a default prompt

Configuration:

Variable	Default	Description
`PROMPT_VERSIONING_ENABLED`	`true`	Enable/disable prompt versioning

Evaluation Pipeline

The evaluation pipeline automatically analyzes scored responses and generates improved prompt versions. In production you can run it on whatever cadence gives the agent enough fresh data, such as nightly; by default this repo runs the evaluator every 6 hours.

The main lesson from the first runs was Goodhart’s Law in miniature: once the prompt generator saw enough performance data, it began producing prompts that mirrored the evaluation criteria instead of simply improving the agent’s behavior. The prompt-generation step now includes explicit output rules that prohibit scoring targets, metrics, and evaluation data from appearing in generated SOUL.md files.

Testing with Smaller Models

For testing the evaluation pipeline, use a smaller model like Haiku instead of Sonnet. Sonnet produces consistently high-quality responses, making it difficult to generate the score variance needed to trigger prompt improvements:

# Switch to Haiku for testing
AGENT_MODEL=claude-3-5-haiku-20241022

Haiku is more likely to produce varied scores across different question types, which helps the evaluation pipeline identify underperforming prompt versions and generate meaningful improvements.

How it works:

A cron function (evaluate-prompts) runs on a configurable schedule (default: every 6 hours):

Load scores — Reads all JSONL files from workspace/scores/
Aggregate — Groups scores by prompt version, computes averages
Identify underperformers — Finds versions with:
- Composite score below target (default: 7.0), OR
- Composite score significantly below best version (default: 1.0+ points gap)
Generate improvements — Calls LLM with underperforming prompt + rationales to produce improved SOUL.md
Create new versions — Writes improved prompts to workspace/prompts/vN/SOUL.md
Promote winners — Versions with ≥80% traffic share and ≥1.0 point advantage over default become new default
Enforce cap — If >5 active versions, retire lowest-scoring (v1 is never retired)

Weight redistribution:

New versions start at 50% weight (configurable)
Remaining weight distributed proportionally among other active versions
Weights normalized automatically

Safety rails:

Minimum data points required before any rewrite (default: 10)
Maximum active versions cap (default: 5)
Baseline (v1) is never deleted, only deprioritized
New versions never start at 100% — always keep a control
Generated prompts are forbidden from including scoring targets, metrics, or evaluation data

Configuration (env vars):

Variable	Default	Description
`EVALUATION_CRON`	`0 /6 * *`	Cron schedule for evaluation runs
`EVAL_MIN_DATA_POINTS`	`10`	Minimum scores before version can be rewritten
`EVAL_TARGET_COMPOSITE`	`7.0`	Target composite score threshold
`EVAL_MAX_VERSIONS`	`5`	Maximum active versions before retirement
`EVAL_NEW_VERSION_WEIGHT`	`0.5`	Initial weight for new versions
`EVAL_PROMOTION_TRAFFIC`	`0.8`	Traffic share required for promotion
`EVAL_PROMOTION_SCORE_GAP`	`1.0`	Score advantage required for promotion
`EVAL_SIGNIFICANT_GAP`	`1.0`	Points below best to trigger rewrite

Conversation Compaction

Long conversations get summarized automatically so the agent doesn’t lose context or hit token limits:

Token estimation: Uses a chars/4 heuristic to estimate conversation size
Threshold: Compaction triggers when estimated tokens exceed 80% of the configured max (150K)
LLM summarization: Old messages are summarized into a structured checkpoint (goals, progress, decisions, next steps)
Recent messages preserved: The most recent ~20K tokens of conversation are kept verbatim
Persisted: The compacted session replaces the JSONL file, so it survives restarts

Compaction runs as an Inngest step (step.run("compact")), so it’s durable and retryable.

Context Pruning

Long tool results bloat the conversation context and cause the LLM to lose focus. The agent uses two-tier pruning:

Soft trim: Tool results over 4K chars get head+tail trimmed (first 1,500 + last 1,500 chars)
Hard clear: When total old tool content exceeds 50K chars, old results are replaced entirely
Budget warnings: System messages are injected when iterations are running low

Adding New Channels

The agent is channel-agnostic. Each channel implements a ChannelHandler interface (src/channels/types.ts) with methods for sending replies, acknowledging messages, and setup. Each channel directory follows the same structure:

src/channels/<name>/
├── handler.ts      # ChannelHandler implementation (sendReply, acknowledge)
├── api.ts          # API client for the channel's platform
├── setup.ts        # Webhook setup automation
├── transform.ts    # Plain JS transform for Inngest webhook
└── format.ts       # Markdown → channel-specific format conversion

To add Discord, WhatsApp, or any other channel:

Create a new directory under src/channels/ following the structure above
Implement the ChannelHandler interface in handler.ts
Write a webhook transform that converts the channel’s payload to agent.message.received
Register the channel in src/channels/index.ts

The agent loop, reply dispatch, and acknowledgment functions are all channel-agnostic — no changes needed outside src/channels/.

Key Inngest Features Used

connect() — WebSocket-based worker
Singleton execution — one run per chat at a time
Step retries — automatic retry on LLM API failures
Event-driven functions — compose behavior from small focused functions
Webhook transforms — convert external payloads to typed events
Checkpointing — near-zero inter-step latency

Acknowledgments

This project uses pi-ai (@mariozechner/pi-ai) by Mario Zechner for its unified LLM interface and @mariozechner/pi-coding-agent for it’s. standard tools. pi-ai provides a single complete() function that works across Anthropic, OpenAI, Google, and other providers — making it easy to swap models without changing any agent code. It’s a great library.

License

Apache-2.0

Find me

v0.3.3[beta]