A self-improving agent platform that uses Ninja Harness as its evaluation gate. Memory, skills, profiles, traces, and a propose-only improvement loop.
A self-improving agent platform that uses Ninja Harness as its evaluation gate.
π Website / docs: gagans23.github.io/agent-os
agent-os is the runtime layer β command routing, agent profiles, persistent
memory, a reusable skill library, full trace recording, and a propose-only
self-improvement loop. Ninja Harness is the evaluation/certification layer it
calls. Keeping them separate is deliberate: one runs agents, the other grades them.
flowchart LR
A["Command<br/>(WhatsApp / CLI)"] --> R[Router]
R --> K{Risk}
K -- read-only --> X[Execute]
K -- write/send/deploy --> Q["Approval<br/>/approve Β· /reject"]
Q --> X
X --> T[Trace] --> N[Ninja Harness] --> P["Propose<br/>improvement"] --> Z[Report]
π Full diagrams & module map: docs/architecture.md Β·
πΊοΈ Modular roadmap (toward a personal agent OS): docs/roadmap.md
Status: v0.11 β the governed swarm π + hardware model advisor π©Ί, on top of
Agent Skills compatibility, a one-command install + local web UI π₯οΈ, model
onboarding (Ollama/OpenAI/Claude), the Brain π§ , and a tamper-evident governance
spine. The three levels (Core Β· Reliability Β· Controlled Autonomy) work and are
tested. Live integrations (WhatsApp/Meta, Gmail, Cloudflare Tunnel, GitHub
publish) are pluggable adapters you wire with your own credentials β none bundled
or faked.
agent-os is the orchestration + evaluation + controlled-autonomy + personal-brain
spine for your own agents. Its job isnβt to be the biggest pile of connectors or
the flashiest chat UI β itβs to make every agent action traced β scored by
Ninja Harness β risk-gated β improved,
running local-first (SQLite + stdlib) so a non-technical person can run it and
heavier tools can be plugged in behind the same governed spine.
One command (creates a local .venv, installs the eval gate + agent-os, no sudo):
curl -fsSL https://raw.githubusercontent.com/gagans23/agent-os/main/install.sh | bash
# or, from a clone: ./install.sh
pip install "ninja-harness @ git+https://github.com/gagans23/ninja-harness.git" # eval gate
pip install -e ".[dev]" # agent-os
No terminal? Double-click the launcher for your system in launchers/
(macOS agent-os.app Β· Windows agent-os-windows.bat Β· Linux agent-os-linux.sh).
It sets up a local environment on first run, then opens the UI in your browser β
you type nothing. (On macOS use agent-os.app; a bare .command often does
nothing on double-click because of a missing file association.) Full walkthrough
(incl. the one-click model setup): docs/no-terminal.md.
Prefer the terminal?
agent-os setup # guided steps to a working local model (prints commands; changes nothing)
agent-os setup --run # also pulls the model + remembers your choice (never installs Ollama for you)
agent-os ui # opens http://127.0.0.1:8765 (auto-picks a free port if busy)
A single local page (stdlib server β nothing extra to install, localhost-only)
to teach the brain, ask it questions, run tasks, swarm a goal, and approve actions β
driving the same governed command router as the CLI, so every action is still
traced, scored, audited, and risk-gated. The startup prints the exact URL; if the
port is taken it falls back to the next free one. π
docs/install-and-ui.md
No-terminal model setup. The first time you open the UI with no model
configured, it shows a setup card with a βPull recommended modelβ button β
one click detects Ollama, pulls the right model for your machine, remembers the
choice, and reloads smart. No commands to type. (It never installs Ollama itself
β if itβs missing, the card links the normal app installer. Same flow on the CLI:
agent-os setup --run.)
agent-os setupis the one-stop guided flow: it detects your machine, tells you
the exact steps to a working local model, and with--runpulls the model and
remembers your choice in~/.agent-os/config.json(no shell-profile editing). It
never installs Ollama for you β that stays an explicit command you run, per
default-deny. The UI shows the same steps as a first-time empty-state when no model
is configured. (agent-os doctor//doctorgive just the hardware β model advice.)
python examples/demo_run.py
# or
agent-os run "research the top Hacker News stories" --profile researcher
Example output:
Job complete.
Result: PASS
Ninja score: 94.3
Safety: PASS
Artifact: traces/<job_id>/final.md
The keystone of the personal-OS vision: agents that are self-aware of your
context. context.py is a local-first, dependency-light knowledge base β
SQLite + BM25-lite retrieval, standard library only, zero infrastructure.
Semantic search is a pluggable embedder you supply (Ollama/OpenAI/etc.) β never
bundled, never a hidden network call.
# Ahaan's maths brain: teach it, then ask β grounded in his own notes.
agent-os cmd "/learn ~/ahaan_maths_notes.md" # ingest a file
agent-os cmd "/learn To add fractions with the same denominator, add the numerators."
agent-os cmd "/ask how do I add fractions?"
Based on your notes:
[source: note] To add fractions with the same denominator, add the numerators...
[PASS Β· grounding 0.75 Β· Job a1b2c3d4]
/ask retrieves the top chunks, answers only from your context, and hands
those chunks to Ninja Harness as grounding references β so the answer is
scored against the source, and ungrounded answers get flagged. Upload notes,
files, or whole folders; it becomes the brain every agent retrieves from.
from agent_os.context import ContextStore
ctx = ContextStore() # or ContextStore(embedder=my_embedder)
ctx.ingest_file("ahaan_maths_notes.md")
print(ctx.build_context("how do I add fractions?")) # β grounded, source-tagged
Run the keystone demo, and read the deep dive:
python examples/ahaan_maths_demo.py # ingest notes β grounded, scored answers
π Deep dive (chunking, BM25-lite, hybrid semantic search, the grounding/scoring
loop, the graph roadmap): docs/brain.md
agent-os ships no model and no keys. You plug in your own with one
environment variable β and it stays Ollama-first so a non-technical user
runs everything locally, for free, with no account:
export AGENT_OS_PROVIDER="ollama:llama3" # local + free, no key
export AGENT_OS_PROVIDER="openai:gpt-4o-mini" # needs OPENAI_API_KEY
export AGENT_OS_PROVIDER="anthropic:claude-3-5-sonnet-20241022" # needs ANTHROPIC_API_KEY
agent-os cmd "/model" # show what's wired
One small adapter (providers.py, standard-library HTTP β no SDK) powers three
roles at once: reasoner (the /ask answer, the /digest prose), embedder
(semantic search in the Brain β /ask becomes hybrid keyword + meaning), and the
agent_fn behind /run. Any OpenAI-compatible endpoint (Together, vLLM, LM
Studio, Replitβs proxy) works via openai:<model> + OPENAI_BASE_URL.
With nothing configured, agent-os stays in deterministic mode and makes no
model calls β every external call is opt-in and uses your credentials. No
bundled keys, no hidden network calls.
from agent_os.providers import get_provider, provider_from_env
p = get_provider("ollama:llama3") # or provider_from_env() to read the env
answer = p.complete("Explain adding fractions to a 10-year-old.")
vectors = p.embed(["add fractions", "multiply fractions"])
python examples/provider_demo.py # offline walkthrough of all three roles
π Deep dive (the three roles, every provider, OpenAI-compatible endpoints, the
opt-in wiring): docs/providers.md
Skills are reusable SKILL.md procedures the agent matches and follows. agent-os
speaks both its own format and the open Agent Skills
standard (YAML-frontmatter SKILL.md), so you can point it at any folder of open
skills and import them with no code:
export AGENT_OS_SKILLS_PATH="/path/to/any/skills/dir" # recursive, multi-root
agent-os cmd "/skills" # your skills + imported ones
A matched skill is injected into the prompt sent to your model β set
AGENT_OS_PROVIDER=ollama:llama3 to run it locally and free. Nothing is
hardwired to one vendor, and privileged tasks still pass the risk gate, the audit
log, and the Ninja Harness score. π docs/skills.md
One goal β a coordinator decomposes it β sub-tasks run in parallel β the
coordinator synthesizes one deliverable. The parallel-swarm pattern, but placed
under agent-osβs trust spine β because speed without verification just produces
scaled-up errors, not scaled-up value.
export AGENT_OS_PROVIDER=ollama:llama3 # local + free; your model, your data
agent-os swarm "research the top 5 local LLM runtimes; for each: license, RAM, speed; compare in a table"
# or: agent-os cmd "/swarm ..." Β· or the π card in `agent-os ui`
π Swarm: ...
3 sub-task(s) Β· 2 done Β· 1 gated Β· 0 failed
- [PASS 89] summarize the intro
- [GATED:WRITE] delete the prod database β default-deny: never auto-run
Synthesis scored 88.8 (Job 2-14f5f9)
Every sub-task is a real job (traced, risk-gated, Ninja-scored, queryable via
/job//trace); privileged sub-tasks are gated, not auto-executed; the
synthesis is scored too. Local-first, your model, honest concurrency (a
bounded pool sized to your machine β no fictional β300β). π
docs/orchestrator.md
Every command is recorded into a hash-chained, tamper-evident audit log
(audit.py): each entryβs hash covers its content plus the previous hash, so any
edit or deletion breaks the chain and is detectable. The risk classifier
(risk.py) is default-deny β anything ambiguous, or that writes/sends/deploys,
is gated for approval β and tool-aware, so a task is escalated if the agent
merely can send or delete. A global error boundary means users never see a raw
stack trace.
agent-os cmd "/audit" # recent entries + chain integrity
agent-os cmd "/risk make the prod table empty" # β WRITE β REQUIRES APPROVAL
Audit log β 12 entries Β· chain β
intact
Every run is also metered β latency, estimated tokens, and estimated cost
($0 on local models). /cost rolls it up, so a run is quality- and cost- and
safety-accounted.
agent-os cmd "/cost" # tokens Β· latency Β· est. cost across recent runs
See SECURITY.md for the full threat model and known limitations.
insights.py turns per-source summaries into structured insights β each a
claim, its evidence (every point cites a source), an implication for a chosen
lens, and a delta vs the previous run (the part that compounds, via memory):
python examples/podcast_digest_demo.py
Cross-episode insights
1) Shared theme: 'incentives'
- Claim/theme: Acquired and Huberman Lab both touch on 'incentives'.
- Evidence: Acquired: ... aligning incentives with investors; Huberman Lab: ...
- Delta vs previous: New theme vs previous digest.
[Ninja Harness] grounding=β¦ hygiene=β¦ cert=β¦
You supply the LLM reasoner= (for the rich prose) and your feed fetcher (to
build EpisodeSummary objects). A deterministic keyword fallback ships so the
loop runs and is testable without a model β and the digest is scored by Ninja
Harness (grounding = claims backed by evidence; hygiene = concise), so weak
synthesis gets flagged.
Run it from the command surface, plug in your model, or watch it compound:
agent-os cmd "/digest" # synthesize β score β persist as a job
python examples/multiday_digest_demo.py # New β Reinforced deltas across days
from agent_os.reasoners import LLMReasoner
from agent_os.insights import CrossEpisodeSynthesizer
synth = CrossEpisodeSynthesizer(reasoner=LLMReasoner(complete=my_model), memory=mem)
π Deep dive (philosophy, schema, grounding, the compounding loop, the LLM
prompt contract): docs/insights.md Β·
skill: podcast-digest.
| Module | Responsibility |
|---|---|
trace_recorder.py |
Record every job into traces/<job_id>/ (command, stdout, screenshots, final, trace.json, ninja_report.json). Produces a Ninja-Harness-parseable trace. |
agent_memory.py |
Persistent memory: MEMORY.md, USER.md, state.db (facts, prefs, outcomes), sessions/. |
skill_registry.py |
Load reusable procedures from skills/*/SKILL.md (triggers, procedure, pitfalls, verification, artifacts) and match a command to the best skill. |
Plus: profiles.py (researcher / operator / builder / qa), improvement.py
(propose-only patches), and runner.py (the loop).
Each profile has its own allowed tools, memory namespace, personality, and a
quality threshold for the eval gate:
After a weak run (NARI < profile threshold), propose_improvement() builds a
structured proposal β failure reason, suggested memory update, suggested skill
patch β that requires explicit human approval. The agent never rewrites
itself automatically.
The platform exposes a transport-agnostic command router β the same commands youβll
wire to WhatsApp later. Try them locally with agent-os cmd:
agent-os cmd "/ping" # liveness
agent-os cmd "/status" # health + recent jobs
agent-os cmd "/agents" # list agent profiles
agent-os cmd "/skills" # list skills + triggers
agent-os cmd "/eval" # run the Ninja Harness suite (or summarize jobs)
agent-os cmd "/browser-demo" # run the demo agent end-to-end
agent-os cmd "/learn <path|text>" # ingest notes/files into the brain
agent-os cmd "/ask <question>" # answer from your knowledge base (grounded + scored)
agent-os cmd "/audit" # recent audit entries + chain integrity
agent-os cmd "/model" # show the configured model provider
agent-os cmd "/job f6df6f7d" # show a persisted job (id or short prefix)
agent-os cmd "/trace f6df6f7d" # show a job's trajectory + score
Every run is persisted to SQLite (agent_state/jobs.db), so jobs survive
restarts and you can look them up by id (or short suffix) afterward. Each major
run leaves behind a trace, a Ninja Harness score, and (if weak) an
improvement proposal β thatβs how the system compounds.
agent-os run "<command>" [--profile P] [--agent-cmd "python my_agent.py"] [--case case.yaml] [--json]
agent-os cmd "/status" # WhatsApp-style command surface
agent-os skills # list skills + triggers
agent-os memory # recent job outcomes
agent-os run (and cmd for write actions) exit non-zero when a run is flagged,
so they work as a CI gate.
Level 1 β Agent OS Core β
(this release)
SQLite persistent jobs Β· trace recorder Β· skill registry Β· agent profiles Β·
command router (/eval /skills /agents /job /trace /status /ping /browser-demo) Β·
Ninja Harness report after every run.
Level 2 β Reliability Layer β
(this release)
Bridge process supervisor (supervisor.py, restart + backoff) Β· health checks
(health.py, /health) Β· structured JSON logs (logging_setup.py) Β· retries +
timeout policy (reliability.py) Β· token health without leaking secrets
(token_health.py) Β· sender allowlist, fail-closed (allowlist.py) Β· daily eval
summary (daily_eval.py). Deploy templates for the permanent Cloudflare named
tunnel, systemd/launchd supervisor service, and daily-eval schedule are in
deploy/ β you run those with your own accounts.
agent-os health # detailed health checks
agent-os supervise -- python bridge.py # keep your bridge alive
agent-os daily-eval # daily reliability summary
Level 3 β Controlled Autonomy β
(this release)
Risk classifier (risk.py) Β· approval queue (approvals.py, /pending
/approve /reject) Β· read-only tasks auto-run, write/send/deploy gated Β·
github-publish + gmail-digest skills. The agent never takes a privileged action
without explicit human approval.
agent-os cmd "/run summarize the inbox" # READ_ONLY β auto-runs
agent-os cmd "/run send the weekly update" # SEND β queued for approval
agent-os cmd "/pending" # list what's waiting
agent-os cmd "/approve <id>" # execute it
agent-os cmd "/reject <id>" # cancel it
Live integrations (WhatsApp/Meta, Gmail, Cloudflare) are pluggable adapters you
wire with your own credentials β none are bundled or faked.
PRs welcome. See CONTRIBUTING.md for setup and the workflow,
and docs/code-review.md for how we review β small
self-contained changes, tests in the same change, and descriptions that say what
and why.
Apache-2.0.