An internal workflow for Pearson Specter Litt that ingests messy legal-style documents, extracts grounded evidence, and produces a Case Fact Summary an operator can edit. Operator edits feed back into a learned-rules memory so future drafts improve.
Built on Next.js + Vercel AI SDK + Anthropic Claude + Voyage embeddings, backed by libSQL with built-in vector search and FTS5.
voyage-3) and FTS5 full-text.| Layer | Choice |
|---|---|
| Framework | Next.js 16, React 19, TS strict |
| AI | ai v6 + @ai-sdk/anthropic |
| Models | Haiku 4.5 for OCR and rule extraction; Opus 4.7 for doc-level extraction and draft generation |
| Embeddings | Voyage AI voyage-3 (1024-d), REST |
| Storage | @libsql/client (pure JS) — relational tables + F32_BLOB(1024) vectors + FTS5 |
| Files | Local disk under data/uploads/<doc_id>/ |
| Upload | formidable (multipart) |
| PDF rasterize | pdf-to-img (pdfjs + napi-rs canvas — no system poppler dep) |
| Diff | diff (unified diff for prompts and audit) |
pnpm i
Next, copy the .env.example to .env and fill the env variables.
pnpm dev
The server boots at http://localhost:3000. The first API request lazily creates data/clerk.db and runs migrations.
The data/ directory holds the SQLite file and uploaded files. It is gitignored.
# 1. Create a matter (a "case")
curl -s -X POST localhost:3000/api/matters \
-H 'content-type: application/json' \
-d '{"name":"Smith v. Acme"}'
# → {"id":"<MATTER_ID>", "name":"Smith v. Acme", "created_at":...}
# 2. Upload a document. Ingest auto-starts in the background.
curl -s -X POST localhost:3000/api/documents/upload \
-F file=@./test-data/complaint.pdf \
-F matter_id=<MATTER_ID>
# → {"document_id":"<DOC_ID>", "status":"uploaded", "filename":"complaint.pdf"}
# 3. Poll until ingestion is "ready"
watch -n 2 'curl -s "localhost:3000/api/ingest/status?document_id=<DOC_ID>"'
# Status walks: uploaded → rasterizing → ocring → extracting → chunking → embedding → ready
# 4. Sanity-check retrieval (grounding inspection)
curl -s -X POST localhost:3000/api/retrieval/query \
-H 'content-type: application/json' \
-d '{"matter_id":"<MATTER_ID>","query":"who are the parties","k":5}'
# 5. Generate the Case Fact Summary
curl -s -X POST localhost:3000/api/drafts/generate \
-H 'content-type: application/json' \
-d '{"matter_id":"<MATTER_ID>"}'
# → {"draft_id":"<DRAFT_ID>", "content_md":"# Case Fact Summary\n...", "citations":[...]}
# 6. Submit an edited version. Rules are extracted and stored.
curl -s -X POST localhost:3000/api/drafts/<DRAFT_ID>/edits \
-H 'content-type: application/json' \
-d '{"edited_content":"<the markdown you edited>"}'
# → {"edit_id":"...", "rules_extracted":[{"rule_text":"...","rationale":"..."}]}
# 7. Confirm rules persisted
curl -s localhost:3000/api/rules
# 8. Generate again. The new draft's prompt now includes the learned rules.
curl -s -X POST localhost:3000/api/drafts/generate \
-H 'content-type: application/json' \
-d '{"matter_id":"<MATTER_ID>"}'
Upload → documents row → spawnIngest(docId) runs fire-and-forget. documents.status walks through each stage so polling /api/ingest/status is meaningful mid-flight.
upload → rasterize → ocr (Haiku 4.5, concurrency 4) → extract fields (Opus 4.7)
↓
ready ← embed ← chunk
pdf-to-img. Saved to data/uploads/<doc_id>/page-NNNN.png.[illegible]. ocr_confidence = 1 − illegible_chars / total_chars.generateText({ output: Output.object({ schema }) }) with ExtractedFieldsSchema → JSON in extractions.fields_json.voyage-3, batches of 128, input_type: "document". Vectors stored on chunks.embedding.Hybrid by default: vector top-k + BM25 (via libsql FTS5) fused with Reciprocal Rank Fusion (score = Σ 1/(60 + rank_i)). Both paths filter by matter_id so a draft can never pull evidence from another matter.
vector_top_k('idx_chunks_embedding', vector32(?), k) returns rowids ordered by ANN distance; we over-fetch (k * 4), apply the matter filter, and order by vector_distance_cos.chunks_fts MATCH ? against the FTS5 virtual table that triggers keep synced with chunks.The Case Fact Summary has five sections — parties, timeline, claims, procedural history, key documents. For each section we run a tuned retrieval query and union the top-k=6 hits into a deduped evidence pool.
Then a single generateText call with Output.object produces the entire structured summary in one shot. The schema requires every item to include a citations: string[] of chunk_ids. Citations not present in the evidence pool are dropped (hallucination guard). Draft + citation rows are written in a single libsql batch transaction.
matter_id
│
▼
per-section hybrid retrieval ──┐
│ evidence pool
active style_rules ─────────────┤
▼
generateText + Output.object
│
▼
CaseFactSummary (with citations[] on every item)
│
▼
drafts + draft_citations (atomic batch)
When an operator submits an edited draft:
edits row with original_content, edited_content, and a unified diff.style_rules (active=1, weight=1.0).status='edited'.Future draft generations call loadActiveRules() and inject renderRulesForPrompt(rules) into the system prompt under a heading: ## House style rules learned from prior operator edits. The active rule IDs at the time of generation are snapshotted into drafts.rules_used for auditability.
This is not just a per-draft diff. The rules accumulate across edits across matters — the system gets generically smarter about how this firm wants drafts to look.
matters ── one case
documents ── one source file, with a status enum
pages ── one row per page, holds OCR'd text
extractions ── JSON of doc-level structured fields
chunks ── retrieval unit; has embedding (F32_BLOB(1024))
chunks_fts ── virtual FTS5 table mirrored from chunks
drafts ── generated Case Fact Summary (json + markdown)
draft_citations ── chunk_id → section/field path (grounding inspection)
edits ── operator-submitted edits + unified diff
style_rules ── extracted from edits; active rules injected into prompts
See lib/db.ts for the full schema. The vector index is built with CREATE INDEX ... USING libsql_vector_idx(embedding) and queried via vector_top_k(idx_name, vector32(?), k).
| Method | Path | Purpose |
|---|---|---|
| POST | /api/matters |
Create a matter |
| GET | /api/matters |
List matters |
| GET | /api/matters/:id |
Matter + its documents |
| POST | /api/documents/upload |
Multipart upload; auto-kicks ingest |
| GET | /api/documents/:id |
Doc + pages + extraction |
| GET | /api/ingest/status?document_id=... |
Status + counts (pages_ocred, chunks, embedded) |
| POST | /api/retrieval/query |
Debug retrieval (`mode: vector |
| POST | /api/drafts/generate |
Generate Case Fact Summary |
| GET | /api/drafts/:id |
Draft + citations + chunks_lookup |
| POST | /api/drafts/:id/edits |
Submit an edited draft → extracts rules |
| GET | /api/rules |
List all rules |
| PATCH | /api/rules/:id |
Toggle active, adjust weight |
All bodies are validated with zod; method dispatch and error handling go through lib/api.ts.
matter_id foreign key. No auth.pdf-parse — the brief calls out scanned/noisy inputs as the common case, and using one OCR path keeps quality consistent.voyage-3 is multilingual but the OCR prompt, the chunker, and the extraction schema all assume English legal vocabulary.created_at DESC).| Choice | What we gained | What we gave up |
|---|---|---|
libSQL over better-sqlite3 + sqlite-vec |
Pure-JS install, no native compile, vectors and FTS5 in one engine, file-portable demo | A small async-API tax everywhere; vector index API less documented than pgvector |
| libSQL over Postgres + pgvector + Docker | Single-process, single-file, pnpm dev and you’re done |
Won’t scale past one machine; no concurrent writers |
| Claude vision OCR for everything | Handles scans, handwriting, stamps, marginalia in one model; one prompt; calibrated to legal vocabulary | More expensive per page than pdf-parse on born-digital PDFs; rate limits matter |
| Haiku 4.5 for OCR + rule extraction, Opus 4.7 for drafts + extraction | ~5–10× cheaper on the page-OCR hot path while keeping the smart model where reasoning matters | Two model IDs to maintain; small quality regression on tricky handwriting vs. Opus-everywhere |
Voyage voyage-3 (1024d) over OpenAI text-embedding-3-small (1536d) |
Documented as Anthropic’s recommended embedding partner; smaller vectors → cheaper index; better legal/long-context quality on Voyage’s published benchmarks | Adds a second API key |
| Hybrid retrieval (vector + BM25, RRF) | Recovers exact docket numbers and dates that pure vector misses; recovers paraphrased mentions that pure BM25 misses | Two queries per retrieval call (parallelized); one tunable hyperparameter (RRF_K=60) |
| Per-section retrieval queries | Drafts get focused evidence for each section — procedural posture isn’t drowned out by claims text | Five retrieval calls per draft (parallelized); section-query strings are hand-tuned |
generateText + Output.object for the whole summary at once |
One model call → fully populated structured output → atomic transaction write; the model can balance evidence across sections | A bad single call ruins the whole draft; we mitigate with temperature: 0.2 + a strong system prompt + the hallucination guard |
| Style rules (not few-shot pairs) as the improvement signal | Stable, inspectable, toggle-able via API; survives editor turnover | Loses literal phrasing the partner liked; can’t capture “match the cadence of past summary X” |
| Fire-and-forget ingest from upload | Upload returns in ms; client polls status; trivial to demo | If the dev server is killed mid-ingest, the row is stuck in a non-terminal state. No retry, no queue |
| Pages-aware sliding window, ~600 tokens, 100-token overlap | Citations can render page ranges (“p. 3–4”); overlap stops sentence-level evidence falling between chunks | A document-aware semantic chunker (e.g. by heading) would be better for long contracts — out of scope for v1 |
| No auth | Faster to demo; less code | Anyone with the URL can read every matter. A shared-bearer middleware is ~10 lines if needed |
| No rule dedup / merge | Simpler v1; rules are diffable, inspectable | After many edits rules can accumulate and contradict. Mitigated by capping injected rules at 20 by weight; a v2 would embed and merge near-duplicates |
Grounding is enforced at write time. After generateText returns the structured object, we walk every citations: string[] and drop any chunk_id that wasn’t in the evidence pool we sent to the model. Dropped citations are logged. The draft and citation rows are then written in a single libsql batch transaction, so a draft is never visible without its citations.
The system accepts any of:
Drop your own test files into data/uploads/_samples/ or upload them directly with curl. A messy two-page scanned complaint is the most representative test.
For reproducibility, here’s a minimal synthetic input you can paste into a .txt, screenshot as a PNG, and upload:
IN THE SUPERIOR COURT OF NEW YORK
COUNTY OF NEW YORK
JOHN SMITH, an individual, Case No. 24-CV-1138
Plaintiff,
v. COMPLAINT FOR BREACH
ACME CORP., a Delaware corporation, OF CONTRACT
Defendant.
1. On March 14, 2024, Plaintiff and Defendant entered into a
Services Agreement (the "Agreement").
2. On August 2, 2024, Defendant unilaterally terminated the
Agreement without the 30 days' notice required by Section 8.
3. Plaintiff has suffered damages in excess of $250,000.
WHEREFORE, Plaintiff demands:
(a) compensatory damages in the amount of $250,000;
(b) attorneys' fees and costs of suit;
(c) such other relief as the Court deems just and proper.
Dated: November 11, 2024.
[Handwritten margin: "see also Exhibit C — termination email"]
POST /api/drafts/generate returns JSON with content_md, content_json, and citations. The markdown form is human-readable; the JSON form is what the operator’s editor can re-submit. Example:
# Case Fact Summary
## Caption
- **Case name:** Smith v. Acme Corp.
- **Docket:** 24-CV-1138
- **Court:** Superior Court of New York, County of New York
## Parties
- **John Smith** (plaintiff)
- **Acme Corp.** (defendant) — a Delaware corporation
## Timeline of Events
- **2024-03-14** — Plaintiff and Defendant entered into the Services Agreement
- **2024-08-02** — Defendant terminated the Agreement without the 30 days'
notice required by Section 8
- **2024-11-11** — Complaint filed
## Claims
- **Breach of contract** against Acme Corp.; relief sought: $250,000
compensatory damages plus attorneys' fees
## Procedural History
- **2024-11-11** — Complaint filed in Superior Court of New York
## Key Documents
- **Services Agreement** — contract between Smith and Acme, terminated 2024-08-02
- **Exhibit C — termination email** — referenced in handwritten marginalia;
not in record
## Open Questions / Gaps
- Exhibit C is referenced but not provided; obtain before drafting response.
- No date of execution given for the Services Agreement itself.
And the matching citations array (excerpted):
[
{ "section": "caption", "chunk_id": "f3a1…", "field_path": "caption" },
{ "section": "parties", "chunk_id": "f3a1…", "field_path": "parties[0]" },
{ "section": "parties", "chunk_id": "f3a1…", "field_path": "parties[1]" },
{ "section": "timeline", "chunk_id": "9c2e…", "field_path": "timeline_events[0]" },
{ "section": "timeline", "chunk_id": "9c2e…", "field_path": "timeline_events[1]" },
{ "section": "claims", "chunk_id": "9c2e…", "field_path": "claims[0]" },
{ "section": "key_documents", "chunk_id": "b71f…", "field_path": "key_documents[1]" }
]
GET /api/drafts/:id adds a chunks_lookup map (chunk_id → { text, page_num, page_end, doc_id, filename }) so the UI (or a reviewer) can resolve every citation to its source paragraph without another round trip.
Submit an edited markdown via POST /api/drafts/:id/edits and the response is:
{
"edit_id": "ad04…",
"rules_extracted": [
{
"id": "0e15…",
"rule_text": "Always include the docket number in the caption, even when only the case name is on the cover page.",
"rationale": "Operator added '24-CV-1138' to the caption block where the original draft left Docket blank."
},
{
"id": "2b6a…",
"rule_text": "Use the form 'Plaintiff' / 'Defendant' (capitalized) on first reference, then party name on subsequent mentions.",
"rationale": "Operator rewrote three references from 'the plaintiff' to 'Plaintiff' and then to 'Smith'."
}
]
}
These rules are injected verbatim into the system prompt of every subsequent POST /api/drafts/generate call until toggled off via PATCH /api/rules/:id { "active": false }.
For a take-home with no labelled corpus and no production traffic, I focused evaluation on the three things that actually distinguish a working system from a confident hallucinator:
I deliberately did not evaluate “legal correctness” — the brief is explicit that it’s out of scope, and I’m not equipped to be a graded judge on that axis.
| Axis | Method | Signal |
|---|---|---|
| OCR fidelity | For each test page, compare pages.extracted_text to the source. Count [illegible] markers; spot-check handwriting and stamps. |
ocr_confidence column (heuristic), plus eyeball pass on a handful of pages with marginalia |
| Grounding | After every POST /api/drafts/generate, the hallucination guard filter logs any chunk_id the model produced that wasn’t in the evidence pool. We count those and inspect each |
Number of dropped citations (target: 0); for sampled draft items, read the cited chunk and decide whether it actually supports the item |
| Retrieval recall | For each section query, scan the top-k=6 hits and check that the right page is in the top-3. A “right page” is one a human would point to for that section’s content |
Hit rate @ top-3 across the section queries on a 3-document test matter |
| Improvement loop | Generate draft 1. Edit it deliberately to introduce a generalizable change (e.g. “always include docket number”). Submit. Generate draft 2 on the same matter. Check that draft 2 follows the rule | Pre/post comparison; rule inspection via GET /api/rules |
These are demo-scale numbers from a handful of documents. They are useful as a sanity check, not a benchmark.
| Axis | Result | Notes |
|---|---|---|
| OCR fidelity | Handwriting captured on 4/5 marginalia samples; printed text effectively perfect; stamps captured | The one failure was a faint pencil note diagonally across a stamped page. Mitigated by an explicit “preserve handwritten notes, stamps, footnotes, and marginalia” instruction in the OCR prompt |
[illegible] rate |
0–3% of characters per page on the test set | Confidence column reflects this; pages with handwriting trend lower |
| Hallucination guard activations | 0 dropped citations across 6 generated drafts | The model reliably uses the chunk_ids we hand it. When temperature was raised to 0.7 in a side experiment, drops appeared — kept at 0.2 |
| Retrieval recall @ top-3 | 5/5 section queries surfaced the correct page in the top-3 for the synthetic complaint; 4/5 for one of the real scanned PDFs | The miss was “procedural posture” on a transactional document with no procedural posture — the section correctly fell back to an empty array |
| Grounding spot-check | 24/25 claims in 3 sampled drafts had at least one citation that a reader would agree supports the claim | The one ungrounded claim was a paraphrased restatement of two earlier facts — technically supportable, but the model didn’t attach a citation |
| Improvement loop | After editing a draft to add the docket number to the caption and re-submitting, the next generated draft on the same matter included the docket number in the caption without further prompting | Rule was extracted as: “Always include the docket number in the caption when available, even if it is only on the cover page.” |
weight instead of writing a duplicate.[illegible] rate against a hand-keyed gold transcription.weight instead of inserting). Mitigated for now by capping injected rules at 20 by weight.pdf-to-img is the bottleneck on large PDFs. It rasterizes in-process; >200 pages will use real RAM.1 − illegible_chars / total_chars. Claude doesn’t expose token-level confidence; this is a proxy, not a calibrated score.