clerk

0
0
0
1
TypeScript
public

Clerk

An internal workflow for Pearson Specter Litt that ingests messy legal-style documents, extracts grounded evidence, and produces a Case Fact Summary an operator can edit. Operator edits feed back into a learned-rules memory so future drafts improve.

Built on Next.js + Vercel AI SDK + Anthropic Claude + Voyage embeddings, backed by libSQL with built-in vector search and FTS5.

Contents

What it does

  1. Ingest scanned PDFs, photos, low-res images, handwritten notes. Each page is rasterized and OCR’d by Claude vision — handwriting, stamps, marginalia and all.
  2. Extract document-level structured fields (parties, dates, claims, docket, court) via Opus 4.7.
  3. Index the OCR’d text into 600-token chunks with both vector embeddings (Voyage voyage-3) and FTS5 full-text.
  4. Generate a Case Fact Summary grounded in retrieved evidence. Every claim in the draft carries citations back to specific chunks.
  5. Learn from operator edits — diff the original vs. edited draft, ask Claude to extract reusable rules, store them, and inject them into future system prompts.

Stack

Layer Choice
Framework Next.js 16, React 19, TS strict
AI ai v6 + @ai-sdk/anthropic
Models Haiku 4.5 for OCR and rule extraction; Opus 4.7 for doc-level extraction and draft generation
Embeddings Voyage AI voyage-3 (1024-d), REST
Storage @libsql/client (pure JS) — relational tables + F32_BLOB(1024) vectors + FTS5
Files Local disk under data/uploads/<doc_id>/
Upload formidable (multipart)
PDF rasterize pdf-to-img (pdfjs + napi-rs canvas — no system poppler dep)
Diff diff (unified diff for prompts and audit)

Setup

pnpm i

Next, copy the .env.example to .env and fill the env variables.

pnpm dev

The server boots at http://localhost:3000. The first API request lazily creates data/clerk.db and runs migrations.

The data/ directory holds the SQLite file and uploaded files. It is gitignored.

End-to-end demo

# 1. Create a matter (a "case")
curl -s -X POST localhost:3000/api/matters \
  -H 'content-type: application/json' \
  -d '{"name":"Smith v. Acme"}'
# → {"id":"<MATTER_ID>", "name":"Smith v. Acme", "created_at":...}

# 2. Upload a document. Ingest auto-starts in the background.
curl -s -X POST localhost:3000/api/documents/upload \
  -F file=@./test-data/complaint.pdf \
  -F matter_id=<MATTER_ID>
# → {"document_id":"<DOC_ID>", "status":"uploaded", "filename":"complaint.pdf"}

# 3. Poll until ingestion is "ready"
watch -n 2 'curl -s "localhost:3000/api/ingest/status?document_id=<DOC_ID>"'
# Status walks: uploaded → rasterizing → ocring → extracting → chunking → embedding → ready

# 4. Sanity-check retrieval (grounding inspection)
curl -s -X POST localhost:3000/api/retrieval/query \
  -H 'content-type: application/json' \
  -d '{"matter_id":"<MATTER_ID>","query":"who are the parties","k":5}'

# 5. Generate the Case Fact Summary
curl -s -X POST localhost:3000/api/drafts/generate \
  -H 'content-type: application/json' \
  -d '{"matter_id":"<MATTER_ID>"}'
# → {"draft_id":"<DRAFT_ID>", "content_md":"# Case Fact Summary\n...", "citations":[...]}

# 6. Submit an edited version. Rules are extracted and stored.
curl -s -X POST localhost:3000/api/drafts/<DRAFT_ID>/edits \
  -H 'content-type: application/json' \
  -d '{"edited_content":"<the markdown you edited>"}'
# → {"edit_id":"...", "rules_extracted":[{"rule_text":"...","rationale":"..."}]}

# 7. Confirm rules persisted
curl -s localhost:3000/api/rules

# 8. Generate again. The new draft's prompt now includes the learned rules.
curl -s -X POST localhost:3000/api/drafts/generate \
  -H 'content-type: application/json' \
  -d '{"matter_id":"<MATTER_ID>"}'

Architecture overview

Ingestion pipeline (lib/ingest.ts)

Upload → documents row → spawnIngest(docId) runs fire-and-forget. documents.status walks through each stage so polling /api/ingest/status is meaningful mid-flight.

upload  →  rasterize  →  ocr (Haiku 4.5, concurrency 4)  →  extract fields (Opus 4.7)
                                                                       ↓
                                                ready  ←  embed  ←  chunk
  • Rasterize: PDF pages → PNG buffers at 2× scale (~200 DPI) via pdf-to-img. Saved to data/uploads/<doc_id>/page-NNNN.png.
  • OCR: each PNG → multimodal Claude call. Prompt explicitly preserves handwriting, stamps, footnotes, and marks unreadable regions [illegible]. ocr_confidence = 1 − illegible_chars / total_chars.
  • Extract: full doc text → generateText({ output: Output.object({ schema }) }) with ExtractedFieldsSchema → JSON in extractions.fields_json.
  • Chunk: page-aware sliding window, 600 tokens with 100-token overlap, snapped to sentence boundaries. Each chunk records the page range it spans (for “p. 3–4” citations).
  • Embed: Voyage voyage-3, batches of 128, input_type: "document". Vectors stored on chunks.embedding.

Retrieval (lib/retrieval.ts)

Hybrid by default: vector top-k + BM25 (via libsql FTS5) fused with Reciprocal Rank Fusion (score = Σ 1/(60 + rank_i)). Both paths filter by matter_id so a draft can never pull evidence from another matter.

  • Vector path: vector_top_k('idx_chunks_embedding', vector32(?), k) returns rowids ordered by ANN distance; we over-fetch (k * 4), apply the matter filter, and order by vector_distance_cos.
  • BM25 path: standard chunks_fts MATCH ? against the FTS5 virtual table that triggers keep synced with chunks.

Grounded draft generation (lib/draft.ts)

The Case Fact Summary has five sections — parties, timeline, claims, procedural history, key documents. For each section we run a tuned retrieval query and union the top-k=6 hits into a deduped evidence pool.

Then a single generateText call with Output.object produces the entire structured summary in one shot. The schema requires every item to include a citations: string[] of chunk_ids. Citations not present in the evidence pool are dropped (hallucination guard). Draft + citation rows are written in a single libsql batch transaction.

matter_id
   │
   ▼
per-section hybrid retrieval ──┐
                                │ evidence pool
active style_rules ─────────────┤
                                ▼
                       generateText + Output.object
                                │
                                ▼
              CaseFactSummary  (with citations[] on every item)
                                │
                                ▼
                drafts + draft_citations  (atomic batch)

Improvement loop (lib/edits.ts)

When an operator submits an edited draft:

  1. Persist edits row with original_content, edited_content, and a unified diff.
  2. Send (original, edited, diff) to Haiku 4.5 with a prompt that demands specific, imperative, reusable rules — and an explicit instruction to return an empty array if the edits were just typo fixes with no generalizable lesson.
  3. Insert each extracted rule into style_rules (active=1, weight=1.0).
  4. Mark the draft status='edited'.

Future draft generations call loadActiveRules() and inject renderRulesForPrompt(rules) into the system prompt under a heading: ## House style rules learned from prior operator edits. The active rule IDs at the time of generation are snapshotted into drafts.rules_used for auditability.

This is not just a per-draft diff. The rules accumulate across edits across matters — the system gets generically smarter about how this firm wants drafts to look.

Data model (libSQL)

matters            ── one case
documents          ── one source file, with a status enum
pages              ── one row per page, holds OCR'd text
extractions        ── JSON of doc-level structured fields
chunks             ── retrieval unit; has embedding (F32_BLOB(1024))
chunks_fts         ── virtual FTS5 table mirrored from chunks
drafts             ── generated Case Fact Summary (json + markdown)
draft_citations    ── chunk_id → section/field path (grounding inspection)
edits              ── operator-submitted edits + unified diff
style_rules        ── extracted from edits; active rules injected into prompts

See lib/db.ts for the full schema. The vector index is built with CREATE INDEX ... USING libsql_vector_idx(embedding) and queried via vector_top_k(idx_name, vector32(?), k).

API surface

Method Path Purpose
POST /api/matters Create a matter
GET /api/matters List matters
GET /api/matters/:id Matter + its documents
POST /api/documents/upload Multipart upload; auto-kicks ingest
GET /api/documents/:id Doc + pages + extraction
GET /api/ingest/status?document_id=... Status + counts (pages_ocred, chunks, embedded)
POST /api/retrieval/query Debug retrieval (`mode: vector
POST /api/drafts/generate Generate Case Fact Summary
GET /api/drafts/:id Draft + citations + chunks_lookup
POST /api/drafts/:id/edits Submit an edited draft → extracts rules
GET /api/rules List all rules
PATCH /api/rules/:id Toggle active, adjust weight

All bodies are validated with zod; method dispatch and error handling go through lib/api.ts.

Assumptions and tradeoffs

Assumptions

  • Single-user, single-firm, local deployment. No multi-tenant isolation beyond the matter_id foreign key. No auth.
  • Pages or images, not native PDFs. Every page goes through rasterization → Claude vision. Even for born-digital PDFs we don’t shortcut to pdf-parse — the brief calls out scanned/noisy inputs as the common case, and using one OCR path keeps quality consistent.
  • English-language documents. Voyage voyage-3 is multilingual but the OCR prompt, the chunker, and the extraction schema all assume English legal vocabulary.
  • Document scope is the matter. A Case Fact Summary spans every document in a matter. We do not currently support “summarize just docs 3 and 7” — a small filter on the retrieval query would do it.
  • Operator edits represent house style. Rule extraction trusts the edit as ground truth. If two operators disagree, the more recent edit wins (weight ties broken by created_at DESC).
  • One draft type for v1. The plan named several (title review, notice summary, etc.); we shipped Case Fact Summary. Adding another is a new schema + new section queries + a new prompt — the rest of the stack is generic.

Tradeoffs (the choices and what we gave up)

Choice What we gained What we gave up
libSQL over better-sqlite3 + sqlite-vec Pure-JS install, no native compile, vectors and FTS5 in one engine, file-portable demo A small async-API tax everywhere; vector index API less documented than pgvector
libSQL over Postgres + pgvector + Docker Single-process, single-file, pnpm dev and you’re done Won’t scale past one machine; no concurrent writers
Claude vision OCR for everything Handles scans, handwriting, stamps, marginalia in one model; one prompt; calibrated to legal vocabulary More expensive per page than pdf-parse on born-digital PDFs; rate limits matter
Haiku 4.5 for OCR + rule extraction, Opus 4.7 for drafts + extraction ~5–10× cheaper on the page-OCR hot path while keeping the smart model where reasoning matters Two model IDs to maintain; small quality regression on tricky handwriting vs. Opus-everywhere
Voyage voyage-3 (1024d) over OpenAI text-embedding-3-small (1536d) Documented as Anthropic’s recommended embedding partner; smaller vectors → cheaper index; better legal/long-context quality on Voyage’s published benchmarks Adds a second API key
Hybrid retrieval (vector + BM25, RRF) Recovers exact docket numbers and dates that pure vector misses; recovers paraphrased mentions that pure BM25 misses Two queries per retrieval call (parallelized); one tunable hyperparameter (RRF_K=60)
Per-section retrieval queries Drafts get focused evidence for each section — procedural posture isn’t drowned out by claims text Five retrieval calls per draft (parallelized); section-query strings are hand-tuned
generateText + Output.object for the whole summary at once One model call → fully populated structured output → atomic transaction write; the model can balance evidence across sections A bad single call ruins the whole draft; we mitigate with temperature: 0.2 + a strong system prompt + the hallucination guard
Style rules (not few-shot pairs) as the improvement signal Stable, inspectable, toggle-able via API; survives editor turnover Loses literal phrasing the partner liked; can’t capture “match the cadence of past summary X”
Fire-and-forget ingest from upload Upload returns in ms; client polls status; trivial to demo If the dev server is killed mid-ingest, the row is stuck in a non-terminal state. No retry, no queue
Pages-aware sliding window, ~600 tokens, 100-token overlap Citations can render page ranges (“p. 3–4”); overlap stops sentence-level evidence falling between chunks A document-aware semantic chunker (e.g. by heading) would be better for long contracts — out of scope for v1
No auth Faster to demo; less code Anyone with the URL can read every matter. A shared-bearer middleware is ~10 lines if needed
No rule dedup / merge Simpler v1; rules are diffable, inspectable After many edits rules can accumulate and contradict. Mitigated by capping injected rules at 20 by weight; a v2 would embed and merge near-duplicates

Hallucination guard

Grounding is enforced at write time. After generateText returns the structured object, we walk every citations: string[] and drop any chunk_id that wasn’t in the evidence pool we sent to the model. Dropped citations are logged. The draft and citation rows are then written in a single libsql batch transaction, so a draft is never visible without its citations.

Sample inputs and outputs

Sample input

The system accepts any of:

  • PDF (scanned or born-digital). The brief’s intended use case.
  • PNG / JPEG / WebP. A photo of a page works.

Drop your own test files into data/uploads/_samples/ or upload them directly with curl. A messy two-page scanned complaint is the most representative test.

For reproducibility, here’s a minimal synthetic input you can paste into a .txt, screenshot as a PNG, and upload:

IN THE SUPERIOR COURT OF NEW YORK
COUNTY OF NEW YORK

JOHN SMITH, an individual,                 Case No. 24-CV-1138
            Plaintiff,
v.                                          COMPLAINT FOR BREACH
ACME CORP., a Delaware corporation,         OF CONTRACT
            Defendant.

1. On March 14, 2024, Plaintiff and Defendant entered into a
   Services Agreement (the "Agreement").
2. On August 2, 2024, Defendant unilaterally terminated the
   Agreement without the 30 days' notice required by Section 8.
3. Plaintiff has suffered damages in excess of $250,000.

WHEREFORE, Plaintiff demands:
  (a) compensatory damages in the amount of $250,000;
  (b) attorneys' fees and costs of suit;
  (c) such other relief as the Court deems just and proper.

Dated: November 11, 2024.

[Handwritten margin: "see also Exhibit C — termination email"]

Sample output (truncated)

POST /api/drafts/generate returns JSON with content_md, content_json, and citations. The markdown form is human-readable; the JSON form is what the operator’s editor can re-submit. Example:

# Case Fact Summary

## Caption
- **Case name:** Smith v. Acme Corp.
- **Docket:** 24-CV-1138
- **Court:** Superior Court of New York, County of New York

## Parties
- **John Smith** (plaintiff)
- **Acme Corp.** (defendant) — a Delaware corporation

## Timeline of Events
- **2024-03-14** — Plaintiff and Defendant entered into the Services Agreement
- **2024-08-02** — Defendant terminated the Agreement without the 30 days'
  notice required by Section 8
- **2024-11-11** — Complaint filed

## Claims
- **Breach of contract** against Acme Corp.; relief sought: $250,000
  compensatory damages plus attorneys' fees

## Procedural History
- **2024-11-11** — Complaint filed in Superior Court of New York

## Key Documents
- **Services Agreement** — contract between Smith and Acme, terminated 2024-08-02
- **Exhibit C — termination email** — referenced in handwritten marginalia;
  not in record

## Open Questions / Gaps
- Exhibit C is referenced but not provided; obtain before drafting response.
- No date of execution given for the Services Agreement itself.

And the matching citations array (excerpted):

[
  { "section": "caption",          "chunk_id": "f3a1…", "field_path": "caption" },
  { "section": "parties",          "chunk_id": "f3a1…", "field_path": "parties[0]" },
  { "section": "parties",          "chunk_id": "f3a1…", "field_path": "parties[1]" },
  { "section": "timeline",         "chunk_id": "9c2e…", "field_path": "timeline_events[0]" },
  { "section": "timeline",         "chunk_id": "9c2e…", "field_path": "timeline_events[1]" },
  { "section": "claims",           "chunk_id": "9c2e…", "field_path": "claims[0]" },
  { "section": "key_documents",    "chunk_id": "b71f…", "field_path": "key_documents[1]" }
]

GET /api/drafts/:id adds a chunks_lookup map (chunk_id → { text, page_num, page_end, doc_id, filename }) so the UI (or a reviewer) can resolve every citation to its source paragraph without another round trip.

Sample edit → extracted rules

Submit an edited markdown via POST /api/drafts/:id/edits and the response is:

{
  "edit_id": "ad04…",
  "rules_extracted": [
    {
      "id": "0e15…",
      "rule_text": "Always include the docket number in the caption, even when only the case name is on the cover page.",
      "rationale": "Operator added '24-CV-1138' to the caption block where the original draft left Docket blank."
    },
    {
      "id": "2b6a…",
      "rule_text": "Use the form 'Plaintiff' / 'Defendant' (capitalized) on first reference, then party name on subsequent mentions.",
      "rationale": "Operator rewrote three references from 'the plaintiff' to 'Plaintiff' and then to 'Smith'."
    }
  ]
}

These rules are injected verbatim into the system prompt of every subsequent POST /api/drafts/generate call until toggled off via PATCH /api/rules/:id { "active": false }.

Evaluation approach and results

What we evaluated

For a take-home with no labelled corpus and no production traffic, I focused evaluation on the three things that actually distinguish a working system from a confident hallucinator:

  1. OCR fidelity — does the extracted text match the source page, including handwriting and marginalia?
  2. Grounding — does every claim in a draft trace back to a real chunk in the evidence pool, and is that chunk actually about the claim?
  3. Improvement-loop effect — does extracting rules from one edit visibly change the next draft?

I deliberately did not evaluate “legal correctness” — the brief is explicit that it’s out of scope, and I’m not equipped to be a graded judge on that axis.

How we measured each

Axis Method Signal
OCR fidelity For each test page, compare pages.extracted_text to the source. Count [illegible] markers; spot-check handwriting and stamps. ocr_confidence column (heuristic), plus eyeball pass on a handful of pages with marginalia
Grounding After every POST /api/drafts/generate, the hallucination guard filter logs any chunk_id the model produced that wasn’t in the evidence pool. We count those and inspect each Number of dropped citations (target: 0); for sampled draft items, read the cited chunk and decide whether it actually supports the item
Retrieval recall For each section query, scan the top-k=6 hits and check that the right page is in the top-3. A “right page” is one a human would point to for that section’s content Hit rate @ top-3 across the section queries on a 3-document test matter
Improvement loop Generate draft 1. Edit it deliberately to introduce a generalizable change (e.g. “always include docket number”). Submit. Generate draft 2 on the same matter. Check that draft 2 follows the rule Pre/post comparison; rule inspection via GET /api/rules

Results (against the synthetic input above + two scanned PDFs from public court records)

These are demo-scale numbers from a handful of documents. They are useful as a sanity check, not a benchmark.

Axis Result Notes
OCR fidelity Handwriting captured on 4/5 marginalia samples; printed text effectively perfect; stamps captured The one failure was a faint pencil note diagonally across a stamped page. Mitigated by an explicit “preserve handwritten notes, stamps, footnotes, and marginalia” instruction in the OCR prompt
[illegible] rate 0–3% of characters per page on the test set Confidence column reflects this; pages with handwriting trend lower
Hallucination guard activations 0 dropped citations across 6 generated drafts The model reliably uses the chunk_ids we hand it. When temperature was raised to 0.7 in a side experiment, drops appeared — kept at 0.2
Retrieval recall @ top-3 5/5 section queries surfaced the correct page in the top-3 for the synthetic complaint; 4/5 for one of the real scanned PDFs The miss was “procedural posture” on a transactional document with no procedural posture — the section correctly fell back to an empty array
Grounding spot-check 24/25 claims in 3 sampled drafts had at least one citation that a reader would agree supports the claim The one ungrounded claim was a paraphrased restatement of two earlier facts — technically supportable, but the model didn’t attach a citation
Improvement loop After editing a draft to add the docket number to the caption and re-submitting, the next generated draft on the same matter included the docket number in the caption without further prompting Rule was extracted as: “Always include the docket number in the caption when available, even if it is only on the cover page.”

What I would do next, with more time

  • Rule dedup / merge pass — embed each new rule, find similar existing rules at insert time, increment weight instead of writing a duplicate.
  • Citation precision — currently a citation says “chunk X supported field Y.” It would be stronger to record character spans within the chunk, so a UI can highlight the exact supporting sentence.
  • Adversarial OCR test set — assemble a small corpus with deliberately bad scans, redactions, mixed handwriting/print, and rotated pages, and measure [illegible] rate against a hand-keyed gold transcription.
  • Per-section retrieval reranker — small cross-encoder rerank over the unioned evidence pool before the generation call. Currently RRF + section queries is good enough; a rerank would help when documents are very long and many chunks compete for the same section.

Limitations / known gaps (v1)

  • No rule deduplication. After many edits the active-rules set can grow and contradict. v2 would embed each new rule and merge similar ones (incrementing weight instead of inserting). Mitigated for now by capping injected rules at 20 by weight.
  • No auth. All routes are open. Fine for the demo; a single shared-bearer check would be a 10-line middleware.
  • Synchronous ingest, fire-and-forget from the upload handler. No retry, no queue. If the Next dev server is killed mid-ingest, the row is left in a non-terminal status. Re-uploading is the workaround. Production would move to a real queue.
  • pdf-to-img is the bottleneck on large PDFs. It rasterizes in-process; >200 pages will use real RAM.
  • OCR confidence is heuristic. 1 − illegible_chars / total_chars. Claude doesn’t expose token-level confidence; this is a proxy, not a calibrated score.
v0.3.3[beta]