Priyanshu Mahey.

Anthill

A collaborative research-paper editor where AI agents edit the document directly over a shared Yjs CRDT, with citations grounded against the actual cited PDF.

Retrieval-Augmented Generation for Code
Saving title…
PM
SC
AI
3 · Method
Dense retrievers reduce hallucinations on long-tail QA by grounding generation in a fixed corpus[arXiv:2401.07983]81%. We extend this line of work to code by replacing the bi-encoder with an instruction-tuned model that takes the same prefix at index- and query-time[arXiv:2305.06983]73%, removing a class of distribution-shift bugs.
We follow the dense-retrieval line but swap the encoder for an instruction-tuned model.[arXiv:2401.07983]Tab
003cb3da·Yjs synced·baseRevision v42a3
Plate · @platejs/yjs
TimelineApril 2026 · weekend hackathon
StackNext.js · Plate · Yjs · Hocuspocus · FastAPI · ChromaDB · llama.cpp · Supabase
FocusMulti-agent collaboration · Citation grounding · CRDTs · Reviewer-driven revision
PartnersNia (cited Q&A) · AgentMail (review inbox)
priyanshumahey/anthillGitHub Repository
TypeScript62%
Python28%
TSX8%
Other2%

MOTIVATION

Why put AI inside the document?

Writing a research paper is mostly context-switching, and the tools we use treat AI like a sidebar.

A normal revision loop looks like this. Draft a paragraph. Tab away to find prior work. Tab away again to format the citation. Tab away a third time to read the reviewer's email and figure out which paragraph it's actually about. By the time you get back to the document, you've forgotten what you were trying to say. Most of the published research about "AI for writing" has measured the wrong thing: it has measured the quality of the suggestion, not the cost of leaving the page to ask for one.

Today's AI writing tools all share the same shape. A chat panel sits next to the document and talks about it without ever touching it. The chat suggests, the user copies, the user pastes, the user formats. The chrome is different but the cost is the same.

Anthill is a research-paper editor that puts the AI on the same surface as the human. It is built on Plate over a Yjs CRDT, and it exposes a small HTTP bridge that lets agents read the live document and write back into it as Yjs transactions. The browser sees agent edits the same way it sees a collaborator's edits: they appear in real time, with an avatar in the corner attributing the change. There is no copy, no paste, and no sidebar. Citations, reviewer feedback, and literature search all happen on the same page as the prose.

It was built over a weekend at the Anthropic × Nia × AgentMail hackathon, and the post below is the system end-to-end: the embedding store, the agent bridge, the auto-cite plugin, Nia-grounded verification, and the review-response agent.


OVERVIEW

The four pieces

A Plate editor in the browser, a Bun process holding the CRDT, a FastAPI backend behind a dev tunnel, and one shared Postgres.

Editor (web)

A Next.js + Plate editor with a citation-suggest plugin, a five-state citation badge, an agents panel, and a "connect agent" dialog. Live edits flow over a WebSocket; agent runs stream over Server-Sent Events. The browser never talks to the FastAPI backend or to Nia directly.

Collab (collab/)

One Bun process exposing two ports. Hocuspocus on the WebSocket handles Yjs sync and presence. The agent bridge on :8889 accepts HTTP edit ops and writes them into the same Y.Doc via openDirectConnection. That colocation is what makes "agents and humans share one CRDT" cheap.

Backend (backend/)

FastAPI on a laptop, exposed via dev tunnel. Hosts the Harrier-embedded local arXiv corpus, the /search and /embed endpoints, and an agent runner that spawns one of eight named agents per call and streams events back through a per-run SSE channel.

BROWSERWEB · NEXT.JSCOLLAB · BUNBACKEND · FASTAPIEXTERNALPlate editor@platejs/yjs · presence · pluginscitation-suggest · 1.2s · top-k=5agents panel · run formsconnect-agent dialogcitation badge · 5 statesAPI routes/citations · /agents/runs · SSESupabase clientdocuments · yjs_state · plain_textHocuspocus:1234 · Yjs syncAgent bridge:8889 · Y.transactopenDirectConnection/search · /embedX-Anthill-Secret/agents/runs8 agents · SSE per runHarrier 270M + Chromalocal · in-processOpenAIplannerNia v2cited Q&AClaudeSonnet 4AgentMailinboxSupabase Postgresyjs_state · content · plain_text
Anthill: humans and agents share one Yjs CRDT.
WebSocket (Yjs)auto-citeagent run (SSE)bridge edit (Y.transact)persistence
Three processes, two HTTP secrets, one Postgres, and one shared CRDT. Agents and humans converge on the same Y.Doc.
CORPUS

A thousand papers, embedded locally

The retrieval substrate is ~1,000 cs.AI papers, embedded with Harrier-OSS-v1 270M through llama.cpp.

At hackathon scale you embed once and query a thousand times, and the GPU sits right under the desk. So the embedding step lives in the same FastAPI process that serves search, and the index is a single Chroma collection on disk. The chunker walks each PDF with PyMuPDF, slides a 512-word window with 64-word overlap, and writes (arxiv_id, chunk_index, char_start, char_end) alongside every vector. Storing char offsets per chunk costs nothing during ingest and pays for itself the first time someone wants a citation badge that deep-links to the source PDF.

INDEX-TIMEQUERY-TIMEarXiv APIcs.AI / cs.IRPDF + metaOpenAlexPyMuPDFchar offsets512-word win64 overlapHarrier 270Mone chunk at a timeL2-norm vector(id, chunk, span)ChromaDBpapers collectionUser queryInstruct: ... Query:FetchChunkEmbed (Harrier)StoreQuery
Local cs.AI corpus · ~1k papers · embed once, query a thousand times
PDF in, L2-normalized vector out. The same Harrier instance handles both index-time and query-time embeddings.

Two implementation gotchas were worth more time than the rest of the pipeline combined. Harrier expects an instruction prefix on every query ("Instruct: Given a scientific query, retrieve relevant paper passages\nQuery: "); without it the cosine scores collapse because the query vectors land in a different region of the space than the indexed chunks. And llama.cpp's batched embedding path errors on this model, so the script falls back to one create_embedding(text) call per chunk. Slower, reliable, and worth the warm-up time on the first /search request.


AUTO-CITE

Type, pause, Tab

The smallest moving part in the system, and the one that makes the editor feel alive.

The Plate plugin in citation-suggest-kit.tsx tracks the active block in module-scope refs (no React re-renders on every keystroke). After 1.2 seconds of inactivity on a paragraph with at least 30 characters, the block text is sent to the FastAPI /search endpoint via a Next.js proxy that holds the shared secret. The top hit becomes a ghost-text pill anchored at the caret. Tab inserts it as a Plate citation inline element carrying the full search trace. Esc dismisses, and the plugin remembers the dismissed text in a Map<blockId, lastQueriedText> so it doesn't re-fire on the next keystroke that lands the cursor back on the block.

3 · Method
typing · 0/30 chars
.
debounceMs
1200
minChars
30
topK
5
minScore
0.55
scoreGap
0.08
Live config, lifted from DEFAULT_OPTIONS in the plugin source.

Three small behaviors took most of the time. The plugin re-checks refs.blockId and refs.blockText when the response comes back, so if the user kept typing while the request was in flight, the result is dropped and no ghost pill appears. The same memoization map prevents re-querying an unchanged paragraph after every cursor move. And when the top-k contains several closely-tied scores (top.score - score ≤ 0.08), Tab inserts up to maxInsert = 3 badges instead of one, because most paragraphs cite one paper but the ones that need a cluster really need a cluster.

The inserted citation node carries the entire search trace (query, latency, top-k candidates), so clicking the badge opens a popover that shows why the agent picked this paper. That trace is the affordance that makes auto-cite trustworthy enough to leave on while you write.


VERIFICATION

Five badge states

Embeddings are good for "probably relevant" and bad for "actually supports the claim". That's where Nia comes in.

The moment a citation is accepted, the editor fires a ground_citation agent run with the inserted node's identity and the surrounding paragraph as a claim. Nia's document/agent endpoint reads the actual cited PDF and returns a structured verdict against a JSON schema we hand it. The browser is listening on the agent's SSE stream; when a finding event arrives with kind: 'grounded_citation', the verification driver finds the matching node by (arxivId, chunkIndex, searchedAt) (stable across Yjs reorderings) and merges the verdict into its verification field. The badge re-renders in place.

…grounding generation in dense retrieval[arXiv:2401.07983]81%.
SUPPORTS91% confidenceNia · 4180ms
Supports the claim
Retrieval-Augmented Generation Reduces Hallucinations in Long-Tail Question Answering
arXiv:2401.07983 · chunk 4 · 81% match
We find that retrieval-augmented generation reduces hallucination by 39% on the long-tail subset of TriviaQA-Web compared to the no-retrieval baseline.
p.7·Methods > Architecture
why: Quote directly compares the two baselines on the cited benchmark.
ground_citation
Click a state
The five terminal states. not_ready exists because Nia silently hallucinates if you query a still-indexing source.

The schema we send is intentionally strict, so the agent has to answer the verification question rather than wax on. Page number and section path nullable, confidence bounded to [0, 1], exact quote required (with the contract that it must be verbatim, not paraphrased).

Schema we send
POST /v2/document/agent · structured_output
{
  "type": "object",
  "properties": {
    "supports_claim": {
      "type": "boolean",
      "description": "True if the cited paper directly supports..."
    },
    "exact_quote": {
      "type": "string",
      "description": "Verbatim sentence — no paraphrase. Empty if none."
    },
    "page_number":   { "type": ["integer", "null"] },
    "section_path":  { "type": ["string", "null"] },
    "confidence":    { "type": "number", "minimum": 0, "maximum": 1 },
    "rationale":     { "type": "string" }
  },
  "required": ["supports_claim", "exact_quote",
               "confidence", "rationale"]
}
Verdict that comes back
merged into citation node's verification field
{
  "supports_claim": true,
  "exact_quote": "We find that retrieval-augmented
generation reduces hallucination by 39% on the
long-tail subset of TriviaQA-Web compared to the
no-retrieval baseline.",
  "page_number": 7,
  "section_path": "Methods > Architecture",
  "confidence": 0.91,
  "rationale": "Quote directly compares the two
baselines on the cited benchmark."
}
War stories from nia_client.py
claude-opus-4-7 → 502
Default Nia model returns 'temperature is deprecated for this model'. Pin claude-sonnet-4-20250514.
Source not ready
document/agent happily accepts a still-indexing source and hallucinates with 0 citations + 0 confidence. We raise NiaSourceNotReady on that exact signature.
POST /sources duplicates
Per-user row created every time, even when Nia has the paper globally. Always GET /sources first to dedup; cache source_id in SQLite.
haiku model 404
claude-haiku-35-20241022 from the docs returns 404. Sonnet stays.
Schema in, structured verdict out, plus the war stories from nia_client.py.
Editor/api/agents/runsNext.jsground_citationnia_cachesqliteNia v2POST { agent, input: {arxiv_id, claim} }create run → run_idSSE /runs/:id/eventslookup arxiv_idGET /sources (dedup)skip if cachedPOST /document/agent (json_schema){ supports, quote, page, conf }SSE finding kind=grounded_citationpatch citation node verification
The verification round-trip. Cache the source id in SQLite and Nia stays cheap on reloads.
BRIDGE

How agents share the CRDT

The naive way to put an agent in a doc is to give it a tool that calls setValue(plate_value). That nukes anyone else's in-flight edits. The Anthill bridge does something more careful.

The bridge opens a Hocuspocus direct connection to the live document and mutates the Yjs fragment inside a Y.transact block, with the agent's identity stamped on the origin so undo and presence stay sane. It is a small Bun HTTP server with a deliberately small surface: discovery, snapshot, full state, edit, presence, and a dev-only repair endpoint. Edits are validated, applied as a single transaction, and broadcast through the same WebSocket every browser is already on.

Request
POST /documents/:id/edit
{
  "ops": [{
    "type": "appendInline",
    "ref": "b3",
    "element": {
      "type": "citation",
      "arxivId": "2305.06983",
      "chunkIndex": 4,
      "title": "CodeT5+",
      "score": 0.79,
      "children": [{ "text": "" }]
    }
  }]
}
How citation_inserter / insert_citation drop a citation badge into a block. Stamped with proofAuthor:ai:insert_citation.
Before
Retrieval-Augmented Generation for Code
b1Method
b2
Dense retrievers reduce hallucinations on long-tail QA by grounding generation in dense retrieval[arXiv:2401.07983]
b3
We extend this to code with an instruction-tuned encoder.
Pick an op
Pick an op, see the request body and the document mutation. The 9 ops in the playground match collab/src/types.ts exactly.

Stable block refs

Every snapshot exposes blocks as b1, b2, ... by ordinal position. Agents reason about "the third paragraph", not about Slate paths that shift on every keystroke. Snapshots also carry inline children (citations, mentions) so the agent knows what would be lost on a destructive edit.

Optimistic locking

Every snapshot returns a baseRevision: a content-hashed token derived from the Yjs state. Pass it on POST /edit and the bridge refuses stale writes with 409 STALE_REVISION. That's how we avoid an agent stomping on a human typing in the same paragraph.

Idempotency keys

Every edit requires an Idempotency-Key. Same key with the same body returns the cached response, so a flaky network never double-applies. Same key with a different body returns 409 IDEMPOTENCY_KEY_REUSED_DIFFERENT_BODY, so a buggy agent can't quietly mutate history.

Carry-inlines on destructive edits

replaceBlock and setBlockText auto-reattach existing inline citations to the new text. deleteBlock refuses with 409 INLINE_ELEMENTS_WOULD_BE_LOST when the block carries inlines, unless the agent passes dropInlineElements: true. Agents reading text-only snapshots rarely know inlines exist; silently losing them would destroy the bibliography.

HumanPlateHocuspocusYjs syncBridge:8889citation_inserteragenttypes paragraph (Yjs delta)GET /snapshotopenDirectConnection (warm)Y.Doc handle{ blocks, baseRevision }embed → top-k chunksPOST /edit appendInline citation+ Idempotency-Key, baseRevisionY.transact (origin = ai:agent)Yjs update over WS{ applied: 1, newRefs: [c1] }
A representative round-trip: snapshot, embed, edit, broadcast.

Because the bridge is just HTTP and a documented op vocabulary, any agent can drive the document. The "Connect agent" dialog in the editor hands you a copy-paste prompt prefilled with this document's bridge URL, ID, headers, and op reference, ready to paste into Claude Code or ChatGPT. Within a minute of opening the dialog, you can have an external LLM rewriting your introduction in real time.

Connect an external agent
Paste this prompt into Claude Code, Copilot, ChatGPT, or any agent that can call HTTP. It's prefilled with this document's bridge URL and ID. Every change the agent makes appears here in real time.
Bridge URLhttps://collab.anthill.app:8889Document ID003cb3da-9f17-4c44-9d3e-2a8e0f1b7e1cProtocolanthill-agent-bridge/1
The actual prompt. Copy, paste into Claude Code, watch the editor fill in.
PRESENCE

One CRDT, many cursors

Agents publish to the same Yjs awareness map every human is on. The editor doesn't care what kind of client you are.

Hocuspocus is the only thing that owns presence. Browsers publish their cursors through the standard @platejs/yjs awareness binding. Agents publish through the bridge's POST /presence endpoint, which writes into the same awareness map from the bridge's openDirectConnection handle. The avatar stack in the top-right of the editor renders both the same way (a colored circle with initials, agents marked with a soft pulse ring).

RAG for Code · §3 Method
PM
You
Dense retrievers reduce hallucinations on long-tail QA by grounding generation in dense retrieval. We extend this to code with an instruction-tuned encoder.
You
One client connected. Yjs awareness map has a single entry — your cursor.
Who's connected
The awareness map is the awareness map. Humans, agents, all in the same row.
SEARCH

Plan, discover, rank

Auto-cite handles paragraphs you've written. The literature-search agent handles the inverse: "I want to write about X, what's out there?"

The agent in literature_search.py is a planner-then-searcher with optional discovery. The planner uses gpt-4o-mini (cheap, fast, perfectly fine at this; the rest of the system uses Claude Sonnet 4 but the planner doesn't need it) to fan the topic out into 4 short sub-queries. Each sub-query gets its own Chroma top-k pass, and the results are merged best-per-paper so you don't see the same arXiv ID four times.

retrieval-augmented generation for code
step · planninggpt-4o-mini · n=4
Pick a topic
Pick a topic, watch the planner expand. Same SSE shape as the live agent's plan_done event.

The interesting bit is discovery. If the user opts in, the agent also queries arXiv directly for fresh papers that aren't in the local Chroma yet. Each candidate is downloaded, chunked, embedded with Harrier, and inserted into the same collection during the run. By the time the search step runs, the new papers are queryable like everything else. Every step of the way (plan, discover, discover_ingest, search, rank, each finding) is published as an SSE event, so the agents panel renders the run live: planner queries appear, new papers download, the rankings update.

literature_search
Runningliver_8c2f1a
"graph neural networks for code"
created Apr 26, 10:14:01·plan_n=4 · discover_max=5 · k_per_query=8
Trace
0 events
streaming…
Findings
0 papers
Waiting for results…
Planner: gpt-4o-mini · Embeddings: Harrier 270M/api/agents/runs/r_8c2f1a/events
A real run from start to finish, rendered the same way the dashboard renders it.
REVIEW

Reviewer email to tracked changes

A peer reviewer sends an email. Four tracked-change cards appear in the doc, anchored to the right paragraphs.

This is the AgentMail integration, and it is the feature that most concretely shows the value of the bridge: an external event becomes a document event with no human in the loop. The shipped UX is a paste-text form in the agents panel; the agent itself accepts both modes (paste-text or AgentMail by inbox_id+message_id), and the bridge end of the pipeline is identical either way.

The agent in review_response.py snapshots the doc, asks Claude Sonnet 4 to map the email into a structured action list ({kind, anchor_ref, replacement, rationale}), validates each anchor_ref against the snapshot (Claude doesn't get to invent block IDs), and posts each surviving action through the bridge as an addNote op with idempotency key review:{run_id}:{anchor_ref}:{kind}. Same key on retry, same response, no double-applied edits.

Mock review response
Skips AgentMail; runs Claude on the pasted text.
Backend calls claude-sonnet-4-20250514 with a JSON-schema system prompt. Each action posts an addNote op with idempotency key review:{run}:{anchor}:{kind}.
review_response
Succeededr_8c2f1a
4 actions queued · 3 suggestions · 1 comment
addNote · suggestionb2
Dense bi-encoders dominate the modern IR leaderboard; sparse baselines (BM25) remain a strong starting point but typically lag on multi-hop queries.
why: Reviewer asked to separate dense vs sparse retrieval.
addNote · suggestionb5
attention(q,k) = softmax(q·kᵀ / √d)
why: Eq. 4 missing 1/√d normalization.
addNote · suggestionb9
…on the BEIR benchmark [Thakur 2021].
why: Citation 12 should be Thakur 2021, not Karpukhin 2020.
addNote · commentb12
Add a one-paragraph limitations note about how the encoder's 512-token window degrades on long-context retrieval (cite §6 of Beltagy 2020).
why: Reviewer asked for a limitations paragraph on long-context retrieval.
idempotency: review:r_8c2f1a:bN:[edit|comment]Sonnet 4
Paste the email on the left, watch the run-detail page on the right fill in with accept/reject suggestions.

The crucial UX decision is that the agent never edits prose destructively. Every textual change goes in as an addNote with kind: 'suggestion', rendered in the editor as an accept/reject card anchored under the original block. It is the peer-review experience everyone already knows from Word, with an LLM as the proposer instead of a human.

ReviewerAgentMailanthill@…watch_inbox.pyreview_responseClaudeBridge → Editorsends review emaillist_messages (poll 5s)new message_idPOST /agents/runs (doc, message)get_messageGET /snapshot → blocksprompt(review + blocks){ actions: [edit b3, comment b7, …] }POST /edit addNote (× N)Idempotency-Key per anchor(optional) reply with summary
The full round-trip from inbox to tracked change.
REFLECTION