Priyanshu Mahey.

Lookout

A distributed multi-camera video RAG system with on-device multimodal embeddings and peer-to-peer transport over QUIC.

EDGE DEVICESLEADERBROWSERiroh QUICHTTPFollower · cam-lab-1webcam + Gemma on CactusFollower · cam-lab-2webcam + Gemma on CactusFollower · cam-doorwebcam + Gemma on Cactusiroh EndpointALPN: lookout/ingest/v1ChromaDBper-modality vectorsQuery / RAGGemma synthesisaxum :8080/api/cameras · /api/queryReact UIlive tiles · chatClip Playbacklazy MP4 pullIngestQueryPlayback
Distributed video RAG — three processes, one QUIC mesh
TimelineApril 2026 · 24h hackathon
StackRust · iroh · Cactus · Gemma 4 · ChromaDB · React
FocusOn-Device AI · Multimodal RAG · P2P Systems
RecognitionGrand Prize · Best B2B · Best Technical Hack at Cactus × DeepMind × YC
priyanshumahey/voice-agents-hackGitHub Repository
Rust79.4%
TypeScript17.6%
Shell2%
Other1%

MOTIVATION

Why build distributed video RAG?

Watching many cameras at once, without shipping a single raw frame to the cloud.

Surveillance, robotics, and "smart space" deployments increasingly involve many concurrent video feeds: several rooms, several robots, several cameras on a vehicle. Operators need to ask natural-language questions across all of them, like "when did the package get delivered?", "which camera last saw the red backpack?", or "did anyone enter the lab between 3 and 5pm?".

Today those questions get answered one of two bad ways: ship every raw frame to the cloud (expensive, privacy-hostile, bandwidth-bound), or stay single-camera and lose the cross-feed narrative. Neither works once you care about privacy, latency, or operating offline.

Lookout treats each camera as a small edge computer. It captures locally, embeds locally with Gemma 4 on Cactus, and only ships tiny embedding vectors (not raw video) over a peer-to-peer QUIC link to a central leader. The leader indexes everything into ChromaDB and answers natural-language queries over the combined index. Raw clips stay on the camera that recorded them and are only ever streamed on explicit request.

It was built in ~24 hours at the Cactus × DeepMind × YC hackathon, where it won Best B2B, Best Technical Hack, and the Grand Prize.


OVERVIEW

The three pieces

A follower lives on every camera. A leader lives on one server. A UI lives in a browser. Everything between them is QUIC.

Follower

A Rust binary that runs on each camera node. It captures webcam frames, slides a short window over them, calls Gemma 4 through Cactus to produce a single multimodal embedding per window, and pushes {embedding, caption, camera_id, timestamp} upstream over iroh. Raw video never leaves the device by default.

Leader

A Rust binary that accepts ingest connections from any number of followers over iroh QUIC, dedupes and persists chunks into ChromaDB, and serves a small HTTP API for the UI: GET /api/cameras, GET /api/live/:id, POST /api/query. Query answers are synthesized by Gemma 4 on Cactus, also running on-device.

UI

A React + Vite dashboard that shows every connected follower as a live tile, lets the operator ask questions in natural language, and plays back raw clips fetched lazily from the originating camera. It talks only to the leader, so it never has to know anything about the P2P mesh underneath.


ARCHITECTURE

System anatomy

The whole system is three processes that could each live on a different continent. Followers dial the leader over iroh, an authenticated QUIC transport with built-in NAT traversal, so there's no port-forwarding, no VPN, and no CA infrastructure to manage. Every peer is identified by its Ed25519 NodeId.

The hero animation at the top of this page cycles through the three interactions the system actually supports: steady-state ingest (vectors flowing up), operator queries (UI to leader to ChromaDB to Gemma), and on-demand playback (UI to leader to originating follower). Everything between edge and leader is a single authenticated QUIC connection per camera.


FOLLOWER

On-device embedding pipeline

Each follower runs a tight loop: capture, chunk, embed, ship. The embedding step is the interesting one. Gemma 4 runs on-device via Cactus and produces a multimodal vector over a short window of frames in a few hundred milliseconds on Apple Silicon. The same model produces a one-sentence caption that we also embed into a text collection so retrieval can fuse dense video similarity with sparse caption matches.

CAPTUREINFERENCEPACKAGETRANSPORTWebcam1080p @ 15fpsMicrophone16kHz monoFrame buffer10s sliding windowSample K framesevery 5s + audio segmentGemma 4 · Cactusvision + audio towerEmbeddingL2-normalizedChunk{emb, caption, jpeg}JPEG q60middle frameiroh QUICingest/v1Disk spoolring bufferCaptureInferenceVectorSpool
Follower pipeline — capture, embed, ship
Capture, embed, ship. A disk spool keeps ingest at-least-once even over flaky networks.
A follower capturing and embedding a live webcam feed in real time. Only the resulting vectors leave the device.

Zero-copy embedding

Gemma 4 on Cactus uses mmap and zero-copy weight loading, which is why a 4B multimodal model comfortably runs on an M-series Mac mini with headroom for the capture pipeline on the same box.

Graceful degradation

If inference falls behind real time, the follower drops frame samples first, captions second, and entire chunks last, emitting a metric every time, so the pipeline never silently loses fidelity.

Disk spool on disconnect

When the iroh stream errors, chunks are written to a bounded ring buffer on disk and drained oldest-first on reconnect, deduped by chunk_id. Followers on flaky Wi-Fi or LTE recover transparently.

Synthetic mode

A --synthetic flag short-circuits Gemma and emits random vectors on the same cadence, so the transport layer and leader can be exercised end-to-end without spinning up the model at all.


TRANSPORT

Wire protocol over iroh

Every node-to-node link is a single iroh QUIC connection, identified by an ALPN string and carrying length-prefixed postcard-encoded frames. The leader is the accepting side; followers dial it using a ticket it prints at startup. After the initial Hello, the stream is long-lived and bidirectional: the follower pushes Chunk and FrameResponse frames, the leader pushes Ack and FrameRequest frames.

HandshakeAuthenticated over Ed25519 — no CA, no port forwardingIngest · long-lived bidirectional QUIC streamFollower pushes embeddings up; leader acks each chunkLive frame proxy · on-demand over the same streamLeader reuses the follower's existing QUIC connection to pull a JPEGNatural-language queryANN + RRF + on-device Gemma synthesis — all on the leaderFollowercam-lab-1 · Rust + CactusLeaderiroh + axum + ChromaDBUIReact · browserdial(ticket)ALPN: cactus/ingest/v1Hello{ camera_id: "cam-lab-1" }register cameraadded to /api/camerasChunk{ chunk_id, embedding, caption, jpeg, ts }persist → ChromaDBvideo / audio / caption collectionsAck{ chunk_id }GET /api/live/cam-lab-1FrameRequest{ req_id }FrameResponse{ req_id, jpeg }200 image/jpegPOST /api/query{ q: "when did the package arrive?" }ANN + RRF + Gemma synthesisCactus · on-deviceanswer + citations[cam-door @ 14:03, cam-door @ 14:04]
One long-lived QUIC stream carries embeddings up and requests down
One long-lived bidirectional QUIC stream per follower carries embeddings up and live-frame requests down.

iroh does the unglamorous plumbing: Ed25519-authenticated TLS 1.3, NAT traversal with relay fallback, connection migration across network changes, and 0-RTT resumption. A follower moving from Wi-Fi to LTE keeps the same logical stream alive. There is no CA, no shared secret, and no port forwarding. The leader prints a dial ticket and that's the whole bootstrap.


QUERY

Natural-language retrieval

A query comes in as plain English. The leader embeds it with Gemma on Cactus in text-only mode, runs filtered ANN search across the per-modality collections in ChromaDB, fuses the results with reciprocal rank fusion, and hands the top-K captions plus their (camera, timestamp) metadata back to Gemma for synthesis. The final answer cites specific clips that the UI can play back on demand.

QUERYPER-MODALITY ANNFUSIONSYNTHESISOperator question"when did the package arrive?"Embed queryGemma 4 · text-onlyANN · video_clipsfilter: camera / timeANN · audio_clipsdense audio vectorsANN · captionssparse text matchRRFrank fusionTop-K chunks+ metadataGemma 4 · Cactuson-device synthesisAnswer + citationscam-door @ 14:03 …Clip playbackiroh pull on clickVideoAudioCaption
Natural-language retrieval — per-modality ANN + RRF + on-device synthesis
Dense video, dense audio, and text caption vectors are fused before Gemma writes the final answer.
Asking the leader a natural-language question across every connected camera. Answer, citations, and clip playback in one pass.

Stage 1: parse the question

Most surveillance questions come with implicit filters: "did anyone enter the lab between 3 and 5pm", "what did cam-front see in the last 30 minutes". Before we touch ChromaDB, the leader asks Gemma to rewrite the query into a structured JSON envelope of the form {time_start_ms, time_end_ms, camera_ids, top_k}. If the user doesn't name a window, the parser defaults to the last 30 minutes; top_k defaults to 20 and is capped at 50. Gemma's output runs through a thinking-marker stripper (<|channel|>analysis / <|message|>) before JSON extraction, because Harmony-style reasoning will otherwise wrap the envelope.

Stage 2: embed the query

The remaining text is embedded by the same Gemma 4 instance that captions incoming chunks, just invoked in text-only mode. The call stays on the leader's hardware, L2-normalizes the output, rejects non-finite or zero-norm vectors, and goes through a 128-entry LRU on the raw query string. The cache matters more than it sounds: during iteration and demos you re-ask the same question many times, and a warm cache turns the embed step into a hash lookup.

Chunks live in two ChromaDB collections, video-clips and audio-clips, indexed under cosine distance. Splitting by modality keeps the query embedding dimension aligned with the stored dimension and lets an audio-only question skip video entirely. The where filter is assembled from the parsed envelope:

{
  "start_ts_ms": { "$lte": end_ms },
  "end_ts_ms":   { "$gte": start_ms },
  "camera_id":   { "$in": ["cam-front", "cam-lab"] }
}

Each collection is queried with n_results = top_k * 2 so we have headroom to fuse. If ChromaDB isn't reachable, the leader degrades to a brute-force cosine scan over the in-memory store. Slower, but the API contract doesn't change.

Stage 4: reciprocal rank fusion and caption boost

Video and audio hit-lists are merged with Reciprocal Rank Fusion (RRF_K = 60), multiplied by a tunable rrf_weight (default 2.0, via LEADER_RRF_WEIGHT). The final ranking score per chunk is:

score = modality_aware_cosine(q, chunk)
      + rrf_weight * Σ 1 / (RRF_K + rank_in_modality)
      + caption_boost

modality_aware_cosine slices the stored embedding back into its video and audio halves (video[0..video_dim] || audio[video_dim..]) so similarity is computed against whichever half matches the collection that produced the hit. caption_boost (default cap 0.15, via LEADER_CAPTION_BOOST) adds a lightweight keyword-overlap term between the query and the chunk caption (stop-words under 3 characters ignored) so exact phrases like "red backpack" don't get drowned out by near-neighbors in vector space.

Stage 5: synthesize the answer

The top 10 chunks are flattened into a numbered caption list of the form {idx}. [{camera_id} {start_ts_ms}ms..{end_ts_ms}ms] {caption} and passed to Gemma with max_tokens = 256, temperature = 0.2, and a terse system prompt that requires citations in the form (#2, cam-lab). Deliberately, only captions go in. No re-sent JPEGs. An early version that re-attached thumbnails for each citation candidate caused Gemma to ignore the question and start describing the scene instead. Text-only synthesis was both faster and more accurate.

If fewer than top_k hits clear a score floor of 0.3, the prompt requires Gemma to answer "No, the captured footage does not show that." rather than confabulate. If Cactus isn't loaded at all, the endpoint falls through to "Found N chunks (LLM not loaded)" so the UI still renders hits.

Splitting retrieve from synthesize

The HTTP surface exposes three endpoints rather than one. POST /api/search returns hits only (~100-500ms); POST /api/answer takes a list of chunk ids and runs synthesis (~1-5s); POST /api/query does both in a single call. The UI calls search first so the grid of citations lights up immediately, then calls answer in the background. The perceived latency is the cheap half, not the Gemma half. A typical end-to-end /api/query p95 lands under 5 seconds on a single Mac mini against a week of 8-camera data, which was the latency bar we set going into the build.


KEY DECISIONS

What I Learned

Shipping a distributed multimodal RAG system in 24 hours forced a lot of decisions that would normally get over-engineered. A few things I'll carry forward:

On-device multimodal is genuinely ready.

Gemma 4 on Cactus produces useful video and audio embeddings on consumer hardware fast enough for soft-real-time ingest. The "send raw frames to a hosted vision model" era of video understanding is ending. The interesting systems push inference to the edge and only centralize vectors.

P2P transport is a superpower for enterprise demos.

Plugging four laptops into four different networks and watching them all connect to one leader with zero configuration (no port forwarding, no VPN, no firewall rules) is the single most compelling demo moment. It's also the feature judges and customers latched onto fastest.

The hard part is the retrieval, not the model.

Swapping the embedding cadence or the synthetic backend was easy. Getting multi-camera, time-filtered, multi-modal retrieval to produce good top-K results took the most iteration. Window size, caption quality, and fusion weights all mattered more than any individual knob on the model.

Feature flags are how you demo fast.

Having --synthetic, --no-camera, and a cactus feature flag meant the system ran end-to-end on any machine in the room within two minutes. That flexibility was worth more during the hackathon than any single piece of the pipeline.