Lookout
A distributed multi-camera video RAG system with on-device multimodal embeddings and peer-to-peer transport over QUIC.
MOTIVATION
Why build distributed video RAG?
Watching many cameras at once, without shipping a single raw frame to the cloud.
Surveillance, robotics, and "smart space" deployments increasingly involve many concurrent video feeds: several rooms, several robots, several cameras on a vehicle. Operators need to ask natural-language questions across all of them, like "when did the package get delivered?", "which camera last saw the red backpack?", or "did anyone enter the lab between 3 and 5pm?".
Today those questions get answered one of two bad ways: ship every raw frame to the cloud (expensive, privacy-hostile, bandwidth-bound), or stay single-camera and lose the cross-feed narrative. Neither works once you care about privacy, latency, or operating offline.
Lookout treats each camera as a small edge computer. It captures locally, embeds locally with Gemma 4 on Cactus, and only ships tiny embedding vectors (not raw video) over a peer-to-peer QUIC link to a central leader. The leader indexes everything into ChromaDB and answers natural-language queries over the combined index. Raw clips stay on the camera that recorded them and are only ever streamed on explicit request.
It was built in ~24 hours at the Cactus × DeepMind × YC hackathon, where it won Best B2B, Best Technical Hack, and the Grand Prize.
OVERVIEW
The three pieces
A follower lives on every camera. A leader lives on one server. A UI lives in a browser. Everything between them is QUIC.
Follower
A Rust binary that runs on each camera node. It captures webcam frames, slides a short window over them, calls Gemma 4 through Cactus to produce a single multimodal embedding per window, and pushes {embedding, caption, camera_id, timestamp} upstream over iroh. Raw video never leaves the device by default.
Leader
A Rust binary that accepts ingest connections from any number of followers over iroh QUIC, dedupes and persists chunks into ChromaDB, and serves a small HTTP API for the UI: GET /api/cameras, GET /api/live/:id, POST /api/query. Query answers are synthesized by Gemma 4 on Cactus, also running on-device.
UI
A React + Vite dashboard that shows every connected follower as a live tile, lets the operator ask questions in natural language, and plays back raw clips fetched lazily from the originating camera. It talks only to the leader, so it never has to know anything about the P2P mesh underneath.
ARCHITECTURE
System anatomy
The whole system is three processes that could each live on a different continent. Followers dial the leader over iroh, an authenticated QUIC transport with
built-in NAT traversal, so there's no port-forwarding, no VPN, and no CA infrastructure to manage. Every peer is identified by its Ed25519 NodeId.
The hero animation at the top of this page cycles through the three interactions the system actually supports: steady-state ingest (vectors flowing up), operator queries (UI to leader to ChromaDB to Gemma), and on-demand playback (UI to leader to originating follower). Everything between edge and leader is a single authenticated QUIC connection per camera.
FOLLOWER
On-device embedding pipeline
Each follower runs a tight loop: capture, chunk, embed, ship. The embedding step is the interesting one. Gemma 4 runs on-device via Cactus and produces a multimodal vector over a short window of frames in a few hundred milliseconds on Apple Silicon. The same model produces a one-sentence caption that we also embed into a text collection so retrieval can fuse dense video similarity with sparse caption matches.
Zero-copy embedding
Gemma 4 on Cactus uses mmap and zero-copy weight loading, which is why a 4B multimodal model comfortably runs on an M-series Mac mini with headroom for the capture pipeline on the same box.
Graceful degradation
If inference falls behind real time, the follower drops frame samples first, captions second, and entire chunks last, emitting a metric every time, so the pipeline never silently loses fidelity.
Disk spool on disconnect
When the iroh stream errors, chunks are written to a bounded ring buffer on disk and drained oldest-first on reconnect, deduped by chunk_id. Followers on flaky Wi-Fi or LTE recover transparently.
Synthetic mode
A --synthetic flag short-circuits Gemma and emits random vectors on the same cadence, so the transport layer and leader can be exercised end-to-end without spinning up the model at all.
TRANSPORT
Wire protocol over iroh
Every node-to-node link is a single iroh QUIC connection, identified by an ALPN string and carrying length-prefixed postcard-encoded frames.
The leader is the accepting side; followers dial it using a ticket it prints at startup. After the initial Hello, the stream is long-lived and
bidirectional: the follower pushes Chunk and FrameResponse frames, the leader pushes Ack and FrameRequest frames.
iroh does the unglamorous plumbing: Ed25519-authenticated TLS 1.3, NAT traversal with relay fallback, connection migration across network changes, and 0-RTT resumption. A follower moving from Wi-Fi to LTE keeps the same logical stream alive. There is no CA, no shared secret, and no port forwarding. The leader prints a dial ticket and that's the whole bootstrap.
QUERY
Natural-language retrieval
A query comes in as plain English. The leader embeds it with Gemma on Cactus in text-only mode, runs filtered ANN search across the
per-modality collections in ChromaDB, fuses the results with reciprocal rank fusion, and hands the top-K captions plus their (camera, timestamp)
metadata back to Gemma for synthesis. The final answer cites specific clips that the UI can play back on demand.
Stage 1: parse the question
Most surveillance questions come with implicit filters: "did anyone enter the lab between 3 and 5pm", "what did cam-front see in the last 30 minutes".
Before we touch ChromaDB, the leader asks Gemma to rewrite the query into a structured JSON envelope of the form {time_start_ms, time_end_ms, camera_ids, top_k}. If the user
doesn't name a window, the parser defaults to the last 30 minutes; top_k defaults to 20 and is capped at 50. Gemma's output runs through a thinking-marker
stripper (<|channel|>analysis / <|message|>) before JSON extraction, because Harmony-style reasoning will otherwise wrap the envelope.
Stage 2: embed the query
The remaining text is embedded by the same Gemma 4 instance that captions incoming chunks, just invoked in text-only mode. The call stays on the leader's hardware, L2-normalizes the output, rejects non-finite or zero-norm vectors, and goes through a 128-entry LRU on the raw query string. The cache matters more than it sounds: during iteration and demos you re-ask the same question many times, and a warm cache turns the embed step into a hash lookup.
Stage 3: modality-aware ANN search
Chunks live in two ChromaDB collections, video-clips and audio-clips, indexed under cosine distance. Splitting by modality keeps the query
embedding dimension aligned with the stored dimension and lets an audio-only question skip video entirely. The where filter is assembled from the parsed envelope:
{
"start_ts_ms": { "$lte": end_ms },
"end_ts_ms": { "$gte": start_ms },
"camera_id": { "$in": ["cam-front", "cam-lab"] }
}Each collection is queried with n_results = top_k * 2 so we have headroom to fuse. If ChromaDB isn't reachable, the leader degrades to a brute-force cosine
scan over the in-memory store. Slower, but the API contract doesn't change.
Stage 4: reciprocal rank fusion and caption boost
Video and audio hit-lists are merged with Reciprocal Rank Fusion (RRF_K = 60), multiplied by a tunable rrf_weight (default 2.0, via LEADER_RRF_WEIGHT).
The final ranking score per chunk is:
score = modality_aware_cosine(q, chunk)
+ rrf_weight * Σ 1 / (RRF_K + rank_in_modality)
+ caption_boostmodality_aware_cosine slices the stored embedding back into its video and audio halves (video[0..video_dim] || audio[video_dim..]) so similarity
is computed against whichever half matches the collection that produced the hit. caption_boost (default cap 0.15, via LEADER_CAPTION_BOOST)
adds a lightweight keyword-overlap term between the query and the chunk caption (stop-words under 3 characters ignored) so exact phrases like "red backpack" don't get
drowned out by near-neighbors in vector space.
Stage 5: synthesize the answer
The top 10 chunks are flattened into a numbered caption list of the form {idx}. [{camera_id} {start_ts_ms}ms..{end_ts_ms}ms] {caption} and passed to Gemma
with max_tokens = 256, temperature = 0.2, and a terse system prompt that requires citations in the form (#2, cam-lab).
Deliberately, only captions go in. No re-sent JPEGs. An early version that re-attached thumbnails for each citation candidate caused Gemma to ignore the question and
start describing the scene instead. Text-only synthesis was both faster and more accurate.
If fewer than top_k hits clear a score floor of 0.3, the prompt requires Gemma to answer "No, the captured footage does not show that."
rather than confabulate. If Cactus isn't loaded at all, the endpoint falls through to "Found N chunks (LLM not loaded)" so the UI still renders hits.
Splitting retrieve from synthesize
The HTTP surface exposes three endpoints rather than one. POST /api/search returns hits only (~100-500ms); POST /api/answer takes a list of chunk ids
and runs synthesis (~1-5s); POST /api/query does both in a single call. The UI calls search first so the grid of citations lights up immediately, then
calls answer in the background. The perceived latency is the cheap half, not the Gemma half. A typical end-to-end /api/query p95 lands under 5 seconds on a single
Mac mini against a week of 8-camera data, which was the latency bar we set going into the build.
KEY DECISIONS
What I Learned
Shipping a distributed multimodal RAG system in 24 hours forced a lot of decisions that would normally get over-engineered. A few things I'll carry forward:
On-device multimodal is genuinely ready.
Gemma 4 on Cactus produces useful video and audio embeddings on consumer hardware fast enough for soft-real-time ingest. The "send raw frames to a hosted vision model" era of video understanding is ending. The interesting systems push inference to the edge and only centralize vectors.
P2P transport is a superpower for enterprise demos.
Plugging four laptops into four different networks and watching them all connect to one leader with zero configuration (no port forwarding, no VPN, no firewall rules) is the single most compelling demo moment. It's also the feature judges and customers latched onto fastest.
The hard part is the retrieval, not the model.
Swapping the embedding cadence or the synthetic backend was easy. Getting multi-camera, time-filtered, multi-modal retrieval to produce good top-K results took the most iteration. Window size, caption quality, and fusion weights all mattered more than any individual knob on the model.
Feature flags are how you demo fast.
Having --synthetic, --no-camera, and a cactus feature flag meant the system ran end-to-end on any machine in the room within two minutes. That flexibility was worth more during the hackathon than any single piece of the pipeline.