How SafeOS Guardian works: motion-gated browser AI that doesn't burn your laptop

A few years ago, running real-time computer vision in a browser was a thought experiment. You'd see a demo at a conference, somebody'd fire up a Codepen, the laptop fan would scream, and you'd nod and move on. As of mid-2025, SafeOS Guardian ships exactly that — a Progressive Web App that runs COCO-SSD and ViT-base on a user's laptop or phone, no cloud round-trip — and the fan stays quiet. Here's how.

The two-tier pipeline

Every frame goes through a cheap screening layer. Only triggered frames go to the expensive deep-learning models. That single design choice — gating CV inference on motion + audio + pixel-change thresholds — is the difference between "browser AI is theoretically possible" and "browser AI runs all night on a phone without melting the battery."

Concretely: the screening layer is a setInterval loop at 200 ms running pixel-diff motion detection plus a Web Audio AnalyserNode FFT at 100 ms. Both are pure math on the CPU. No GPU context, no model loading, no service worker thrash. The intervals are constants in CameraFeed.tsx at lines 73–75:

const FRAME_INTERVAL = 1000; // gated frame capture

const MOTION_INTERVAL = 200; // pixel-diff every 200ms

const AUDIO_INTERVAL = 100; // FFT every 100ms

The gate

Object detection only runs when the screening layer says "something happened." The gate is two lines in person-detection.ts:

// line 271-272: only run AI detection if motion was detected

if (!motionTriggered) return null;

No motion, no inference. The GPU stays in its low-power state. COCO-SSD's model.detect() only fires when something in the frame moved enough to cross the per-scenario threshold (configurable in settings — different thresholds for babies, pets, security). The pipeline can theoretically push 10–30 FPS through COCO-SSD on WebGL/WebGPU, but in practice it averages a fraction of a Hz because most frames don't pass the gate.

Two models, one fallback

Primary detector: a quantized COCO-SSD via TensorFlow.js. ~5 MB on the wire, downloaded once on first model load, cached by the service worker forever after. Spots 80+ object classes with bounding boxes. The detection canvas is 320×240 (model input size) regardless of the source video resolution.

Tie-breaker: a quantized Xenova/vit-base-patch16-224 via Transformers.js. ~89 MB, downloaded on-demand only when COCO-SSD's top prediction is below the per-scenario confidence threshold. ViT-base is heavier but better at fine-grained scene labeling, useful for ambiguous frames ("is this person standing or fallen?").

For really hard scenes, you can configure an optional bridge to a local Ollama install on the same LAN — moondream, llava:7b, or llama3.2-vision — for richer scene reasoning. Nothing leaves your network. For when even Ollama isn't enough you can plug in your own OpenAI / Anthropic / Gemini keys; cloud fallback only fires on the frames the local models couldn't classify confidently.

Audio FFT, not an audio model

Cry detection, distress vocalizations, glass break, sustained silence — none of these need a model. The Web Audio API's AnalyserNode with fftSize = 256 gives 128 frequency bins, sampled every ~100 ms. Threshold the right bins and you get most of what a small audio classifier would give you, at a fraction of the cost. Baby cries cluster in the 300–600 Hz fundamental with characteristic harmonic patterns; glass breaks are mostly high-frequency transients; silence is the absence of energy across the spectrum.

Lost & Found: 32 + 64 + 1 KB

The matcher doesn't use a deep model at all. Reference photos reduce to three signatures: a 32-bin color histogram, the top-5 dominant colors via k-means, and an 8×8 Sobel-derived edge grid. Total: under 1 KB per reference photo. The matcher samples the live feed at 1–2 FPS and compares each candidate frame by cosine similarity. Code: visual-fingerprint.ts.

Why this matters: a 32-bin histogram + an 8×8 edge grid generalizes well enough to match a dog at different angles and distances, without needing a face-embedding model you'd have to keep updated. It's cheap, transparent, and runs at sample rate. The full Lost & Found loop lives in monitor/page.tsx around lines 242–254.

Same philosophy as AgentOS

SafeOS isn't the only thing Frame ships under the "local-first AI" banner. AgentOS is the agent runtime side — same idea, different domain. AgentOS's memory layer is grounded in cognitive science instead of throwing a fixed-window context buffer at the LLM and hoping, and it scores 85.6% on LongMemEval-S (+1.4 over Mastra) at under $0.009 per correct retrieval with GPT-4o. AgentOS's tool-forging is gated on a similar "only when needed" principle as SafeOS's CV inference: agents create Zod-validated functions on demand, then reuse them in a sandboxed V8 context. Same instinct: don't pay the cost until you need to.

The state management is boring

State is Zustand (lightweight, persist middleware writes to IndexedDB) and the UI is Next.js 14 + React 18 + Tailwind. No magic. The interesting parts of the codebase are the pipeline files in src/lib/, not the React shell.

What's broken

PWA install on iOS is still rough — Safari's installation flow isn't great, and background audio capture has well-known limits. ViT-base's 89 MB download is real; on slow connections the first load is slow. Browser fingerprint matching isn't as good as a proper face-embedding model would be, but the trade-off is no model to keep updated and no PII leakage. Most of these are tracked in GitHub issues.

The honest version is: SafeOS is rough, occasionally inelegant, and runs on engineering decisions I'm going to second-guess in six months. But it works, it's free, it's MIT-licensed, and it doesn't send your camera feed to a stranger's server. If you want to read about the most consequential of those decisions — the motion gate that keeps the GPU idle — the next post is just about that.

— Johnny Dunn, Frame