The lie on my landing page: how SafeOS actually gates CV inference

Until last week, the SafeOS landing page said this:

"Every frame is analyzed by computer vision models running locally on your device."

Reader, that was a lie. Not a malicious one — I shipped that copy six months ago when I was excited about the model loading working at all — but a lie. Every frame is not analyzed by computer vision models. Most frames don't hit a model at all. The pipeline is motion-gated, by design, and it has to be, because running TF.js inference on every video frame would melt a phone in twenty minutes.

This post walks through the actual pipeline. It's also the post I should have written before I wrote the marketing copy.

What I claimed vs. what runs

The claim: every frame → COCO-SSD → bounding boxes → alerts.

The reality: every frame → cheap pixel-diff motion screening → if (and only if) motion crosses threshold → COCO-SSD. The motion screening runs on the CPU at 5–10 Hz, never touches the GPU, and decides whether to spin up the actual deep-learning models.

The gate is two lines in person-detection.ts, at lines 271–272:

// no motion = no inference. the GPU stays asleep.

if (!motionTriggered) return null;

No motion → return null → GPU never wakes up. That single check is the difference between "cool laptop" and "turbofan" for the user.

The accurate two-tier diagram

The architecture SVG on safeos.sh shows the gate now (we shipped the diagram update along with this post). The flow:

Camera, microphone, and reference photos feed into a per-frame screening pill
Screening runs continuously: pixel-diff motion at 200 ms, FFT audio analysis at 100 ms, pixel-change tracking. All cheap. All on the CPU.
Only when something crosses threshold does the frame get handed to TF.js for COCO-SSD detection
Ambiguous detections fall through to ViT-base (Transformers.js) as a tie-breaker
Confident detections feed the alert engine; everything else gets dropped on the floor

Why I built it this way

Three reasons, ordered by how much they actually mattered at design time.

Battery and thermal. Phones have thermal envelopes. A laptop without a fan (an M-series MacBook Air, a Chromebook, anything passive) has even less. COCO-SSD at 30 FPS sustained for 8 hours overnight melts that envelope. Gating the inference means the GPU stays in its idle state for hours at a time. The fan doesn't spin up. The phone stays in your pocket without burning your leg.

Honest signal. Pixel-diff motion + audio FFT + pixel-change counting handle 90% of the "is something happening?" question already. Adding CV inference on top of that is for the part where you want to know what happened — was that motion a person, a pet, the curtain blowing? You don't need a 5 MB model running 24/7 to answer that question. You need it running for the few seconds around an event.

Same idea as agent tool-forging. The AgentOS runtime, the Frame project I work on most days, uses the same "don't do the expensive thing until you have to" principle: agents create Zod-validated tools on demand in a sandboxed V8 context, not preemptively. Gated CV inference in SafeOS is the same move, applied to a different domain.

The numbers

MOTION_INTERVAL = 200 ms (5 FPS screening)
AUDIO_INTERVAL = 100 ms (10 FPS FFT)
FRAME_INTERVAL = 1000 ms (1 Hz gated capture)
analyserRef.fftSize = 256 (audio FFT bin count)
Detection canvas: 320×240 (model input size)
COCO-SSD: ~5 MB quantized, 10–30 FPS achievable on WebGL/WebGPU when it runs
ViT-base: ~89 MB quantized, tie-breaker only
Lost & Found: 32-bin color histogram + 8×8 Sobel grid < 1 KB / photo

What I learned shipping the wrong copy

Two things. First, marketing copy and engineering reality drift apart fast when nobody re-reads them side by side. The fix isn't to be more careful — careful people still ship wrong copy — it's to re-audit the marketing against the code every time the code changes meaningfully.

Second: static analysis is underrated. The thing the LLM-era AI industry undersells is that 90% of "intelligence" in a real-time pipeline is non-AI math — pixel-diff, FFT bins, motion vectors. The 10% that's actual deep learning matters a lot, but only when you need it. SafeOS's gate isn't a fancy technique. It's a boolean and an early return. It is also the single most important design decision in the whole codebase.

The corrected landing copy and the updated architecture diagram are live now. The full pipeline is open source at github.com/framersai/safeos. If you find another lie in the marketing copy, please file an issue. The honest version is always better.

— Johnny Dunn, Frame