March 15, 20264 min read

Smart Detect Now Sees You: How Vision Tracking Took Us to 99% Accuracy

By The AuditionPartner.ai Team

engineeringsmart-detectproduct-update

When we launched Smart Detect, it already worked well. Speech-to-text matched your words against the script. Voice activity detection knew when you stopped talking. For most lines, the system advanced smoothly — like a good scene partner who's following along.

But "most lines" isn't good enough for a self-tape.

The problem with long lines

If you've ever recorded a monologue with Smart Detect, you might have noticed something: on very long lines, the system would occasionally jump ahead before you finished. Not by a lot. Maybe a few words. But when you're in the zone, delivering an emotional speech, having the scene partner cut in early breaks the take.

The issue came down to how speech matching works. Once the system has heard enough of your line, it considers it complete. On short lines, "enough" is basically all of it. On a long monologue, the math works against you — hearing most of a lengthy speech still leaves words unspoken.

We needed a signal that was fundamentally different from counting words.

Adding eyes

The answer turned out to be surprisingly literal: watch the actor's mouth.

Your webcam is already running during self-tapes. We're already capturing video. So we added a real-time face tracking system that analyzes your facial movements on the fly. When speech-to-text says "the line might be done" and voice activity detection says "the actor stopped making sound," vision tracking adds a third vote. Three independent signals, all agreeing. That's when the system advances.

Why three signals matter

Each signal has blind spots:

Speech-to-text knows what you said but not when you truly stopped. It can decide you're done before you actually are.
Voice activity detection knows when sound stops but not what was said. A mid-line breath registers as silence.
Vision tracking knows your mouth closed but not why. You might pause to swallow or take a dramatic beat.

No single signal is reliable enough on its own. But together, they cover each other's weaknesses. The result is a system that's remarkably hard to fool — and remarkably good at knowing when you're actually done.

Protecting dramatic pauses

Actors pause. That's the craft. A long pause after "I loved you" hits different than rushing to the next line. We had to make sure vision tracking didn't punish natural performance choices.

We built graduated protections that adapt based on line length and context. Short lines behave snappily. Long monologues are given more patience. The system accounts for mid-line pauses, breath marks, and dramatic beats — requiring enough evidence from multiple signals before it'll advance on a long speech.

The details are tuned per line length, but the principle is simple: the longer the line, the more conservative the system becomes.

The last-line safeguard

During testing, we discovered one more edge case: the last line of a scene.

On every other line, if the system resolves slightly early, it's recoverable — the scene partner's next line plays, and you get your cue naturally. Nobody notices. But on the last line, early resolution means the recording stops. Your final words could be cut off.

Our solution: after the last line resolves, the recording keeps running for a few extra seconds. You can finish speaking, hold your final beat — which is standard practice for professional self-tapes anyway — and the recording captures everything. Hit Stop Recording whenever you're ready.

Testing across genres

We didn't ship this based on a single script. Our testing covered:

Silver Linings Playbook — 48 lines including a 72-word monologue. The line that originally broke our system now resolves perfectly every time.
Shameless — 37 lines of fast-paced, emotionally charged dialogue with short punchy lines.
Five Easy Pieces — The iconic diner scene with its long, cascading orders and rapid-fire exchanges.
Red Tails — Military drama with long speeches and emotional monologues.

Across all scripts: 100% pass rate. No premature advances. No missed lines. No stepping on the actor's words.

What's next

Smart Detect is now the most accurate it's ever been — three detection signals working together, with adaptive protections for different line lengths and scene contexts.

We're building a per-line sensitivity adjustment that will let you fine-tune timing after a recording. If a specific line advanced too quickly or lingered too long, you'll be able to dial the sensitivity for that line and re-record. Think of it as having a conversation with your scene partner about pacing — "hold a beat longer on that one."

We're also working on a sensitivity toggle for setting the overall detection aggressiveness for a session: Cautious, Balanced, or Aggressive. Different scenes call for different approaches, and you should be in control.

For now, try running a self-tape with Smart Detect and see how it feels. We think you'll notice the difference — especially on those long lines that used to give the system trouble.

The goal was always to make technology invisible. To make the AI scene partner feel like a real person following along. With vision tracking, we're closer than ever.