auditionpartner.ai
All posts
February 26, 20266 min read

How We're Teaching AI to Listen Like a Scene Partner

By The AuditionPartner.ai Team
engineeringsmart-detectbehind-the-scenes

You're running a self-tape at 11pm. It's take seven. Your reader — our AI scene partner — has been feeding you cues all evening. You land the final line of a gut-punch monologue, hold the beat… and the app doesn't advance. The moment is gone.

That kind of failure is exactly what Smart Detect exists to prevent.

What Smart Detect actually does

When you rehearse a scene on AuditionPartner.ai, you don't have to tap a button every time you finish a line. Smart Detect listens through your microphone and figures out when you've completed your scripted dialogue — then automatically cues your scene partner's next line.

Think of it like a really attentive stage manager: one who's following along in the script and can tell when you've delivered your lines, even if you didn't say them word-for-word.

The concept sounds simple. The reality is anything but.

Why this is harder than it sounds

Under the hood, Smart Detect uses your browser's speech-to-text (STT) engine to convert your voice into words in real time. Then it compares those words against the script to decide: did the actor finish their line?

Here's where it gets interesting. Speech-to-text is imperfect, and actors — being human — don't deliver lines like robots. The script might say:

"I don't know what you're talking about."

But what the actor actually says (and what the STT engine hears) could be:

  • "I dunno what you're talkin about"
  • "I don't know what you're... talking about"
  • "Um, I don't know what you're talking about"
  • "I do not know what you are talking about"

Every one of those should count as the line being delivered. A rigid word-for-word comparison would fail on all of them.

Building a fuzzy ear

Our first job was teaching the system to be flexible about what counts as a "match." We built a normalization pipeline that runs on both the script text and the spoken text before comparing them:

  • Contractions — "don't" and "do not" are treated as the same thing, using a dictionary of 46 common contractions
  • Numbers — "42" matches "forty two" and vice versa
  • Punctuation and case — stripped away entirely, because speech has no punctuation
  • Stage directions — things like (Beat) or (angrily) in the script are removed before matching, since actors don't speak those out loud
  • Filler words — "um," "uh," "er," and "ah" in the spoken text are ignored, because everyone uses them but nobody scripts them

After normalization, we compare individual words using fuzzy matching — a technique that measures how similar two words are by counting the minimum number of character edits (insertions, deletions, or substitutions) needed to transform one into the other. "Talkin" is one edit away from "talking." Close enough.

The 97% problem

To make sure all this actually worked, we didn't just test a handful of examples. We built a test harness — a system that automatically generates thousands of realistic test scenarios and checks that Smart Detect handles them correctly.

Here's how it works: we took 2,755 real dialogue lines from 97 different scripts (everything from indie dramas to network procedurals). For each line, we generated 13 different "mutations" that simulate real STT behavior — filler words added, contractions expanded, words slightly garbled, sentences cut off partway through. That's 35,837 test cases.

When we first ran the harness, Smart Detect passed 97.7% of them. That sounds pretty good — until you think about what it means in practice.

A typical scene has maybe 20 lines of dialogue for your character. At 97.7% accuracy, that's roughly a 50/50 chance of a missed cue in any given run-through. For a tool that actors need to trust with their late-night, high-pressure self-tape sessions? Not good enough.

Finding the best alignment

The culprit was how our matching algorithm read through the script. The original version worked like a human reading left to right: it would find the first matching word, then look for the next one after that, and so on. This works fine most of the time — but it falls apart when the same common word appears multiple times in a line.

Consider a line like "to be or not to be." If the algorithm matches the first "to" and "be," it has already consumed those positions. When it encounters the second "to be" in the spoken text, it can't find a match — those words are "behind" where it's looking.

The fix was to replace this left-to-right scan with something smarter: a technique from computer science called Longest Common Subsequence (LCS). Instead of reading word by word, LCS considers every possible alignment between the spoken text and the script simultaneously, then picks the one that matches the most words.

Think of it as the difference between a stage manager who's reading the script with their finger tracking line by line (and occasionally losing their place) versus one who has the entire script memorized and can instantly tell you how much of a line has been delivered, no matter what order the words came in.

We added fuzzy matching and compound word awareness into the LCS algorithm, so it still handles contractions, garbled words, and all the other messy realities of live speech.

The results

After the LCS rewrite:

  • Pass rate: 98.5% (up from 97.7%)
  • False positives: 0 — the system never incorrectly thinks you've finished a line when you haven't

That remaining 1.5%? Almost entirely cases where an actor has genuinely only said a small fraction of their line — situations where the system should wait for more. The algorithm is working correctly; those aren't failures, they're the system being appropriately cautious.

For a 20-line scene, 98.5% accuracy means you'd expect a missed cue roughly once every three full run-throughs, rather than every other one. And with the improvements we're building next, we expect to push that even further.

What's next

Speech recognition tells us what the actor said. But there's another signal we're not using yet: when they stopped saying it.

The next layer of Smart Detect will use voice activity detection — tracking not just words, but the silences between them. When the speech signal goes quiet after enough of the line has been delivered, that's a strong second indicator that the actor is done.

Two independent signals — text matching and silence detection — give us redundancy. Even if one system isn't sure, the other can confirm. It's the same principle behind why theaters have both a stage manager and a cue light system.

Why we're sharing this

We believe the tools actors use should be built with the same care actors bring to their craft. Smart Detect isn't just a feature checkbox — it's the difference between a rehearsal tool you tolerate and one you trust.

We're sharing the engineering behind it because transparency matters. When you're running a self-tape at 11pm and the app just works — advancing your cues at exactly the right moment — you shouldn't have to wonder if it's reliable. You should know the work that went into making it so.

We're building AuditionPartner.ai in public, one algorithm at a time. If you're the kind of actor who's curious about the tools behind the craft, stay tuned. There's a lot more to share.