April 5, 20264 min read

Building Tape Composition: Stitching Self-Tapes Without Leaving the Browser

By The AuditionPartner.ai Team

engineeringproduct-updateself-tape

Every actor knows the drill. You set up your camera, record your slate — "Hi, I'm Jane Smith reading for the role of Pat" — then launch into the scene. If the scene doesn't go well, you do it again. And again. And every single time, you re-record the slate too.

It's tedious. The slate is the same every time. But traditional self-tape workflows don't separate them — you hit record, do the slate, do the scene, and whatever comes out is your take. If you want to swap the slate or pick a different scene performance, you need editing software.

We wanted to fix that.

Separating slate from scene

The first step was simple in concept: let actors record their slate as a standalone clip. Press record, introduce yourself, stop. That clip lives on its own. Then do the scene separately — as many takes as you want — without ever touching the slate again.

The UI mirrors what actors already know. Camera preview, countdown, record, review. The difference is that when you're done, you have a library of slates and a library of scene takes, each labeled with its take number. Slate Take 1. Scene Take 3. Familiar language for anyone who's been on a set.

The composition problem

Having separate clips is only useful if you can combine them. And that's where things get interesting.

Video files aren't like text files — you can't just append one to another. Different browsers record in different formats. Safari produces MP4 with H.264 video. Chrome and Firefox produce WebM with VP8 or VP9. If an actor records their slate in one browser and their scene in another, those are fundamentally different containers with different codecs.

We needed a system that could take any two recordings, regardless of format, and produce a single clean video.

Server-side composition with FFmpeg

The answer was FFmpeg, the Swiss Army knife of video processing. When an actor picks a slate and a scene in our compose dialog, the request goes to our API. We spin up a background task that downloads both recordings, determines whether they need re-encoding, and stitches them together.

If both recordings are the same format — say, two WebM files from Chrome — we use FFmpeg's concat demuxer with stream copy. No re-encoding, near-instant, minimal server load.

If the formats differ — an MP4 slate from Safari paired with a WebM scene from Chrome — we normalize both to a consistent format before concatenating. It takes a bit longer, but the result is a universally playable video that works everywhere.

The composed video gets uploaded back to storage and sent to our video pipeline for processing. Within a minute or two, the actor has a polished MP4 ready to download and submit.

Take numbers: small detail, big difference

While building this, we realized that recording timestamps are meaningless to actors. "Apr 4, 10:46 PM" doesn't tell you which take was the good one. But "Scene Take 3" does — instantly.

We added take numbers that accumulate per project, per recording type. Your first scene is Scene Take 1, your third slate is Slate Take 3, your composed final is Composed Take 1. It's the language of filmmaking, and it makes the compose dialog actually usable — you can tell at a glance which recordings you're combining.

Inline preview

The compose dialog lets you hover over any recording thumbnail and preview it right there — a tiny inline video player that lets you scrub through a few seconds to jog your memory. No need to open each recording in a separate player to figure out which take was the one where you nailed the emotional beat.

What's next

Tape composition is the foundation for something bigger. Right now, you pick a slate and a scene and get a combined video. But the architecture we built — server-side video processing with format normalization — opens the door to more sophisticated editing: trimming heads and tails, adding title cards, adjusting audio levels.

For now, though, the simple version solves the real problem: record your slate once, pair it with your best take, and submit a polished self-tape without ever opening an editing app.