Animated captions, burned-in

Render animated, per-word captions directly into the MP4 from a transcript JSON file.

Ship per-word animated captions in fifteen minutes. No SRT muxing, no caption library, no font-loading flake — captions are just DOM nodes, animated by CSS, rendered into the frame.

What you'll learn

Mapping a transcript JSON into timed <span>s
Driving the animation with data-start and data-duration
A tiny build step that goes from Whisper output to ready-to-render HTML

The result

0.00s

Per-word reveal driven entirely by CSS animation-delay.

The transcript

Whisper, Deepgram, AssemblyAI — they all emit something close to this shape:

json

{
  "words": [
    { "text": "Captions", "start": 0.10, "end": 0.42 },
    { "text": "are",      "start": 0.45, "end": 0.58 },
    { "text": "just",     "start": 0.80, "end": 1.02 },
    { "text": "DOM",      "start": 1.15, "end": 1.40 },
    { "text": "nodes",    "start": 1.50, "end": 1.78 },
    { "text": "now.",     "start": 1.85, "end": 2.20 }
  ]
}

The build step

A short script turns the transcript into spans with timing baked in:

// build-captions.mjs
import { readFileSync, writeFileSync } from "node:fs";
const { words } = JSON.parse(readFileSync("transcript.json", "utf8"));

const spans = words.map(w => {
  const duration = (w.end - w.start).toFixed(3);
  return `<span class="w"
    data-start="${w.start}"
    data-duration="${duration}"
    style="animation-delay:${w.start}s">${w.text}</span>`;
}).join("\n  ");

const html = readFileSync("template.html", "utf8").replace("{{CAPTIONS}}", spans);
writeFileSync("out.html", html);

The template is a plain document with one slot:

html

<!doctype html>
<html>
<head>
<style>
  body { margin:0; height:100vh; background:#000; color:#fff;
         font: 700 56px/1.2 ui-sans-serif, system-ui;
         display:grid; place-items:end center; padding-bottom:80px; }
  .cap { max-width:80vw; text-align:center; text-wrap: balance; }
  .w { opacity:0; display:inline-block; margin:0 6px;
       animation: pop .35s cubic-bezier(.2,.9,.2,1) both; }
  @keyframes pop {
    from { opacity:0; transform: translateY(8px) scale(.96); }
    to   { opacity:1; transform:none; }
  }
</style>
</head>
<body>
  <p class="cap">{{CAPTIONS}}</p>
</body>
</html>

Render:

bash

node build-captions.mjs
hyperframes render out.html --out captioned.mp4 --crf 18 --workers 4

If the captions sit on top of a video clip, both share the same timeline. The clip element uses data-start / data-duration for its own timing; captions use data-start for their per-word reveal. No coordination required — both read from frame-pinned time.

html

<video class="clip" src="hero.webm"
       data-start="0" data-duration="8"></video>

<p class="cap" data-start="0" data-duration="8">
  <span class="w" data-start="0.10" style="animation-delay:0.10s">Captions</span>
  ...
</p>

Styling that survives platform compression

Three things make burned-in captions readable after TikTok's compressor mangles them:

A semi-opaque shadow under each word: text-shadow: 0 2px 8px rgba(0,0,0,.6);
A minimum font size of ~48px at 1080p.
Avoid pure white on bright backgrounds — #f8fafc reads as white and bands less.

Tweak it

Swap pop for slide-up if your brand voice is calmer.
Highlight the active word by giving it a different color via :nth-child matched to playback time.
Add data-fade="out:0.2" to fade the whole caption block as the clip ends.

Animated captions, burned-in

Render animated, per-word captions directly into the MP4 from a transcript JSON file.

Ship per-word animated captions in fifteen minutes. No SRT muxing, no caption library, no font-loading flake — captions are just DOM nodes, animated by CSS, rendered into the frame.

What you'll learn

Mapping a transcript JSON into timed <span>s
Driving the animation with data-start and data-duration
A tiny build step that goes from Whisper output to ready-to-render HTML

The result

0.00s

Per-word reveal driven entirely by CSS animation-delay.

The transcript

Whisper, Deepgram, AssemblyAI — they all emit something close to this shape:

json

{
  "words": [
    { "text": "Captions", "start": 0.10, "end": 0.42 },
    { "text": "are",      "start": 0.45, "end": 0.58 },
    { "text": "just",     "start": 0.80, "end": 1.02 },
    { "text": "DOM",      "start": 1.15, "end": 1.40 },
    { "text": "nodes",    "start": 1.50, "end": 1.78 },
    { "text": "now.",     "start": 1.85, "end": 2.20 }
  ]
}

The build step

A short script turns the transcript into spans with timing baked in:

// build-captions.mjs
import { readFileSync, writeFileSync } from "node:fs";
const { words } = JSON.parse(readFileSync("transcript.json", "utf8"));

const spans = words.map(w => {
  const duration = (w.end - w.start).toFixed(3);
  return `<span class="w"
    data-start="${w.start}"
    data-duration="${duration}"
    style="animation-delay:${w.start}s">${w.text}</span>`;
}).join("\n  ");

const html = readFileSync("template.html", "utf8").replace("{{CAPTIONS}}", spans);
writeFileSync("out.html", html);

The template is a plain document with one slot:

html

<!doctype html>
<html>
<head>
<style>
  body { margin:0; height:100vh; background:#000; color:#fff;
         font: 700 56px/1.2 ui-sans-serif, system-ui;
         display:grid; place-items:end center; padding-bottom:80px; }
  .cap { max-width:80vw; text-align:center; text-wrap: balance; }
  .w { opacity:0; display:inline-block; margin:0 6px;
       animation: pop .35s cubic-bezier(.2,.9,.2,1) both; }
  @keyframes pop {
    from { opacity:0; transform: translateY(8px) scale(.96); }
    to   { opacity:1; transform:none; }
  }
</style>
</head>
<body>
  <p class="cap">{{CAPTIONS}}</p>
</body>
</html>

Render:

bash

node build-captions.mjs
hyperframes render out.html --out captioned.mp4 --crf 18 --workers 4

Pairing captions with a clip

html

<video class="clip" src="hero.webm"
       data-start="0" data-duration="8"></video>

<p class="cap" data-start="0" data-duration="8">
  <span class="w" data-start="0.10" style="animation-delay:0.10s">Captions</span>
  ...
</p>

Styling that survives platform compression

Three things make burned-in captions readable after TikTok's compressor mangles them:

A semi-opaque shadow under each word: text-shadow: 0 2px 8px rgba(0,0,0,.6);
A minimum font size of ~48px at 1080p.
Avoid pure white on bright backgrounds — #f8fafc reads as white and bands less.

Tweak it

Swap pop for slide-up if your brand voice is calmer.
Highlight the active word by giving it a different color via :nth-child matched to playback time.
Add data-fade="out:0.2" to fade the whole caption block as the clip ends.

Animated captions, burned-in

What you'll learn

The result

The transcript

The build step

Pairing captions with a clip

Styling that survives platform compression

Tweak it

Next

Animated captions, burned-in

What you'll learn

The result

The transcript

The build step

Pairing captions with a clip

Styling that survives platform compression

Tweak it

Next