How to burn subtitles into an MP4 (and why you should)
Burn-in subtitles for video using HTML and CSS — typography, timing, safe-area rules, and the export pipeline. Beats FFmpeg for design control.
Eighty-five percent of social video plays muted. Your video either has subtitles or it has nothing.
The default move is FFmpeg's subtitles= filter, which reads an SRT and burns it into the frame. It works. It also looks like 2010, ignores brand typography, and gives you exactly one (1) lever — font size. If your video has any visual identity, you want a different path.
This is the design-controlled burn-in: render subtitles as HTML inside the video template, frame-aligned to the same timeline as everything else. You get the typography you actually want, the line breaks at the right places, and full control over background blocking, kerning, and animation.
What burned-in subtitles actually need
Five things, in order of impact on legibility:
- High-contrast backing. Solid black at 65–80% opacity behind the text. Don't try to read 18pt sans-serif against arbitrary video without a backing — you'll lose 30% of viewers in low-contrast scenes.
- Bottom-third positioning, with safe-area margin. The bottom 15% of the frame, with a 4% margin from the bottom edge.
- Two lines max, ~32 characters per line. Anything longer scrolls or wraps inelegantly.
- Snap to word boundaries on timing. Subtitles that change mid-sentence read as broken.
- Sans-serif, ~22pt at 1080p, weight 600. Bold enough to read; not so bold it looks like a meme template.
Get those five right and your captions read on any platform without further tuning.
Timing: from transcript to frames
The transcript-to-subtitle pipeline:
- Transcribe with Whisper (or your provider of choice). You get word-level timestamps.
- Chunk into 2–6 word phrases. A line break every 30 characters or every 1.5 seconds, whichever comes first.
- Snap to word boundaries. Never break mid-word.
- Emit
[ { text, startMs, endMs } ].
Inside the HTML template, render the active subtitle by finding the chunk whose [startMs, endMs] contains the current frame's time.
<div class="captions" data-frame-time="0">
<!-- The renderer rewrites this per frame -->
<div class="cap">The codec doesn't care about your brand</div>
</div>Animation: in and out, not during
The two acceptable animations on a subtitle:
- In: a 120ms fade + 4px slide-up.
- Out: a 120ms fade (no slide).
Anything more elaborate — slot-machine letter reveals, word-by-word color changes — distracts from the speech they exist to support. Keep them quiet.
Why not FFmpeg subtitles?
FFmpeg's burn-in works, and for SRT files with no design opinion, it is fine. Three reasons to skip it for produced content:
- Typography control is shallow. You get font + size. You don't get kerning, line height, backing radius.
- No per-platform variants. You will want shorter lines for 9:16 than for 16:9. FFmpeg can't do that without re-encoding.
- No animation. A static subtitle that flicks on and off looks worse than one that fades.
For a one-shot recording with an SRT, use FFmpeg. For anything you'll iterate on, render captions inside the template.
Multi-language subtitles
The same template can render multiple language variants by swapping the chunk array. The geometry stays the same; only the text changes. Ten Spanish renders + ten English renders from the same source.
This is where templating beats SRT. SRT files don't ship with their own typography, so a Cyrillic SRT renders in Arial on FFmpeg's default. A template ships with the language-specific font (Noto Sans CJK for Japanese, etc.) baked into the CSS.
Speaker-attributed captions
For interview content, prefix each chunk with the speaker:
KIRA: We tried four codecs before AV1 came up.
INT: Why not just default to AV1 from the start?Render with a colored label per speaker. Two-speaker dialogue is the right complexity ceiling; three or more speakers degrades quickly without a video-conference-style spatial cue.
The render pipeline
Inside the HyperFrames pipeline, captions are just another HTML element with a frame-time variable. The render loop seeks the frame-time, the active caption updates, the frame is captured. Deterministic, frame-aligned, no race conditions.
The same template that renders TikTok variants renders the burned-in subtitle pass — no separate tool, no SRT roundtrip.
Open the playground, paste a chunk array, see the caption track align to the seek bar.
Cite this postBibTeX · APA · Markdown
@misc{tanaka2026burn,
author = {Kira Tanaka},
title = {How to burn subtitles into an MP4 (and why you should)},
year = {2026},
url = {https://hyperframes.video/blog/burn-subtitles-into-mp4},
note = {HyperFrames blog}
}Kira Tanaka. (2026, April 30). How to burn subtitles into an MP4 (and why you should). HyperFrames. https://hyperframes.video/blog/burn-subtitles-into-mp4
[How to burn subtitles into an MP4 (and why you should)](https://hyperframes.video/blog/burn-subtitles-into-mp4) — Kira Tanaka, 2026
Kira works on the render core: headless Chromium scheduling, frame capture, and the encoder pipeline. She cares about reproducible builds and small numbers next to the word "variance."
Animated recipe card videos for social
Build a recipe card video for Instagram, TikTok, and Pinterest — ingredients check off line-by-line, a step counter ticks, and a circular timer fills. Rendered deterministically to MP4.
Animated timeline infographic generator
Generate a timeline infographic video from a JSON of milestones — a vertical spine draws downward, dots land on dates, labels slide in from alternate sides. Deterministic MP4.
Render an animated Gantt chart to MP4
An animated Gantt chart video built from a JSON of tasks. Horizontal bars that grow across a date axis with a moving today cursor — deterministic MP4 out.
Building with HyperFrames? Come hang out.
We're on GitHub, in Discord, and the playground is one click away. Bring weird ideas — we collect them.