HTML is the next video codec
Video formats describe pixels. HTML describes intent. When a browser can render frames deterministically, the document becomes the codec — and the entire pipeline changes.
Most people, when they hear the word codec, picture a black box that takes pixels in and produces a smaller pile of pixels out. H.264, AV1, ProRes — these are containers that solve one problem: compressing a stream of already-rendered images. They are the last mile of a long journey that started somewhere else, usually in After Effects or Premiere, sometimes in a game engine, occasionally in a Python script with cv2.imwrite in a loop.
We think the journey is upside down. The interesting question is not how to compress pixels. The interesting question is: what is the smallest, most expressive, most diff-able description of what should be on the screen at time t — and can we render it deterministically? Once you answer that, the codec moves up the stack. The document becomes the source of truth. The pixels are a build artifact.
For us at HyperFrames, the answer to "what is the most expressive description" is increasingly obvious: it is HTML. Or more precisely, HTML plus CSS plus a small amount of seek-friendly JavaScript. We will spend the next two thousand words unpacking why.
Pixels are the wrong primitive
Every modern video file is a sequence of pixel grids with motion vectors and a few clever tricks bolted on. That representation is excellent for the moment you want to play a video and useless for everything else. You cannot diff two MP4s. You cannot grep them. You cannot version-control them in a way that means anything. You cannot ask Claude to "make the second chart twice as tall" and get a useful answer.
The problem is that the pixel-grid representation throws away every piece of authorial intent on the way down. The fact that this red bar should grow from 0 to 60 over 1.2 seconds with an ease-out cubic? Lost. The fact that this caption is the second sentence in a three-sentence sequence? Lost. The fact that the brand color is #ff3b1f and should never drift even one byte? Lost — at least until someone re-extracts it from a frame and discovers the encoder pushed it to #ff3a1e.
If you have ever tried to make a small text change to an existing 30-second ad, you know how this story ends. You open the original project file. The project file references seventeen assets, half of which have moved. You re-render. The new MP4 is structurally identical to the old one except for one word, and you ship 38 megabytes of new pixels to convey three letters of change.
What HTML already gets right
The browser is the most-tested rendering engine in the history of software. Every typography rule, every easing curve, every blend mode, every color space, every shader is debugged across hundreds of millions of devices a day. We do not need to build a new motion graphics renderer. The good one already exists, and it ships in every laptop.
HTML and CSS, taken together, are a remarkably good language for describing how a scene should look at time t. They have layout (flex, grid, absolute), they have type (variable fonts, OpenType features, italic small caps), they have color (HSL, OKLCH, P3, gradients), they have animation (@keyframes, animation-delay, animation-fill-mode), they have compositing (filters, masks, blend modes), they have 3D transforms, they have SVG and Canvas and WebGL when you need to descend a level. They are, crucially, declarative: the document at time t is a function of the document plus t, not of accumulated mutation.
The declarative property is the one that matters for video. A deterministic renderer needs to seek. It needs to ask the document, "what should you look like at exactly 1.234 seconds?" and get the same answer every time, regardless of frame rate or thread schedule. CSS animations, when driven by animation-delay and animation-play-state: paused, give you that. JavaScript that listens for an hf-seek event and writes computed state into the DOM gives you that. The browser gives you that.
What the codec metaphor unlocks
Once you accept that the document is the codec, four things happen in quick succession.
First, your video is suddenly a text file. Twelve kilobytes of HTML rendered at 1080p produces a hundred megabytes of MP4. The HTML is what you store, diff, review in pull requests, ship through your CI, ask an agent to modify. The MP4 is generated on demand. You no longer have a binary build artifact masquerading as a creative asset.
Second, your video gains a type system. The composition is structured: title cards have classes, captions have data attributes, charts read from JSON. You can lint it. You can statically analyze it. You can refuse to render if the duration is wrong. We ship npx hyperframes lint for exactly this reason.
Third, your video becomes composable in the same way websites are. You write a <LowerThird> component once and reuse it across thirty videos. You bump a brand token and every composition rebuilds with the new color. You import a chart from a different file. None of this is novel — it is the same logic that made React win. We are just pointing it at frames instead of pages.
Fourth, and most consequentially: agents can write video. An LLM that has read a billion HTML pages is fluent in CSS keyframes — which is why our OpenAI integration lets you generate compositions directly from a function call. It is not fluent in After Effects scene graphs, because there are not a billion of those on the public internet. The shortest path from a prompt to a frame, today, runs straight through HTML.
What the browser still gets wrong
We are not going to pretend the browser was designed for this. There are real problems and they are worth naming.
The first is determinism. By default, a browser is a soft-real-time renderer: it tries to draw things at 60 frames per second, dropping frames if it cannot keep up, jittering animations when other tabs steal CPU. That is the opposite of what a frame-perfect render farm needs. The fix is to pause every animation and drive it from a single clock. HyperFrames does this with a hf-seek event: the engine sets document.documentElement.style.setProperty for animation timing, dispatches the event, waits for requestAnimationFrame to settle, then captures the frame. The browser becomes a synchronous time machine.
The second is fonts. Web fonts arrive over the network, and they arrive at unpredictable times. A first-frame render that fires before the font has loaded looks nothing like the second-frame render that fires after. We solve this by waiting on document.fonts.ready before the first capture, and by warning loudly when a composition references fonts not in the bundle. If you have ever shipped an ad with the wrong font because the staging environment had a different cache, you know exactly the bug we are preventing.
The third is GPU variance. Two machines running the same Chromium with the same composition can produce subtly different anti-aliasing, particularly for filters and 3D transforms. We pin the Chromium version. We pin the device-pixel-ratio. We disable the GPU compositor when bitwise determinism matters more than performance. It is not free, but it is honest. (For a side-by-side with the closest peer in this space, see how HyperFrames compares to Remotion.)
The encoder is still an encoder
To be clear: we still encode to MP4 at the end. The codec metaphor does not mean we have invented a new video format the world has to play. It means the place where authorial intent lives, and the place where pixels live, are now different places. The encoder becomes a boring last step instead of an opinion-laden creative tool. We use H.264 because every device on earth plays it; we use AV1 when bandwidth matters; we use ProRes when an editor downstream wants to color-grade. The interesting layer is upstream.
What this looks like in practice
A HyperFrames composition is a single HTML file with a small amount of metadata. You can render it with one command:
npx hyperframes render hero.html \
--out hero.mp4 \
--duration 5 \
--width 1920 --height 1080 --fps 60The engine boots a headless Chromium, opens the file, waits for fonts and images, then loops from frame 0 to frame (duration × fps). For each frame, it dispatches hf-seek, lets the browser settle, captures, and pipes to ffmpeg. The output is bit-identical across machines that share the same engine version.
The composition itself looks like a web page that happens to be five seconds long:
<style>
.title {
font-family: "Newsreader", serif;
font-size: 96px;
animation: rise 700ms cubic-bezier(.2,.7,.1,1) 200ms backwards;
animation-play-state: paused;
}
@keyframes rise {
from { opacity: 0; transform: translateY(24px); }
to { opacity: 1; transform: translateY(0); }
}
</style>
<h1 class="title">Hello, frame 0.</h1>The engine drives animation-delay from the seek event. The composition is, structurally, the codec.
Why this is a generational shift
For most of the last forty years, computer graphics has lived in two worlds. There is the real-time world (games, 3D apps) where you write shaders and accept whatever the GPU draws. And there is the offline world (film, ads, motion graphics) where you write keyframes in a proprietary tool and wait for a render farm.
Video on the web has lived uncomfortably between them, mostly by way of the offline world: someone makes an MP4 in a desktop tool, then someone else uploads it. The "web video" pipeline has been Premiere with extra steps.
We think there is a third world emerging. It is one where compositions are documents — versioned, diffable, agent-writable, deterministic on render — and where the act of "making a video" is much closer to the act of building a static site. The codec is HTML. The renderer is the browser. The output is a sequence of pixels you ship to wherever pixels are useful.
The next decade of video will be written, not exported. The fact that this sentence sounds obvious only after you write it is the first sign that something is about to move.
If you want to start now, npx hyperframes init puts a working composition on your disk in under a minute, or you can poke at compositions directly in the browser playground. The future of video is text. Open your editor.
Cite this postBibTeX · APA · Markdown
@misc{team2026html,
author = {HyperFrames Team},
title = {HTML is the next video codec},
year = {2026},
url = {https://hyperframes.video/blog/html-is-the-next-video-codec},
note = {HyperFrames blog}
}HyperFrames Team. (2026, April 30). HTML is the next video codec. HyperFrames. https://hyperframes.video/blog/html-is-the-next-video-codec
[HTML is the next video codec](https://hyperframes.video/blog/html-is-the-next-video-codec) — HyperFrames Team, 2026
We build the deterministic HTML-to-video pipeline at HyperFrames. We write here when we have something concrete to say.
Animated funnel chart in HTML (no D3, no After Effects)
Build an animated funnel chart in plain HTML and CSS — stacked trapezoids, conversion percentages that count up, deterministic MP4 export. No charting library required.
An animated number counter in HTML (the count-up done well)
A count-up counter is the smallest data-viz pattern. Done well, it sells a number; done badly, it dances. Here's the version worth shipping.
Programmatic video generation in Node.js
A walkthrough of generating MP4s from a Node.js process — headless Chromium, frame capture, encoding, and the pitfalls to avoid.
Building with HyperFrames? Come hang out.
We're on GitHub, in Discord, and the playground is one click away. Bring weird ideas — we collect them.