From DOM to MP4: an annotated render
A frame-by-frame walkthrough of what happens between hyperframes render and a finished MP4. Chromium, seek events, capture, encode — the whole pipeline, with timings.
When you run hyperframes render hero.html, somewhere between sixty and seventeen hundred milliseconds later you have an MP4 on disk. I want to spend this post walking through every millisecond in that interval. The pipeline is not magic. It has six stages, each with specific failure modes, specific optimizations, and specific numbers. After three years of working on it, I can still surprise myself with what is happening between the keystroke and the file.
This post assumes you have used the CLI at least once. If you haven't, run npx hyperframes init first; it produces a working composition you can render in a few seconds. The developer hub has the SDK reference and the docs cover the CLI flags referenced below. Then come back.
Stage 1: Composition resolution (5–40ms)
The first thing the renderer does is read your HTML file and walk it for references. Every <img src>, every <link> to a font, every <script src>. We build a manifest: every asset that will be needed at render time, and where it lives. Local files are resolved to absolute paths. Remote URLs are validated.
This stage exists because of one specific failure: if Chromium opens your file and immediately starts rendering, it will render the first frame before some of those assets have loaded. The composition will look wrong, the first frame will be a flash of unstyled content, and you will be sad. The manifest lets us preheat — fetch and cache everything before the page even opens.
We also compute a content hash here. The hash is the SHA-256 of the resolved HTML plus every asset. It is the fingerprint of this exact composition. If you render twice with the same hash, our cache layer (which we will get to) shortcuts the entire pipeline. The build is reproducible because the inputs are.
Stage 2: Chromium boot (300–800ms cold, 0ms warm)
This is the largest single chunk of latency on a cold render, and the one we have spent the most effort eliminating. Booting a fresh headless Chromium takes around half a second. For interactive workflows (preview, agent loops) we keep a warm pool: one or two Chromium processes idling, ready to accept a new tab. Cold boot drops to "open a new page in an existing browser," which is around 20-40ms.
The flags we boot Chromium with matter. We disable a bunch of things — GPU rasterization on platforms where it is unstable, background networking, the entire extensions system, telemetry, audio. We enable one thing — synchronous animation timing — which is custom-patched into our Chromium build for reasons I will get to.
We also pin the Chromium version. Every release of HyperFrames is locked to a specific Chromium commit. When you upgrade HyperFrames, you upgrade Chromium. When you don't, you don't. This is the only way bitwise determinism survives across machines: everyone has to be running the same renderer.
Stage 3: Page load and ready-gate (50–300ms)
The browser opens your HTML. We do not start capturing yet. Instead, we wait on a series of "ready" gates. Each gate has a name, a max timeout, and a specific signal it waits for.
dom-ready:DOMContentLoadedfires. The HTML is parsed.fonts-ready:document.fonts.readyresolves. Every font referenced infont-familyhas loaded.images-ready: every<img>we found in stage 1 has firedloadand calleddecode().composition-ready: an optional gate. If the composition defineswindow.__hfReady = () => Promise<void>, we await it. This is the escape hatch for compositions that do their own async setup — fetching JSON, initializing a Three.js scene, loading a Lottie animation.
Only when all gates resolve do we move to capture. This is the single most important detail in the entire pipeline. Skip this and you get the bug where frame 0 has fallback fonts and the rest of the video has the right fonts and your viewer sees a one-frame flicker that ruins the whole video. We hit this exactly once, in 2024, and added every guard you see here as a result.
Stage 4: The seek loop (the main event)
This is where most of the wall-clock time goes, and it is the part of the system that needed the most surgery. Here is what happens, in pseudocode, once per frame:
for (let i = 0; i < totalFrames; i++) {
const t = i / fps;
await page.evaluate((time) => {
document.documentElement.style.setProperty("--hf-time", `${time}s`);
window.dispatchEvent(new CustomEvent("hf-seek", { detail: { time } }));
}, t);
await page.evaluate(() => new Promise(r => requestAnimationFrame(() => r())));
const buffer = await page.screenshot({ type: "png", omitBackground: false });
encoder.write(buffer);
}A few things are doing a lot of work in those nine lines.
The --hf-time CSS variable cascades into every CSS animation in the composition. Animations are paused (animation-play-state: paused) and their effective progress is computed from the variable via animation-delay: calc(var(--hf-time) * -1). The browser's keyframe interpolator runs as normal; we just lie to it about what time it is.
The hf-seek event is the escape hatch for JavaScript-driven animations. Authors who want to drive Three.js, Canvas, GSAP, or anything else listen for this event and update their state from event.detail.time. The event is synchronous from the renderer's perspective: we wait for the next requestAnimationFrame to confirm the DOM has settled.
The screenshot itself is the slow part. Even on warm Chromium, taking a 1920×1080 PNG screenshot is around 8-15ms. Across 300 frames at 60fps (a 5-second video), that is 2.4–4.5 seconds of pure capture time. We have explored faster paths — CDP's Page.captureScreenshot with captureBeyondViewport: false is fastest, and we use it. We've also experimented with a raw framebuffer extraction that skips PNG encoding entirely, piping uncompressed pixels to ffmpeg's stdin. It is around 30% faster but more fragile. We ship it behind --unsafe-raw-pipe for users who want the speed.
What synchronous animation means
The single biggest determinism win came from patching Chromium's animation scheduler. Normally, when you change animation-delay, the browser updates the visual at the next compositor tick — which is asynchronous, with respect to JavaScript. That means the screenshot we take immediately after the seek might capture the animation at its old position, not its new one.
Our patch adds a single function to the CDP protocol: Animation.flushSync. It forces the animation host to recompute every animated property immediately, on the main thread, before returning. We call it after every seek event. The cost is minor (200µs); the correctness gain is total. We have submitted this patch upstream; reception has been polite but slow.
Stage 5: Encoding (parallel with capture)
We do not wait for every frame to be captured before starting to encode. ffmpeg runs in a separate process, reading PNGs from a Unix pipe or shared memory. As soon as frame 0 arrives, encoding begins. By the time frame N-1 is captured, frame 0 is already in the muxer.
The encoder choice matters. We default to libx264 -preset medium -crf 18 for general use; the resulting H.264 is universally playable and the quality is high. For users who want smaller files, -preset slow -crf 20 shaves 30% off the file size at a 2x encode cost. For users who need AV1 we shell out to libaom-av1; it is much slower but the bitrate-to-quality curve is dramatically better. For users who need lossless intermediates, we offer prores_ks (Apple ProRes) and FFV1.
What we do not do is re-encode after the fact. The captures are written to the encoder directly; the encoder is the final stage. There is no temporary "raw frames on disk" step, because that would be wasteful and slow.
Stage 6: Muxing and finalization (20–80ms)
The encoder writes a stream of compressed video data. ffmpeg muxes that into an MP4 container — adding the moov atom (which describes the video's structure), inserting any audio tracks, writing metadata. The final file lands on disk.
We also write a sidecar JSON: a manifest of what was rendered. Composition hash, engine version, Chromium version, ffmpeg version, encoder settings, total duration, total frames, render wall time. This sidecar is invaluable for debugging — when a customer says "this MP4 looks wrong," the first thing we ask for is the manifest. The manifest tells us exactly which pipeline produced it.
What can go wrong, in priority order
After three years, the failure modes have a clear long tail. Here are the top ones, with frequencies from our error telemetry.
- Font load timeout (4.1% of renders). User referenced a font that the network is slow to deliver. Fix: bundle the font locally.
- Composition timeout in
__hfReady(1.2%). User's async setup never resolved. Usually a fetch that hangs. Fix: add aPromise.racewith a timeout. - Image 404 (0.8%). User referenced a path that doesn't exist. Fix: lint catches this before render.
- Out-of-memory in Chromium (0.3%). User created a 12-layer-deep filter graph that the rasterizer cannot fit in 4GB. Fix: simplify or render at lower resolution.
- Encoder crash (<0.1%). ffmpeg got something it could not handle. Usually a 16k-wide canvas. Fix: raise a sensible error before invoking the encoder.
We work hard to make every one of these fail at lint time, not at render time. The lint pass catches asset references, font references, suspicious DOM sizes, and missing duration metadata. By the time you run render, the only things that can fail are network and out-of-memory — and even those we trap with clear messages.
What the pipeline looks like in numbers
A representative render: 5 seconds at 1920×1080, 60fps. 300 frames. Warm Chromium.
| Stage | Time | Notes |
|---|---|---|
| Composition resolution | 12ms | Asset walk, hash, manifest |
| Chromium boot | 0ms | Warm pool |
| Page load + ready gates | 180ms | Most of this is fonts |
| Seek loop | 1380ms | ~4.6ms per frame, parallel with encode |
| Encoder finalize | 90ms | Mux, moov, sidecar |
| Total | 1662ms |
Cold Chromium adds 500-700ms. A 30-second video at the same resolution lands around 9 seconds wall time. A 4K render at 60fps is roughly 4x slower per frame. These numbers are honest, measured on a laptop CPU, with no GPU acceleration of the rasterizer.
The takeaway: from your keystroke to your MP4 on disk, every stage is doing specific, measurable work. None of it is magic. All of it is now boring infrastructure, which is exactly what we wanted. If you want a contrast with a different architecture, the Remotion comparison walks through the same six stages on their pipeline.
Cite this postBibTeX · APA · Markdown
@misc{tanaka2026from,
author = {Kira Tanaka},
title = {From DOM to MP4: an annotated render},
year = {2026},
url = {https://hyperframes.video/blog/from-dom-to-mp4},
note = {HyperFrames blog}
}Kira Tanaka. (2026, May 5). From DOM to MP4: an annotated render. HyperFrames. https://hyperframes.video/blog/from-dom-to-mp4
[From DOM to MP4: an annotated render](https://hyperframes.video/blog/from-dom-to-mp4) — Kira Tanaka, 2026
Kira works on the render core: headless Chromium scheduling, frame capture, and the encoder pipeline. She cares about reproducible builds and small numbers next to the word "variance."
Frame-accurate timing in the browser: a 2026 status report
requestAnimationFrame quirks, document.timeline, OffscreenCanvas, WAAPI commitStyles, the new Chromium headless timing model. What is reliable in 2026, and what is still broken.
Headless Chrome video rendering, the right way
Headless Chromium is the engine, not the renderer. The difference matters when you're trying to produce frame-perfect MP4s.
WebCodecs for deterministic video rendering in 2026
The WebCodecs API has finally grown up. A deep look at VideoEncoder, hardware H.264 vs AV1 support across Chromium 130+, and why we are slowly rewriting parts of the HyperFrames render path on top of it.
Building with HyperFrames? Come hang out.
We're on GitHub, in Discord, and the playground is one click away. Bring weird ideas — we collect them.