The agent's camera
What does video tooling look like when the author is a language model? A look at how HyperFrames was designed from frame one to be the easiest video pipeline an LLM can drive.
The first time I gave Claude a HyperFrames composition and asked it to "make the second title card feel more confident," it took twelve seconds. The title moved up 4 pixels, the easing curve changed from ease-out to cubic-bezier(.2,.7,.1,1), and the font weight went from 500 to 600. The video, when I rendered it, was visibly better. I had not asked for any of those specific changes. I had asked it to make a video feel more confident, and it had translated that into precise edits to a file.
I have been a filmmaker. I have been a developer advocate at a CI company. I have spent the last year watching language models slowly become very, very good at writing motion design, and I want to tell you what I have learned about how to design tools for them. Because the tools matter enormously. Most video pipelines, handed to an LLM, produce slop. HyperFrames, handed to the same LLM, produces work that ships. The difference is not the model. The difference is what we expose to it.
The author is no longer human-shaped
For thirty years, every video tool has been designed for a human sitting at a desk with a mouse. The interface is timelines and bezier curves and modal dialogs and undo stacks. The mental model is direct manipulation: you grab a thing and drag it. The author and the tool are in continuous physical conversation.
An LLM does not have hands. It cannot drag a bezier handle. It cannot watch a preview and decide to nudge a keyframe by two pixels. It can, however, read a file, understand the structure, and write a different file. The natural interface for an LLM is the source code of the composition itself. Plain text. Versionable. Diffable. Greppable.
This is why HyperFrames is HTML. We could have invented a new format. Many tools have. But every new format is a tax on every model — the model has to learn the schema, learn the gotchas, learn what is legal and what is not. HTML is free. The model already knows it. It has read a billion examples. When we expose a composition as HTML, the model arrives pre-trained. The OpenAI integration wires this up as a single function-calling tool you can drop into an existing agent.
What an LLM needs from a video tool
I have spent a lot of time watching agents author videos. The pattern of what they need is consistent across models — GPT, Claude, Gemini — and it is different from what a human needs.
A single file as the substrate. Humans tolerate project files with seventeen scattered assets. Agents need everything in one place, or they hallucinate paths. Our compositions are single HTML files with inline styles and inline scripts. Assets are referenced by URL or base64. The whole video fits in a context window.
Deterministic render. I have written about this elsewhere on this blog, but it matters specifically for agents. An LLM in an iteration loop needs the feedback to be stable. If the render is noisy, the gradient is garbage. We described why earlier in A deterministic video manifesto.
A linter, not a debugger. Humans debug by stepping through. Agents debug by re-reading the error. The richer the error, the better the next attempt. hyperframes lint produces structured errors: line numbers, expected versus actual, suggested fixes. Every error message ends with the question, "what would an LLM do with this?" If the answer is "be confused," we rewrite the message.
A preview that the agent can see. We expose hyperframes preview --json which outputs a structured description of the composition at frame intervals — text, positions, colors, durations. An agent can sample-render at 0s, 1s, 2s and read the JSON to verify what it built without having to look at pixels. (When pixels are needed, we render to PNG and pipe it back through a multimodal model. But the JSON path is the fast loop.)
Sub-composition reuse. Agents are excellent at composition. If you give them a <LowerThird> and a <KPICard> and a <TitleCard>, they will reach for them. If you make them write motion graphics primitives from scratch every time, they will, but the output will be worse and the iteration will be slower. We ship a registry of these primitives at hyperframes add. The agent imports them.
The shape of the agent loop
When an agent authors a video in HyperFrames, the loop looks like this, every time:
- Read the brief.
- Generate a draft composition HTML.
- Run
hyperframes lint. Read the errors. Fix. - Run
hyperframes preview --frame 1500 --out probe.png. Look at the still. - Render the full thing. Open the MP4 (or, more usefully, watch the timeline JSON).
- Compare against the brief. Identify the largest delta.
- Edit one or two things. Go to 3.
What is striking about this loop is how much like a human's loop it is. A motion designer at a desk does the same thing: draft, preview, lint mentally, render a probe, compare. The reason the agent can do this is that we have exposed all the same tools the human uses, but in a form the agent can drive. Lint runs in the terminal. Preview produces a file. Render produces a file. The agent can run all of these with Bash. There is no GUI to click.
If you have used Cursor or Claude Code to write a TypeScript project, you have used this pattern. Agents are excellent at developer loops — write, lint, run, read output, fix. The bet HyperFrames made early was: make the video loop look exactly like a developer loop. Then every existing agent already knows how to drive it.
What agents are bad at
I want to be honest about the failure modes, because they are real.
Agents are bad at taste. They will produce compositions that are technically correct, on-brief, well-easing, and somehow lifeless. The difference between a good motion designer's work and an agent's work is rarely a specific keyframe — it is the gestalt of fifty small choices, every one of which the agent had no preference about. We address this by shipping opinionated defaults. Our default eases are not browser-default; they are tuned. Our default type scale is not Tailwind-default; it is a typographer's. The agent, lacking taste, inherits ours.
Agents are bad at long-form pacing. A 5-second composition? Fine. A 60-second composition with three acts and a turn at 0:38? Hard. The model loses the thread of the larger structure when every individual frame is a local optimization. We are experimenting with composition outlines — a YAML file the agent writes first that describes the beats — and then the agent fills in HTML for each beat. Early results are encouraging but not shippable.
Agents are bad at brand consistency across many compositions. If you have a brand system with fifteen rules, the agent will follow ten of them in any given composition, but a different ten each time. We address this with a brand.css that the agent imports verbatim — colors, type, spacing — so the rules cannot drift.
Why the agent's camera is HTML, specifically
Could we have built this on Remotion? Manim? A custom DSL? Yes to all three, and I have shipped on all three. Here is why HTML wins for the agent case in particular.
Remotion is React, which is JavaScript, which is fine — but the agent has to learn the timeline conventions, the composition API, the frame counter. None of these are universally known. Every Remotion project I have seen Claude write has at least one off-by-one in the frame math, because the model is reasoning about a custom timeline rather than the CSS clock it already knows. (We get into the full architectural delta in the Remotion comparison.)
Manim is Python and beautiful for technical animation. But Manim is also a deep API, and the agent has to remember which Animation subclass to use for which effect. When it forgets, the output is wrong in a way that does not lint. The error mode is silent.
A custom DSL is the cleanest in theory and the worst in practice. Every DSL is a tax. The first thing the agent does with a DSL is mistranslate the brief into it. HTML has no translation step. The agent is writing in the substrate.
This is also why we chose plain CSS keyframes over a JS animation library. Frameworks like GSAP and Anime are wonderful for humans because they hide complexity. They are bad for agents because they hide complexity. The agent cannot reason about an effect it cannot see in the source. We support GSAP for advanced use cases, but our defaults are vanilla CSS for a reason.
The next year of agentic video
Two things will happen quickly.
First, the latency from "describe a video" to "ship a video" will fall below thirty seconds for a meaningful range of briefs. We are already there for short title cards, lower thirds, social ads. The full minute-long product explainer is six months out, maybe twelve. The bottleneck is not the model; it is the toolchain.
Second, the unit of authorship will shift from "one video" to "ten thousand variants." When an agent can ship a video in thirty seconds, you do not ship one — you ship a campaign. Different copy, different lengths, different aspect ratios, different markets. We will write more about this in Render 10k variants overnight. The short version: the agent is the camera, and the camera is now cheap enough to point a thousand times.
If you are building anything that touches video — marketing, education, social, internal training, regulated communications — your team is about to grow by one. The new teammate works fast, takes direction in English, and never tires. The tool you give it is the multiplier. Make sure the tool was built with this teammate in mind.
We built HyperFrames for that teammate. Open a terminal. Run npx hyperframes init, or pair-program with the model directly in the HyperFrames playground. Ask Claude to make you something. Then come back and tell us what you saw.
Cite this postBibTeX · APA · Markdown
@misc{park2026agents,
author = {Ren Park},
title = {The agent's camera},
year = {2026},
url = {https://hyperframes.video/blog/the-agents-camera},
note = {HyperFrames blog}
}Ren Park. (2026, May 3). The agent's camera. HyperFrames. https://hyperframes.video/blog/the-agents-camera
[The agent's camera](https://hyperframes.video/blog/the-agents-camera) — Ren Park, 2026
Ren writes guides, runs workshops, and breaks the CLI on purpose so you do not have to. Previously dev rel at a CI company; before that, an actual filmmaker.
Replacing After Effects with a text editor
I spent eight years in After Effects. I spent the last year of work in VS Code. Here is what I gained, what I lost, and what surprised me on the way.
Code a Wes Anderson title card (and 4 other director styles)
Five title-card styles from five directors, in pure HTML and CSS. Symmetric framing, serif italics, and the typography choices that signal each one.
Generative motion design: LLMs writing CSS animations
How well do current LLMs actually produce motion design? An honest field test of the major models on animation prompts, the failure modes that keep recurring, and the few-shot patterns that fix them.
Building with HyperFrames? Come hang out.
We're on GitHub, in Discord, and the playground is one click away. Bring weird ideas — we collect them.