AI video generation: wire ChatGPT or Claude to an MP4 endpoint
Prompt → HTML → MP4 in one POST. The system prompt, the rendering route, and the safety net.
The pitch most "AI video generation" startups make is wrong. The interesting AI video pipeline is not "diffusion model generates frames pixel-by-pixel" — those outputs are unreliable, expensive, and undeterministic. The interesting pipeline is "LLM writes HTML, renderer turns HTML into video." The LLM generates code, which is what LLMs are actually good at.
Here is that pipeline end-to-end: prompt → HTML → MP4.
Why HTML is the right intermediate
Three reasons HTML beats pixels as the AI's output target:
- LLMs are excellent at HTML. Years of training data; small models can output coherent CSS animations.
- HTML is editable. A human can read the output and fix the one wrong color before rendering.
- HTML renders deterministically. Same HTML, same MP4. Same pixels, every time.
Diffusion-generated frames have none of these. The version you render today is not the version you render tomorrow. The output is not text and you cannot edit it. The LLM/HTML pipeline is the only one that fits an engineering workflow.
The system prompt
The single most-important file in the pipeline. The LLM needs to know:
- The output shape (a single HTML document)
- The animation contract (
addEventListener('hf-seek', ...), nosetInterval) - The variable system (
{{$VAR}}placeholders) - The aspect ratio and duration
A working system prompt, abbreviated:
You generate deterministic animation templates as single HTML documents.
Constraints:
- Output a single HTML document. No external assets.
- All animation runs as a pure function of t (the playhead time in seconds).
- Listen for `hf-seek` CustomEvents on window; read e.detail.time.
- Never use setInterval, setTimeout, or requestAnimationFrame for animation.
- Use {{$VAR}} placeholders for any text the user might change.
- The animation should loop within data-duration seconds.
Example structure:
<!doctype html><html data-duration="6" data-aspect="16:9">
<head><style>...</style></head>
<body>...
<script>
function render(t) { /* mutate DOM based on t */ }
addEventListener('hf-seek', e => render(e.detail.time));
render(0);
</script>
</body></html>Both the OpenAI and Claude integrations ship a longer version of this prompt.
The route handler
The full pipeline as a single Next.js Route Handler:
// app/api/ai-render/route.ts
import OpenAI from 'openai';
import { renderHtmlToMp4 } from '@hyperframes/sdk';
const openai = new OpenAI();
export async function POST(req: Request) {
const { prompt } = await req.json();
const completion = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: SYSTEM_PROMPT },
{ role: 'user', content: prompt },
],
});
const html = extractHtml(completion.choices[0].message.content);
if (!validateHtml(html)) return new Response('Generation failed', { status: 422 });
const mp4 = await renderHtmlToMp4(html, { width: 1920, height: 1080, duration: 6 });
return new Response(mp4, { headers: { 'content-type': 'video/mp4' } });
}extractHtml pulls the HTML out of the LLM's response (it usually wraps in html ... ). validateHtml sanity-checks: has <!doctype>, has hf-seek listener, has reasonable element count. We skip the renderer if validation fails and surface the error to the user.
The result, end to end
What it actually looks like, prompt to pixels:
"A 6-second title card for a podcast called 'Frame by Frame', episode 42, with the host name Kira Tanaka. Dark background, orange accent for the episode number."The safety net
LLMs hallucinate. Two safeguards I run in production:
- Validation. Reject any HTML that fails to compile, lacks
hf-seek, or exceeds 50KB. Re-prompt with the validation error. - Sandboxing. Render in an isolated worker with no network access. Even if the LLM emits malicious JS, it has nowhere to send the data.
Most "AI fails dramatically" stories are missing these two. With them, the worst case is a render that produces a boring video, not a security incident.
What this lets you build
The interesting downstream surfaces, in order of how much they will surprise you:
- Slack bot that renders MP4 explainers from any user prompt.
- Internal tools that turn a Linear ticket into a 6-second customer-facing video.
- End-user features where the prompt is "explain my dashboard" and the output is a personalized video walkthrough.
The Agents Camera post covers the broader frame here — agents that emit video are the next interesting product surface, and the LLM-to-HTML pipeline is what makes them economical.
What still needs human review
Three things the LLM is bad at, in 2026:
- Brand consistency. It will pick "an orange" but not "your orange." Pin colors in the system prompt.
- Long durations. 6-second renders are reliable; 30-second renders drift in pacing.
- Type hierarchy. The LLM will use one big font and one small font; getting the middle tier right takes a human pass.
Wire the LLM to draft. Have a human edit. Render. Ship. The combined loop is a 10x improvement over either pure-human or pure-LLM workflows.
See also: the OpenAI integration and the render API surface for the SDK details.
Cite this postBibTeX · APA · Markdown
@misc{park2026video,
author = {Ren Park},
title = {AI video generation: wire ChatGPT or Claude to an MP4 endpoint},
year = {2026},
url = {https://hyperframes.video/blog/ai-video-generation-api},
note = {HyperFrames blog}
}Ren Park. (2026, May 12). AI video generation: wire ChatGPT or Claude to an MP4 endpoint. HyperFrames. https://hyperframes.video/blog/ai-video-generation-api
[AI video generation: wire ChatGPT or Claude to an MP4 endpoint](https://hyperframes.video/blog/ai-video-generation-api) — Ren Park, 2026
Ren writes guides, runs workshops, and breaks the CLI on purpose so you do not have to. Previously dev rel at a CI company; before that, an actual filmmaker.
The AI video landscape in 2026: Sora 2, Veo 3, and the gap deterministic rendering fills
A field guide to the generative video models shipping in 2026 — Sora 2, Veo 3, Runway Gen-4, Pika — what they cost, what they get right, and where deterministic HTML-to-MP4 fits in a stack that uses all of them.
Render an MP4 from a Next.js API route (real example)
POST a JSON payload, get back an MP4. The route handler, the template, and the Vercel deploy notes.
Animated recipe card videos for social
Build a recipe card video for Instagram, TikTok, and Pinterest — ingredients check off line-by-line, a step counter ticks, and a circular timer fills. Rendered deterministically to MP4.
Building with HyperFrames? Come hang out.
We're on GitHub, in Discord, and the playground is one click away. Bring weird ideas — we collect them.