Semantic video metadata: making MP4s discoverable in 2026
VideoObject schema, WebVTT chapters, MP4 atom metadata, and the YouTube data layer — a 2026 guide to making your videos discoverable, accessible, and machine-readable.
A confession: at HyperFrames, we shipped video output for a year before we thought seriously about metadata. The MP4 came out, the bytes were correct, the visual fidelity was right, and we called it done. Then we started actually distributing video — to social platforms, to our own site, to customers who in turn distributed it — and discovered how much unstructured information lives in and around a video file.
In 2026 this matters more than it used to. Search engines, social platforms, AI assistants, and accessibility tools all read video metadata. They read different parts of it. They read it from different places. Getting any of this right requires understanding the whole stack — from schema.org markup on the embedding page, through WebVTT tracks for captions and chapters, down into the binary metadata atoms of the MP4 itself.
This post is the field guide I wish I had read 18 months ago.
The four layers of video metadata
Before we go deep, here is the mental model. Every video on the modern web has metadata at four layers, and each is read by a different set of consumers.
- Page-level structured data (JSON-LD, VideoObject). Read by Google, Bing, social previews, AI search assistants. Lives on the HTML page that embeds the video.
- Sidecar text tracks (WebVTT for captions, chapters, descriptions). Read by browsers (
<track>element), screen readers, video players, YouTube, and AI transcription/translation tools. - Container-level metadata (MP4 atoms —
udta,©nam,meta). Read by file managers, media players, content management systems. Survives re-uploading and re-embedding more reliably than the page-level data. - Embedded captions and burned-in text. Read by humans, and by OCR-based AI pipelines. The least structured, sometimes the most durable.
Most teams get layer 1 right (it is well-documented) and ignore the rest. That is a missed opportunity. The interesting use cases of 2026 — AI assistants summarizing your content, agentic systems indexing your video library, social platforms generating previews — read from layers 2 and 3 as much as from layer 1.
Layer 1: VideoObject schema
The baseline. Every page that embeds video should include a VideoObject JSON-LD block in the <head>. Google's documentation has the canonical reference; here is the shape we ship:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "VideoObject",
"name": "From DOM to MP4: an annotated render",
"description": "A frame-by-frame walkthrough of the HyperFrames render pipeline.",
"thumbnailUrl": "https://hyperframes.dev/blog/dom-to-mp4-thumb.jpg",
"uploadDate": "2026-05-05T13:00:00Z",
"duration": "PT4M32S",
"contentUrl": "https://hyperframes.dev/blog/dom-to-mp4.mp4",
"embedUrl": "https://hyperframes.dev/blog/from-dom-to-mp4",
"publisher": {
"@type": "Organization",
"name": "HyperFrames",
"logo": {
"@type": "ImageObject",
"url": "https://hyperframes.dev/brand/logo-square.png"
}
},
"transcript": "The transcript text...",
"hasPart": [
{
"@type": "Clip",
"name": "Stage 1: Composition resolution",
"startOffset": 0,
"endOffset": 47
},
{
"@type": "Clip",
"name": "Stage 2: Chromium boot",
"startOffset": 47,
"endOffset": 124
}
]
}
</script>A few things to flag.
The duration field uses ISO 8601 duration format. PT4M32S = 4 minutes, 32 seconds. Browsers and search engines are picky about this; do not invent your own.
The hasPart array with Clip types is how you signal chapters. Google has used these to generate the chapter-list previews in search results for several years; in 2026 they are also read by major AI assistants when they answer questions about your video.
The transcript field is read by AI search. In 2026 this is increasingly important — if a user asks an AI assistant "what does HyperFrames say about rendering pipelines?", the assistant is more likely to find your video if the transcript is in the page-level metadata than if it is buried in a sidecar track.
Layer 2: WebVTT tracks
A WebVTT (.vtt) file is plain text. It looks like this:
WEBVTT
Kind: chapters
00:00:00.000 --> 00:00:47.000
Stage 1: Composition resolution
00:00:47.000 --> 00:02:04.000
Stage 2: Chromium boot
00:02:04.000 --> 00:03:18.000
Stage 3: Page load and ready-gate
00:03:18.000 --> 00:04:32.000
Stage 4: The seek loopYou attach this to a <video> element via a <track> element with kind="chapters". The browser exposes the chapters in its native UI; the user can jump between them with keyboard shortcuts; screen readers announce them; YouTube reads them if you upload the WebVTT alongside the video.
The other kind values you should care about:
subtitles— translation of dialogue into another language. Read by humans.captions— verbatim dialogue plus relevant non-speech audio ("[door slams]"). Read by humans and search engines.descriptions— audio descriptions for visually-impaired users. Read by screen readers.metadata— application-specific data. Read by your own JavaScript.
The single most underused of these is descriptions. For any video where the visual carries information not present in the audio (which is most editorial video, charts, demos), audio descriptions are an accessibility win, an SEO win, and a discoverability win.
Layer 3: MP4 atoms
This is the layer most people skip, and I think they are wrong to. The MP4 container has a structured metadata section — the udta (user data) atom — that survives every kind of re-upload, re-encoding, and re-embedding that strips the page-level data.
The schema is iTunes-derived (because Apple defined a lot of this in the early 2000s). The fields are four-character codes, some prefixed with ©:
©nam— title©ART— author/artist©day— date©cmt— comment / description©too— encoding tool©cpy— copyrightdesc— long description
You set these with ffmpeg via -metadata:
ffmpeg -i input.mp4 \
-metadata title="From DOM to MP4: an annotated render" \
-metadata author="Kira Tanaka" \
-metadata date="2026-05-05" \
-metadata comment="HyperFrames blog: walkthrough of the render pipeline" \
-metadata copyright="HyperFrames, CC BY 4.0" \
-codec copy output.mp4The HyperFrames CLI sets these for you when you provide a <title> and a <meta name="author"> in your composition's HTML head. We treat them as a contract: the metadata in the MP4 should match the metadata in the source.
The interesting bit for 2026: AI agents and search crawlers increasingly read MP4 metadata directly. A video downloaded from YouTube and re-uploaded to Twitter and then re-downloaded by an agent still carries the udta metadata you set when you encoded it. Page-level JSON-LD does not survive this trip. Atom metadata does.
Chapter cues inside the MP4
A less-known trick: MP4 supports an embedded text track that the player reads as chapters. This is different from a sidecar WebVTT file — the chapter info lives inside the MP4 itself, so it survives any kind of redistribution.
The format is a small text-stream track marked with kind="chapters". ffmpeg can produce this from a chapters file:
;FFMETADATA1
title=From DOM to MP4
[CHAPTER]
TIMEBASE=1/1000
START=0
END=47000
title=Composition resolution
[CHAPTER]
TIMEBASE=1/1000
START=47000
END=124000
title=Chromium bootSave as chapters.txt, then:
ffmpeg -i input.mp4 -i chapters.txt -map_metadata 1 -codec copy output.mp4Apple's QuickTime, VLC, mpv, and most modern browsers will read this. YouTube will read it on upload and auto-generate the chapter sidebar. This is the single most underused video distribution trick I know.
The schema.org evolution
A few things changed in schema.org's video vocabulary recently that are worth knowing.
The transcript property on VideoObject was previously a free-text string; in late 2025 schema.org accepted a proposal to allow it to point to a Transcript typed object with structured timestamps. This is what AI assistants read now to answer "what does this video say at minute 3?"
The learningResourceType property is increasingly used by educational AI assistants to filter content. If your video is a tutorial, mark it as "Tutorial". If it is a demonstration, "Demonstration". The vocabulary is constrained; check schema.org for the current list.
The accessibilityFeature array is read by accessibility-focused search tools. Values like "captions", "audioDescription", "transcript" tell crawlers what is available. Set them.
The YouTube data layer
A specific note for video that ends up on YouTube. YouTube reads the following on upload:
- The video file's
udtametadata (title, description, author). - Embedded chapter tracks (auto-populates the chapter sidebar).
- WebVTT caption files uploaded alongside.
- Any embedded subtitles inside the MP4.
If you upload an MP4 to YouTube with all of these set, YouTube auto-populates 80% of the upload form. This is a small workflow win for individual creators and a large one for teams uploading hundreds of videos per week.
The HyperFrames CLI ships a --youtube-ready flag that ensures the right atoms are written and that a .vtt sidecar is generated from any <track> elements in the source composition. It is the same data; the flag just ensures it ends up in the places YouTube expects to find it.
What we ship by default
For the curious: every MP4 the HyperFrames CLI produces in 2026 includes, by default:
udtametadata from the composition's<title>,<meta name="author">, and<meta name="description">.- Chapter cues from any
<section>elements withdata-chapterattributes. - A
metaatom with a JSON blob of the composition's content hash (used for snapshot diffing).
The CLI does not, by default, ship JSON-LD or WebVTT files — those are properties of the page that embeds the video, not of the video file itself. We provide a helper (hyperframes metadata extract) that emits a JSON-LD VideoObject from the MP4's metadata, which you can paste into your page or generate dynamically server-side.
If you want to integrate this into a Next.js or similar site, the Next.js integration covers how we wire metadata extraction into static page generation.
A short checklist
For anyone who wants a TL;DR:
- Ship JSON-LD
VideoObjecton every page that embeds video. - Set the
udtaatoms on the MP4 itself (title, author, date, description). - Add a WebVTT chapters track for any video longer than 60 seconds.
- Add WebVTT captions for any video with speech.
- Add a WebVTT descriptions track for any video where the visual carries information.
- Embed chapter cues inside the MP4 for redistribution durability.
- Update
accessibilityFeaturein your schema.org markup to reflect what you shipped.
Each of these is a small win. Together they are the difference between a video that is findable and one that is invisible.
That said: shipping metadata is the second-most-important thing you can do for video. The first is shipping the video. Get the video out the door, then come back and do the metadata. The order matters in that direction. We covered the shipping side in from DOM to MP4 and the agent-side distribution patterns in why AI agents need deterministic rendering.
Now, go set some atoms.
Cite this postBibTeX · APA · Markdown
@misc{team2026semantic,
author = {HyperFrames Team},
title = {Semantic video metadata: making MP4s discoverable in 2026},
year = {2026},
url = {https://hyperframes.video/blog/semantic-video-metadata-2026},
note = {HyperFrames blog}
}HyperFrames Team. (2026, May 17). Semantic video metadata: making MP4s discoverable in 2026. HyperFrames. https://hyperframes.video/blog/semantic-video-metadata-2026
[Semantic video metadata: making MP4s discoverable in 2026](https://hyperframes.video/blog/semantic-video-metadata-2026) — HyperFrames Team, 2026
We build the deterministic HTML-to-video pipeline at HyperFrames. We write here when we have something concrete to say.
Animated app onboarding screens to MP4
Build an animated app onboarding video in HTML — three-screen carousel with sliding screens, fading headlines, scaling illustrations, advancing dots — and render to MP4.
Animated meme generator (deterministic, scriptable)
Build a scriptable meme video generator in HTML — top-text bottom-text reveal, punchline punch-scale, shaky-cam emphasis — and render reproducible MP4s from a CSV.
Animated newsletter header MP4s (that fall back to a still)
Build an animated newsletter header in HTML, render it to a deterministic MP4, and ship a still PNG as the fallback for clients that strip video.
Building with HyperFrames? Come hang out.
We're on GitHub, in Discord, and the playground is one click away. Bring weird ideas — we collect them.