Semantic video metadata: making MP4s discoverable in 2026

VideoObject schema, WebVTT chapters, MP4 atom metadata, and the YouTube data layer — a 2026 guide to making your videos discoverable, accessible, and machine-readable.

HyperFrames Team

Engineering, HyperFrames

May 17, 2026·7 min read

A confession: at HyperFrames, we shipped video output for a year before we thought seriously about metadata. The MP4 came out, the bytes were correct, the visual fidelity was right, and we called it done. Then we started actually distributing video — to social platforms, to our own site, to customers who in turn distributed it — and discovered how much unstructured information lives in and around a video file.

In 2026 this matters more than it used to. Search engines, social platforms, AI assistants, and accessibility tools all read video metadata. They read different parts of it. They read it from different places. Getting any of this right requires understanding the whole stack — from schema.org markup on the embedding page, through WebVTT tracks for captions and chapters, down into the binary metadata atoms of the MP4 itself.

This post is the field guide I wish I had read 18 months ago.

The four layers of video metadata

Before we go deep, here is the mental model. Every video on the modern web has metadata at four layers, and each is read by a different set of consumers.

Page-level structured data (JSON-LD, VideoObject). Read by Google, Bing, social previews, AI search assistants. Lives on the HTML page that embeds the video.
Sidecar text tracks (WebVTT for captions, chapters, descriptions). Read by browsers (<track> element), screen readers, video players, YouTube, and AI transcription/translation tools.
Container-level metadata (MP4 atoms — udta, ©nam, meta). Read by file managers, media players, content management systems. Survives re-uploading and re-embedding more reliably than the page-level data.
Embedded captions and burned-in text. Read by humans, and by OCR-based AI pipelines. The least structured, sometimes the most durable.

Most teams get layer 1 right (it is well-documented) and ignore the rest. That is a missed opportunity. The interesting use cases of 2026 — AI assistants summarizing your content, agentic systems indexing your video library, social platforms generating previews — read from layers 2 and 3 as much as from layer 1.

Layer 1: VideoObject schema

The baseline. Every page that embeds video should include a VideoObject JSON-LD block in the <head>. Google's documentation has the canonical reference; here is the shape we ship:

html

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "VideoObject",
  "name": "From DOM to MP4: an annotated render",
  "description": "A frame-by-frame walkthrough of the HyperFrames render pipeline.",
  "thumbnailUrl": "https://hyperframes.dev/blog/dom-to-mp4-thumb.jpg",
  "uploadDate": "2026-05-05T13:00:00Z",
  "duration": "PT4M32S",
  "contentUrl": "https://hyperframes.dev/blog/dom-to-mp4.mp4",
  "embedUrl": "https://hyperframes.dev/blog/from-dom-to-mp4",
  "publisher": {
    "@type": "Organization",
    "name": "HyperFrames",
    "logo": {
      "@type": "ImageObject",
      "url": "https://hyperframes.dev/brand/logo-square.png"
    }
  },
  "transcript": "The transcript text...",
  "hasPart": [
    {
      "@type": "Clip",
      "name": "Stage 1: Composition resolution",
      "startOffset": 0,
      "endOffset": 47
    },
    {
      "@type": "Clip",
      "name": "Stage 2: Chromium boot",
      "startOffset": 47,
      "endOffset": 124
    }
  ]
}
</script>

A few things to flag.

The duration field uses ISO 8601 duration format. PT4M32S = 4 minutes, 32 seconds. Browsers and search engines are picky about this; do not invent your own.

The hasPart array with Clip types is how you signal chapters. Google has used these to generate the chapter-list previews in search results for several years; in 2026 they are also read by major AI assistants when they answer questions about your video.

The transcript field is read by AI search. In 2026 this is increasingly important — if a user asks an AI assistant "what does HyperFrames say about rendering pipelines?", the assistant is more likely to find your video if the transcript is in the page-level metadata than if it is buried in a sidecar track.

Layer 2: WebVTT tracks

A WebVTT (.vtt) file is plain text. It looks like this:

html

WEBVTT
Kind: chapters

00:00:00.000 --> 00:00:47.000
Stage 1: Composition resolution

00:00:47.000 --> 00:02:04.000
Stage 2: Chromium boot

00:02:04.000 --> 00:03:18.000
Stage 3: Page load and ready-gate

00:03:18.000 --> 00:04:32.000
Stage 4: The seek loop

You attach this to a <video> element via a <track> element with kind="chapters". The browser exposes the chapters in its native UI; the user can jump between them with keyboard shortcuts; screen readers announce them; YouTube reads them if you upload the WebVTT alongside the video.

The other kind values you should care about:

subtitles — translation of dialogue into another language. Read by humans.
captions — verbatim dialogue plus relevant non-speech audio ("[door slams]"). Read by humans and search engines.
descriptions — audio descriptions for visually-impaired users. Read by screen readers.
metadata — application-specific data. Read by your own JavaScript.

The single most underused of these is descriptions. For any video where the visual carries information not present in the audio (which is most editorial video, charts, demos), audio descriptions are an accessibility win, an SEO win, and a discoverability win.

Layer 3: MP4 atoms

This is the layer most people skip, and I think they are wrong to. The MP4 container has a structured metadata section — the udta (user data) atom — that survives every kind of re-upload, re-encoding, and re-embedding that strips the page-level data.

The schema is iTunes-derived (because Apple defined a lot of this in the early 2000s). The fields are four-character codes, some prefixed with ©:

©nam — title
©ART — author/artist
©day — date
©cmt — comment / description
©too — encoding tool
©cpy — copyright
desc — long description

You set these with ffmpeg via -metadata:

bash

ffmpeg -i input.mp4 \
  -metadata title="From DOM to MP4: an annotated render" \
  -metadata author="Kira Tanaka" \
  -metadata date="2026-05-05" \
  -metadata comment="HyperFrames blog: walkthrough of the render pipeline" \
  -metadata copyright="HyperFrames, CC BY 4.0" \
  -codec copy output.mp4

The HyperFrames CLI sets these for you when you provide a <title> and a <meta name="author"> in your composition's HTML head. We treat them as a contract: the metadata in the MP4 should match the metadata in the source.

The interesting bit for 2026: AI agents and search crawlers increasingly read MP4 metadata directly. A video downloaded from YouTube and re-uploaded to Twitter and then re-downloaded by an agent still carries the udta metadata you set when you encoded it. Page-level JSON-LD does not survive this trip. Atom metadata does.

Chapter cues inside the MP4

A less-known trick: MP4 supports an embedded text track that the player reads as chapters. This is different from a sidecar WebVTT file — the chapter info lives inside the MP4 itself, so it survives any kind of redistribution.

The format is a small text-stream track marked with kind="chapters". ffmpeg can produce this from a chapters file:

html

;FFMETADATA1
title=From DOM to MP4

[CHAPTER]
TIMEBASE=1/1000
START=0
END=47000
title=Composition resolution

[CHAPTER]
TIMEBASE=1/1000
START=47000
END=124000
title=Chromium boot

Save as chapters.txt, then:

bash

ffmpeg -i input.mp4 -i chapters.txt -map_metadata 1 -codec copy output.mp4

Apple's QuickTime, VLC, mpv, and most modern browsers will read this. YouTube will read it on upload and auto-generate the chapter sidebar. This is the single most underused video distribution trick I know.

The schema.org evolution

A few things changed in schema.org's video vocabulary recently that are worth knowing.

The transcript property on VideoObject was previously a free-text string; in late 2025 schema.org accepted a proposal to allow it to point to a Transcript typed object with structured timestamps. This is what AI assistants read now to answer "what does this video say at minute 3?"

The learningResourceType property is increasingly used by educational AI assistants to filter content. If your video is a tutorial, mark it as "Tutorial". If it is a demonstration, "Demonstration". The vocabulary is constrained; check schema.org for the current list.

The accessibilityFeature array is read by accessibility-focused search tools. Values like "captions", "audioDescription", "transcript" tell crawlers what is available. Set them.

The YouTube data layer

A specific note for video that ends up on YouTube. YouTube reads the following on upload:

The video file's udta metadata (title, description, author).
Embedded chapter tracks (auto-populates the chapter sidebar).
WebVTT caption files uploaded alongside.
Any embedded subtitles inside the MP4.

If you upload an MP4 to YouTube with all of these set, YouTube auto-populates 80% of the upload form. This is a small workflow win for individual creators and a large one for teams uploading hundreds of videos per week.

The HyperFrames CLI ships a --youtube-ready flag that ensures the right atoms are written and that a .vtt sidecar is generated from any <track> elements in the source composition. It is the same data; the flag just ensures it ends up in the places YouTube expects to find it.

What we ship by default

For the curious: every MP4 the HyperFrames CLI produces in 2026 includes, by default:

udta metadata from the composition's <title>, <meta name="author">, and <meta name="description">.
Chapter cues from any <section> elements with data-chapter attributes.
A meta atom with a JSON blob of the composition's content hash (used for snapshot diffing).

The CLI does not, by default, ship JSON-LD or WebVTT files — those are properties of the page that embeds the video, not of the video file itself. We provide a helper (hyperframes metadata extract) that emits a JSON-LD VideoObject from the MP4's metadata, which you can paste into your page or generate dynamically server-side.

If you want to integrate this into a Next.js or similar site, the Next.js integration covers how we wire metadata extraction into static page generation.

A short checklist

For anyone who wants a TL;DR:

Ship JSON-LD VideoObject on every page that embeds video.
Set the udta atoms on the MP4 itself (title, author, date, description).
Add a WebVTT chapters track for any video longer than 60 seconds.
Add WebVTT captions for any video with speech.
Add a WebVTT descriptions track for any video where the visual carries information.
Embed chapter cues inside the MP4 for redistribution durability.
Update accessibilityFeature in your schema.org markup to reflect what you shipped.

Each of these is a small win. Together they are the difference between a video that is findable and one that is invisible.

That said: shipping metadata is the second-most-important thing you can do for video. The first is shipping the video. Get the video out the door, then come back and do the metadata. The order matters in that direction. We covered the shipping side in from DOM to MP4 and the agent-side distribution patterns in why AI agents need deterministic rendering.

Now, go set some atoms.

Cite this postBibTeX · APA · Markdown

BibTeX

@misc{team2026semantic,
  author = {HyperFrames Team},
  title  = {Semantic video metadata: making MP4s discoverable in 2026},
  year   = {2026},
  url    = {https://hyperframes.video/blog/semantic-video-metadata-2026},
  note   = {HyperFrames blog}
}

APA

HyperFrames Team. (2026, May 17). Semantic video metadata: making MP4s discoverable in 2026. HyperFrames. https://hyperframes.video/blog/semantic-video-metadata-2026

Markdown

[Semantic video metadata: making MP4s discoverable in 2026](https://hyperframes.video/blog/semantic-video-metadata-2026) — HyperFrames Team, 2026

Share X LinkedIn HN

HyperFrames Team

Engineering, HyperFrames

We build the deterministic HTML-to-video pipeline at HyperFrames. We write here when we have something concrete to say.

All posts →

Keep reading

onboarding

Animated app onboarding screens to MP4

Build an animated app onboarding video in HTML — three-screen carousel with sliding screens, fading headlines, scaling illustrations, advancing dots — and render to MP4.

HyperFrames TeamMay 21, 2026 · 8 min

meme

Animated meme generator (deterministic, scriptable)

Build a scriptable meme video generator in HTML — top-text bottom-text reveal, punchline punch-scale, shaky-cam emphasis — and render reproducible MP4s from a CSV.

Marcus OkaforMay 21, 2026 · 7 min

newsletter

Animated newsletter header MP4s (that fall back to a still)

Build an animated newsletter header in HTML, render it to a deterministic MP4, and ship a still PNG as the fallback for clients that strip video.

Marcus OkaforMay 21, 2026 · 7 min

Join the build

Building with HyperFrames? Come hang out.

We're on GitHub, in Discord, and the playground is one click away. Bring weird ideas — we collect them.

GitHub★ 4.2k Discord Try the playground →

seo metadata schema

Semantic video metadata: making MP4s discoverable in 2026

VideoObject schema, WebVTT chapters, MP4 atom metadata, and the YouTube data layer — a 2026 guide to making your videos discoverable, accessible, and machine-readable.

HyperFrames Team

Engineering, HyperFrames

May 17, 2026·7 min read

This post is the field guide I wish I had read 18 months ago.

The four layers of video metadata

Before we go deep, here is the mental model. Every video on the modern web has metadata at four layers, and each is read by a different set of consumers.

Page-level structured data (JSON-LD, VideoObject). Read by Google, Bing, social previews, AI search assistants. Lives on the HTML page that embeds the video.
Sidecar text tracks (WebVTT for captions, chapters, descriptions). Read by browsers (<track> element), screen readers, video players, YouTube, and AI transcription/translation tools.
Container-level metadata (MP4 atoms — udta, ©nam, meta). Read by file managers, media players, content management systems. Survives re-uploading and re-embedding more reliably than the page-level data.
Embedded captions and burned-in text. Read by humans, and by OCR-based AI pipelines. The least structured, sometimes the most durable.

Layer 1: VideoObject schema

The baseline. Every page that embeds video should include a VideoObject JSON-LD block in the <head>. Google's documentation has the canonical reference; here is the shape we ship:

html

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "VideoObject",
  "name": "From DOM to MP4: an annotated render",
  "description": "A frame-by-frame walkthrough of the HyperFrames render pipeline.",
  "thumbnailUrl": "https://hyperframes.dev/blog/dom-to-mp4-thumb.jpg",
  "uploadDate": "2026-05-05T13:00:00Z",
  "duration": "PT4M32S",
  "contentUrl": "https://hyperframes.dev/blog/dom-to-mp4.mp4",
  "embedUrl": "https://hyperframes.dev/blog/from-dom-to-mp4",
  "publisher": {
    "@type": "Organization",
    "name": "HyperFrames",
    "logo": {
      "@type": "ImageObject",
      "url": "https://hyperframes.dev/brand/logo-square.png"
    }
  },
  "transcript": "The transcript text...",
  "hasPart": [
    {
      "@type": "Clip",
      "name": "Stage 1: Composition resolution",
      "startOffset": 0,
      "endOffset": 47
    },
    {
      "@type": "Clip",
      "name": "Stage 2: Chromium boot",
      "startOffset": 47,
      "endOffset": 124
    }
  ]
}
</script>

A few things to flag.

The duration field uses ISO 8601 duration format. PT4M32S = 4 minutes, 32 seconds. Browsers and search engines are picky about this; do not invent your own.

Layer 2: WebVTT tracks

A WebVTT (.vtt) file is plain text. It looks like this:

html

WEBVTT
Kind: chapters

00:00:00.000 --> 00:00:47.000
Stage 1: Composition resolution

00:00:47.000 --> 00:02:04.000
Stage 2: Chromium boot

00:02:04.000 --> 00:03:18.000
Stage 3: Page load and ready-gate

00:03:18.000 --> 00:04:32.000
Stage 4: The seek loop

The other kind values you should care about:

subtitles — translation of dialogue into another language. Read by humans.
captions — verbatim dialogue plus relevant non-speech audio ("[door slams]"). Read by humans and search engines.
descriptions — audio descriptions for visually-impaired users. Read by screen readers.
metadata — application-specific data. Read by your own JavaScript.

Layer 3: MP4 atoms

The schema is iTunes-derived (because Apple defined a lot of this in the early 2000s). The fields are four-character codes, some prefixed with ©:

©nam — title
©ART — author/artist
©day — date
©cmt — comment / description
©too — encoding tool
©cpy — copyright
desc — long description

You set these with ffmpeg via -metadata:

bash

ffmpeg -i input.mp4 \
  -metadata title="From DOM to MP4: an annotated render" \
  -metadata author="Kira Tanaka" \
  -metadata date="2026-05-05" \
  -metadata comment="HyperFrames blog: walkthrough of the render pipeline" \
  -metadata copyright="HyperFrames, CC BY 4.0" \
  -codec copy output.mp4

Chapter cues inside the MP4

The format is a small text-stream track marked with kind="chapters". ffmpeg can produce this from a chapters file:

html

;FFMETADATA1
title=From DOM to MP4

[CHAPTER]
TIMEBASE=1/1000
START=0
END=47000
title=Composition resolution

[CHAPTER]
TIMEBASE=1/1000
START=47000
END=124000
title=Chromium boot

Save as chapters.txt, then:

bash

ffmpeg -i input.mp4 -i chapters.txt -map_metadata 1 -codec copy output.mp4

The schema.org evolution

A few things changed in schema.org's video vocabulary recently that are worth knowing.

The accessibilityFeature array is read by accessibility-focused search tools. Values like "captions", "audioDescription", "transcript" tell crawlers what is available. Set them.

The YouTube data layer

A specific note for video that ends up on YouTube. YouTube reads the following on upload:

The video file's udta metadata (title, description, author).
Embedded chapter tracks (auto-populates the chapter sidebar).
WebVTT caption files uploaded alongside.
Any embedded subtitles inside the MP4.

What we ship by default

For the curious: every MP4 the HyperFrames CLI produces in 2026 includes, by default:

udta metadata from the composition's <title>, <meta name="author">, and <meta name="description">.
Chapter cues from any <section> elements with data-chapter attributes.
A meta atom with a JSON blob of the composition's content hash (used for snapshot diffing).

If you want to integrate this into a Next.js or similar site, the Next.js integration covers how we wire metadata extraction into static page generation.

A short checklist

For anyone who wants a TL;DR:

Ship JSON-LD VideoObject on every page that embeds video.
Set the udta atoms on the MP4 itself (title, author, date, description).
Add a WebVTT chapters track for any video longer than 60 seconds.
Add WebVTT captions for any video with speech.
Add a WebVTT descriptions track for any video where the visual carries information.
Embed chapter cues inside the MP4 for redistribution durability.
Update accessibilityFeature in your schema.org markup to reflect what you shipped.

Each of these is a small win. Together they are the difference between a video that is findable and one that is invisible.

Now, go set some atoms.

Cite this postBibTeX · APA · Markdown

BibTeX

@misc{team2026semantic,
  author = {HyperFrames Team},
  title  = {Semantic video metadata: making MP4s discoverable in 2026},
  year   = {2026},
  url    = {https://hyperframes.video/blog/semantic-video-metadata-2026},
  note   = {HyperFrames blog}
}

APA

HyperFrames Team. (2026, May 17). Semantic video metadata: making MP4s discoverable in 2026. HyperFrames. https://hyperframes.video/blog/semantic-video-metadata-2026

Markdown

[Semantic video metadata: making MP4s discoverable in 2026](https://hyperframes.video/blog/semantic-video-metadata-2026) — HyperFrames Team, 2026

Share X LinkedIn HN

HyperFrames Team

Engineering, HyperFrames

We build the deterministic HTML-to-video pipeline at HyperFrames. We write here when we have something concrete to say.

All posts →

Keep reading

onboarding

Animated app onboarding screens to MP4

Build an animated app onboarding video in HTML — three-screen carousel with sliding screens, fading headlines, scaling illustrations, advancing dots — and render to MP4.

HyperFrames TeamMay 21, 2026 · 8 min

meme

Animated meme generator (deterministic, scriptable)

Build a scriptable meme video generator in HTML — top-text bottom-text reveal, punchline punch-scale, shaky-cam emphasis — and render reproducible MP4s from a CSV.

Marcus OkaforMay 21, 2026 · 7 min

newsletter

Animated newsletter header MP4s (that fall back to a still)

Build an animated newsletter header in HTML, render it to a deterministic MP4, and ship a still PNG as the fallback for clients that strip video.

Marcus OkaforMay 21, 2026 · 7 min

Join the build

Building with HyperFrames? Come hang out.

We're on GitHub, in Discord, and the playground is one click away. Bring weird ideas — we collect them.

GitHub★ 4.2k Discord Try the playground →