Podcast audiograms with animated waveforms
A podcast audiogram generator built in HTML: a 32-bar waveform animates with per-bar phase offsets, captions appear in sync, and the episode title sits on top.
A podcast audiogram generator is what turns an audio-only medium into a scrollable social asset. The pattern is settled: an episode title on top, a square cover or color block in the middle, captions in sync underneath, and a 32-bar waveform somewhere prominent. The waveform is what makes the eye stop — it implies audio, and the implication is enough to make a static post into a video moment.
The polished version below renders the waveform from a single deterministic function. No audio file, no FFT, no library — each bar's height is a phase-offset sine wave with a per-bar amplitude envelope. The result reads as "audio playing" without ever touching audio.
Why synthetic beats sampled
The first instinct when building an audiogram is to extract the audio's actual waveform — FFT the file, get amplitude buckets per frame, drive the bars. This works and is wrong for the use case.
The viewer cannot hear the audio. Most social autoplays muted. The visual job of the waveform is "imply audio happening" — not "represent specific audio." A synthetic waveform with believable motion does that job and has three operational advantages:
- No audio pipeline. You don't need ffmpeg, an audio decoder, or the original audio file at render time.
- Deterministic by construction. A function of
tis byte-reproducible; a sampled FFT depends on decoder version, sample rate, and bucketing. - You control the dynamics. Real podcast audio has long quiet stretches and short loud bursts, which produce visually unreadable bar patterns. Synthetic dynamics stay in the readable range.
The synthetic waveform function
Each bar's height comes from three components: a phase based on its index, a slow envelope that breathes across the bar array, and two faster sine waves that wobble the height.
bars.forEach((bar, i) => {
const phase = i * 0.42;
const env = 0.55 + 0.45 * Math.sin(i * 0.31 + t * 0.7);
const wob = Math.sin(t * 6.0 + phase) * 0.5
+ Math.sin(t * 11.0 + phase * 2) * 0.3;
const h = 12 + Math.max(0, env + wob) * 42;
bar.style.height = `${h}px`;
});Three things make this read as "audio":
- The phase offset (
i * 0.42) — adjacent bars don't move in lockstep. Without phase offsets, all 32 bars move identically and the eye reads it as one big block bouncing, not as a waveform. - The envelope — the slow
sin(i * 0.31 + t * 0.7)creates a "shape" across the bar array that drifts over time. Loud sections cluster, then move. - The two-frequency wobble — 6Hz + 11Hz beating gives the bars a non-periodic feel. A single sine looks too clean; two slightly-related frequencies look like noise.
Caption sync
Captions don't need lip-sync precision because the audio is muted. They need cadence sync — appear in the rough region where the speaker would be saying them, hold for a readable beat, then swap.
const captions = [
{ t: 0.2, text: "The web platform already knows how to render." },
{ t: 1.7, text: "What it can't do is render the same way twice." },
{ t: 3.3, text: "That's the problem deterministic rendering solves." },
{ t: 4.8, text: "Same HTML, same MP4, every time." }
];
let current = captions[0].text;
for (const c of captions) if (t >= c.t) current = c.text;
cap.innerHTML = `<span>${current}</span>`;This is a step function — the caption is whichever entry has the largest t value not exceeding the current frame time. No fade between captions; a hard cut reads as "next line of dialog" and matches how people parse captioned video.
Aim for 1.5 - 2.0 seconds per caption, 6-10 words each. Shorter captions get lost, longer captions don't have time to read. If your transcript naturally has long sentences, break them into clauses on prepositions ("the platform / already knows / how to render").
Tweak the audiogram
Data shape for an episode
For a podcast publishing weekly, the pipeline is: produce the episode, ASR-transcribe the chosen 30-second clip, group ASR output into ~4-6 captions, run the template once per clip, post the MP4. The waveform is fully synthetic so it's not part of the data — only the episode metadata and captions are.
{
"show": "The Render Loop",
"episode_number": 47,
"episode_title": "Why deterministic rendering matters",
"cover_gradient": ["#ff3b1f", "#2b66ff"],
"duration_seconds": 6,
"captions": [
{ "t": 0.2, "text": "The web platform already knows how to render." },
{ "t": 1.7, "text": "What it can't do is render the same way twice." },
{ "t": 3.3, "text": "That's the problem deterministic rendering solves." },
{ "t": 4.8, "text": "Same HTML, same MP4, every time." }
]
}This matches the programmatic video from data pipeline. Loop over your episode list, write one HTML file per clip, render each. For a back-catalog of 100 episodes, a single CI job produces 100 MP4s in parallel — see batch personalized videos from CSV for the orchestration pattern.
Render to MP4
hyperframes render audiogram.html --out clip.mp4 --duration 6 --fps 30For Twitter/X and LinkedIn feed, square 1080×1080 is the default. For Reels and TikTok, switch to 1080×1920 and bump the waveform height so it carries the vertical canvas. For Spotify Canvas (the 8-second looping artwork some podcasts use), bump --duration 8. The quickstart covers installation and the full CLI surface, and deterministic rendering explains why every render of this template is byte-identical.
FAQ
Can I drive the waveform from the actual audio?
Yes, but you have to do it ahead of time. Extract amplitude buckets server-side (ffmpeg + a 32-bin FFT works), store them as an array in the JSON, and have the seek listener look up buckets[Math.floor(t * fps)] per bar. The visual difference from synthetic is subtle, and the operational cost is high — most teams stick with synthetic for a year before deciding it doesn't matter.
How long should the clip be?
Six to eight seconds for a feed-native autoplay, 30-60 seconds for a dedicated "watch this clip" share. Below 6 seconds the captions don't have room to breathe; above 60 seconds the format becomes a transcript reader and the waveform is just decoration.
Should I include audio in the MP4?
If you have the audio for the captioned clip, yes — hyperframes render accepts an --audio flag that overlays a track on the rendered video. The waveform is still synthetic and visual-only; the audio plays underneath for the percentage of viewers who tap to unmute.
What's the right bar count?
24-32 bars for a square format, 36-48 for a wide format. Fewer than 20 bars look chunky and read as "loading dots" rather than waveform. More than 50 bars become a fuzz at typical render resolutions — the individual bars stop being visible.
Why a gradient cover instead of the show artwork?
The example uses a gradient because it's a generic template. For your actual show, swap in the cover art as a base64 data URI or a local file referenced from the HTML. Avoid hot-linking to a remote URL — the render pipeline is deterministic only if all assets resolve identically every time.
Related
- Animated quote cards for Twitter — same caption-on-card grammar for text-only quotes
- Burn subtitles into MP4 — the technique for full-show captions, not just clips
- Animated KPI cards — count-up patterns for "listen count" overlays
Cite this postBibTeX · APA · Markdown
@misc{team2026podcast,
author = {HyperFrames Team},
title = {Podcast audiograms with animated waveforms},
year = {2026},
url = {https://hyperframes.video/blog/animated-podcast-audiogram},
note = {HyperFrames blog}
}HyperFrames Team. (2026, May 21). Podcast audiograms with animated waveforms. HyperFrames. https://hyperframes.video/blog/animated-podcast-audiogram
[Podcast audiograms with animated waveforms](https://hyperframes.video/blog/animated-podcast-audiogram) — HyperFrames Team, 2026
We build the deterministic HTML-to-video pipeline at HyperFrames. We write here when we have something concrete to say.
Animated flight itinerary cards in HTML
Build an animated boarding pass video: a plane traces an arc from origin to destination, gate and time settle in, the pass folds out — rendered as MP4.
Animated job listing videos for LinkedIn
Build a job listing video for LinkedIn: company logo lands, title fades up, requirements stagger in, and an Apply button pulses — exported as deterministic MP4.
Animated poll results as MP4
Build a poll results animation that renders to deterministic MP4: four horizontal bars race to their final percentages, then a winner ribbon sweeps in.
Building with HyperFrames? Come hang out.
We're on GitHub, in Discord, and the playground is one click away. Bring weird ideas — we collect them.