AI Engineering

Video and Audio Generation: Sora, Runway, Suno, ElevenLabs - Multimedia AI

Цели урока

Evaluate current capabilities and limitations of video/audio generation in 2026
Integrate video generation API (Runway) with async jobs via BullMQ
Use ElevenLabs SFX and Suno for generating sounds and music
Design a media pipeline: upload → generate → transcode (ffmpeg) → CDN
Implement cost management: budget caps, quotas, tiered quality

Sora (February 2024) generates a 60-second photorealistic video in 10 minutes. Runway Gen-3 - in 30 seconds. Kling - in 15. Over the past year, the price dropped 20x. Video generation is following the same path as image generation in 2022: first "impressive demos, limited API", then a production-ready tool on every backend. A commercial that cost USD 50,000 and a week of studio work in 2023 is generated for USD 5 in 3 minutes in 2026.

Runway is used in Hollywood for pre-visualization - directors generate scenes before shooting, saving hundreds of thousands on storyboards
Suno generates 12 million tracks per month via diffusion models on audio spectrograms - more than all record labels worldwide
TikTok is testing AI video generation for advertisers: upload a product photo → get a ready-made feed video
Canva Video + AI generates social media videos from text descriptions - backend assembles Runway + ElevenLabs TTS into one pipeline

From Demos to Production: How Generative Media Grew Up

**Runway Gen-2 (2023)** brought text-to-video out of research papers and into a usable tool, generating short clips from a prompt. **Pika (2023)** arrived the same year with a focus on quick, accessible video clips. **OpenAI announced Sora (February 2024)**, showing up to a minute of coherent, photorealistic video and signaling that temporal consistency at length was within reach. On the audio side, **ElevenLabs (founded 2022)** pushed realistic voice synthesis into wide use, and **Suno (2023)** did the same for full music tracks with vocals. In roughly two years the field moved from impressive demos to APIs a backend engineer could call in production.

Предварительные знания

Image Generation API: DALL-E, Stable Diffusion, FLUX - Generating Images

Video Generation: Sora, Runway, Pika - Industry State in 2026

February 2024. OpenAI publishes Sora - 60 seconds of photorealistic video from text. Runway Gen-3 generates 10 seconds in 30 seconds. Kling from Kuaishou - in 15. Over the past year, the price dropped 20x. Video generation is following the same trajectory as image generation in 2022: first "impressive demos, limited API", then a production-ready tool in every backend engineer's hands.

Under the hood, all of them run **latent diffusion** for video - the same diffusion models as Stable Diffusion, extended into the temporal dimension. The key architectural challenge is **temporal consistency**: every frame must be coherent with the previous one. Not just a beautiful image, but a coherent sequence. This is exactly why video generation is 10-50x more computationally expensive than image generation.

Provider	Max Duration	Resolution	API Access	Price	Status (2026)
Sora (OpenAI)	up to 60 sec	1080p	API (limited)	USD 0.15-0.60/video	API in early access
Runway Gen-3 Alpha	up to 10 sec	1080p	REST API	~USD 0.05 per sec	Production API
Pika 2.0	up to 5 sec	1080p	REST API	~USD 0.04 per sec	Production API
Kling (Kuaishou)	up to 10 sec	1080p	REST API	~USD 0.03 per sec	Production API
Luma Dream Machine	up to 5 sec	1080p	REST API	~USD 0.04 per sec	Production API

**Video generation is a resource-intensive operation.** Generating a 10-second video takes from 30 seconds to several minutes. All APIs work asynchronously: send request → get task_id → polling/webhook for the result. Synchronous HTTP calls are impossible here.

Video generation is just image generation × 30fps: take an image model and generate frames one by one

Temporal consistency is a separate architectural challenge that requires entirely different models

Generating frames independently produces a flickering slideshow: the character changes appearance every 3 frames, the background jumps, lighting flickers. Video models (Runway, Sora, Kling) use latent diffusion across the entire frame sequence simultaneously, applying 3D attention along the temporal axis. That is a completely different architecture - and it explains why 10 seconds of video costs 10-50x more than generating 10 individual images.

Why do video generation APIs work asynchronously rather than returning results immediately?

Video API Integration: Async Jobs, Polling, Webhooks

Integrating with a video generation API is not "call it, get a response". The pattern is different: **submit task → poll status → download result**. Runway Gen-3 accepts a request, returns a task_id, and then the backend must ask "is it ready?" every 5 seconds. Or configure a webhook - then Runway calls back when done. Webhooks are cleaner but require a public endpoint.

Polling in the main thread is an anti-pattern in production. 60 iterations × 5 seconds = 5 minutes of a blocked worker. BullMQ solves this cleanly: the controller returns a jobId in milliseconds, the worker handles generation in the background, the client periodically checks status via a separate endpoint.

**BullMQ is the ideal choice** for video generation jobs. It supports retry with backoff, progress tracking, TTL for completed jobs and concurrency control. The client gets a jobId instantly and checks status via a separate endpoint.

What architectural pattern is optimal for video generation in production NestJS?

Audio Generation: Music (Suno, Udio) and Sound Effects

Suno generates 12 million tracks per month - more than all recording labels in the world combined. For 5 cents per request. This isn't a replacement for musicians - it's a new toolkit for products that need dynamic sound: games with procedural music, video editors with auto-soundtracks, learning platforms.

AI audio generation splits into three directions: **music** (Suno, Udio - full tracks with vocals via diffusion models on audio spectrograms), **sound effects** (ElevenLabs SFX, Stability Audio - synchronous APIs for UI sounds) and **ambient/background** (background music for videos, podcasts). An important architectural detail: music APIs work asynchronously like video, while sound effect APIs are usually synchronous.

Provider	Content Type	API Access	Duration	Price
Suno v4	Music with vocals	REST API	up to 4 min	~USD 0.05 per track
Udio	Music with vocals	REST API	up to 2 min	~USD 0.05 per track
ElevenLabs SFX	Sound effects	REST API	up to 22 sec	~USD 0.01 per effect
Stability Audio	Music, SFX	REST API	up to 47 sec	~USD 0.02 per generation
Meta MusicGen	Instrumental music	Self-hosted	up to 30 sec	GPU cost

**AI music licensing** is a gray area in 2026. Suno and Udio allow commercial use on paid plans, but the legal situation is evolving. For production, monitor ToS updates and consider royalty-free alternatives for critical use cases.

For generating a short notification sound (2 seconds) for a mobile app, which tool is optimal?

Media Pipeline: Upload → Generate → Transcode → Deliver

The raw file from Runway or Suno is not something to hand directly to users. Runway might return an .mp4 with a codec that won't play in Safari. Suno - a wav without normalization. Video without the `faststart` flag won't play until fully downloaded. A production media pipeline handles this uniformly: **upload** (receiving files) → **generate** (calling AI API via BullMQ) → **transcode** (ffmpeg brings everything to H.264+AAC) → **deliver** (CDN).

**movflags +faststart** is a critical parameter for web video. Without it, the browser must download the entire file before playback. With faststart, metadata is moved to the beginning of the file, and video starts playing immediately.

Why does the media pipeline use ffmpeg for transcoding after AI video generation?

Cost and Limits: Pricing, Rate Limits, Quality vs Speed

Media generation is the most expensive category of AI operations in absolute terms. A 10-second Runway video costs USD 0.50. A single GPT-4o LLM request - USD 0.01-0.05. A 10-50x difference. At 1000 videos per day - that's USD 500 just on generation, before storage and CDN. This is exactly why cost management here is not an "optimization" - it's mandatory architecture from day one.

Operation	Cost	Generation Time	Rate Limit (typical)
LLM request (GPT-4o)	USD 0.01-0.05	1-5 sec	500-10000 RPM
Image gen (DALL-E 3)	USD 0.04-0.12	5-15 sec	50 images/min
TTS (1000 characters)	USD 0.015	1-3 sec	100 RPM
Video gen (5 sec)	USD 0.25-0.50	30-120 sec	10-50/hr
Music gen (1 track)	USD 0.05-0.10	30-60 sec	20-100/hr
SFX (5 sec)	USD 0.01-0.03	5-10 sec	100 RPM

**Tiered quality:** fast preview (low quality) → final version (high quality) only on user confirmation
**Caching:** hash(prompt + params) → S3 lookup. Identical prompts are never regenerated
**Pre-generation:** stock content (backgrounds, intros) generated in advance via cron
**Budget caps:** daily and monthly limits per user. Without them, one user can burn the entire budget
**Watermark for preview:** free preview with watermark, clean version requires payment
**Shorter content:** 5 seconds of video costs 2x less than 10 seconds

**Without budget caps, one user can spend the entire budget.** At USD 0.50 per video and 1000 requests - that's USD 500 per day. Daily and monthly limits are mandatory from day one in production.

An application generates video for users. What strategy best balances UX and cost?

Video generation is just image generation × 30fps

Temporal consistency is a separate architectural challenge that requires entirely different models

Generating frames independently produces a flickering slideshow: a character changes appearance every 3 frames, the background jumps, lighting flickers. Video models (Sora, Runway, Kling) run latent diffusion across the entire frame sequence simultaneously, using 3D attention along the temporal axis. This is a fundamentally different architecture - and it's why 10 seconds of video costs 10-50x more than generating 10 individual images.

Summary

Video generation uses latent diffusion with temporal consistency - not "images × fps", but 3D attention along the time axis
Video generation APIs (Runway, Sora, Kling) work asynchronously: submit → poll/webhook. 5-10 sec video = 30s-3min generation
Audio generation: Suno/Udio for music (diffusion on spectrograms), ElevenLabs SFX for short UI sounds, Stability Audio for ambient
Production pipeline: BullMQ for async jobs, ffmpeg for transcoding (H.264 + faststart), S3 + CDN for delivery
Cost management is critical: video = USD 0.25-1.00 each. Budget caps and tiered quality are mandatory from day one
AI content licensing is a gray area. Monitor Suno and Udio ToS for commercial use

Вопросы для размышления

What video generation use case justifies USD 0.50 per video in a specific product - and how should the unit economics be calculated?
Temporal consistency is solved at the model level. What does this imply for prompt engineering - how does a good video prompt differ from an image prompt?
If Runway goes down - how should a fallback be designed in the media pipeline? Which providers are interchangeable and which are not?

What's Next

Media generation is the final part of the Voice and Multimodal block. Next - AI Agents: systems that autonomously perform tasks, calling tools and making decisions.

AI Agents — From content generation to autonomous actions - agents use tools, including media generation
Image Generation — Foundation for video generation - the same prompting techniques for static images
Cost Management — Media generation is the most expensive AI operation. Advanced optimization strategies

Связанные уроки

aie-26-image-generation — Video extends frame-by-frame image generation
aie-17-agent-fundamentals — Agents invoke media generation as a tool
aie-29-cost-management — Media generation is the priciest AI operation
aie-24-text-to-speech — Audio generation reuses voice synthesis techniques
ml-33-gan — Generative models for temporal media
ml-11

AI Engineering

Video and Audio Generation: Sora, Runway, Suno, ElevenLabs - Multimedia AI

Цели урока

Evaluate current capabilities and limitations of video/audio generation in 2026
Integrate video generation API (Runway) with async jobs via BullMQ
Use ElevenLabs SFX and Suno for generating sounds and music
Design a media pipeline: upload → generate → transcode (ffmpeg) → CDN
Implement cost management: budget caps, quotas, tiered quality

Runway is used in Hollywood for pre-visualization - directors generate scenes before shooting, saving hundreds of thousands on storyboards
Suno generates 12 million tracks per month via diffusion models on audio spectrograms - more than all record labels worldwide
TikTok is testing AI video generation for advertisers: upload a product photo → get a ready-made feed video
Canva Video + AI generates social media videos from text descriptions - backend assembles Runway + ElevenLabs TTS into one pipeline

From Demos to Production: How Generative Media Grew Up

Предварительные знания

Image Generation API: DALL-E, Stable Diffusion, FLUX - Generating Images

Video Generation: Sora, Runway, Pika - Industry State in 2026

Provider	Max Duration	Resolution	API Access	Price	Status (2026)
Sora (OpenAI)	up to 60 sec	1080p	API (limited)	USD 0.15-0.60/video	API in early access
Runway Gen-3 Alpha	up to 10 sec	1080p	REST API	~USD 0.05 per sec	Production API
Pika 2.0	up to 5 sec	1080p	REST API	~USD 0.04 per sec	Production API
Kling (Kuaishou)	up to 10 sec	1080p	REST API	~USD 0.03 per sec	Production API
Luma Dream Machine	up to 5 sec	1080p	REST API	~USD 0.04 per sec	Production API

Video generation is just image generation × 30fps: take an image model and generate frames one by one

Temporal consistency is a separate architectural challenge that requires entirely different models

Why do video generation APIs work asynchronously rather than returning results immediately?

Video API Integration: Async Jobs, Polling, Webhooks

What architectural pattern is optimal for video generation in production NestJS?

Audio Generation: Music (Suno, Udio) and Sound Effects

Provider	Content Type	API Access	Duration	Price
Suno v4	Music with vocals	REST API	up to 4 min	~USD 0.05 per track
Udio	Music with vocals	REST API	up to 2 min	~USD 0.05 per track
ElevenLabs SFX	Sound effects	REST API	up to 22 sec	~USD 0.01 per effect
Stability Audio	Music, SFX	REST API	up to 47 sec	~USD 0.02 per generation
Meta MusicGen	Instrumental music	Self-hosted	up to 30 sec	GPU cost

For generating a short notification sound (2 seconds) for a mobile app, which tool is optimal?

Media Pipeline: Upload → Generate → Transcode → Deliver

Why does the media pipeline use ffmpeg for transcoding after AI video generation?

Cost and Limits: Pricing, Rate Limits, Quality vs Speed

Operation	Cost	Generation Time	Rate Limit (typical)
LLM request (GPT-4o)	USD 0.01-0.05	1-5 sec	500-10000 RPM
Image gen (DALL-E 3)	USD 0.04-0.12	5-15 sec	50 images/min
TTS (1000 characters)	USD 0.015	1-3 sec	100 RPM
Video gen (5 sec)	USD 0.25-0.50	30-120 sec	10-50/hr
Music gen (1 track)	USD 0.05-0.10	30-60 sec	20-100/hr
SFX (5 sec)	USD 0.01-0.03	5-10 sec	100 RPM

**Tiered quality:** fast preview (low quality) → final version (high quality) only on user confirmation
**Caching:** hash(prompt + params) → S3 lookup. Identical prompts are never regenerated
**Pre-generation:** stock content (backgrounds, intros) generated in advance via cron
**Budget caps:** daily and monthly limits per user. Without them, one user can burn the entire budget
**Watermark for preview:** free preview with watermark, clean version requires payment
**Shorter content:** 5 seconds of video costs 2x less than 10 seconds

**Without budget caps, one user can spend the entire budget.** At USD 0.50 per video and 1000 requests - that's USD 500 per day. Daily and monthly limits are mandatory from day one in production.

An application generates video for users. What strategy best balances UX and cost?

Video generation is just image generation × 30fps

Temporal consistency is a separate architectural challenge that requires entirely different models

Summary

Video generation uses latent diffusion with temporal consistency - not "images × fps", but 3D attention along the time axis
Video generation APIs (Runway, Sora, Kling) work asynchronously: submit → poll/webhook. 5-10 sec video = 30s-3min generation
Audio generation: Suno/Udio for music (diffusion on spectrograms), ElevenLabs SFX for short UI sounds, Stability Audio for ambient
Production pipeline: BullMQ for async jobs, ffmpeg for transcoding (H.264 + faststart), S3 + CDN for delivery
Cost management is critical: video = USD 0.25-1.00 each. Budget caps and tiered quality are mandatory from day one
AI content licensing is a gray area. Monitor Suno and Udio ToS for commercial use

Вопросы для размышления

What video generation use case justifies USD 0.50 per video in a specific product - and how should the unit economics be calculated?
Temporal consistency is solved at the model level. What does this imply for prompt engineering - how does a good video prompt differ from an image prompt?
If Runway goes down - how should a fallback be designed in the media pipeline? Which providers are interchangeable and which are not?

What's Next

Media generation is the final part of the Voice and Multimodal block. Next - AI Agents: systems that autonomously perform tasks, calling tools and making decisions.

AI Agents — From content generation to autonomous actions - agents use tools, including media generation
Image Generation — Foundation for video generation - the same prompting techniques for static images
Cost Management — Media generation is the most expensive AI operation. Advanced optimization strategies

Связанные уроки

aie-26-image-generation — Video extends frame-by-frame image generation
aie-17-agent-fundamentals — Agents invoke media generation as a tool
aie-29-cost-management — Media generation is the priciest AI operation
aie-24-text-to-speech — Audio generation reuses voice synthesis techniques
ml-33-gan — Generative models for temporal media
ml-11