AI Engineering

Video and Audio Generation: Sora, Runway, Suno, ElevenLabs - Multimedia AI

Цели урока

  • Evaluate current capabilities and limitations of video/audio generation in 2026
  • Integrate video generation API (Runway) with async jobs via BullMQ
  • Use ElevenLabs SFX and Suno for generating sounds and music
  • Design a media pipeline: upload → generate → transcode (ffmpeg) → CDN
  • Implement cost management: budget caps, quotas, tiered quality

Sora (February 2024) generates a 60-second photorealistic video in 10 minutes. Runway Gen-3 - in 30 seconds. Kling - in 15. Over the past year, the price dropped 20x. Video generation is following the same path as image generation in 2022: first "impressive demos, limited API", then a production-ready tool on every backend. A commercial that cost USD 50,000 and a week of studio work in 2023 is generated for USD 5 in 3 minutes in 2026.

  • Runway is used in Hollywood for pre-visualization - directors generate scenes before shooting, saving hundreds of thousands on storyboards
  • Suno generates 12 million tracks per month via diffusion models on audio spectrograms - more than all record labels worldwide
  • TikTok is testing AI video generation for advertisers: upload a product photo → get a ready-made feed video
  • Canva Video + AI generates social media videos from text descriptions - backend assembles Runway + ElevenLabs TTS into one pipeline

From Demos to Production: How Generative Media Grew Up

**Runway Gen-2 (2023)** brought text-to-video out of research papers and into a usable tool, generating short clips from a prompt. **Pika (2023)** arrived the same year with a focus on quick, accessible video clips. **OpenAI announced Sora (February 2024)**, showing up to a minute of coherent, photorealistic video and signaling that temporal consistency at length was within reach. On the audio side, **ElevenLabs (founded 2022)** pushed realistic voice synthesis into wide use, and **Suno (2023)** did the same for full music tracks with vocals. In roughly two years the field moved from impressive demos to APIs a backend engineer could call in production.

Предварительные знания

  • Image Generation API: DALL-E, Stable Diffusion, FLUX - Generating Images

Video Generation: Sora, Runway, Pika - Industry State in 2026

February 2024. OpenAI publishes Sora - 60 seconds of photorealistic video from text. Runway Gen-3 generates 10 seconds in 30 seconds. Kling from Kuaishou - in 15. Over the past year, the price dropped 20x. Video generation is following the same trajectory as image generation in 2022: first "impressive demos, limited API", then a production-ready tool in every backend engineer's hands.

Under the hood, all of them run **latent diffusion** for video - the same diffusion models as Stable Diffusion, extended into the temporal dimension. The key architectural challenge is **temporal consistency**: every frame must be coherent with the previous one. Not just a beautiful image, but a coherent sequence. This is exactly why video generation is 10-50x more computationally expensive than image generation.

ProviderMax DurationResolutionAPI AccessPriceStatus (2026)
Sora (OpenAI)up to 60 sec1080pAPI (limited)USD 0.15-0.60/videoAPI in early access
Runway Gen-3 Alphaup to 10 sec1080pREST API~USD 0.05 per secProduction API
Pika 2.0up to 5 sec1080pREST API~USD 0.04 per secProduction API
Kling (Kuaishou)up to 10 sec1080pREST API~USD 0.03 per secProduction API
Luma Dream Machineup to 5 sec1080pREST API~USD 0.04 per secProduction API

**Video generation is a resource-intensive operation.** Generating a 10-second video takes from 30 seconds to several minutes. All APIs work asynchronously: send request → get task_id → polling/webhook for the result. Synchronous HTTP calls are impossible here.

Video generation is just image generation × 30fps: take an image model and generate frames one by one

Temporal consistency is a separate architectural challenge that requires entirely different models

Generating frames independently produces a flickering slideshow: the character changes appearance every 3 frames, the background jumps, lighting flickers. Video models (Runway, Sora, Kling) use latent diffusion across the entire frame sequence simultaneously, applying 3D attention along the temporal axis. That is a completely different architecture - and it explains why 10 seconds of video costs 10-50x more than generating 10 individual images.

Why do video generation APIs work asynchronously rather than returning results immediately?

Video API Integration: Async Jobs, Polling, Webhooks

Integrating with a video generation API is not "call it, get a response". The pattern is different: **submit task → poll status → download result**. Runway Gen-3 accepts a request, returns a task_id, and then the backend must ask "is it ready?" every 5 seconds. Or configure a webhook - then Runway calls back when done. Webhooks are cleaner but require a public endpoint.

Polling in the main thread is an anti-pattern in production. 60 iterations × 5 seconds = 5 minutes of a blocked worker. BullMQ solves this cleanly: the controller returns a jobId in milliseconds, the worker handles generation in the background, the client periodically checks status via a separate endpoint.

**BullMQ is the ideal choice** for video generation jobs. It supports retry with backoff, progress tracking, TTL for completed jobs and concurrency control. The client gets a jobId instantly and checks status via a separate endpoint.

What architectural pattern is optimal for video generation in production NestJS?

Audio Generation: Music (Suno, Udio) and Sound Effects

Suno generates 12 million tracks per month - more than all recording labels in the world combined. For 5 cents per request. This isn't a replacement for musicians - it's a new toolkit for products that need dynamic sound: games with procedural music, video editors with auto-soundtracks, learning platforms.

AI audio generation splits into three directions: **music** (Suno, Udio - full tracks with vocals via diffusion models on audio spectrograms), **sound effects** (ElevenLabs SFX, Stability Audio - synchronous APIs for UI sounds) and **ambient/background** (background music for videos, podcasts). An important architectural detail: music APIs work asynchronously like video, while sound effect APIs are usually synchronous.

ProviderContent TypeAPI AccessDurationPrice
Suno v4Music with vocalsREST APIup to 4 min~USD 0.05 per track
UdioMusic with vocalsREST APIup to 2 min~USD 0.05 per track
ElevenLabs SFXSound effectsREST APIup to 22 sec~USD 0.01 per effect
Stability AudioMusic, SFXREST APIup to 47 sec~USD 0.02 per generation
Meta MusicGenInstrumental musicSelf-hostedup to 30 secGPU cost

**AI music licensing** is a gray area in 2026. Suno and Udio allow commercial use on paid plans, but the legal situation is evolving. For production, monitor ToS updates and consider royalty-free alternatives for critical use cases.

For generating a short notification sound (2 seconds) for a mobile app, which tool is optimal?

Media Pipeline: Upload → Generate → Transcode → Deliver

The raw file from Runway or Suno is not something to hand directly to users. Runway might return an .mp4 with a codec that won't play in Safari. Suno - a wav without normalization. Video without the `faststart` flag won't play until fully downloaded. A production media pipeline handles this uniformly: **upload** (receiving files) → **generate** (calling AI API via BullMQ) → **transcode** (ffmpeg brings everything to H.264+AAC) → **deliver** (CDN).

**movflags +faststart** is a critical parameter for web video. Without it, the browser must download the entire file before playback. With faststart, metadata is moved to the beginning of the file, and video starts playing immediately.

Why does the media pipeline use ffmpeg for transcoding after AI video generation?

Cost and Limits: Pricing, Rate Limits, Quality vs Speed

Media generation is the most expensive category of AI operations in absolute terms. A 10-second Runway video costs USD 0.50. A single GPT-4o LLM request - USD 0.01-0.05. A 10-50x difference. At 1000 videos per day - that's USD 500 just on generation, before storage and CDN. This is exactly why cost management here is not an "optimization" - it's mandatory architecture from day one.

OperationCostGeneration TimeRate Limit (typical)
LLM request (GPT-4o)USD 0.01-0.051-5 sec500-10000 RPM
Image gen (DALL-E 3)USD 0.04-0.125-15 sec50 images/min
TTS (1000 characters)USD 0.0151-3 sec100 RPM
Video gen (5 sec)USD 0.25-0.5030-120 sec10-50/hr
Music gen (1 track)USD 0.05-0.1030-60 sec20-100/hr
SFX (5 sec)USD 0.01-0.035-10 sec100 RPM
  • **Tiered quality:** fast preview (low quality) → final version (high quality) only on user confirmation
  • **Caching:** hash(prompt + params) → S3 lookup. Identical prompts are never regenerated
  • **Pre-generation:** stock content (backgrounds, intros) generated in advance via cron
  • **Budget caps:** daily and monthly limits per user. Without them, one user can burn the entire budget
  • **Watermark for preview:** free preview with watermark, clean version requires payment
  • **Shorter content:** 5 seconds of video costs 2x less than 10 seconds

**Without budget caps, one user can spend the entire budget.** At USD 0.50 per video and 1000 requests - that's USD 500 per day. Daily and monthly limits are mandatory from day one in production.

An application generates video for users. What strategy best balances UX and cost?

Video generation is just image generation × 30fps

Temporal consistency is a separate architectural challenge that requires entirely different models

Generating frames independently produces a flickering slideshow: a character changes appearance every 3 frames, the background jumps, lighting flickers. Video models (Sora, Runway, Kling) run latent diffusion across the entire frame sequence simultaneously, using 3D attention along the temporal axis. This is a fundamentally different architecture - and it's why 10 seconds of video costs 10-50x more than generating 10 individual images.

Summary

  • Video generation uses latent diffusion with temporal consistency - not "images × fps", but 3D attention along the time axis
  • Video generation APIs (Runway, Sora, Kling) work asynchronously: submit → poll/webhook. 5-10 sec video = 30s-3min generation
  • Audio generation: Suno/Udio for music (diffusion on spectrograms), ElevenLabs SFX for short UI sounds, Stability Audio for ambient
  • Production pipeline: BullMQ for async jobs, ffmpeg for transcoding (H.264 + faststart), S3 + CDN for delivery
  • Cost management is critical: video = USD 0.25-1.00 each. Budget caps and tiered quality are mandatory from day one
  • AI content licensing is a gray area. Monitor Suno and Udio ToS for commercial use

Вопросы для размышления

  • What video generation use case justifies USD 0.50 per video in a specific product - and how should the unit economics be calculated?
  • Temporal consistency is solved at the model level. What does this imply for prompt engineering - how does a good video prompt differ from an image prompt?
  • If Runway goes down - how should a fallback be designed in the media pipeline? Which providers are interchangeable and which are not?

What's Next

Media generation is the final part of the Voice and Multimodal block. Next - AI Agents: systems that autonomously perform tasks, calling tools and making decisions.

  • AI Agents — From content generation to autonomous actions - agents use tools, including media generation
  • Image Generation — Foundation for video generation - the same prompting techniques for static images
  • Cost Management — Media generation is the most expensive AI operation. Advanced optimization strategies

Связанные уроки

  • aie-26-image-generation — Video extends frame-by-frame image generation
  • aie-17-agent-fundamentals — Agents invoke media generation as a tool
  • aie-29-cost-management — Media generation is the priciest AI operation
  • aie-24-text-to-speech — Audio generation reuses voice synthesis techniques
  • ml-33-gan — Generative models for temporal media
  • ml-11
Video and Audio Generation: Sora, Runway, Suno, ElevenLabs - Multimedia AI

0

1

Sign In