AI Engineering

Image Generation API: DALL-E, Stable Diffusion, FLUX - Generating Images

Цели урока

Compare image generation providers (DALL-E 3, SD SDXL, FLUX.1, Midjourney) and choose the right one for the task
Integrate DALL-E 3 API: generation, saving to S3, prompt engineering
Use Stability AI API: negative prompts, image-to-image, seed for reproducibility
Implement image editing: inpainting, variations, style transfer, background removal
Design a production pipeline: moderation → cache → generate → S3/CDN → quotas

Stable Diffusion (August 2022) went open-source as the open-weights answer to DALL-E 2, which OpenAI had unveiled that April. In one week - 10 million images. This wasn't about drawing. It was about image generation becoming infrastructure. Canva added AI generation and got 2 billion images in the first year. Each one an API call at USD 0.04-0.12. At that scale, pipeline architecture - caching, moderation, provider selection - determines the difference between profit and a multi-million dollar loss.

Canva AI generates billions of images - DALL-E + Stable Diffusion + proprietary models
Shopify automatically generates backgrounds for product photos - inpainting + background removal
Adobe Firefly is built into Photoshop - generative fill (inpainting) as a native design tool
FLUX.1 (2024) - new SOTA: outperforms SDXL on quality and speed, actively used in production

From GAN to FLUX: 10 years on one screen

**2014**: Ian Goodfellow invents GAN (Generative Adversarial Network) - the first working method for image generation through two competing networks. **2021**: OpenAI releases DALL-E - text to image via a transformer, results are surprising. **August 2022**: Stability AI open-sources Stable Diffusion - the open-weights answer to DALL-E 2, which OpenAI had unveiled that April. One week later - 10 million images. The market explodes. **2022-2023**: Midjourney v4/v5 sets a new bar for artistic quality - a Discord bot becomes a designer's tool. **2024**: Black Forest Labs releases FLUX.1 - new SOTA, outperforming SDXL on quality at comparable speed. The diffusion process (noise to image) has definitively won over GAN.

Предварительные знания

LLM API Integration: OpenAI, Anthropic, Open-Source Models

Image Generation Landscape: DALL-E 3, Stable Diffusion, Midjourney, FLUX

August 2022. Stable Diffusion goes open-source - the open-weights answer to DALL-E 2, which OpenAI had unveiled that April. In one week - 10 million images generated. This wasn't about drawing. It was about image generation becoming infrastructure - the same way file storage or email delivery became infrastructure.

The market isn't a monopoly of one name. Three distinct niches, three different answers to the question 'what does production actually need'.

Provider	API Access	Strengths	Price per Image	Self-hosting
DALL-E 3 (OpenAI)	REST API, SDK	Text understanding, prompt adherence	USD 0.04-0.12	No
Stable Diffusion SDXL (Stability AI)	REST API, SDK	Flexibility, ControlNet, LoRA fine-tuning	USD 0.03-0.07	Yes (GPU)
Midjourney	No official API	Artistic quality, style	~USD 0.02 (subscription)	No
FLUX.1 (Black Forest Labs)	REST API	New SOTA: speed + text quality	USD 0.03-0.06	Yes (GPU)
Ideogram	REST API	Best text rendering	USD 0.04-0.08	No

**Midjourney has no official API** (as of early 2026). Integration is only possible through a Discord bot or unofficial wrapper libraries - unreliable for production. For serious projects: DALL-E 3, Stability AI, or FLUX.1.

For a production application with mass image generation (5000+/day) and the need for fine-tuning on a corporate style, which provider is optimal?

OpenAI Images API: DALL-E 3, Prompting, Sizes, Quality

DALL-E 3 does something unusual: it **rewrites the prompt** before generation. Send 'a cat' - internally it expands to 'A fluffy orange tabby cat sitting on a windowsill in warm afternoon light'. Not a bug. An architectural decision that explains why DALL-E 3 understands intent better than anything else.

One call, three required parameters: model, prompt, size. Cost: USD 0.04 for standard, USD 0.08 for HD. The main production catch - the URL from the response lives for 60 minutes.

**Specificity:** "A red sports car" → "A cherry red 2024 Porsche 911 on a wet mountain road at sunset, cinematic lighting"
**Style:** "digital art", "oil painting", "3D render", "watercolor", "photorealistic"
**Composition:** "close-up", "aerial view", "wide angle", "centered", "rule of thirds"
**Lighting:** "soft diffused light", "golden hour", "dramatic shadows", "neon glow"
**Exclusions:** DALL-E 3 doesn't support negative prompts - describe in text: "without text, no watermarks"

**DALL-E image URLs expire in 60 minutes.** Storing a URL in the database means a broken link for every user within an hour. Always request `b64_json` and upload to S3/CDN for permanent storage.

Why is it necessary to save the image to S3 when using DALL-E 3 API, rather than storing the URL from the response?

Stability AI API: Models, ControlNet, Negative Prompts

Stable Diffusion runs a **diffusion process**: starts with pure noise and iteratively denoises it, guided by CFG scale (classifier-free guidance). Each step - a small shift from random to meaningful. 20-50 steps later - the image exists.

The key differentiator from DALL-E: **negative prompts**. A separate parameter that explicitly tells the model what should not appear. `blurry, low quality, watermark, deformed` - those artifacts disappear from the output. DALL-E 3 has no such mechanism.

Capability	DALL-E 3	Stable Diffusion SDXL
Negative prompts	No (text only)	Yes (separate parameter)
Image-to-Image	No	Yes (strength control)
ControlNet	No	Yes (pose, edge, depth)
Seed for repeatability	No	Yes
Fine-tuning (LoRA)	No	Yes
Self-hosting	No	Yes (open-source)
Prompt understanding	Excellent (rewrites)	Good (literal)

**Seed for reproducibility:** the same prompt + seed = the same image every time. Useful for A/B testing (change only the prompt, keep the seed fixed) and for creating series of similar images with a consistent style.

Which Stable Diffusion parameter allows excluding unwanted elements from generation?

Image Editing: Inpainting, Outpainting, Variations, Style Transfer

Shopify replaces product photo backgrounds automatically. Adobe baked generative fill into Photoshop. Not filters - **inpainting**: replacing a selected area by mask while preserving the rest of the image context.

Four core editing operations: **inpainting** (replacing an area by mask), **outpainting** (extending beyond the frame), **variations** (creating variations of an existing image), **style transfer** (changing style while keeping structure).

**DALL-E 3 does not support edit and variations.** These operations are only available through DALL-E 2. For production inpainting - Stability AI API: better quality and more control through mask, negative prompt, and seed.

How does inpainting differ from image-to-image generation?

Production: Moderation, Caching, CDN, Cost

Canva: 2 billion generations in the first year. Each one USD 0.04-0.12. At that scale, the difference between 'just call the API' and 'design the pipeline correctly' is literally millions of dollars per year.

The production pipeline: **content moderation** (block prohibited content before generation - before spending money), **cache check** (same prompt = same result), **generate → S3** (replace the 60-minute URL with a permanent one), **quota tracking** (per-user limits by plan).

Generation Volume	Cost/mo (DALL-E 3)	Recommendation
100 images/day	~USD 120	OpenAI API, basic cache
1,000 images/day	~USD 1,200	Caching + CDN + quotas
10,000 images/day	~USD 12,000	Self-hosted SD + DALL-E for premium
100,000+ images/day	~USD 120,000	Dedicated GPU cluster, self-hosted

**Content moderation is mandatory.** Without prompt checking, users can generate prohibited content. OpenAI blocks some requests on its own - but that's not complete protection. Preliminary moderation via the Moderation API plus logging all prompts is the required minimum.

What is the first step before generating an image from a user prompt in production?

Image generation means DALL-E

The market is diverse: Midjourney (artistic quality), SD SDXL (open-source, LoRA, ControlNet), FLUX.1 (new 2024 SOTA), Ideogram (text rendering). DALL-E 3 is one option, not a monopoly

DALL-E is the most recognizable brand, but not the best at everything. Midjourney produces artistic results DALL-E 3 can't match. SD SDXL is the only option for LoRA fine-tuning and self-hosting at 10,000+ generations per day. FLUX.1 (2024) surpassed SDXL on key metrics. Choosing a provider is an engineering decision, not a marketing one.

Image generation API is just prompt → image, like a text API

60-minute URLs, content moderation, S3 pipeline, negative prompts, seed for reproducibility, CFG scale - this is a distinct engineering discipline

A beginner calls the API, gets a URL, saves it to the database. An hour later, images are broken for every user. Then learns about moderation - after the first NSFW incident. The correct architecture: moderation → cache → b64_json → S3 → CDN. That's 50 lines of code separating a toy project from production.

Summary

The market isn't a monopoly: DALL-E 3 (USD 0.04 per img) - prompt understanding, SD SDXL - LoRA/ControlNet/self-hosting, FLUX.1 - new 2024 SOTA, Midjourney - artistic quality without an API
DALL-E 3 rewrites the prompt internally - that's a feature, not a bug. URLs live 60 minutes - always use b64_json + S3
Diffusion process: noise to image in 20-50 steps. CFG scale controls prompt adherence. Negative prompts - only in SD/FLUX
Image editing: inpainting (area replacement by mask), image-to-image (strength 0-1), style transfer, background removal - via DALL-E 2 or Stability AI
Production pipeline: moderation (first step, before spending money) → cache → generate → S3/CDN → quota. At 10,000+/day - self-hosted SD saves an order of magnitude in cost

What's Next

Images are static content. The next frontier of AI generation is video and audio: dynamic content, async pipelines, and significantly greater resource requirements.

Video and Audio Generation — From static images to video (Sora, Runway) and music (Suno) - a new level of AI generation
Multimodal AI — Vision API analyzes images, Image Generation creates them - two directions of working with visual content
Cost Management — Image generation is one of the most expensive AI operations. Cost optimization strategies

Связанные уроки

aie-05-api-integration — Image APIs use the same integration patterns
aie-27-video-audio-generation — Image generation is the base for video frames
aie-25-multimodal — Vision understanding complements image generation
aie-29-cost-management — Image generation cost grows fast at scale
ml-33-gan — Earlier generative approach to synthesizing images