AI Engineering
Image Generation API: DALL-E, Stable Diffusion, FLUX - Generating Images
Цели урока
- Compare image generation providers (DALL-E 3, SD SDXL, FLUX.1, Midjourney) and choose the right one for the task
- Integrate DALL-E 3 API: generation, saving to S3, prompt engineering
- Use Stability AI API: negative prompts, image-to-image, seed for reproducibility
- Implement image editing: inpainting, variations, style transfer, background removal
- Design a production pipeline: moderation → cache → generate → S3/CDN → quotas
Stable Diffusion (August 2022) went open-source as the open-weights answer to DALL-E 2, which OpenAI had unveiled that April. In one week - 10 million images. This wasn't about drawing. It was about image generation becoming infrastructure. Canva added AI generation and got 2 billion images in the first year. Each one an API call at USD 0.04-0.12. At that scale, pipeline architecture - caching, moderation, provider selection - determines the difference between profit and a multi-million dollar loss.
- Canva AI generates billions of images - DALL-E + Stable Diffusion + proprietary models
- Shopify automatically generates backgrounds for product photos - inpainting + background removal
- Adobe Firefly is built into Photoshop - generative fill (inpainting) as a native design tool
- FLUX.1 (2024) - new SOTA: outperforms SDXL on quality and speed, actively used in production
From GAN to FLUX: 10 years on one screen
**2014**: Ian Goodfellow invents GAN (Generative Adversarial Network) - the first working method for image generation through two competing networks. **2021**: OpenAI releases DALL-E - text to image via a transformer, results are surprising. **August 2022**: Stability AI open-sources Stable Diffusion - the open-weights answer to DALL-E 2, which OpenAI had unveiled that April. One week later - 10 million images. The market explodes. **2022-2023**: Midjourney v4/v5 sets a new bar for artistic quality - a Discord bot becomes a designer's tool. **2024**: Black Forest Labs releases FLUX.1 - new SOTA, outperforming SDXL on quality at comparable speed. The diffusion process (noise to image) has definitively won over GAN.
Предварительные знания
Image Generation Landscape: DALL-E 3, Stable Diffusion, Midjourney, FLUX
August 2022. Stable Diffusion goes open-source - the open-weights answer to DALL-E 2, which OpenAI had unveiled that April. In one week - 10 million images generated. This wasn't about drawing. It was about image generation becoming infrastructure - the same way file storage or email delivery became infrastructure.
The market isn't a monopoly of one name. Three distinct niches, three different answers to the question 'what does production actually need'.
| Provider | API Access | Strengths | Price per Image | Self-hosting |
|---|---|---|---|---|
| DALL-E 3 (OpenAI) | REST API, SDK | Text understanding, prompt adherence | USD 0.04-0.12 | No |
| Stable Diffusion SDXL (Stability AI) | REST API, SDK | Flexibility, ControlNet, LoRA fine-tuning | USD 0.03-0.07 | Yes (GPU) |
| Midjourney | No official API | Artistic quality, style | ~USD 0.02 (subscription) | No |
| FLUX.1 (Black Forest Labs) | REST API | New SOTA: speed + text quality | USD 0.03-0.06 | Yes (GPU) |
| Ideogram | REST API | Best text rendering | USD 0.04-0.08 | No |
**Midjourney has no official API** (as of early 2026). Integration is only possible through a Discord bot or unofficial wrapper libraries - unreliable for production. For serious projects: DALL-E 3, Stability AI, or FLUX.1.
For a production application with mass image generation (5000+/day) and the need for fine-tuning on a corporate style, which provider is optimal?
OpenAI Images API: DALL-E 3, Prompting, Sizes, Quality
DALL-E 3 does something unusual: it **rewrites the prompt** before generation. Send 'a cat' - internally it expands to 'A fluffy orange tabby cat sitting on a windowsill in warm afternoon light'. Not a bug. An architectural decision that explains why DALL-E 3 understands intent better than anything else.
One call, three required parameters: model, prompt, size. Cost: USD 0.04 for standard, USD 0.08 for HD. The main production catch - the URL from the response lives for 60 minutes.
- **Specificity:** "A red sports car" → "A cherry red 2024 Porsche 911 on a wet mountain road at sunset, cinematic lighting"
- **Style:** "digital art", "oil painting", "3D render", "watercolor", "photorealistic"
- **Composition:** "close-up", "aerial view", "wide angle", "centered", "rule of thirds"
- **Lighting:** "soft diffused light", "golden hour", "dramatic shadows", "neon glow"
- **Exclusions:** DALL-E 3 doesn't support negative prompts - describe in text: "without text, no watermarks"
**DALL-E image URLs expire in 60 minutes.** Storing a URL in the database means a broken link for every user within an hour. Always request `b64_json` and upload to S3/CDN for permanent storage.
Why is it necessary to save the image to S3 when using DALL-E 3 API, rather than storing the URL from the response?
Stability AI API: Models, ControlNet, Negative Prompts
Stable Diffusion runs a **diffusion process**: starts with pure noise and iteratively denoises it, guided by CFG scale (classifier-free guidance). Each step - a small shift from random to meaningful. 20-50 steps later - the image exists.
The key differentiator from DALL-E: **negative prompts**. A separate parameter that explicitly tells the model what should not appear. `blurry, low quality, watermark, deformed` - those artifacts disappear from the output. DALL-E 3 has no such mechanism.
| Capability | DALL-E 3 | Stable Diffusion SDXL |
|---|---|---|
| Negative prompts | No (text only) | Yes (separate parameter) |
| Image-to-Image | No | Yes (strength control) |
| ControlNet | No | Yes (pose, edge, depth) |
| Seed for repeatability | No | Yes |
| Fine-tuning (LoRA) | No | Yes |
| Self-hosting | No | Yes (open-source) |
| Prompt understanding | Excellent (rewrites) | Good (literal) |
**Seed for reproducibility:** the same prompt + seed = the same image every time. Useful for A/B testing (change only the prompt, keep the seed fixed) and for creating series of similar images with a consistent style.
Which Stable Diffusion parameter allows excluding unwanted elements from generation?
Image Editing: Inpainting, Outpainting, Variations, Style Transfer
Shopify replaces product photo backgrounds automatically. Adobe baked generative fill into Photoshop. Not filters - **inpainting**: replacing a selected area by mask while preserving the rest of the image context.
Four core editing operations: **inpainting** (replacing an area by mask), **outpainting** (extending beyond the frame), **variations** (creating variations of an existing image), **style transfer** (changing style while keeping structure).
**DALL-E 3 does not support edit and variations.** These operations are only available through DALL-E 2. For production inpainting - Stability AI API: better quality and more control through mask, negative prompt, and seed.
How does inpainting differ from image-to-image generation?
Production: Moderation, Caching, CDN, Cost
Canva: 2 billion generations in the first year. Each one USD 0.04-0.12. At that scale, the difference between 'just call the API' and 'design the pipeline correctly' is literally millions of dollars per year.
The production pipeline: **content moderation** (block prohibited content before generation - before spending money), **cache check** (same prompt = same result), **generate → S3** (replace the 60-minute URL with a permanent one), **quota tracking** (per-user limits by plan).
| Generation Volume | Cost/mo (DALL-E 3) | Recommendation |
|---|---|---|
| 100 images/day | ~USD 120 | OpenAI API, basic cache |
| 1,000 images/day | ~USD 1,200 | Caching + CDN + quotas |
| 10,000 images/day | ~USD 12,000 | Self-hosted SD + DALL-E for premium |
| 100,000+ images/day | ~USD 120,000 | Dedicated GPU cluster, self-hosted |
**Content moderation is mandatory.** Without prompt checking, users can generate prohibited content. OpenAI blocks some requests on its own - but that's not complete protection. Preliminary moderation via the Moderation API plus logging all prompts is the required minimum.
What is the first step before generating an image from a user prompt in production?
Image generation means DALL-E
The market is diverse: Midjourney (artistic quality), SD SDXL (open-source, LoRA, ControlNet), FLUX.1 (new 2024 SOTA), Ideogram (text rendering). DALL-E 3 is one option, not a monopoly
DALL-E is the most recognizable brand, but not the best at everything. Midjourney produces artistic results DALL-E 3 can't match. SD SDXL is the only option for LoRA fine-tuning and self-hosting at 10,000+ generations per day. FLUX.1 (2024) surpassed SDXL on key metrics. Choosing a provider is an engineering decision, not a marketing one.
Image generation API is just prompt → image, like a text API
60-minute URLs, content moderation, S3 pipeline, negative prompts, seed for reproducibility, CFG scale - this is a distinct engineering discipline
A beginner calls the API, gets a URL, saves it to the database. An hour later, images are broken for every user. Then learns about moderation - after the first NSFW incident. The correct architecture: moderation → cache → b64_json → S3 → CDN. That's 50 lines of code separating a toy project from production.
Summary
- The market isn't a monopoly: DALL-E 3 (USD 0.04 per img) - prompt understanding, SD SDXL - LoRA/ControlNet/self-hosting, FLUX.1 - new 2024 SOTA, Midjourney - artistic quality without an API
- DALL-E 3 rewrites the prompt internally - that's a feature, not a bug. URLs live 60 minutes - always use b64_json + S3
- Diffusion process: noise to image in 20-50 steps. CFG scale controls prompt adherence. Negative prompts - only in SD/FLUX
- Image editing: inpainting (area replacement by mask), image-to-image (strength 0-1), style transfer, background removal - via DALL-E 2 or Stability AI
- Production pipeline: moderation (first step, before spending money) → cache → generate → S3/CDN → quota. At 10,000+/day - self-hosted SD saves an order of magnitude in cost
What's Next
Images are static content. The next frontier of AI generation is video and audio: dynamic content, async pipelines, and significantly greater resource requirements.
- Video and Audio Generation — From static images to video (Sora, Runway) and music (Suno) - a new level of AI generation
- Multimodal AI — Vision API analyzes images, Image Generation creates them - two directions of working with visual content
- Cost Management — Image generation is one of the most expensive AI operations. Cost optimization strategies
Связанные уроки
- aie-05-api-integration — Image APIs use the same integration patterns
- aie-27-video-audio-generation — Image generation is the base for video frames
- aie-25-multimodal — Vision understanding complements image generation
- aie-29-cost-management — Image generation cost grows fast at scale
- ml-33-gan — Earlier generative approach to synthesizing images