Machine Learning

MLOps Pipeline

87% of ML models never reach production. The gap between a working Jupyter notebook and a reliable production system is enormous - and MLOps closes that gap with tools for versioning, tracking, and automation.

  • **Uber Michelangelo** - the internal MLOps platform through which all company ML models pass: from predicting arrival times to fraud detection. Without a unified tracking and deployment system, every team was reinventing the wheel
  • **Spotify** uses Kubeflow to orchestrate recommendation pipelines: hundreds of models train in parallel on data from millions of users, and experiment tracking allows rolling back to a previous model version in minutes
  • **Tesla Autopilot** - every model iteration is versioned alongside data (millions of camera frames). Without a DVC-like tool, it would be impossible to track exactly which data the model currently driving in real cars was trained on

Предварительные знания

  • Cross-Validation and Overfitting Prevention

When ML met technical debt

For years teams treated a trained model as the finish line, then watched it rot in production for reasons nobody had named. In 2015 a Google team led by D. Sculley published 'Hidden Technical Debt in Machine Learning Systems' at NeurIPS, and it landed hard. Their point was blunt: the model is a tiny box in a much larger diagram. Around it sit data collection, feature extraction, configuration, serving, and monitoring, and that surrounding plumbing is where the real cost accumulates. They catalogued failure modes that practitioners knew but could not articulate: entanglement (changing anything changes everything), hidden feedback loops, undeclared data dependencies, and 'glue code' holding the system together. The paper gave the industry a shared vocabulary for why ML in production is hard. Around 2018 the term MLOps took shape, borrowing the discipline of DevOps and adapting it to systems where data, not just code, drives behavior. The CD4ML practice (Continuous Delivery for Machine Learning) extended CI/CD to cover data and model versioning, automated retraining, and validation, turning Sculley's warnings into engineering process.

MLflow

MLflow is an open-source platform from Databricks for managing the full lifecycle of ML models. It solves three key problems: **experiment tracking** - logging parameters, metrics, and artifacts for each run; **model registry** - versioning and model stages (Staging, Production, Archived); **model deployment** - a unified packaging format for deployment on any platform. MLflow is not tied to any specific ML library: it works with scikit-learn, PyTorch, TensorFlow, XGBoost, and any other framework.

The core of MLflow is the Tracking API. Every training run is recorded with full context: which hyperparameters were used (`log_param`), which metrics were obtained (`log_metric`), which files were created (`log_artifact`). Runs are grouped into experiments. All data is accessible through the `mlflow ui` web interface, where you can compare runs, filter by metrics, and build charts.

**MLflow Model Registry - managing the model lifecycle:** After training, a model is registered in the registry with a version: ``` mlflow.register_model( "runs:/<run_id>/random-forest-model", "IrisClassifier" ) ``` Each registered model moves through stages: - **None** - just registered - **Staging** - undergoing testing - **Production** - serving real traffic - **Archived** - retired from production This allows rolling back to a previous version with a single command if a new model degrades in production.

MLflow Models provides a unified model packaging format - **MLmodel**. Regardless of whether a model was trained in scikit-learn, PyTorch, or TensorFlow, MLflow wraps it in a standard interface with a `predict()` method. This lets you deploy the model as a REST API (`mlflow models serve`), Docker container, or to cloud services (AWS SageMaker, Azure ML, Databricks) without changing code.

Which MLflow function is responsible for saving training hyperparameters for later experiment comparison?

DVC - Data Version Control

Git handles code versioning excellently but struggles with large data files. Committing a 10 GB CSV or a folder with millions of images turns a repository into an unmanageable monster. **DVC (Data Version Control)** solves this: it works *on top of* Git, adding versioning for data and models. Data is stored in remote storage (S3, GCS, Azure Blob, SSH), while only lightweight `.dvc` files with hashes - pointers to specific data versions - are committed to Git.

The DVC workflow starts with `dvc init` in a Git repository. Then `dvc add data.csv` creates a `data.csv.dvc` file - a metafile containing the md5 hash, size, and path to the data. The actual `data.csv` is added to `.gitignore`, while the `.dvc` file is committed to Git. With `dvc push`, data is uploaded to the configured remote storage. A colleague clones the repository, runs `dvc pull`, and gets the exact same data.

**DVC Pipelines - automating the ML process:** DVC doesn't just version data - it can also describe ML pipelines. The `dvc.yaml` file defines stages, their dependencies, and outputs: ``` stages: preprocess: cmd: python src/preprocess.py deps: - src/preprocess.py - data/raw.csv outs: - data/processed.csv train: cmd: python src/train.py deps: - src/train.py - data/processed.csv params: - train.n_estimators - train.max_depth outs: - models/model.pkl metrics: - metrics.json: cache: false ``` The command `dvc repro` runs only the stages whose dependencies have changed. If only `train.py` changed, preprocess is not re-run.

The main advantage of DVC is **reproducibility through Git history**. Every Git commit captures a specific version of code, parameters, and data. Six months later you can return to any commit, run `dvc checkout` and `dvc repro` - and get exactly the same results. Without DVC, a typical scenario is: "the model worked a month ago, but we don't remember which data or parameters were used".

What exactly is stored in a Git repository when using DVC to version a 10 GB dataset?

Kubeflow and Orchestration

When an ML project grows from a single `train.py` script into a system of ten interconnected stages (data collection, cleaning, feature engineering, training multiple models, evaluation, A/B testing, deployment), the need for **orchestration** arises. An orchestrator manages execution order, retries failed steps, scales to a cluster, and tracks status. **Kubeflow** is the most popular ML orchestrator, built on top of Kubernetes.

Kubeflow Pipelines SDK allows describing ML pipelines in Python. Each step is defined as a **component** - a function with explicit inputs and outputs. Components are automatically packaged into Docker containers and run in Kubernetes. Data passes between components via artifacts. The SDK automatically builds a DAG (directed acyclic graph) of dependencies and optimizes execution order.

**When MLflow is enough vs when you need an orchestrator:** **MLflow is sufficient when:** - One person or a small team - Linear pipeline: data -> training -> evaluation - Manual or cron-triggered runs - No requirements for cluster scaling **An orchestrator is needed when:** - Complex step dependencies (DAG) - Automatic error handling and retries are required - Parallel training of multiple models - Scaling to a GPU cluster - Schedules, triggers, conditional logic - Audit and compliance requirements MLflow and Kubeflow are not competitors - they complement each other. Kubeflow orchestrates steps, while MLflow logs parameters and metrics inside each step.

Orchestrator choice depends on context. If the company already has Kubernetes and a strong infrastructure team - Kubeflow. If the ML team is small and needs a quick start - Prefect or Dagster. If the main goal is experiment reproducibility rather than scaling - DVC Pipelines. The key is not to choose a tool more complex than the task requires: for a single model that retrains weekly, Kubeflow is overkill.

Regardless of the orchestrator choice, the key principles are the same: **each step is isolated** (its own dependencies, its own container), **data between steps is passed explicitly** (through artifacts, not global variables), **the pipeline is described as code** (versioned in Git), **steps are idempotent** (re-running produces the same result).

In which situation is using Kubeflow Pipelines justified rather than excessive?

Experiment Tracking and Reproducibility

ML has a **reproducibility crisis**: research shows that a significant portion of published results cannot be reproduced by other researchers. The cause is not malice, but the complexity of ML experiments. Results depend not just on code and data, but on **dozens of hidden factors**: library versions, data ordering, random seed, GPU type, CUDA driver version. Experiment tracking records all these factors, making every experiment reproducible.

**Weights & Biases (wandb)** is one of the most popular experiment tracking tools. Unlike MLflow, wandb provides a cloud platform with rich visualization: real-time training charts, experiment comparison, automatic hardware metric logging (GPU utilization, RAM), integration with PyTorch, TensorFlow, Hugging Face. The free tier covers the needs of an individual researcher.

In practice, experiment tracking starts small: logging hyperparameters and metrics via MLflow or wandb. Over time, additions include: data versioning (DVC), environment logging (pip freeze, Docker image), artifact saving (models, charts). It's important to start **before the project becomes complex** - retroactively reconstructing experiment context is nearly impossible.

**Minimal MLOps for a solo Data Scientist:** 1. **Git** - code versioning (required from day one) 2. **DVC** - data and model versioning (add with the first dataset > 100 MB) 3. **MLflow/wandb** - experiment logging (add by the second experiment) 4. **Docker** - environment locking (add before deployment) 5. **CI/CD** - test and deployment automation (add with regular retraining) Each next level is added as project complexity grows. No need to set up Kubeflow for a model that trains in 5 minutes on a laptop.

MLOps is only for large companies with dozens of ML engineers and massive infrastructure

Even a solo Data Scientist benefits from data versioning and experiment tracking - it saves hours of searching for 'which parameters gave the best result' and lets you reproduce any experiment

MLOps is a spectrum from simple (git + mlflow.log_param) to complex (Kubeflow on a cluster). A solo researcher doesn't build a Kubernetes cluster, but they can set up MLflow or wandb in 5 minutes and never again face the problem of 'which hyperparameters did I use last Tuesday'. Every hour spent searching for old results in Jupyter notebooks is an hour MLOps could have saved.

A researcher trained a model 3 months ago. Now they want to reproduce the result but get different metrics. What is the most likely cause?

Key Takeaways

  • **MLflow** - a platform for experiment tracking (log_param, log_metric, log_artifact), model registry (Staging/Production/Archived), and deployment: framework-agnostic and launched with a single command
  • **DVC** - Git for data: .dvc files with hashes are stored in Git, while actual data lives in S3/GCS. `dvc repro` runs only changed pipeline stages, and `git checkout + dvc checkout` returns to any version of code + data
  • **Kubeflow and orchestration** - when a pipeline grows from a single script into a DAG of dozens of steps, an orchestrator manages order, retries, and scaling. Kubeflow for large teams, Prefect/Dagster for small ones, DVC Pipelines for individuals
  • **Experiment tracking** - the reproducibility crisis is solved by capturing the full context: code, data, parameters, environment, hardware. Those 87% of models that never reach production are largely casualties of the absence of systematic experiment tracking

Related Topics

MLOps Pipeline connects ML tools with engineering practices - from model training to its life in production:

  • Model Serving — The next step after MLOps - how to deploy a trained model to serve requests: REST API, gRPC, batch inference, and choosing between real-time and offline predictions
  • Monitoring — After deploying a model, its production behavior needs to be tracked: data drift, concept drift, metric degradation. MLOps provides the infrastructure for monitoring and automatic retraining
  • Cross-Validation — MLflow and wandb log cross-validation results as experiment metrics. DVC captures specific data splits, ensuring reproducibility of model evaluation
  • Hyperparameter Tuning — Experiment tracking is indispensable during hyperparameter tuning: every grid search or Bayesian optimization run is automatically logged, and the best configuration is found with a single query to MLflow or wandb

Вопросы для размышления

  • Suppose you're working on an ML project alone. Which MLOps elements would you implement in the first week, and which would you postpone? Why?
  • A company wants to move from manual model deployment to an automated MLOps pipeline. Which tool would you start with: MLflow, DVC, or Kubeflow? What factors drive that choice?
  • How are data versioning (DVC) and experiment reproducibility related? Is versioning code in Git alone sufficient for full ML model reproducibility?

Связанные уроки

  • ml-44-cross-validation — Validation is a pipeline stage
  • ml-46-model-serving — Pipelines deploy models to serving
  • ml-47-model-monitoring — Pipelines feed monitoring after deploy
  • ml-43-hyperparameters — Automated tuning runs in the pipeline
  • sd-09-message-queue — Queues orchestrate pipeline stages
  • devops-09
MLOps Pipeline

0

1

Sign In