Big Data

Spark MLlib and Distributed ML Training

Spotify updates its recommendation model for 600 million users every few hours. One day of listening data: 150 terabytes. Without Spark, this would take weeks instead of hours. Feature engineering, training, evaluation - the full ML pipeline runs on the cluster automatically.

  • Uber: surge pricing via Spark MLlib on 20 billion trip records, retrained every 4 hours
  • LinkedIn: Job Recommendations via Gradient Boosted Trees on 900 million profiles
  • Airbnb: fraud detection on Spark, detecting fraudulent listings in near real time
  • Netflix: ALS matrix factorization on 300 million user-item interactions

Distributed Training: Why One Machine Is Not Enough

2015. Netflix launches a recommendation system for 50 million users. One machine: 48 hours of training. A 200-node Spark cluster: 45 minutes. A 64x difference. When the model needs to update hourly, the difference is between 'works' and 'impossible'.

**Spark MLlib** is a machine learning library on top of Spark, operating on distributed data. Data is stored as a DataFrame, algorithms work on partitions in parallel. Key algorithms: linear models (LR, SVM), trees (Random Forest, GBT), clustering (K-Means), recommendations (ALS).

**Parameter Server vs All-Reduce.** Spark MLlib uses the parameter server pattern: one worker holds the weights, others send gradients. PyTorch Distributed (for DNNs) uses all-reduce through NCCL: every worker holds a copy of the weights, gradients are averaged via a ring. All-reduce is faster for DNNs; parameter server suits sparse models (recommendations).

Why does Spark MLlib scale to terabytes of training data?

Feature Engineering on Spark: Scale Changes the Rules

On a single machine one writes `df['feature'] = np.log1p(df['raw'])`. At 10 billion rows, that line becomes a distributed transformation across 200 nodes. But that is not the main problem. The main problem is **data leakage**: normalization statistics cannot be computed on the full dataset if part of it is the test set. Spark Pipeline solves both problems.

**Imbalanced datasets on Spark.** When one class is 1% of the data, the model learns to always predict 0. Solution: oversampling via `sampleBy` or undersampling the majority class. For Random Forest - `classWeightCol`. SMOTE (synthetic samples) on Spark - via custom UDF or the `imbalanced-learn` library with SparkContext.

Why must data be split into train/test before applying StandardScaler?

ML Pipeline: From Features to Production

A **Spark ML Pipeline** is not just a sequence of steps. It is a reproducible artifact: a trained pipeline can be saved, loaded, and applied to new data with the same normalization parameters. Uber trained ML models for surge pricing through Spark Pipelines - artifacts were stored in S3, versioned like code.

What does parallelism=4 mean in CrossValidator?

Model Evaluation: Metrics, Monitoring and Model Drift

**Metrics are not just AUC.** In production, what matters is: probability calibration (Platt scaling), per-subgroup metrics (fairness), and drift detection. AUC 0.9 in 2023 on 2024 data may become 0.7 if the distribution has shifted. This is **model drift**.

**Model Registry** is the next level beyond a trained artifact. MLflow (Databricks) or Vertex AI Model Registry stores model versions, metrics, parameters, and artifacts. Promotion workflow: Staging to Production to Archived. A/B testing between versions on Spark via `randomSplit` or feature flags.

High AUC on the test set guarantees good performance in production

A test set from the same time period as training gives an optimistic estimate. In production, the data distribution shifts over time.

Correct evaluation: temporal split (test from the future relative to training) plus metric monitoring after deployment. If train: Jan-Nov and test: Dec - the estimate is more realistic.

What is model drift and how can it be detected?

Key ideas

  • Spark MLlib: ML algorithms on distributed DataFrames - data on the cluster, computation is parallel
  • Pipeline: chain of Transformer -> Estimator; the artifact can be saved and reproduced
  • Data leakage: train/test split before any fitted transformations (scaler, encoder)
  • Cross-validation: parallelism=N trains N models simultaneously during grid search
  • Model drift: AUC degrades over time due to distribution shift - monitoring is mandatory

Related topics

Distributed ML builds on Spark core and leads naturally to feature stores.

  • Spark DataFrame and RDD — MLlib DataFrame API sits on top of Spark SQL computation
  • Spark SQL and Transformations — Feature engineering via Spark SQL parallels ML transformations
  • Feature Store — The next level - centralized feature management

Вопросы для размышления

  • How does Spark MLlib differ from sklearn? When should one be chosen over the other?
  • How can online learning (updating a model on a stream) be implemented on top of Spark Streaming?
  • What is Population Stability Index (PSI) and why is it better than the KS test for monitoring drift in production?

Связанные уроки

  • bd-04 — Spark RDD and DataFrame are the foundation of MLlib API
  • bd-05 — Spark SQL and transformations are needed for feature engineering
  • bd-06 — Flink streaming enables online feature computation
  • bd-16 — Feature Store as the next level of feature management
  • arch-15-gpu-architecture — GPU-accelerated Spark ML through RAPIDS cuML
  • ml-54-distributed-training
Spark MLlib and Distributed ML Training

0

1

Sign In