Big Data
Spark MLlib and Distributed ML Training
Spotify updates its recommendation model for 600 million users every few hours. One day of listening data: 150 terabytes. Without Spark, this would take weeks instead of hours. Feature engineering, training, evaluation - the full ML pipeline runs on the cluster automatically.
- Uber: surge pricing via Spark MLlib on 20 billion trip records, retrained every 4 hours
- LinkedIn: Job Recommendations via Gradient Boosted Trees on 900 million profiles
- Airbnb: fraud detection on Spark, detecting fraudulent listings in near real time
- Netflix: ALS matrix factorization on 300 million user-item interactions
Distributed Training: Why One Machine Is Not Enough
2015. Netflix launches a recommendation system for 50 million users. One machine: 48 hours of training. A 200-node Spark cluster: 45 minutes. A 64x difference. When the model needs to update hourly, the difference is between 'works' and 'impossible'.
**Spark MLlib** is a machine learning library on top of Spark, operating on distributed data. Data is stored as a DataFrame, algorithms work on partitions in parallel. Key algorithms: linear models (LR, SVM), trees (Random Forest, GBT), clustering (K-Means), recommendations (ALS).
**Parameter Server vs All-Reduce.** Spark MLlib uses the parameter server pattern: one worker holds the weights, others send gradients. PyTorch Distributed (for DNNs) uses all-reduce through NCCL: every worker holds a copy of the weights, gradients are averaged via a ring. All-reduce is faster for DNNs; parameter server suits sparse models (recommendations).
Why does Spark MLlib scale to terabytes of training data?
Feature Engineering on Spark: Scale Changes the Rules
On a single machine one writes `df['feature'] = np.log1p(df['raw'])`. At 10 billion rows, that line becomes a distributed transformation across 200 nodes. But that is not the main problem. The main problem is **data leakage**: normalization statistics cannot be computed on the full dataset if part of it is the test set. Spark Pipeline solves both problems.
**Imbalanced datasets on Spark.** When one class is 1% of the data, the model learns to always predict 0. Solution: oversampling via `sampleBy` or undersampling the majority class. For Random Forest - `classWeightCol`. SMOTE (synthetic samples) on Spark - via custom UDF or the `imbalanced-learn` library with SparkContext.
Why must data be split into train/test before applying StandardScaler?
ML Pipeline: From Features to Production
A **Spark ML Pipeline** is not just a sequence of steps. It is a reproducible artifact: a trained pipeline can be saved, loaded, and applied to new data with the same normalization parameters. Uber trained ML models for surge pricing through Spark Pipelines - artifacts were stored in S3, versioned like code.
What does parallelism=4 mean in CrossValidator?
Model Evaluation: Metrics, Monitoring and Model Drift
**Metrics are not just AUC.** In production, what matters is: probability calibration (Platt scaling), per-subgroup metrics (fairness), and drift detection. AUC 0.9 in 2023 on 2024 data may become 0.7 if the distribution has shifted. This is **model drift**.
**Model Registry** is the next level beyond a trained artifact. MLflow (Databricks) or Vertex AI Model Registry stores model versions, metrics, parameters, and artifacts. Promotion workflow: Staging to Production to Archived. A/B testing between versions on Spark via `randomSplit` or feature flags.
High AUC on the test set guarantees good performance in production
A test set from the same time period as training gives an optimistic estimate. In production, the data distribution shifts over time.
Correct evaluation: temporal split (test from the future relative to training) plus metric monitoring after deployment. If train: Jan-Nov and test: Dec - the estimate is more realistic.
What is model drift and how can it be detected?
Key ideas
- Spark MLlib: ML algorithms on distributed DataFrames - data on the cluster, computation is parallel
- Pipeline: chain of Transformer -> Estimator; the artifact can be saved and reproduced
- Data leakage: train/test split before any fitted transformations (scaler, encoder)
- Cross-validation: parallelism=N trains N models simultaneously during grid search
- Model drift: AUC degrades over time due to distribution shift - monitoring is mandatory
Related topics
Distributed ML builds on Spark core and leads naturally to feature stores.
- Spark DataFrame and RDD — MLlib DataFrame API sits on top of Spark SQL computation
- Spark SQL and Transformations — Feature engineering via Spark SQL parallels ML transformations
- Feature Store — The next level - centralized feature management
Вопросы для размышления
- How does Spark MLlib differ from sklearn? When should one be chosen over the other?
- How can online learning (updating a model on a stream) be implemented on top of Spark Streaming?
- What is Population Stability Index (PSI) and why is it better than the KS test for monitoring drift in production?
Связанные уроки
- bd-04 — Spark RDD and DataFrame are the foundation of MLlib API
- bd-05 — Spark SQL and transformations are needed for feature engineering
- bd-06 — Flink streaming enables online feature computation
- bd-16 — Feature Store as the next level of feature management
- arch-15-gpu-architecture — GPU-accelerated Spark ML through RAPIDS cuML
- ml-54-distributed-training