Building Scalable ML Pipelines in Production
A practical guide to designing machine learning pipelines that handle millions of predictions daily. Covers feature stores, model registries, deployment strategies, and MLOps best practices.
Mostafa
Fractional CTO & Software Architect
Introduction
Building ML models is the easy part. Getting them to run reliably in production, handling millions of predictions daily, is where the real challenge begins.
In this guide, I’ll share practical lessons from deploying ML systems at scale across fintech and EdTech applications.
System Architecture
Let’s start with the high-level view of a production ML pipeline:
flowchart LR
A[Data Sources] --> B[Ingestion Layer]
B --> C[Feature Store]
C --> D[Training Pipeline]
D --> E[Model Registry]
E --> F[Serving Layer]
F --> G[API Gateway]
G --> H[Applications]
subgraph Monitoring
I[Metrics]
J[Drift Detection]
K[Alerting]
end
F --> I
F --> J
J --> K
Data Model
Here’s the core entity relationship for tracking ML artifacts:
erDiagram
MODEL_VERSION ||--o{ PREDICTION : generates
MODEL_VERSION {
uuid id PK
string model_name
string version
float accuracy
float f1_score
timestamp trained_at
string artifact_path
}
PREDICTION {
uuid id PK
uuid model_version_id FK
jsonb input_features
jsonb output
float latency_ms
timestamp created_at
}
FEATURE_SET ||--|{ MODEL_VERSION : trains
FEATURE_SET {
uuid id PK
string name
jsonb schema
timestamp created_at
}
The Math Behind Model Selection
When choosing between models, we optimize for expected utility — maximizing the probability-weighted outcome across all predictions.
For classification with imbalanced classes, we use weighted cross-entropy loss, where class weights are inversely proportional to class frequency. This prevents the model from ignoring rare classes.
Why Class Weights Matter: In fraud detection with 0.1% positive rate, without class weights, the model learns to predict “not fraud” for everything and achieves 99.9% accuracy while being useless.
Feature Store Implementation
A feature store provides consistent feature computation between training and serving:
from datetime import datetime
from typing import Dict, Any
import redis
import pandas as pd
class FeatureStore:
"""Simple Redis-backed feature store."""
def __init__(self, redis_url: str):
self.redis = redis.from_url(redis_url)
self.ttl = 3600 * 24 # 24 hours
def get_features(self, entity_id: str) -> Dict[str, Any]:
"""Retrieve features for an entity."""
key = f"features:{entity_id}"
data = self.redis.hgetall(key)
if not data:
# Compute features on-demand
data = self._compute_features(entity_id)
self._store_features(entity_id, data)
return {k.decode(): float(v) for k, v in data.items()}
def _compute_features(self, entity_id: str) -> Dict[str, float]:
"""Compute features from raw data."""
# Your feature engineering logic here
return {
"transaction_count_7d": 42,
"avg_transaction_amount": 150.50,
"days_since_registration": 365,
}
def _store_features(self, entity_id: str, features: Dict) -> None:
"""Cache computed features."""
key = f"features:{entity_id}"
self.redis.hset(key, mapping=features)
self.redis.expire(key, self.ttl)
Pro Tip: Start with Redis for low-latency serving. Migrate to Feast or Tecton when you need point-in-time correctness for training data.
Model Serving Architecture
For high-throughput serving, we use a multi-tier approach:
sequenceDiagram
participant Client
participant Gateway
participant Cache
participant ModelServer
participant FeatureStore
Client->>Gateway: POST /predict
Gateway->>Cache: Check cache
alt Cache hit
Cache-->>Gateway: Cached prediction
Gateway-->>Client: 200 OK (cached)
else Cache miss
Gateway->>FeatureStore: Get features
FeatureStore-->>Gateway: Features
Gateway->>ModelServer: Predict(features)
ModelServer-->>Gateway: Prediction
Gateway->>Cache: Store prediction
Gateway-->>Client: 200 OK
end
Monitoring for Drift
Model performance degrades over time. Here’s what to monitor:
Feature Drift (Data Distribution Changes)
Track KL divergence between training and production feature distributions. When the divergence exceeds your threshold, it’s time to retrain.
Prediction Drift
Monitor the distribution of predictions:
from scipy import stats
import numpy as np
def detect_drift(
reference: np.ndarray,
current: np.ndarray,
threshold: float = 0.05
) -> bool:
"""Detect distribution drift using KS test."""
statistic, p_value = stats.ks_2samp(reference, current)
return p_value < threshold
Key Takeaways
- Feature stores are essential - Consistency between training and serving prevents subtle bugs
- Start simple - Redis + Flask before Kubernetes + Ray Serve
- Monitor everything - You can’t fix what you can’t see
- Automate retraining - Schedule pipelines, don’t manually retrain
What’s Next?
In the next post, I’ll dive deeper into MLOps tooling: comparing MLflow vs Weights & Biases vs custom solutions.
Have questions? Reach out or comment below.