Building Scalable ML Pipelines in Production

Introduction

Building ML models is the easy part. Getting them to run reliably in production, handling millions of predictions daily, is where the real challenge begins.

In this guide, I’ll share practical lessons from deploying ML systems at scale across fintech and EdTech applications.

System Architecture

Let’s start with the high-level view of a production ML pipeline:

flowchart LR
    A[Data Sources] --> B[Ingestion Layer]
    B --> C[Feature Store]
    C --> D[Training Pipeline]
    D --> E[Model Registry]
    E --> F[Serving Layer]
    F --> G[API Gateway]
    G --> H[Applications]
    
    subgraph Monitoring
        I[Metrics]
        J[Drift Detection]
        K[Alerting]
    end
    
    F --> I
    F --> J
    J --> K

Data Model

Here’s the core entity relationship for tracking ML artifacts:

erDiagram
    MODEL_VERSION ||--o{ PREDICTION : generates
    MODEL_VERSION {
        uuid id PK
        string model_name
        string version
        float accuracy
        float f1_score
        timestamp trained_at
        string artifact_path
    }
    PREDICTION {
        uuid id PK
        uuid model_version_id FK
        jsonb input_features
        jsonb output
        float latency_ms
        timestamp created_at
    }
    FEATURE_SET ||--|{ MODEL_VERSION : trains
    FEATURE_SET {
        uuid id PK
        string name
        jsonb schema
        timestamp created_at
    }

The Math Behind Model Selection

When choosing between models, we optimize for expected utility — maximizing the probability-weighted outcome across all predictions.

For classification with imbalanced classes, we use weighted cross-entropy loss, where class weights are inversely proportional to class frequency. This prevents the model from ignoring rare classes.

Why Class Weights Matter: In fraud detection with 0.1% positive rate, without class weights, the model learns to predict “not fraud” for everything and achieves 99.9% accuracy while being useless.

Feature Store Implementation

A feature store provides consistent feature computation between training and serving:

from datetime import datetime
from typing import Dict, Any
import redis
import pandas as pd

class FeatureStore:
    """Simple Redis-backed feature store."""
    
    def __init__(self, redis_url: str):
        self.redis = redis.from_url(redis_url)
        self.ttl = 3600 * 24  # 24 hours
    
    def get_features(self, entity_id: str) -> Dict[str, Any]:
        """Retrieve features for an entity."""
        key = f"features:{entity_id}"
        data = self.redis.hgetall(key)
        
        if not data:
            # Compute features on-demand
            data = self._compute_features(entity_id)
            self._store_features(entity_id, data)
        
        return {k.decode(): float(v) for k, v in data.items()}
    
    def _compute_features(self, entity_id: str) -> Dict[str, float]:
        """Compute features from raw data."""
        # Your feature engineering logic here
        return {
            "transaction_count_7d": 42,
            "avg_transaction_amount": 150.50,
            "days_since_registration": 365,
        }
    
    def _store_features(self, entity_id: str, features: Dict) -> None:
        """Cache computed features."""
        key = f"features:{entity_id}"
        self.redis.hset(key, mapping=features)
        self.redis.expire(key, self.ttl)

Pro Tip: Start with Redis for low-latency serving. Migrate to Feast or Tecton when you need point-in-time correctness for training data.

Model Serving Architecture

For high-throughput serving, we use a multi-tier approach:

sequenceDiagram
    participant Client
    participant Gateway
    participant Cache
    participant ModelServer
    participant FeatureStore
    
    Client->>Gateway: POST /predict
    Gateway->>Cache: Check cache
    alt Cache hit
        Cache-->>Gateway: Cached prediction
        Gateway-->>Client: 200 OK (cached)
    else Cache miss
        Gateway->>FeatureStore: Get features
        FeatureStore-->>Gateway: Features
        Gateway->>ModelServer: Predict(features)
        ModelServer-->>Gateway: Prediction
        Gateway->>Cache: Store prediction
        Gateway-->>Client: 200 OK
    end

Monitoring for Drift

Model performance degrades over time. Here’s what to monitor:

Feature Drift (Data Distribution Changes)

Track KL divergence between training and production feature distributions. When the divergence exceeds your threshold, it’s time to retrain.

Prediction Drift

Monitor the distribution of predictions:

from scipy import stats
import numpy as np

def detect_drift(
    reference: np.ndarray, 
    current: np.ndarray, 
    threshold: float = 0.05
) -> bool:
    """Detect distribution drift using KS test."""
    statistic, p_value = stats.ks_2samp(reference, current)
    return p_value < threshold

Key Takeaways

Feature stores are essential - Consistency between training and serving prevents subtle bugs
Start simple - Redis + Flask before Kubernetes + Ray Serve
Monitor everything - You can’t fix what you can’t see
Automate retraining - Schedule pipelines, don’t manually retrain

What’s Next?

In the next post, I’ll dive deeper into MLOps tooling: comparing MLflow vs Weights & Biases vs custom solutions.

Have questions? Reach out or comment below.