Data Infrastructure for AI & Experimentation at Scale

A comprehensive deep-dive into the data backbone powering ML, personalization, experimentation, and GenAI on modern streaming platforms

Series Deep Dives · 4 parts

Data Infrastructure for AI & Experimentation at Scale

A complete deep-dive into Data Engineering + Machine Learning + Ads Systems + Identity Graphs + GenAI + Experimentation — the foundational pillars behind modern streaming and content platforms.


What Does a Data & AI Platform Power?

Understanding the bigger picture — on any large-scale streaming platform (video, music, podcasts, live content), the data infrastructure powers:

  • Identity systems — who is the user across devices
  • Device graph — connect smart TV, mobile, tablet, laptop to same person/household
  • Audience platform — build segments like "sports lovers" or "sci-fi bingers"
  • Recommendation & personalization — what to show each user on the home feed
  • Measurement & attribution — did a promotion lead to engagement or subscription?
  • Experimentation — A/B testing thousands of changes simultaneously
  • Reporting & insights — dashboards + GenAI agents for analytics
  • Ad systems — targeting, bidding, measurement for ad-supported tiers

This is Applied ML + Data Platform + Experimentation + GenAI — not pure ML research.


Table of Contents

  1. The Event Backbone — Ingestion & Streaming
  2. Machine Learning Fundamentals
  3. The Feature Platform
  4. Ads / Audience / Recommendation Systems
  5. Identity Resolution & Device Graph
  6. Data Engineering at Scale
  7. ML System Design
  8. The Experimentation Platform
  9. Gen AI / LLM / Agents
  10. SQL & Data Modeling
  11. KPIs / Objective Functions
  12. MLOps / Production ML
  13. Data Quality & Observability
  14. End-to-End Architecture

1. The Event Backbone — Ingestion & Streaming

What Needs to Be Captured

On a streaming platform, user interactions generate a staggering volume of telemetry:

Event Type Examples Volume (large platform)
Playback events play, pause, seek, buffer, complete ~50B/day
Navigation events impression, click, scroll, search ~30B/day
Engagement signals like, save, share, add-to-list ~2B/day
Device/context device type, OS, network, geo, time Attached to every event
Content metadata genre, duration, cast, release date Updated in batch

Streaming Architecture — Apache Kafka at the Core

┌──────────────┐     ┌──────────────────┐     ┌────────────────────┐
  Client SDKs │────▶│  Kafka Clusters  │────▶│  Stream Processors 
  (mobile,           (partitioned by        (Flink / Spark    
   web, TV,           user_id or             Structured       
   console)           device_id)             Streaming)       
└──────────────┘     └──────────────────┘     └────────────────────┘
                                                       
                                                       
                     ┌────────────────┐       ┌────────────────────┐
                       Raw Event              Derived Streams   
                       Data Lake              (sessionized,     
                       (S3/ADLS/GCS)           enriched,        
                                               aggregated)      
                     └────────────────┘       └────────────────────┘

Key Design Decisions

Schema Registry (Avro/Protobuf): Every event must conform to a registered schema. Without this, downstream consumers break constantly. Use Apache Avro with Confluent Schema Registry or Protobuf definitions checked into version control. Schema evolution rules (backward/forward compatibility) are critical.

Partitioning Strategy: Partition by user_id for user-centric analytics (session analysis, feature computation). Partition by content_id for content-centric tasks (popularity counters, trending). Some platforms maintain dual topics — one partitioned each way.

Exactly-Once Semantics: Kafka supports idempotent producers and transactional writes. For streaming jobs consuming from Kafka → writing to a data lake, use Flink checkpointing with exactly-once sink connectors to avoid duplicate or missing events.

Backpressure & Dead Letter Queues: When a consumer can't keep up, events must be buffered, not dropped. Dead letter queues capture malformed events for later reprocessing rather than silently discarding them.

Event Schema Design — Example

{
  "event_id": "uuid-v4",
  "event_type": "playback.started",
  "timestamp_ms": 1712419200000,
  "user_id": "u_abc123",
  "session_id": "s_xyz789",
  "content_id": "c_movie_456",
  "device": {
    "type": "smart_tv",
    "os": "tvOS",
    "app_version": "5.12.0"
  },
  "context": {
    "page": "home_feed",
    "row": "continue_watching",
    "position": 3,
    "algorithm": "collab_filter_v2",
    "experiment_id": "exp_2026_q2_reco_v3",
    "variant": "treatment_b"
  },
  "content_metadata": {
    "genre": ["sci-fi", "thriller"],
    "duration_sec": 7200,
    "release_year": 2025
  }
}

Why context.algorithm and context.experiment_id matter: They tie every interaction back to the model and experiment that produced the recommendation. Without this, you can't measure model performance or run valid A/B tests.


2. Machine Learning Fundamentals

Supervised Learning

  • Regression — predict continuous values (e.g., predicted watch time)
  • Linear Regression — fit a line using ordinary least squares. Assumes linear relationship between features and target. Closed-form solution:
  • Polynomial Regression — extend linear regression with polynomial features for non-linear relationships
  • Classification — predict categories (e.g., will user click?)
  • Logistic Regression — binary classification using the sigmoid function: . Despite the name, it's a classifier, not regression. Outputs probabilities, threshold at 0.5 by default.
  • Decision Trees — recursive splitting on feature thresholds to minimize impurity (Gini or entropy). Human-interpretable but prone to overfitting.
  • Random Forest — ensemble of decision trees trained on bootstrap samples (bagging). Each tree sees a random subset of features. Reduces variance while maintaining low bias. Typically 100-500 trees.
  • Gradient Boosting (XGBoost, LightGBM, CatBoost) — sequential trees where each new tree corrects the errors of the ensemble so far. Minimizes a loss function via gradient descent in function space. XGBoost uses regularized objective: . LightGBM is faster with histogram-based splitting and leaf-wise growth.
  • Neural Networks — layers of neurons with non-linear activations (ReLU, sigmoid, tanh). Universal approximators — can learn any function given sufficient depth and width.

Unsupervised Learning

  • Clustering (KMeans) — partition data into K groups by minimizing within-cluster sum of squares. Sensitive to initialization (use KMeans++). Requires choosing K (elbow method, silhouette score).
  • DBSCAN — density-based clustering. Finds arbitrarily shaped clusters. No need to specify K. Parameters: epsilon (neighborhood radius), min_samples.
  • PCA (Principal Component Analysis) — project data onto orthogonal directions of maximum variance. Used for dimensionality reduction, visualization, and denoising. Keep components that explain >95% variance.
  • Embeddings — dense vector representations of entities (users, items, words). Learned via neural networks. Similar entities have similar vectors in embedding space.
  • Similarity search — find nearest neighbors in embedding space using cosine similarity, Euclidean distance, or dot product. Approximate Nearest Neighbor (ANN) algorithms: FAISS, ScaNN, HNSW for sub-linear search.

Important Concepts

  • Bias vs Variance tradeoff — high bias = underfitting (model too simple), high variance = overfitting (model too complex). Goal: minimize total error = bias² + variance + irreducible noise.
  • Overfitting — model memorizes training data, performs poorly on unseen data. Signs: training accuracy >> validation accuracy. Fix: more data, regularization, simpler model, early stopping, dropout.
  • Regularization
  • L1 (Lasso): adds to loss. Encourages sparsity — some weights become exactly zero. Good for feature selection.
  • L2 (Ridge): adds to loss. Shrinks weights evenly toward zero. Prevents any single feature from dominating.
  • Elastic Net: combines L1 + L2 regularization.
  • Dropout (neural networks): randomly zero out neurons during training. Acts as ensemble of sub-networks.
  • Cross Validation — evaluate model on multiple train/test splits. K-fold: split data into K parts, train on K-1, validate on 1, rotate. Stratified K-fold preserves class distribution.
  • Feature Engineering — creating useful input features from raw data. Often the #1 differentiator in applied ML. Examples: time-since-last-interaction, rolling averages, ratio features, cross features.
  • Feature Scaling — StandardScaler (zero mean, unit variance) or MinMaxScaler (0 to 1). Critical for distance-based algorithms (KNN, SVM) and gradient descent optimization.
  • Handling missing values — imputation (mean, median, mode), indicator columns ("is_missing"), or model-based (KNN imputation, iterative imputer). Never silently drop rows in production.
  • Class imbalance — when positive class is rare (e.g., 1% conversion rate). Techniques: SMOTE (synthetic oversampling), class weights in loss function, undersampling majority class, focal loss.

Metrics

Metric What It Measures When to Use
Accuracy % of correct predictions Balanced classes only
Precision Of predicted positive, how many are actually positive When false positives are costly (spam detection)
Recall Of actual positives, how many did we catch When false negatives are costly (fraud detection)
F1 Score Harmonic mean of precision and recall Imbalanced classes, need balance
ROC AUC Area under ROC curve — tradeoff between TPR and FPR Binary classification, threshold-independent
PR AUC Area under Precision-Recall curve Highly imbalanced datasets
Log Loss Penalizes confident wrong predictions CTR prediction, calibrated probabilities
RMSE Root Mean Squared Error Regression, penalizes large errors
MAE Mean Absolute Error Regression, robust to outliers
NDCG Normalized Discounted Cumulative Gain Ranking quality
MAP Mean Average Precision Information retrieval
MRR Mean Reciprocal Rank First relevant result position

3. The Feature Platform

The Feature Engineering Problem

ML models need features. Features come from raw events. The challenge:

  • Training needs features computed over historical data (batch)
  • Serving needs features computed in real-time (online)
  • The features must be identical — otherwise you get training-serving skew, the silent killer of ML systems

Feature Store Architecture

flowchart TD
    Raw["📊 Raw Events<br/>(Kafka / Data Lake)"] --> Batch["⏱️ Batch Pipeline<br/>(Spark / dbt)"]
    Raw --> Stream["⚡ Stream Pipeline<br/>(Flink / Spark Streaming)"]

    Batch --> Offline["🗄️ Offline Store<br/>(Hive / Delta Lake / BigQuery)"]
    Stream --> Online["⚡ Online Store<br/>(Redis / DynamoDB / Bigtable)"]

    Offline --> Training["🤖 Model Training"]
    Online --> Serving["🔮 Model Serving"]

    Offline -.->|"point-in-time join"| Training
    Batch -.->|"backfill"| Online

    style Online fill:#ff9800,color:#fff
    style Offline fill:#2196f3,color:#fff

Types of Features on a Streaming Platform

User-Level Features (computed per user)

user_total_watch_hours_7d           rolling 7-day watch time
user_genre_affinity_vector          softmax over genre engagement
user_avg_session_length_30d         average session in last 30 days
user_skip_rate_7d                   fraction of content abandoned < 30s
user_search_to_play_ratio_7d       how often search leads to a play
user_time_of_day_distribution       histogram of activity by hour
user_device_preference              primary device type
user_content_completion_rate_30d    fraction of content watched to end
user_days_since_signup              account age
user_subscription_tier              free / basic / premium

Content-Level Features (computed per item)

content_total_plays_7d              popularity signal
content_avg_completion_rate         quality signal
content_genre_tags                  categorical
content_release_recency_days        freshness
content_avg_rating                  explicit quality
content_play_to_impression_ratio    CTR proxy

Cross Features (user × content interaction)

user_has_watched_same_genre_7d     — genre relevance
user_watched_same_creator          — creator affinity
user_x_content_genre_overlap       — cosine similarity of genre vectors

Point-in-Time Joins — The Most Critical Concept

When training a model, you must join features as they existed at the time of the event, not as they exist now. Otherwise you leak future information into training data.

# WRONG — uses current features for historical events
features = feature_store.get_latest("user_123")

# RIGHT — uses features as-of the event timestamp
features = feature_store.get_as_of("user_123", timestamp="2026-03-15T14:00:00Z")

Frameworks like Feast, Tecton, and Hopsworks handle point-in-time correctness automatically when you define feature views with timestamps.

Online Feature Serving — Latency Matters

Model inference at serving time must complete within 50-100ms (including feature fetch + model forward pass). This means:

  • Redis Cluster or DynamoDB for sub-5ms feature lookups
  • Feature caching at the application layer for session-stable features
  • Batch precomputation of expensive features (e.g., run nightly, push to online store)
  • Feature embedding lookups — precomputed user/content embeddings stored in vector stores

4. Ads / Audience / Recommendation Systems

Ads System Architecture

The full flow of an ad system on a streaming platform:

graph LR
    A[User] --> B[Device]
    B --> C[Identity Resolution]
    C --> D[Audience Segment]
    D --> E[Ad Selection]
    E --> F[Auction]
    F --> G[Impression]
    G --> H[Click]
    H --> I[Conversion]
    I --> J[Attribution]
    J --> K[Reporting]

How Ad Auctions Work

When a user opens a streaming app and hits an ad break, the platform runs an auction in real-time:

  1. Bid Request — platform sends user context (anonymized) to demand-side platforms or internal ad server
  2. Bid Response — advertisers bid for the impression, specifying CPM (cost per 1000 impressions)
  3. Auction — typically a second-price auction (winner pays $0.01 above second-highest bid) or increasingly first-price auction
  4. Ad Selection — rank by eCPM = , also factor in relevance and user experience
  5. Impression & Tracking — serve the ad, fire tracking pixels for viewability, completion, clicks

Models Used in Ads

  • CTR prediction (Click Through Rate) — probability user clicks on ad. Typically uses gradient boosted trees (XGBoost/LightGBM) or deep learning with embedding layers for sparse features. Training data: billions of impression-click pairs.
  • CVR prediction (Conversion Rate) — probability of purchase after click. Harder than CTR due to sparser signal and longer attribution windows.
  • Ranking models — order ads by expected value. eCPM = CTR × bid. Multi-objective ranking can also factor in user experience.
  • Lookalike modeling — given a seed audience (e.g., people who bought product X), find similar users in the broader population. Typically uses embedding similarity or supervised classification.
  • Audience segmentation — cluster users into meaningful groups using behavioral features. KMeans, DBSCAN, or learned segment embeddings.
  • Collaborative filtering — recommend based on similar users' behavior. Matrix factorization (ALS), neural collaborative filtering.
  • Content-based filtering — recommend based on item features (genre, tags, description embeddings).
  • Embeddings for users & items — learn dense vector representations via two-tower models or autoencoders. User embedding captures preferences, item embedding captures attributes.
  • Multi-armed bandits — explore vs exploit for ad/content selection. Thompson Sampling, Upper Confidence Bound (UCB), epsilon-greedy.
  • Budget pacing — spend advertiser budget evenly over campaign duration. Proportional-integral-derivative (PID) controllers are common.

Recommendation Model Architectures

Two-Tower Model (Retrieval Stage)

          User Tower                    Item Tower
    ┌─────────────────┐          ┌─────────────────┐
      user features               item features   
      watch history               genre, tags     
      demographics                popularity      
    └────────┬────────┘          └────────┬────────┘
                                         
        [Dense Layers]              [Dense Layers]
                                         
                                         
      user_embedding              item_embedding
       (128-dim)                   (128-dim)
                                         
             └──────── dot product ───────┘
                          
                     similarity score
  • Training: In-batch negatives or hard negative mining. Loss: softmax cross-entropy or sampled softmax.
  • Serving: Precompute all item embeddings → build ANN index (FAISS, ScaNN, HNSW). At request time, compute user embedding → ANN lookup for top-K candidates in <10ms.
  • Scale: Retrieve ~500 candidates from millions of items in under 10ms.

Deep Ranking Model (Ranking Stage)

Takes ~500 candidates from retrieval and scores them:

Input Features:
  - User features (dense + sparse)
  - Item features (dense + sparse)
  - Cross features (user × item)
  - Context features (time, device, page)
       │
  [Embedding Layer — sparse features]
       │
  [Concatenation with dense features]
       │
  [Multi-Layer Perceptron or DCN-v2]
       │
  [Multi-Task Heads]
       ├── P(click)
       ├── P(watch > 50%)
       ├── P(complete)
       ├── P(like)
       └── P(add to list)
       │
  [Weighted combination → final score]

Multi-Task Learning is essential because optimizing for clicks alone leads to clickbait. The final score:

Sequence Models for Watch History

User watch history is sequential — order matters:

  • Transformer-based: Self-attention over recent N items watched. Captures temporal patterns (binge-watching a series).
  • SASRec (Self-Attentive Sequential Recommendation): Causal attention mask on item sequence to predict next interaction.
  • GRU4Rec: RNN-based session recommendation. Lighter weight, still effective for session-based reco.

Content Cold Start — The Chicken-and-Egg Problem

New content has zero engagement signal. How do you recommend something nobody has watched?

Strategy How It Works Limitations
Content-based features Use metadata (genre, cast, director, synopsis embedding) to find similar existing content Ignores personal taste
Explore/exploit Thompson Sampling or epsilon-greedy — intentionally show new content to a small % of users Hurts short-term metrics
Contextual bandits Use contextual features (user segment, time of day) to decide who sees new content Requires fast feedback loop
Creator-based transfer If a creator has a track record, use their historical performance as a prior Only for known creators
Editorial boost Curators manually promote content for initial impressions Doesn't scale

Best systems combine content-based warm-starting + explore/exploit for initial signals, then transition to collaborative filtering once enough data exists.

Handling Bias in Recommendation Data

All training data is biased — users can only engage with what was shown.

Position Bias: Items in position 1 get more clicks regardless of relevance. Solution: train a position bias model separately and debias during training.

Selection Bias: Training data only contains items the model chose to show. Solution: inverse propensity scoring (IPS) — weight training examples by .

Popularity Bias: Popular items get shown more → get more engagement → appear more popular. Rich-get-richer feedback loop. Solution: diversity objectives + calibration to match user interest distribution.


5. Identity Resolution & Device Graph

The Fundamental Problem

How do you know that a smart TV, a mobile phone, a laptop, and a tablet belong to the same user or household? This is the identity resolution problem.

Methods

  • Deterministic matching — exact match on email, login, phone number, account ID. High precision, low recall. Requires authenticated sessions.
  • Probabilistic matching — statistical models using IP address, Wi-Fi network, location patterns, timing correlation. Lower precision, higher recall.
  • Graph-based identity resolution — build a graph of devices and signals, run connected components or community detection.
  • Embedding similarity — learn device/user embeddings from interaction patterns, cluster by cosine similarity.
  • Graph Neural Networks (GNN) — use message passing on the device graph for more sophisticated entity resolution.

Device Graph Structure

A graph where nodes are devices/users/households and edges represent observed relationships:

graph TD
    H[Household] --> TV[Smart TV]
    H --> M[Mobile Phone]
    H --> L[Laptop]
    H --> T[Tablet]
    TV -.->|same IP / WiFi| M
    M -.->|same login| L
    L -.->|co-occurrence| T

Key Graph Algorithms

  • Connected Components — find clusters of related devices. Simple BFS/DFS. At scale, use graph-parallel frameworks (GraphX, Pregel) for distributed connected components.
  • PageRank — rank importance of nodes. The "primary device" in a household often has the highest PageRank.
  • Node Embeddings (Node2Vec) — random walks on the graph → skip-gram model → dense vector per node. Similar devices end up close in embedding space.
  • GraphSAGE — inductive GNN that learns node representations by aggregating features from local neighborhoods. Scales to large graphs because it samples neighborhoods rather than using the full graph.
  • Link Prediction — predict missing edges (e.g., does this phone belong to this household?). Train a classifier on node-pair features: common neighbors, Jaccard similarity, learned embeddings.

Identity Graph at Scale

At streaming platform scale (100M+ households), the identity graph has:

  • Billions of nodes (devices, IPs, accounts, cookies)
  • Hundreds of billions of edges
  • Continuous updates — new devices appear, IPs rotate, accounts merge

This requires:

  • Graph databases: Neo4j, JanusGraph, Amazon Neptune, or custom solutions on Spark GraphX
  • Incremental graph updates — don't recompute from scratch; process new edges as they arrive
  • Privacy-compliant design — hash/anonymize PII, respect opt-outs, comply with GDPR/CCPA

6. Data Engineering at Scale

Core Concepts Deep Dive

  • ETL vs ELT — Traditional ETL transforms before loading to warehouse. Modern ELT loads raw data to lake first, transforms in-place using Spark or dbt. ELT is dominant because storage is cheap, and you preserve raw data.
  • Data Lakes — store raw data in any format (S3, ADLS Gen2, GCS). Use open table formats (Delta Lake, Apache Iceberg, Apache Hudi) for ACID transactions, time travel, schema evolution on the lake.
  • Data Warehouse — structured, optimized for analytics. Columnar storage, query optimization. Snowflake, BigQuery, Redshift, Synapse, Databricks SQL.
  • Lakehouse — hybrid: data lake storage + warehouse query performance. Delta Lake on Databricks, Apache Iceberg on Trino. Eliminates the need to maintain separate lake and warehouse.
  • Batch vs Streaming — batch processes data in scheduled chunks (hourly/daily). Streaming processes events as they arrive (Kafka + Flink). Lambda architecture runs both; Kappa architecture uses streaming for everything.
  • Apache Spark — distributed compute engine. RDDs → DataFrames → Structured Streaming. Key concepts: lazy evaluation, partitioning, shuffle, catalyst optimizer, adaptive query execution (AQE in Spark 3+).
  • Apache Flink — true stream processing with event-time semantics, watermarks, exactly-once guarantees. Better than Spark Streaming for low-latency use cases.
  • Presto / Trino — distributed SQL query engine for interactive analytics on data lakes. Federated queries across multiple data sources.
  • Parquet — columnar storage format. Efficient for analytics (read only needed columns). Supports predicate pushdown, dictionary encoding, run-length encoding.
  • Feature Store — centralized repository for ML features (Feast, Tecton, Hopsworks). Ensures consistency between training and serving.
  • Apache Airflow / Dagster / Prefect — workflow orchestration. Define DAGs (directed acyclic graphs) of tasks with dependencies. Airflow is most popular but Dagster has better dev experience and asset-based paradigm.
  • Kafka — distributed event streaming. Partitioned, replicated, durable. Key concepts: topics, partitions, consumer groups, offset management, exactly-once semantics.

Data at Scale — Common Challenges

  • Data partitioning — split data by date/region for faster queries. Partition pruning avoids scanning irrelevant data.
  • Data skew — uneven data distribution across partitions. One partition with 100x more data than others. Fix: salting keys, repartitioning, broadcast joins for small tables.
  • Joins at scale — broadcast join (small table fits in memory), sort-merge join (both tables large, co-partitioned), shuffle hash join (default, most expensive).
  • Window functions — compute aggregates across rows related to current row. Essential for sessionization, rolling features, ranking.
  • Late-arriving data — events that arrive after the processing window has closed. Handle with watermarks (Flink), merge-on-read (Iceberg/Hudi), or reprocessing.

Typical Data Pipeline

graph LR
    A[Client Events] --> B[Kafka]
    B --> C[Raw Data Lake<br/>Parquet/Iceberg]
    C --> D[Spark / dbt<br/>Transform]
    D --> E[Feature Store]
    E --> F[ML Model Training]
    D --> G[Analytics<br/>Warehouse]
    G --> H[Dashboards]
    F --> I[Model Serving]
    I --> J[Predictions API]

Sessionization — Deceptively Hard

A "session" isn't straightforward. Is it timeout-based (30 min inactivity)? Content-boundary-based (new show = new session)? Device-specific?

-- Sessionization using inactivity gap (30 minutes)
WITH events_with_gap AS (
    SELECT
        user_id,
        event_timestamp,
        LAG(event_timestamp) OVER (
            PARTITION BY user_id ORDER BY event_timestamp
        ) AS prev_timestamp,
        CASE
            WHEN EXTRACT(EPOCH FROM event_timestamp - LAG(event_timestamp)
                 OVER (PARTITION BY user_id ORDER BY event_timestamp)) > 1800
            THEN 1
            ELSE 0
        END AS new_session_flag
    FROM raw_events
)
SELECT
    user_id,
    event_timestamp,
    SUM(new_session_flag) OVER (
        PARTITION BY user_id ORDER BY event_timestamp
    ) AS session_id
FROM events_with_gap;

Real-time sessionization in Flink uses session windows with a gap timeout — but requires careful handling of late-arriving events and cross-device sessions.


7. ML System Design

System Design Framework

When designing any ML system, structure it as:

  1. Problem definition — what are we solving?
  2. Metrics — how do we measure success? (offline + online)
  3. Data sources — what data do we have?
  4. Feature engineering — what features do we build?
  5. Model choice — what algorithm fits?
  6. Training pipeline — how do we train at scale?
  7. Serving architecture — batch vs real-time?
  8. Monitoring — what do we track in production?
  9. Retraining — when and how do we update?
  10. Experimentation — how do we A/B test?

ML System Pipeline

graph TD
    A[Data Ingestion] --> B[Data Cleaning]
    B --> C[Feature Engineering]
    C --> D[Feature Store]
    D --> E[Model Training]
    E --> F[Model Evaluation]
    F --> G[Model Registry]
    G --> H{Deployment Strategy}
    H -->|Shadow| I1[Log predictions, don't serve]
    H -->|Canary| I2[5% traffic, monitor]
    H -->|A/B Test| I3[50/50 with old model]
    H -->|Full| I4[100% traffic]
    I1 --> J[Monitoring & Alerting]
    I2 --> J
    I3 --> J
    I4 --> J
    J -->|"drift / degradation"| K[Retrain Trigger]
    K --> E

Real-Time vs Batch Training

Aspect Batch Training Real-Time Training
Freshness Hours to days stale Minutes stale
Use case Stable preferences Trending content, viral items
Infrastructure Spark + GPU cluster Flink + parameter server
Complexity Lower Much higher
Typical approach Daily/weekly retrain Continuous embedding updates

Most platforms use a hybrid: batch-train the full model daily/weekly, but update embedding tables in near-real-time for new content and shifting user preferences (as described in ByteDance's Monolith paper).

Model Serving Infrastructure

For a platform serving millions of concurrent users:

  • Model format: TensorFlow SavedModel, ONNX, or TorchScript
  • Serving framework: TensorFlow Serving, Triton Inference Server, BentoML, or custom gRPC services
  • Dynamic batching: Group multiple requests into one GPU batch for throughput
  • Caching: LRU cache on popular content embeddings; session-level cache for user context
  • Fallback: If model inference fails, fall back to a rule-based ranker (popularity-based)
  • Latency budget: <50ms p99 for the entire recommend-and-rank pipeline

Example Systems to Understand

  • Content recommendation system (home feed ranking)
  • Ad targeting and ranking system
  • Audience segmentation system
  • Search ranking system
  • Attribution and measurement system
  • Identity resolution system
  • Reporting insights AI agent

8. The Experimentation Platform

Why Experimentation Infrastructure Is as Important as ML

A model is only as good as your ability to measure its impact. Streaming platforms run hundreds to thousands of concurrent experiments:

  • New ranking models
  • UI layout changes
  • Content promotion strategies
  • Notification timing
  • Pricing experiments
  • Ad targeting algorithms

Core Components

flowchart LR
    Config["🎛️ Experiment Config<br/>(variants, allocation,<br/>guardrails)"] --> Assign["👤 Assignment<br/>Service"]
    Assign --> Client["📱 Client SDK<br/>(get variant)"]
    Client --> Events["📊 Events<br/>(tagged with<br/>experiment_id + variant)"]
    Events --> Pipeline["⚙️ Metrics<br/>Pipeline"]
    Pipeline --> Stats["📈 Statistical<br/>Analysis"]
    Stats --> Dashboard["📊 Experiment<br/>Dashboard"]

    style Config fill:#9c27b0,color:#fff
    style Stats fill:#4caf50,color:#fff

Assignment — Deterministic Hashing

import hashlib

def get_variant(user_id: str, experiment_id: str, num_variants: int) -> int:
    """Deterministic assignment: same user always gets same variant."""
    hash_input = f"{user_id}:{experiment_id}"
    hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
    return hash_value % num_variants

Properties: - Deterministic — same user always sees same variant (no flickering) - Uniform — SHA256 gives near-perfect uniform distribution - Independent — different experiments use different hash inputs, so assignments are uncorrelated - Stateless — no need to store assignments; recompute on every request

Core Statistical Concepts

  • A/B testing — compare control (A) vs treatment (B) with random assignment
  • Hypothesis testing — formulate null (: no difference) and alternative (: treatment is better) hypothesis
  • P-value — probability of observing the result (or more extreme) under . Typically reject if < 0.05.
  • Confidence interval — range where true value likely falls (typically 95%)
  • Statistical significance — result unlikely due to chance
  • Power analysis — determine required sample size before experiment

Sample Size Calculation

Where: - = z-score for significance level (1.96 for 95%) - = z-score for power (0.84 for 80% power) - = variance of the metric - = minimum detectable effect (MDE)

Typical MDEs for streaming platforms: - Engagement metrics (watch hours): ±0.5% - Retention metrics (day-7 retention): ±0.2% - Revenue metrics (ARPU): ±1.0%

Traffic Allocation & Mutual Exclusion

Total Traffic (100%)
├── Layer: Ranking (40%)   ├── Experiment A: New model v3 (50% control / 50% treatment)   └── Experiment B: Feature expansion (50/50)       Note: A and B are mutually exclusive within this layer
├── Layer: UI (30%)   ├── Experiment C: Card size (33/33/33)   └── Experiment D: Autoplay threshold (50/50)
├── Layer: Notifications (20%)   └── Experiment E: Send time optimization
└── Holdout (10%)
    └── No experiments  clean baseline

Within a layer, experiments are mutually exclusive. Across layers, they are orthogonal (independent). This is Google's overlapping experiments architecture.

Variance Reduction Techniques

Raw metrics have high variance (some users binge 8 hours, most watch 30 minutes):

CUPED (Controlled-experiment Using Pre-Experiment Data):

Where is pre-experiment metric value and . Reduces variance by 30-50%.

Stratified sampling: Bucket users by activity level, compute within-stratum estimates.

Winsorization: Cap extreme values at 99th percentile to reduce outlier influence.

Multiple Testing Correction

With hundreds of experiments, false positives are inevitable: - Bonferroni: (conservative) - Benjamini-Hochberg (FDR control): Rank p-values, reject where - Always Valid Sequential Testing — allows peeking at results without inflating false positives

Guardrail Metrics

Every experiment must monitor guardrails — things that must NOT degrade:

Guardrail Threshold Why
App crash rate +0.0% Reliability
Page load latency p99 +50ms Performance
Error rate +0.1% Stability
Customer support contacts +5% UX quality
Subscription cancellation rate +0.5% Revenue

If any guardrail is violated, auto-pause the experiment.

Metrics for Experiments

  • CTR — click through rate
  • Conversion rate — purchases / impressions
  • Revenue / ARPU — average revenue per user
  • ROAS — Return on Ad Spend (revenue / ad spend)
  • Engagement — time spent, interactions, completion rate
  • Retention — day-1, day-7, day-28 return rate

9. Gen AI / LLM / Agents

LLM Core Topics

  • Transformers — the architecture behind modern LLMs. Self-attention mechanism computes relevance between all token pairs: . Encoder-decoder (T5, BART), decoder-only (GPT, Llama), encoder-only (BERT).
  • Embeddings — convert text/entities to dense vectors. Sentence transformers (all-MiniLM, E5, BGE) produce fixed-size embeddings for semantic similarity.
  • Tokenization — BPE (Byte-Pair Encoding), SentencePiece, tiktoken. Subword tokenization balances vocabulary size and coverage.
  • Vector databases — store and search embeddings at scale. Pinecone, Weaviate, Milvus, Qdrant, pgvector, Chroma. Support approximate nearest neighbor (ANN) search with HNSW or IVF indexes.
  • RAG (Retrieval Augmented Generation) — retrieve relevant context from a knowledge base, inject into LLM prompt, then generate answer. Reduces hallucination and keeps knowledge current without retraining.
  • Prompt engineering — few-shot examples, chain-of-thought reasoning, system prompts, output formatting. Temperature controls randomness.
  • Fine-tuning — adapt a pre-trained LLM to a specific domain or task. LoRA (Low-Rank Adaptation) and QLoRA achieve this efficiently without full parameter updates.
  • Evaluation — BLEU, ROUGE (automated), human eval, LLM-as-judge, factual accuracy, faithfulness to retrieved context.
  • Hallucination — model generates plausible but incorrect information. Mitigate with RAG, grounding, and retrieval verification.
  • Guardrails — input/output validation, content filtering, PII detection. Libraries: Guardrails AI, NeMo Guardrails.
  • Agents — LLMs that can take actions using tools. ReAct pattern: Reason about what to do → Act using a tool → Observe the result → Repeat.
  • Tool calling / Function calling — LLM generates structured output to invoke external functions (APIs, databases, code interpreters).
  • Semantic search — search by meaning rather than keywords using embedding similarity.
  • NL→SQL systems — convert natural language questions to SQL queries. Text2SQL with LLMs + schema awareness.
  • Knowledge graphs + LLM — structured data enhancing LLM reasoning. GraphRAG combines graph traversal with retrieval.

RAG Architecture

graph TD
    A[User Question] --> B[Embedding Model]
    B --> C[Vector Search<br/>in Vector DB]
    C --> D[Top-K Relevant<br/>Documents]
    D --> E[Construct Prompt<br/>Question + Context]
    E --> F[LLM]
    F --> G[Answer]

    style C fill:#ff9800,color:#fff
    style F fill:#4caf50,color:#fff

RAG Pipeline Details

  1. Indexing Phase (offline):
  2. Chunk documents (512-1024 tokens, with overlap)
  3. Generate embeddings per chunk
  4. Store in vector database with metadata
  5. Query Phase (online):
  6. Embed the user query
  7. Retrieve top-K similar chunks (cosine similarity)
  8. (Optional) Rerank retrieved chunks with a cross-encoder
  9. Inject context + query into LLM prompt
  10. Generate answer with citations

Reporting Agent Architecture

graph TD
    A[User: Natural Language Query] --> B[Agent / Orchestrator]
    B --> C{Route}
    C -->|SQL Query| D[NL-to-SQL Engine]
    C -->|Dashboard| E[Visualization Tool]
    C -->|Analysis| F[Analytics Engine]
    D --> G[Data Warehouse]
    G --> H[Results]
    H --> I[LLM Summarization]
    I --> J[Response to User]

This is increasingly common on streaming platforms — analysts ask questions in natural language and get SQL-backed answers with visualizations.


10. SQL & Data Modeling

SQL Deep Dive

  • Joins — INNER (matching rows), LEFT (all from left + matching right), RIGHT (all from right + matching left), FULL OUTER (all rows), CROSS (cartesian product). In distributed systems, join order matters enormously for performance.
  • Group by + Aggregations — SUM, COUNT, AVG, MIN, MAX. GROUP BY with HAVING for filtered aggregates.
  • Window functions — the most powerful SQL feature for analytics:
  • ROW_NUMBER() — sequential numbering within partition
  • RANK() / DENSE_RANK() — rank with ties handling
  • LAG() / LEAD() — access previous/next row
  • SUM() OVER (ORDER BY ... ROWS BETWEEN ...) — running totals
  • NTILE(n) — divide into n equal groups
  • CTEs (Common Table Expressions) — WITH clause for readable, composable queries. Recursive CTEs for hierarchical data.
  • Indexing — B-tree (default, range queries), hash (equality lookups), GiST/GIN (full-text search, arrays). Index selection is critical for query performance.
  • Query optimization — EXPLAIN ANALYZE, predicate pushdown, partition pruning, broadcast hints.

Data Modeling for Analytics

Star Schema — central fact table surrounded by dimension tables. Denormalized for query performance.

graph TD
    F[Fact: Impressions / Plays] --> D1[Dim: User]
    F --> D2[Dim: Content]
    F --> D3[Dim: Device]
    F --> D4[Dim: Time]
    F --> D5[Dim: Campaign]
    F --> D6[Dim: Geography]
  • Fact table — stores measurable events (impressions, plays, clicks, conversions) with foreign keys to dimensions and numeric measures.
  • Dimension table — descriptive attributes (user demographics, content metadata, device info). Slowly changing dimensions (SCD Type 1/2) for tracking history.
  • OLAP vs OLTP — analytics (column-oriented, aggregations, full scans) vs transactions (row-oriented, lookups, updates).

11. KPIs / Objective Functions

Choosing the right objective function for each system is critical:

System Objective Why
CTR model Log Loss (Binary Cross-Entropy) Penalizes confident wrong predictions, produces calibrated probabilities
Content ranking NDCG (Normalized Discounted Cumulative Gain) Measures quality of ordered list, weights top positions higher
Recommendation MAP / Recall@K Evaluates relevance in top-K results
Audience segmentation Silhouette Score / Calinski-Harabasz Measures cluster quality and separation
Attribution Incrementality (causal lift) Measures true causal impact, not just correlation
Budget pacing Spend vs Target deviation Minimize under/over-delivery
Identity resolution Precision / Recall / F1 of matches Correct device-to-user mapping accuracy
Engagement Composite: quality watch hours + diversity Prevents Goodhart's Law from gaming single metrics

The North Star Metric Problem

Streaming platforms have many metrics that often conflict:

Metric Optimizes For Risk
Watch hours Engagement Autoplay addiction, low-quality binges
DAU / MAU Retention Doesn't capture depth
Content starts Discovery Doesn't measure satisfaction
Completion rate Satisfaction Penalizes long content
Revenue / ARPU Business May sacrifice long-term engagement

Solution: define a composite metric:

Where QualityWatchHours filters out background/autoplay — only counting engaged viewing.


12. MLOps / Production ML

Core MLOps Concepts

  • Model deployment — serving predictions via API (real-time) or batch job (scheduled)
  • Batch vs real-time inference — precompute predictions for all users nightly vs compute on-demand per request
  • Feature store — consistent features for training and serving (prevents skew)
  • Model versioning — track model artifacts, hyperparameters, training data version, metrics. Tools: MLflow, Weights & Biases, Neptune.
  • Monitoring
  • Prediction quality — track accuracy/AUC on incoming data vs hold-out
  • Data drift — input feature distributions shift over time (KL divergence, PSI). Seasonal changes, new content catalog.
  • Concept drift — relationship between features and target changes. User behavior evolves.
  • Latency & throughput — p50/p95/p99 inference latency, requests per second
  • Retraining pipelines — automated model refresh on new data. Can be scheduled (daily/weekly) or triggered by drift detection.
  • Deployment strategies
  • Shadow deployment — run new model alongside old, log predictions but don't serve to users. Compare outputs.
  • Canary deployment — roll out to 1-5% of traffic first. Monitor guardrails. Gradually increase.
  • Blue-green — two identical environments. Switch traffic entirely from blue (old) to green (new).
  • A/B model testing — statistically rigorous comparison on live traffic with proper sample size.

MLOps Architecture

graph TD
    A[Data Pipeline] --> B[Feature Store]
    B --> C[Training Pipeline<br/>GPU Cluster]
    C --> D[Model Registry<br/>MLflow / W&B]
    D --> E[Offline Evaluation<br/>AUC, NDCG]
    E --> F{Deployment}
    F -->|Shadow| G1[Log Only]
    F -->|Canary| G2[5% Traffic]
    F -->|A/B Test| G3[Experiment]
    F -->|Full| G4[100% Traffic]
    G1 --> H[Monitoring<br/>Drift + Latency + Quality]
    G2 --> H
    G3 --> H
    G4 --> H
    H -->|"Drift Detected"| I[Auto-Retrain]
    I --> C

13. Data Quality & Observability

Why Data Quality Is the Biggest Risk

ML models are only as good as their input data. Common issues:

Issue Impact Detection
Missing events (SDK bug) Undercounting → wrong experiment results Volume anomaly detection
Duplicate events (retry storms) Overcounting → inflated metrics Dedup by event_id
Schema changes (unannounced) Pipeline breakages Schema registry enforcement
Clock skew (device time wrong) Feature computation errors Server-side timestamp validation
Bot/fraud traffic Pollutes training data Behavioral anomaly detection
Late-arriving data Incomplete aggregations Watermark-based processing

Data Quality Framework

flowchart LR
    Ingest["📥 Ingestion"] --> Schema["📋 Schema<br/>Validation"]
    Schema --> Volume["📊 Volume<br/>Checks"]
    Volume --> Freshness["⏱️ Freshness<br/>SLAs"]
    Freshness --> Distribution["📈 Distribution<br/>Checks"]
    Distribution --> Alert["🚨 Alert &<br/>Circuit Breaker"]

    style Alert fill:#f44336,color:#fff

Tools: Great Expectations, dbt tests, Monte Carlo, Anomalo, Soda, or custom solutions on statistical process control.

Circuit Breakers: If input data quality drops below a threshold, automatically stop the ML training pipeline from consuming bad data. Better to serve a slightly stale model than one trained on corrupted data.

Data Lineage & Cataloging

  • Lineage tracking: From raw event → derived table → feature → model → experiment metric. Tools: OpenLineage, DataHub, Amundsen.
  • Data catalog: Searchable inventory of all datasets with ownership, freshness, schema, and usage statistics.
  • SLA monitoring: Each critical table has a freshness SLA. Breaches trigger alerts.

14. End-to-End Architecture

Putting It All Together

┌─────────────────────────────────────────────────────────────────────────┐
                         CLIENT DEVICES                                   
   (Smart TV, Mobile, Web, Console, Set-Top Box, Gaming Console)         
└─────────────┬───────────────────────────────────────────┬───────────────┘
               events                                      API calls
                                                          
┌─────────────────────────┐              ┌───────────────────────────────┐
   EVENT INGESTION                         API GATEWAY                  
   ┌─────────────────┐                    ┌────────────────────────┐  
     Schema Registry                       Experiment Assignment   
     Kafka Clusters                        (deterministic hash)    
     (multi-region)                      └────────────────────────┘  
   └────────┬────────┘                    ┌────────────────────────┐  
                                            Recommendation API      
└────────────┼─────────────┘                   retrieve  rank       
                                              rerank  serve         
                                            └────────────────────────┘  
┌─────────────────────────────┐          └──────────────┬────────────────┘
   STREAM PROCESSING                                   
   (Flink / Spark Streaming)                           
   ┌───────────────────────┐           ┌──────────────▼────────────────┐
     Sessionization                     ML SERVING                   
     Real-time features                 ┌────────────────────────┐  
     Real-time metrics                    Retrieval (ANN index)   
     Anomaly detection                    Ranking (GPU/CPU)       
   └───────────┬───────────┘                Feature Store (Redis)   
└───────────────┼─────────────┘               Model Registry          
                                           └────────────────────────┘  
                                        └───────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
                          DATA LAKE                                       
   ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ 
     Raw Events     Sessionized     Feature        Experiment  
     (Parquet)      Events          Tables         Results     
   └──────────────┘ └───────────────┘ └──────────────┘ └─────────────┘ 
                                                                         
   Storage: S3 / ADLS / GCS  Format: Delta Lake / Iceberg / Hudi       
   Compute: Spark / Trino / dbt                                          
└─────────────────────────────────────────────────────────────────────────┘

Technology Choices — Practical Summary

Component Open Source Managed Service
Event streaming Apache Kafka Confluent Cloud, Amazon MSK, Azure Event Hubs
Stream processing Apache Flink, Spark Structured Streaming Kinesis Data Analytics, Dataflow
Data lake storage HDFS, MinIO S3, ADLS Gen2, GCS
Table format Delta Lake, Apache Iceberg, Apache Hudi Databricks, Snowflake
Batch compute Apache Spark, Trino/Presto Databricks, EMR, Synapse, BigQuery
Feature store Feast, Hopsworks Tecton, SageMaker Feature Store, Vertex AI
ML training PyTorch, TensorFlow, XGBoost SageMaker, Vertex AI, Azure ML
Model serving TF Serving, Triton, BentoML SageMaker Endpoints, Vertex AI
Experiment platform Custom (most common) Eppo, Statsig, LaunchDarkly, Optimizely
Data quality Great Expectations, dbt tests, Soda Monte Carlo, Anomalo
Orchestration Apache Airflow, Dagster, Prefect MWAA, Cloud Composer, Astronomer
Data catalog DataHub, Amundsen, OpenMetadata Collibra, Alation

Learning Path

Phase 1: Foundations

  • ML basics — regression, classification, trees, neural networks
  • Metrics — precision, recall, F1, AUC, NDCG
  • Feature engineering techniques
  • SQL deep dive — window functions, CTEs, optimization

Phase 2: Domain Knowledge

  • Ads systems — CTR, ranking, auctions, attribution
  • Recommendation systems — collaborative & content-based filtering, two-tower, sequence models
  • Identity resolution — deterministic, probabilistic, graph-based
  • A/B testing — hypothesis testing, power analysis, CUPED

Phase 3: Infrastructure

  • Data engineering — Spark, Flink, Kafka, data lakes, Iceberg/Delta
  • ML system design patterns
  • Feature store architecture
  • MLOps & deployment strategies

Phase 4: GenAI

  • RAG architecture and implementation
  • Agents & tool calling (ReAct pattern)
  • NL to SQL
  • Embeddings & vector databases
  • LLM evaluation and guardrails

Priority Topics (Focus These First)

If limited on time, prioritize:

  1. ML basics — regression, classification, trees, metrics
  2. Feature engineering — the #1 differentiator in applied ML
  3. Recommendation systems — two-tower, ranking, cold start
  4. A/B testing & experimentation — fundamental to data-driven culture
  5. Data pipelines — Spark, Kafka, ETL, data lakes
  6. Identity resolution / device graph — unique cross-device challenge
  7. ML system design — end-to-end thinking
  8. RAG / LLM / Agents — the GenAI wave
  9. SQL & data modeling — the universal data language
  10. Metrics / KPIs — knowing what to optimize

Further Reading


This is a living document — I'll keep going deeper into each topic as I learn more. Stay focused while studying — try FocusTu for deep-work focus sessions. 🧘