Data Infrastructure for AI & Experimentation at Scale

A complete deep-dive into Data Engineering + Machine Learning + Ads Systems + Identity Graphs + GenAI + Experimentation — the foundational pillars behind modern streaming and content platforms.

What Does a Data & AI Platform Power?

Understanding the bigger picture — on any large-scale streaming platform (video, music, podcasts, live content), the data infrastructure powers:

Identity systems — who is the user across devices
Device graph — connect smart TV, mobile, tablet, laptop to same person/household
Audience platform — build segments like "sports lovers" or "sci-fi bingers"
Recommendation & personalization — what to show each user on the home feed
Measurement & attribution — did a promotion lead to engagement or subscription?
Experimentation — A/B testing thousands of changes simultaneously
Reporting & insights — dashboards + GenAI agents for analytics
Ad systems — targeting, bidding, measurement for ad-supported tiers

This is Applied ML + Data Platform + Experimentation + GenAI — not pure ML research.

The Event Backbone — Ingestion & Streaming
Machine Learning Fundamentals
The Feature Platform
Ads / Audience / Recommendation Systems
Identity Resolution & Device Graph
Data Engineering at Scale
ML System Design
The Experimentation Platform
Gen AI / LLM / Agents
SQL & Data Modeling
KPIs / Objective Functions
MLOps / Production ML
Data Quality & Observability
End-to-End Architecture

1. The Event Backbone — Ingestion & Streaming

What Needs to Be Captured

On a streaming platform, user interactions generate a staggering volume of telemetry:

Event Type	Examples	Volume (large platform)
Playback events	play, pause, seek, buffer, complete	~50B/day
Navigation events	impression, click, scroll, search	~30B/day
Engagement signals	like, save, share, add-to-list	~2B/day
Device/context	device type, OS, network, geo, time	Attached to every event
Content metadata	genre, duration, cast, release date	Updated in batch

Streaming Architecture — Apache Kafka at the Core

┌──────────────┐     ┌──────────────────┐     ┌────────────────────┐
│  Client SDKs │────▶│  Kafka Clusters  │────▶│  Stream Processors │
│  (mobile,    │     │  (partitioned by │     │  (Flink / Spark    │
│   web, TV,   │     │   user_id or     │     │   Structured       │
│   console)   │     │   device_id)     │     │   Streaming)       │
└──────────────┘     └──────────────────┘     └────────────────────┘
                              │                         │
                              ▼                         ▼
                     ┌────────────────┐       ┌────────────────────┐
                     │  Raw Event     │       │  Derived Streams   │
                     │  Data Lake     │       │  (sessionized,     │
                     │  (S3/ADLS/GCS) │       │   enriched,        │
                     │                │       │   aggregated)      │
                     └────────────────┘       └────────────────────┘

Key Design Decisions

Schema Registry (Avro/Protobuf): Every event must conform to a registered schema. Without this, downstream consumers break constantly. Use Apache Avro with Confluent Schema Registry or Protobuf definitions checked into version control. Schema evolution rules (backward/forward compatibility) are critical.

Partitioning Strategy: Partition by user_id for user-centric analytics (session analysis, feature computation). Partition by content_id for content-centric tasks (popularity counters, trending). Some platforms maintain dual topics — one partitioned each way.

Exactly-Once Semantics: Kafka supports idempotent producers and transactional writes. For streaming jobs consuming from Kafka → writing to a data lake, use Flink checkpointing with exactly-once sink connectors to avoid duplicate or missing events.

Backpressure & Dead Letter Queues: When a consumer can't keep up, events must be buffered, not dropped. Dead letter queues capture malformed events for later reprocessing rather than silently discarding them.

Event Schema Design — Example

{
  "event_id": "uuid-v4",
  "event_type": "playback.started",
  "timestamp_ms": 1712419200000,
  "user_id": "u_abc123",
  "session_id": "s_xyz789",
  "content_id": "c_movie_456",
  "device": {
    "type": "smart_tv",
    "os": "tvOS",
    "app_version": "5.12.0"
  },
  "context": {
    "page": "home_feed",
    "row": "continue_watching",
    "position": 3,
    "algorithm": "collab_filter_v2",
    "experiment_id": "exp_2026_q2_reco_v3",
    "variant": "treatment_b"
  },
  "content_metadata": {
    "genre": ["sci-fi", "thriller"],
    "duration_sec": 7200,
    "release_year": 2025
  }
}

Why context.algorithm and context.experiment_id matter: They tie every interaction back to the model and experiment that produced the recommendation. Without this, you can't measure model performance or run valid A/B tests.

2. Machine Learning Fundamentals

Supervised Learning

Regression — predict continuous values (e.g., predicted watch time)
Linear Regression — fit a line using ordinary least squares. Assumes linear relationship between features and target. Closed-form solution:
Polynomial Regression — extend linear regression with polynomial features for non-linear relationships
Classification — predict categories (e.g., will user click?)
Logistic Regression — binary classification using the sigmoid function: . Despite the name, it's a classifier, not regression. Outputs probabilities, threshold at 0.5 by default.
Decision Trees — recursive splitting on feature thresholds to minimize impurity (Gini or entropy). Human-interpretable but prone to overfitting.
Random Forest — ensemble of decision trees trained on bootstrap samples (bagging). Each tree sees a random subset of features. Reduces variance while maintaining low bias. Typically 100-500 trees.
Gradient Boosting (XGBoost, LightGBM, CatBoost) — sequential trees where each new tree corrects the errors of the ensemble so far. Minimizes a loss function via gradient descent in function space. XGBoost uses regularized objective: . LightGBM is faster with histogram-based splitting and leaf-wise growth.
Neural Networks — layers of neurons with non-linear activations (ReLU, sigmoid, tanh). Universal approximators — can learn any function given sufficient depth and width.

Unsupervised Learning

Clustering (KMeans) — partition data into K groups by minimizing within-cluster sum of squares. Sensitive to initialization (use KMeans++). Requires choosing K (elbow method, silhouette score).
DBSCAN — density-based clustering. Finds arbitrarily shaped clusters. No need to specify K. Parameters: epsilon (neighborhood radius), min_samples.
PCA (Principal Component Analysis) — project data onto orthogonal directions of maximum variance. Used for dimensionality reduction, visualization, and denoising. Keep components that explain >95% variance.
Embeddings — dense vector representations of entities (users, items, words). Learned via neural networks. Similar entities have similar vectors in embedding space.
Similarity search — find nearest neighbors in embedding space using cosine similarity, Euclidean distance, or dot product. Approximate Nearest Neighbor (ANN) algorithms: FAISS, ScaNN, HNSW for sub-linear search.

Important Concepts

Bias vs Variance tradeoff — high bias = underfitting (model too simple), high variance = overfitting (model too complex). Goal: minimize total error = bias² + variance + irreducible noise.
Overfitting — model memorizes training data, performs poorly on unseen data. Signs: training accuracy >> validation accuracy. Fix: more data, regularization, simpler model, early stopping, dropout.
Regularization
L1 (Lasso): adds to loss. Encourages sparsity — some weights become exactly zero. Good for feature selection.
L2 (Ridge): adds to loss. Shrinks weights evenly toward zero. Prevents any single feature from dominating.
Elastic Net: combines L1 + L2 regularization.
Dropout (neural networks): randomly zero out neurons during training. Acts as ensemble of sub-networks.
Cross Validation — evaluate model on multiple train/test splits. K-fold: split data into K parts, train on K-1, validate on 1, rotate. Stratified K-fold preserves class distribution.
Feature Engineering — creating useful input features from raw data. Often the #1 differentiator in applied ML. Examples: time-since-last-interaction, rolling averages, ratio features, cross features.
Feature Scaling — StandardScaler (zero mean, unit variance) or MinMaxScaler (0 to 1). Critical for distance-based algorithms (KNN, SVM) and gradient descent optimization.
Handling missing values — imputation (mean, median, mode), indicator columns ("is_missing"), or model-based (KNN imputation, iterative imputer). Never silently drop rows in production.
Class imbalance — when positive class is rare (e.g., 1% conversion rate). Techniques: SMOTE (synthetic oversampling), class weights in loss function, undersampling majority class, focal loss.

Metrics

Metric	What It Measures	When to Use
Accuracy	% of correct predictions	Balanced classes only
Precision	Of predicted positive, how many are actually positive	When false positives are costly (spam detection)
Recall	Of actual positives, how many did we catch	When false negatives are costly (fraud detection)
F1 Score	Harmonic mean of precision and recall	Imbalanced classes, need balance
ROC AUC	Area under ROC curve — tradeoff between TPR and FPR	Binary classification, threshold-independent
PR AUC	Area under Precision-Recall curve	Highly imbalanced datasets
Log Loss	Penalizes confident wrong predictions	CTR prediction, calibrated probabilities
RMSE	Root Mean Squared Error	Regression, penalizes large errors
MAE	Mean Absolute Error	Regression, robust to outliers
NDCG	Normalized Discounted Cumulative Gain	Ranking quality
MAP	Mean Average Precision	Information retrieval
MRR	Mean Reciprocal Rank	First relevant result position

3. The Feature Platform

The Feature Engineering Problem

ML models need features. Features come from raw events. The challenge:

Training needs features computed over historical data (batch)
Serving needs features computed in real-time (online)
The features must be identical — otherwise you get training-serving skew, the silent killer of ML systems

Feature Store Architecture

flowchart TD
    Raw["📊 Raw Events<br/>(Kafka / Data Lake)"] --> Batch["⏱️ Batch Pipeline<br/>(Spark / dbt)"]
    Raw --> Stream["⚡ Stream Pipeline<br/>(Flink / Spark Streaming)"]

    Batch --> Offline["🗄️ Offline Store<br/>(Hive / Delta Lake / BigQuery)"]
    Stream --> Online["⚡ Online Store<br/>(Redis / DynamoDB / Bigtable)"]

    Offline --> Training["🤖 Model Training"]
    Online --> Serving["🔮 Model Serving"]

    Offline -.->|"point-in-time join"| Training
    Batch -.->|"backfill"| Online

    style Online fill:#ff9800,color:#fff
    style Offline fill:#2196f3,color:#fff

Types of Features on a Streaming Platform

User-Level Features (computed per user)

user_total_watch_hours_7d          — rolling 7-day watch time
user_genre_affinity_vector         — softmax over genre engagement
user_avg_session_length_30d        — average session in last 30 days
user_skip_rate_7d                  — fraction of content abandoned < 30s
user_search_to_play_ratio_7d      — how often search leads to a play
user_time_of_day_distribution      — histogram of activity by hour
user_device_preference             — primary device type
user_content_completion_rate_30d   — fraction of content watched to end
user_days_since_signup             — account age
user_subscription_tier             — free / basic / premium

Content-Level Features (computed per item)

content_total_plays_7d             — popularity signal
content_avg_completion_rate        — quality signal
content_genre_tags                 — categorical
content_release_recency_days       — freshness
content_avg_rating                 — explicit quality
content_play_to_impression_ratio   — CTR proxy

Cross Features (user × content interaction)

user_has_watched_same_genre_7d     — genre relevance
user_watched_same_creator          — creator affinity
user_x_content_genre_overlap       — cosine similarity of genre vectors

Point-in-Time Joins — The Most Critical Concept

When training a model, you must join features as they existed at the time of the event, not as they exist now. Otherwise you leak future information into training data.

# WRONG — uses current features for historical events
features = feature_store.get_latest("user_123")

# RIGHT — uses features as-of the event timestamp
features = feature_store.get_as_of("user_123", timestamp="2026-03-15T14:00:00Z")

Frameworks like Feast, Tecton, and Hopsworks handle point-in-time correctness automatically when you define feature views with timestamps.

Online Feature Serving — Latency Matters

Model inference at serving time must complete within 50-100ms (including feature fetch + model forward pass). This means:

Redis Cluster or DynamoDB for sub-5ms feature lookups
Feature caching at the application layer for session-stable features
Batch precomputation of expensive features (e.g., run nightly, push to online store)
Feature embedding lookups — precomputed user/content embeddings stored in vector stores

4. Ads / Audience / Recommendation Systems

Ads System Architecture

The full flow of an ad system on a streaming platform:

graph LR
    A[User] --> B[Device]
    B --> C[Identity Resolution]
    C --> D[Audience Segment]
    D --> E[Ad Selection]
    E --> F[Auction]
    F --> G[Impression]
    G --> H[Click]
    H --> I[Conversion]
    I --> J[Attribution]
    J --> K[Reporting]

How Ad Auctions Work

When a user opens a streaming app and hits an ad break, the platform runs an auction in real-time:

Bid Request — platform sends user context (anonymized) to demand-side platforms or internal ad server
Bid Response — advertisers bid for the impression, specifying CPM (cost per 1000 impressions)
Auction — typically a second-price auction (winner pays $0.01 above second-highest bid) or increasingly first-price auction
Ad Selection — rank by eCPM = , also factor in relevance and user experience
Impression & Tracking — serve the ad, fire tracking pixels for viewability, completion, clicks

Models Used in Ads

CTR prediction (Click Through Rate) — probability user clicks on ad. Typically uses gradient boosted trees (XGBoost/LightGBM) or deep learning with embedding layers for sparse features. Training data: billions of impression-click pairs.
CVR prediction (Conversion Rate) — probability of purchase after click. Harder than CTR due to sparser signal and longer attribution windows.
Ranking models — order ads by expected value. eCPM = CTR × bid. Multi-objective ranking can also factor in user experience.
Lookalike modeling — given a seed audience (e.g., people who bought product X), find similar users in the broader population. Typically uses embedding similarity or supervised classification.
Audience segmentation — cluster users into meaningful groups using behavioral features. KMeans, DBSCAN, or learned segment embeddings.
Collaborative filtering — recommend based on similar users' behavior. Matrix factorization (ALS), neural collaborative filtering.
Content-based filtering — recommend based on item features (genre, tags, description embeddings).
Embeddings for users & items — learn dense vector representations via two-tower models or autoencoders. User embedding captures preferences, item embedding captures attributes.
Multi-armed bandits — explore vs exploit for ad/content selection. Thompson Sampling, Upper Confidence Bound (UCB), epsilon-greedy.
Budget pacing — spend advertiser budget evenly over campaign duration. Proportional-integral-derivative (PID) controllers are common.

Recommendation Model Architectures

Two-Tower Model (Retrieval Stage)

          User Tower                    Item Tower
    ┌─────────────────┐          ┌─────────────────┐
    │  user features   │          │  item features   │
    │  watch history   │          │  genre, tags     │
    │  demographics    │          │  popularity      │
    └────────┬────────┘          └────────┬────────┘
             │                            │
        [Dense Layers]              [Dense Layers]
             │                            │
             ▼                            ▼
      user_embedding              item_embedding
       (128-dim)                   (128-dim)
             │                            │
             └──────── dot product ───────┘
                          │
                     similarity score

Training: In-batch negatives or hard negative mining. Loss: softmax cross-entropy or sampled softmax.
Serving: Precompute all item embeddings → build ANN index (FAISS, ScaNN, HNSW). At request time, compute user embedding → ANN lookup for top-K candidates in <10ms.
Scale: Retrieve ~500 candidates from millions of items in under 10ms.

Deep Ranking Model (Ranking Stage)

Takes ~500 candidates from retrieval and scores them:

Input Features:
  - User features (dense + sparse)
  - Item features (dense + sparse)
  - Cross features (user × item)
  - Context features (time, device, page)
       │
  [Embedding Layer — sparse features]
       │
  [Concatenation with dense features]
       │
  [Multi-Layer Perceptron or DCN-v2]
       │
  [Multi-Task Heads]
       ├── P(click)
       ├── P(watch > 50%)
       ├── P(complete)
       ├── P(like)
       └── P(add to list)
       │
  [Weighted combination → final score]

Multi-Task Learning is essential because optimizing for clicks alone leads to clickbait. The final score:

Sequence Models for Watch History

User watch history is sequential — order matters:

Transformer-based: Self-attention over recent N items watched. Captures temporal patterns (binge-watching a series).
SASRec (Self-Attentive Sequential Recommendation): Causal attention mask on item sequence to predict next interaction.
GRU4Rec: RNN-based session recommendation. Lighter weight, still effective for session-based reco.

Content Cold Start — The Chicken-and-Egg Problem

New content has zero engagement signal. How do you recommend something nobody has watched?

Strategy	How It Works	Limitations
Content-based features	Use metadata (genre, cast, director, synopsis embedding) to find similar existing content	Ignores personal taste
Explore/exploit	Thompson Sampling or epsilon-greedy — intentionally show new content to a small % of users	Hurts short-term metrics
Contextual bandits	Use contextual features (user segment, time of day) to decide who sees new content	Requires fast feedback loop
Creator-based transfer	If a creator has a track record, use their historical performance as a prior	Only for known creators
Editorial boost	Curators manually promote content for initial impressions	Doesn't scale

Best systems combine content-based warm-starting + explore/exploit for initial signals, then transition to collaborative filtering once enough data exists.

Handling Bias in Recommendation Data

All training data is biased — users can only engage with what was shown.

Position Bias: Items in position 1 get more clicks regardless of relevance. Solution: train a position bias model separately and debias during training.

Selection Bias: Training data only contains items the model chose to show. Solution: inverse propensity scoring (IPS) — weight training examples by .

Popularity Bias: Popular items get shown more → get more engagement → appear more popular. Rich-get-richer feedback loop. Solution: diversity objectives + calibration to match user interest distribution.

5. Identity Resolution & Device Graph

The Fundamental Problem

How do you know that a smart TV, a mobile phone, a laptop, and a tablet belong to the same user or household? This is the identity resolution problem.

Methods

Deterministic matching — exact match on email, login, phone number, account ID. High precision, low recall. Requires authenticated sessions.
Probabilistic matching — statistical models using IP address, Wi-Fi network, location patterns, timing correlation. Lower precision, higher recall.
Graph-based identity resolution — build a graph of devices and signals, run connected components or community detection.
Embedding similarity — learn device/user embeddings from interaction patterns, cluster by cosine similarity.
Graph Neural Networks (GNN) — use message passing on the device graph for more sophisticated entity resolution.

Device Graph Structure

A graph where nodes are devices/users/households and edges represent observed relationships:

graph TD
    H[Household] --> TV[Smart TV]
    H --> M[Mobile Phone]
    H --> L[Laptop]
    H --> T[Tablet]
    TV -.->|same IP / WiFi| M
    M -.->|same login| L
    L -.->|co-occurrence| T

Key Graph Algorithms

Connected Components — find clusters of related devices. Simple BFS/DFS. At scale, use graph-parallel frameworks (GraphX, Pregel) for distributed connected components.
PageRank — rank importance of nodes. The "primary device" in a household often has the highest PageRank.
Node Embeddings (Node2Vec) — random walks on the graph → skip-gram model → dense vector per node. Similar devices end up close in embedding space.
GraphSAGE — inductive GNN that learns node representations by aggregating features from local neighborhoods. Scales to large graphs because it samples neighborhoods rather than using the full graph.
Link Prediction — predict missing edges (e.g., does this phone belong to this household?). Train a classifier on node-pair features: common neighbors, Jaccard similarity, learned embeddings.

Identity Graph at Scale

At streaming platform scale (100M+ households), the identity graph has:

Billions of nodes (devices, IPs, accounts, cookies)
Hundreds of billions of edges
Continuous updates — new devices appear, IPs rotate, accounts merge

This requires:

Graph databases: Neo4j, JanusGraph, Amazon Neptune, or custom solutions on Spark GraphX
Incremental graph updates — don't recompute from scratch; process new edges as they arrive
Privacy-compliant design — hash/anonymize PII, respect opt-outs, comply with GDPR/CCPA

6. Data Engineering at Scale

Core Concepts Deep Dive

ETL vs ELT — Traditional ETL transforms before loading to warehouse. Modern ELT loads raw data to lake first, transforms in-place using Spark or dbt. ELT is dominant because storage is cheap, and you preserve raw data.
Data Lakes — store raw data in any format (S3, ADLS Gen2, GCS). Use open table formats (Delta Lake, Apache Iceberg, Apache Hudi) for ACID transactions, time travel, schema evolution on the lake.
Data Warehouse — structured, optimized for analytics. Columnar storage, query optimization. Snowflake, BigQuery, Redshift, Synapse, Databricks SQL.
Lakehouse — hybrid: data lake storage + warehouse query performance. Delta Lake on Databricks, Apache Iceberg on Trino. Eliminates the need to maintain separate lake and warehouse.
Batch vs Streaming — batch processes data in scheduled chunks (hourly/daily). Streaming processes events as they arrive (Kafka + Flink). Lambda architecture runs both; Kappa architecture uses streaming for everything.
Apache Spark — distributed compute engine. RDDs → DataFrames → Structured Streaming. Key concepts: lazy evaluation, partitioning, shuffle, catalyst optimizer, adaptive query execution (AQE in Spark 3+).
Apache Flink — true stream processing with event-time semantics, watermarks, exactly-once guarantees. Better than Spark Streaming for low-latency use cases.
Presto / Trino — distributed SQL query engine for interactive analytics on data lakes. Federated queries across multiple data sources.
Parquet — columnar storage format. Efficient for analytics (read only needed columns). Supports predicate pushdown, dictionary encoding, run-length encoding.
Feature Store — centralized repository for ML features (Feast, Tecton, Hopsworks). Ensures consistency between training and serving.
Apache Airflow / Dagster / Prefect — workflow orchestration. Define DAGs (directed acyclic graphs) of tasks with dependencies. Airflow is most popular but Dagster has better dev experience and asset-based paradigm.
Kafka — distributed event streaming. Partitioned, replicated, durable. Key concepts: topics, partitions, consumer groups, offset management, exactly-once semantics.

Data at Scale — Common Challenges

Data partitioning — split data by date/region for faster queries. Partition pruning avoids scanning irrelevant data.
Data skew — uneven data distribution across partitions. One partition with 100x more data than others. Fix: salting keys, repartitioning, broadcast joins for small tables.
Joins at scale — broadcast join (small table fits in memory), sort-merge join (both tables large, co-partitioned), shuffle hash join (default, most expensive).
Window functions — compute aggregates across rows related to current row. Essential for sessionization, rolling features, ranking.
Late-arriving data — events that arrive after the processing window has closed. Handle with watermarks (Flink), merge-on-read (Iceberg/Hudi), or reprocessing.

Typical Data Pipeline

graph LR
    A[Client Events] --> B[Kafka]
    B --> C[Raw Data Lake<br/>Parquet/Iceberg]
    C --> D[Spark / dbt<br/>Transform]
    D --> E[Feature Store]
    E --> F[ML Model Training]
    D --> G[Analytics<br/>Warehouse]
    G --> H[Dashboards]
    F --> I[Model Serving]
    I --> J[Predictions API]

Sessionization — Deceptively Hard

A "session" isn't straightforward. Is it timeout-based (30 min inactivity)? Content-boundary-based (new show = new session)? Device-specific?

-- Sessionization using inactivity gap (30 minutes)
WITH events_with_gap AS (
    SELECT
        user_id,
        event_timestamp,
        LAG(event_timestamp) OVER (
            PARTITION BY user_id ORDER BY event_timestamp
        ) AS prev_timestamp,
        CASE
            WHEN EXTRACT(EPOCH FROM event_timestamp - LAG(event_timestamp)
                 OVER (PARTITION BY user_id ORDER BY event_timestamp)) > 1800
            THEN 1
            ELSE 0
        END AS new_session_flag
    FROM raw_events
)
SELECT
    user_id,
    event_timestamp,
    SUM(new_session_flag) OVER (
        PARTITION BY user_id ORDER BY event_timestamp
    ) AS session_id
FROM events_with_gap;

Real-time sessionization in Flink uses session windows with a gap timeout — but requires careful handling of late-arriving events and cross-device sessions.

7. ML System Design

System Design Framework

When designing any ML system, structure it as:

Problem definition — what are we solving?
Metrics — how do we measure success? (offline + online)
Data sources — what data do we have?
Feature engineering — what features do we build?
Model choice — what algorithm fits?
Training pipeline — how do we train at scale?
Serving architecture — batch vs real-time?
Monitoring — what do we track in production?
Retraining — when and how do we update?
Experimentation — how do we A/B test?

ML System Pipeline

graph TD
    A[Data Ingestion] --> B[Data Cleaning]
    B --> C[Feature Engineering]
    C --> D[Feature Store]
    D --> E[Model Training]
    E --> F[Model Evaluation]
    F --> G[Model Registry]
    G --> H{Deployment Strategy}
    H -->|Shadow| I1[Log predictions, don't serve]
    H -->|Canary| I2[5% traffic, monitor]
    H -->|A/B Test| I3[50/50 with old model]
    H -->|Full| I4[100% traffic]
    I1 --> J[Monitoring & Alerting]
    I2 --> J
    I3 --> J
    I4 --> J
    J -->|"drift / degradation"| K[Retrain Trigger]
    K --> E

Real-Time vs Batch Training

Aspect	Batch Training	Real-Time Training
Freshness	Hours to days stale	Minutes stale
Use case	Stable preferences	Trending content, viral items
Infrastructure	Spark + GPU cluster	Flink + parameter server
Complexity	Lower	Much higher
Typical approach	Daily/weekly retrain	Continuous embedding updates

Most platforms use a hybrid: batch-train the full model daily/weekly, but update embedding tables in near-real-time for new content and shifting user preferences (as described in ByteDance's Monolith paper).

Model Serving Infrastructure

For a platform serving millions of concurrent users:

Model format: TensorFlow SavedModel, ONNX, or TorchScript
Serving framework: TensorFlow Serving, Triton Inference Server, BentoML, or custom gRPC services
Dynamic batching: Group multiple requests into one GPU batch for throughput
Caching: LRU cache on popular content embeddings; session-level cache for user context
Fallback: If model inference fails, fall back to a rule-based ranker (popularity-based)
Latency budget: <50ms p99 for the entire recommend-and-rank pipeline

Example Systems to Understand

Content recommendation system (home feed ranking)
Ad targeting and ranking system
Audience segmentation system
Search ranking system
Attribution and measurement system
Identity resolution system
Reporting insights AI agent

8. The Experimentation Platform

Why Experimentation Infrastructure Is as Important as ML

A model is only as good as your ability to measure its impact. Streaming platforms run hundreds to thousands of concurrent experiments:

New ranking models
UI layout changes
Content promotion strategies
Notification timing
Pricing experiments
Ad targeting algorithms

Core Components

flowchart LR
    Config["🎛️ Experiment Config<br/>(variants, allocation,<br/>guardrails)"] --> Assign["👤 Assignment<br/>Service"]
    Assign --> Client["📱 Client SDK<br/>(get variant)"]
    Client --> Events["📊 Events<br/>(tagged with<br/>experiment_id + variant)"]
    Events --> Pipeline["⚙️ Metrics<br/>Pipeline"]
    Pipeline --> Stats["📈 Statistical<br/>Analysis"]
    Stats --> Dashboard["📊 Experiment<br/>Dashboard"]

    style Config fill:#9c27b0,color:#fff
    style Stats fill:#4caf50,color:#fff

Assignment — Deterministic Hashing

import hashlib

def get_variant(user_id: str, experiment_id: str, num_variants: int) -> int:
    """Deterministic assignment: same user always gets same variant."""
    hash_input = f"{user_id}:{experiment_id}"
    hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
    return hash_value % num_variants

Properties: - Deterministic — same user always sees same variant (no flickering) - Uniform — SHA256 gives near-perfect uniform distribution - Independent — different experiments use different hash inputs, so assignments are uncorrelated - Stateless — no need to store assignments; recompute on every request

Core Statistical Concepts

A/B testing — compare control (A) vs treatment (B) with random assignment
Hypothesis testing — formulate null (: no difference) and alternative (: treatment is better) hypothesis
P-value — probability of observing the result (or more extreme) under . Typically reject if < 0.05.
Confidence interval — range where true value likely falls (typically 95%)
Statistical significance — result unlikely due to chance
Power analysis — determine required sample size before experiment

Sample Size Calculation

Where: - = z-score for significance level (1.96 for 95%) - = z-score for power (0.84 for 80% power) - = variance of the metric - = minimum detectable effect (MDE)

Typical MDEs for streaming platforms: - Engagement metrics (watch hours): ±0.5% - Retention metrics (day-7 retention): ±0.2% - Revenue metrics (ARPU): ±1.0%

Traffic Allocation & Mutual Exclusion

Total Traffic (100%)
├── Layer: Ranking (40%)
│   ├── Experiment A: New model v3 (50% control / 50% treatment)
│   └── Experiment B: Feature expansion (50/50)
│       Note: A and B are mutually exclusive within this layer
├── Layer: UI (30%)
│   ├── Experiment C: Card size (33/33/33)
│   └── Experiment D: Autoplay threshold (50/50)
├── Layer: Notifications (20%)
│   └── Experiment E: Send time optimization
└── Holdout (10%)
    └── No experiments — clean baseline

Within a layer, experiments are mutually exclusive. Across layers, they are orthogonal (independent). This is Google's overlapping experiments architecture.

Variance Reduction Techniques

Raw metrics have high variance (some users binge 8 hours, most watch 30 minutes):

CUPED (Controlled-experiment Using Pre-Experiment Data):

Where is pre-experiment metric value and . Reduces variance by 30-50%.

Stratified sampling: Bucket users by activity level, compute within-stratum estimates.

Winsorization: Cap extreme values at 99th percentile to reduce outlier influence.

Multiple Testing Correction

With hundreds of experiments, false positives are inevitable: - Bonferroni: (conservative) - Benjamini-Hochberg (FDR control): Rank p-values, reject where - Always Valid Sequential Testing — allows peeking at results without inflating false positives

Guardrail Metrics

Every experiment must monitor guardrails — things that must NOT degrade:

Guardrail	Threshold	Why
App crash rate	+0.0%	Reliability
Page load latency p99	+50ms	Performance
Error rate	+0.1%	Stability
Customer support contacts	+5%	UX quality
Subscription cancellation rate	+0.5%	Revenue

If any guardrail is violated, auto-pause the experiment.

Metrics for Experiments

CTR — click through rate
Conversion rate — purchases / impressions
Revenue / ARPU — average revenue per user
ROAS — Return on Ad Spend (revenue / ad spend)
Engagement — time spent, interactions, completion rate
Retention — day-1, day-7, day-28 return rate

9. Gen AI / LLM / Agents

LLM Core Topics

Transformers — the architecture behind modern LLMs. Self-attention mechanism computes relevance between all token pairs: . Encoder-decoder (T5, BART), decoder-only (GPT, Llama), encoder-only (BERT).
Embeddings — convert text/entities to dense vectors. Sentence transformers (all-MiniLM, E5, BGE) produce fixed-size embeddings for semantic similarity.
Tokenization — BPE (Byte-Pair Encoding), SentencePiece, tiktoken. Subword tokenization balances vocabulary size and coverage.
Vector databases — store and search embeddings at scale. Pinecone, Weaviate, Milvus, Qdrant, pgvector, Chroma. Support approximate nearest neighbor (ANN) search with HNSW or IVF indexes.
RAG (Retrieval Augmented Generation) — retrieve relevant context from a knowledge base, inject into LLM prompt, then generate answer. Reduces hallucination and keeps knowledge current without retraining.
Prompt engineering — few-shot examples, chain-of-thought reasoning, system prompts, output formatting. Temperature controls randomness.
Fine-tuning — adapt a pre-trained LLM to a specific domain or task. LoRA (Low-Rank Adaptation) and QLoRA achieve this efficiently without full parameter updates.
Evaluation — BLEU, ROUGE (automated), human eval, LLM-as-judge, factual accuracy, faithfulness to retrieved context.
Hallucination — model generates plausible but incorrect information. Mitigate with RAG, grounding, and retrieval verification.
Guardrails — input/output validation, content filtering, PII detection. Libraries: Guardrails AI, NeMo Guardrails.
Agents — LLMs that can take actions using tools. ReAct pattern: Reason about what to do → Act using a tool → Observe the result → Repeat.
Tool calling / Function calling — LLM generates structured output to invoke external functions (APIs, databases, code interpreters).
Semantic search — search by meaning rather than keywords using embedding similarity.
NL→SQL systems — convert natural language questions to SQL queries. Text2SQL with LLMs + schema awareness.
Knowledge graphs + LLM — structured data enhancing LLM reasoning. GraphRAG combines graph traversal with retrieval.

RAG Architecture

graph TD
    A[User Question] --> B[Embedding Model]
    B --> C[Vector Search<br/>in Vector DB]
    C --> D[Top-K Relevant<br/>Documents]
    D --> E[Construct Prompt<br/>Question + Context]
    E --> F[LLM]
    F --> G[Answer]

    style C fill:#ff9800,color:#fff
    style F fill:#4caf50,color:#fff

RAG Pipeline Details

Indexing Phase (offline):
Chunk documents (512-1024 tokens, with overlap)
Generate embeddings per chunk
Store in vector database with metadata
Query Phase (online):
Embed the user query
Retrieve top-K similar chunks (cosine similarity)
(Optional) Rerank retrieved chunks with a cross-encoder
Inject context + query into LLM prompt
Generate answer with citations

Reporting Agent Architecture

graph TD
    A[User: Natural Language Query] --> B[Agent / Orchestrator]
    B --> C{Route}
    C -->|SQL Query| D[NL-to-SQL Engine]
    C -->|Dashboard| E[Visualization Tool]
    C -->|Analysis| F[Analytics Engine]
    D --> G[Data Warehouse]
    G --> H[Results]
    H --> I[LLM Summarization]
    I --> J[Response to User]

This is increasingly common on streaming platforms — analysts ask questions in natural language and get SQL-backed answers with visualizations.

10. SQL & Data Modeling

SQL Deep Dive

Joins — INNER (matching rows), LEFT (all from left + matching right), RIGHT (all from right + matching left), FULL OUTER (all rows), CROSS (cartesian product). In distributed systems, join order matters enormously for performance.
Group by + Aggregations — SUM, COUNT, AVG, MIN, MAX. GROUP BY with HAVING for filtered aggregates.
Window functions — the most powerful SQL feature for analytics:
ROW_NUMBER() — sequential numbering within partition
RANK() / DENSE_RANK() — rank with ties handling
LAG() / LEAD() — access previous/next row
SUM() OVER (ORDER BY ... ROWS BETWEEN ...) — running totals
NTILE(n) — divide into n equal groups
CTEs (Common Table Expressions) — WITH clause for readable, composable queries. Recursive CTEs for hierarchical data.
Indexing — B-tree (default, range queries), hash (equality lookups), GiST/GIN (full-text search, arrays). Index selection is critical for query performance.
Query optimization — EXPLAIN ANALYZE, predicate pushdown, partition pruning, broadcast hints.

Data Modeling for Analytics

Star Schema — central fact table surrounded by dimension tables. Denormalized for query performance.

graph TD
    F[Fact: Impressions / Plays] --> D1[Dim: User]
    F --> D2[Dim: Content]
    F --> D3[Dim: Device]
    F --> D4[Dim: Time]
    F --> D5[Dim: Campaign]
    F --> D6[Dim: Geography]

Fact table — stores measurable events (impressions, plays, clicks, conversions) with foreign keys to dimensions and numeric measures.
Dimension table — descriptive attributes (user demographics, content metadata, device info). Slowly changing dimensions (SCD Type 1/2) for tracking history.
OLAP vs OLTP — analytics (column-oriented, aggregations, full scans) vs transactions (row-oriented, lookups, updates).

11. KPIs / Objective Functions

Choosing the right objective function for each system is critical:

System	Objective	Why
CTR model	Log Loss (Binary Cross-Entropy)	Penalizes confident wrong predictions, produces calibrated probabilities
Content ranking	NDCG (Normalized Discounted Cumulative Gain)	Measures quality of ordered list, weights top positions higher
Recommendation	MAP / Recall@K	Evaluates relevance in top-K results
Audience segmentation	Silhouette Score / Calinski-Harabasz	Measures cluster quality and separation
Attribution	Incrementality (causal lift)	Measures true causal impact, not just correlation
Budget pacing	Spend vs Target deviation	Minimize under/over-delivery
Identity resolution	Precision / Recall / F1 of matches	Correct device-to-user mapping accuracy
Engagement	Composite: quality watch hours + diversity	Prevents Goodhart's Law from gaming single metrics

The North Star Metric Problem

Streaming platforms have many metrics that often conflict:

Metric	Optimizes For	Risk
Watch hours	Engagement	Autoplay addiction, low-quality binges
DAU / MAU	Retention	Doesn't capture depth
Content starts	Discovery	Doesn't measure satisfaction
Completion rate	Satisfaction	Penalizes long content
Revenue / ARPU	Business	May sacrifice long-term engagement

Solution: define a composite metric:

Where QualityWatchHours filters out background/autoplay — only counting engaged viewing.

12. MLOps / Production ML

Core MLOps Concepts

Model deployment — serving predictions via API (real-time) or batch job (scheduled)
Batch vs real-time inference — precompute predictions for all users nightly vs compute on-demand per request
Feature store — consistent features for training and serving (prevents skew)
Model versioning — track model artifacts, hyperparameters, training data version, metrics. Tools: MLflow, Weights & Biases, Neptune.
Monitoring
Prediction quality — track accuracy/AUC on incoming data vs hold-out
Data drift — input feature distributions shift over time (KL divergence, PSI). Seasonal changes, new content catalog.
Concept drift — relationship between features and target changes. User behavior evolves.
Latency & throughput — p50/p95/p99 inference latency, requests per second
Retraining pipelines — automated model refresh on new data. Can be scheduled (daily/weekly) or triggered by drift detection.
Deployment strategies
Shadow deployment — run new model alongside old, log predictions but don't serve to users. Compare outputs.
Canary deployment — roll out to 1-5% of traffic first. Monitor guardrails. Gradually increase.
Blue-green — two identical environments. Switch traffic entirely from blue (old) to green (new).
A/B model testing — statistically rigorous comparison on live traffic with proper sample size.

MLOps Architecture

graph TD
    A[Data Pipeline] --> B[Feature Store]
    B --> C[Training Pipeline<br/>GPU Cluster]
    C --> D[Model Registry<br/>MLflow / W&B]
    D --> E[Offline Evaluation<br/>AUC, NDCG]
    E --> F{Deployment}
    F -->|Shadow| G1[Log Only]
    F -->|Canary| G2[5% Traffic]
    F -->|A/B Test| G3[Experiment]
    F -->|Full| G4[100% Traffic]
    G1 --> H[Monitoring<br/>Drift + Latency + Quality]
    G2 --> H
    G3 --> H
    G4 --> H
    H -->|"Drift Detected"| I[Auto-Retrain]
    I --> C

13. Data Quality & Observability

Why Data Quality Is the Biggest Risk

ML models are only as good as their input data. Common issues:

Issue	Impact	Detection
Missing events (SDK bug)	Undercounting → wrong experiment results	Volume anomaly detection
Duplicate events (retry storms)	Overcounting → inflated metrics	Dedup by event_id
Schema changes (unannounced)	Pipeline breakages	Schema registry enforcement
Clock skew (device time wrong)	Feature computation errors	Server-side timestamp validation
Bot/fraud traffic	Pollutes training data	Behavioral anomaly detection
Late-arriving data	Incomplete aggregations	Watermark-based processing

Data Quality Framework

flowchart LR
    Ingest["📥 Ingestion"] --> Schema["📋 Schema<br/>Validation"]
    Schema --> Volume["📊 Volume<br/>Checks"]
    Volume --> Freshness["⏱️ Freshness<br/>SLAs"]
    Freshness --> Distribution["📈 Distribution<br/>Checks"]
    Distribution --> Alert["🚨 Alert &<br/>Circuit Breaker"]

    style Alert fill:#f44336,color:#fff

Tools: Great Expectations, dbt tests, Monte Carlo, Anomalo, Soda, or custom solutions on statistical process control.

Circuit Breakers: If input data quality drops below a threshold, automatically stop the ML training pipeline from consuming bad data. Better to serve a slightly stale model than one trained on corrupted data.

Data Lineage & Cataloging

Lineage tracking: From raw event → derived table → feature → model → experiment metric. Tools: OpenLineage, DataHub, Amundsen.
Data catalog: Searchable inventory of all datasets with ownership, freshness, schema, and usage statistics.
SLA monitoring: Each critical table has a freshness SLA. Breaches trigger alerts.

14. End-to-End Architecture

Putting It All Together

┌─────────────────────────────────────────────────────────────────────────┐
│                         CLIENT DEVICES                                   │
│   (Smart TV, Mobile, Web, Console, Set-Top Box, Gaming Console)         │
└─────────────┬───────────────────────────────────────────┬───────────────┘
              │ events                                     │ API calls
              ▼                                            ▼
┌─────────────────────────┐              ┌───────────────────────────────┐
│   EVENT INGESTION        │              │   API GATEWAY                  │
│   ┌─────────────────┐   │              │   ┌────────────────────────┐  │
│   │  Schema Registry │   │              │   │  Experiment Assignment │  │
│   │  Kafka Clusters  │   │              │   │  (deterministic hash)  │  │
│   │  (multi-region)  │   │              │   └────────────────────────┘  │
│   └────────┬────────┘   │              │   ┌────────────────────────┐  │
│            │             │              │   │  Recommendation API    │  │
└────────────┼─────────────┘              │   │  retrieve → rank →    │  │
             │                            │   │  rerank → serve       │  │
             ▼                            │   └────────────────────────┘  │
┌─────────────────────────────┐          └──────────────┬────────────────┘
│   STREAM PROCESSING          │                         │
│   (Flink / Spark Streaming)  │                         │
│   ┌───────────────────────┐ │          ┌──────────────▼────────────────┐
│   │  Sessionization       │ │          │   ML SERVING                   │
│   │  Real-time features   │ │          │   ┌────────────────────────┐  │
│   │  Real-time metrics    │ │          │   │  Retrieval (ANN index) │  │
│   │  Anomaly detection    │ │          │   │  Ranking (GPU/CPU)     │  │
│   └───────────┬───────────┘ │          │   │  Feature Store (Redis) │  │
└───────────────┼─────────────┘          │   │  Model Registry        │  │
                │                        │   └────────────────────────┘  │
                ▼                        └───────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│                          DATA LAKE                                       │
│   ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ │
│   │  Raw Events  │ │  Sessionized  │ │  Feature     │ │  Experiment │ │
│   │  (Parquet)   │ │  Events       │ │  Tables      │ │  Results    │ │
│   └──────────────┘ └───────────────┘ └──────────────┘ └─────────────┘ │
│                                                                         │
│   Storage: S3 / ADLS / GCS — Format: Delta Lake / Iceberg / Hudi       │
│   Compute: Spark / Trino / dbt                                          │
└─────────────────────────────────────────────────────────────────────────┘

Technology Choices — Practical Summary

Component	Open Source	Managed Service
Event streaming	Apache Kafka	Confluent Cloud, Amazon MSK, Azure Event Hubs
Stream processing	Apache Flink, Spark Structured Streaming	Kinesis Data Analytics, Dataflow
Data lake storage	HDFS, MinIO	S3, ADLS Gen2, GCS
Table format	Delta Lake, Apache Iceberg, Apache Hudi	Databricks, Snowflake
Batch compute	Apache Spark, Trino/Presto	Databricks, EMR, Synapse, BigQuery
Feature store	Feast, Hopsworks	Tecton, SageMaker Feature Store, Vertex AI
ML training	PyTorch, TensorFlow, XGBoost	SageMaker, Vertex AI, Azure ML
Model serving	TF Serving, Triton, BentoML	SageMaker Endpoints, Vertex AI
Experiment platform	Custom (most common)	Eppo, Statsig, LaunchDarkly, Optimizely
Data quality	Great Expectations, dbt tests, Soda	Monte Carlo, Anomalo
Orchestration	Apache Airflow, Dagster, Prefect	MWAA, Cloud Composer, Astronomer
Data catalog	DataHub, Amundsen, OpenMetadata	Collibra, Alation

Learning Path

Phase 1: Foundations

ML basics — regression, classification, trees, neural networks
Metrics — precision, recall, F1, AUC, NDCG
Feature engineering techniques
SQL deep dive — window functions, CTEs, optimization

Phase 2: Domain Knowledge

Ads systems — CTR, ranking, auctions, attribution
Recommendation systems — collaborative & content-based filtering, two-tower, sequence models
Identity resolution — deterministic, probabilistic, graph-based
A/B testing — hypothesis testing, power analysis, CUPED

Phase 3: Infrastructure

Data engineering — Spark, Flink, Kafka, data lakes, Iceberg/Delta
ML system design patterns
Feature store architecture
MLOps & deployment strategies

Phase 4: GenAI

RAG architecture and implementation
Agents & tool calling (ReAct pattern)
NL to SQL
Embeddings & vector databases
LLM evaluation and guardrails

Priority Topics (Focus These First)

If limited on time, prioritize:

ML basics — regression, classification, trees, metrics
Feature engineering — the #1 differentiator in applied ML
Recommendation systems — two-tower, ranking, cold start
A/B testing & experimentation — fundamental to data-driven culture
Data pipelines — Spark, Kafka, ETL, data lakes
Identity resolution / device graph — unique cross-device challenge
ML system design — end-to-end thinking
RAG / LLM / Agents — the GenAI wave
SQL & data modeling — the universal data language
Metrics / KPIs — knowing what to optimize

Data Infrastructure for AI & Experimentation at Scale

What Does a Data & AI Platform Power?

Table of Contents

1. The Event Backbone — Ingestion & Streaming

What Needs to Be Captured

Streaming Architecture — Apache Kafka at the Core

Key Design Decisions

Event Schema Design — Example

2. Machine Learning Fundamentals

Supervised Learning

Unsupervised Learning

Important Concepts

Metrics

3. The Feature Platform

The Feature Engineering Problem

Feature Store Architecture

Types of Features on a Streaming Platform

User-Level Features (computed per user)

Content-Level Features (computed per item)

Cross Features (user × content interaction)

Point-in-Time Joins — The Most Critical Concept

Online Feature Serving — Latency Matters

4. Ads / Audience / Recommendation Systems

Ads System Architecture

How Ad Auctions Work

Models Used in Ads

Recommendation Model Architectures

Two-Tower Model (Retrieval Stage)

Deep Ranking Model (Ranking Stage)

Sequence Models for Watch History

Content Cold Start — The Chicken-and-Egg Problem

Handling Bias in Recommendation Data

5. Identity Resolution & Device Graph

The Fundamental Problem

Methods

Device Graph Structure

Key Graph Algorithms

Identity Graph at Scale

6. Data Engineering at Scale

Core Concepts Deep Dive

Data at Scale — Common Challenges

Typical Data Pipeline

Sessionization — Deceptively Hard

7. ML System Design

System Design Framework

ML System Pipeline

Real-Time vs Batch Training

Model Serving Infrastructure

Example Systems to Understand

8. The Experimentation Platform

Why Experimentation Infrastructure Is as Important as ML

Core Components

Assignment — Deterministic Hashing

Core Statistical Concepts

Sample Size Calculation

Traffic Allocation & Mutual Exclusion

Variance Reduction Techniques

Multiple Testing Correction

Guardrail Metrics

Metrics for Experiments

9. Gen AI / LLM / Agents

LLM Core Topics

RAG Architecture

RAG Pipeline Details

Reporting Agent Architecture

10. SQL & Data Modeling

SQL Deep Dive

Data Modeling for Analytics

11. KPIs / Objective Functions

The North Star Metric Problem

12. MLOps / Production ML

Core MLOps Concepts

MLOps Architecture

13. Data Quality & Observability

Why Data Quality Is the Biggest Risk

Data Quality Framework

Data Lineage & Cataloging

14. End-to-End Architecture

Putting It All Together

Technology Choices — Practical Summary