Diving Deep into Monolith reco by Bytedance
resources
- Paper : https://arxiv.org/pdf/2209.07663
- Repo : https://github.com/dinesh-coderepo/tiktok
- NotebookLM Discussion on Paper : Audio , Copied to google drive
- NotebookLM private link : NotebookLM Knowledge base
- Diving deep using LLM : Information dump
- New Notebook LM, included info to include more on embeddings and other details as well : Recording
Background & Prerequisites — What You Need to Know Before Reading This Blog
The Monolith paper sits at the intersection of recommendation systems, real-time ML serving, and large-scale distributed systems. Below are the foundational topics you need to understand.
1. Recommendation System Fundamentals
Why: Monolith is a production recommendation system — you need to understand the landscape it operates in. - Collaborative Filtering (CF) — User-based and item-based CF. Matrix factorization (SVD, ALS, NMF) decomposes user-item interaction matrices into latent factors. Sparse data is the core challenge. - Content-Based Filtering — Uses item features (video tags, duration, category) and user features (demographics, watch history) to recommend. No cold-start for items but limited diversity. - Deep Learning for Reco — Neural collaborative filtering (NCF), two-tower models (user encoder + item encoder), attention mechanisms (transformers) for sequential recommendation. - Ranking vs Retrieval — Two-stage systems: retrieval (candidate generation from millions → hundreds using approximate nearest neighbors) then ranking (score hundreds → top-K using a more complex model). - Multi-task learning — Predicting multiple objectives simultaneously: click-through rate (CTR), watch time, likes, shares, follows. Monolith handles these jointly.
2. Embedding Tables & Feature Engineering
Why: Monolith's key contribution is around real-time embedding table updates — you need to understand what embedding tables are. - Sparse features — Categorical features with high cardinality: user_id (billions), video_id (millions), hashtags, device_type. Represented as one-hot or multi-hot vectors. - Embedding lookup — Each sparse feature ID maps to a dense vector (e.g., 64 or 128 dimensions) in an embedding table. The table is a matrix of shape (num_unique_ids × embedding_dim). - Collision & hashing — With billions of IDs, full embedding tables are impractical. Hash-based tricks (feature hashing, quotient-remainder trick) reduce memory at the cost of collisions. - Dense features — Numerical features like watch duration, scroll speed, time of day. Typically normalized and concatenated with embeddings. - Feature interaction — Cross-features (user_id × video_category), factorization machines, and DeepFM combine sparse and dense signals.
3. Real-Time Training vs Batch Training
Why: Monolith's core innovation is enabling real-time (online) training for recommendation models, as opposed to traditional batch training. - Batch training — Collect user interactions over hours/days → train model offline → deploy updated model. Introduces staleness: model doesn't reflect recent trends or viral content. - Online/real-time training — Continuously update model parameters as new interactions stream in. Captures trending content, breaking news, and shifting user preferences immediately. - Challenges of online training — Parameter consistency (multiple workers updating simultaneously), feature distribution shift, catastrophic forgetting, training-serving skew. - Why it matters for TikTok — Short-video feeds are highly dynamic. A video can go viral in minutes. Batch-trained models miss these signals; real-time training captures them.
4. Parameter Server Architecture
Why: Monolith builds on the parameter server paradigm for distributed ML training. - What is a parameter server — A distributed system where model parameters are sharded across server nodes. Workers pull parameters, compute gradients on data shards, and push gradients back. The server aggregates and updates. - PS vs AllReduce — AllReduce (used in dense models like CNNs) synchronizes all workers. PS is better for sparse models (recommendation) because workers only need a subset of parameters (the embeddings for IDs in their data batch). - Synchronous vs Asynchronous updates — Sync: all workers finish before any update (consistent but slow). Async: workers update independently (faster but introduces gradient staleness). Monolith uses async. - Existing PS frameworks — TensorFlow's ParameterServerStrategy, BytePS, PServer. Monolith extends TensorFlow's PS with a custom collisionless hash table.
5. Monolith's Key Contributions
Why: Understanding what makes the paper novel. - Collisionless hash table — Unlike standard approaches that use fixed-size embedding tables with hash collisions, Monolith uses a dynamic, collision-free hash table. New IDs get their own embedding; old/inactive IDs are expired. Reduces quality loss from collisions. - Real-time training pipeline — Integrates online training with serving. The model in production is continuously updated with fresh user interactions. The paper shows this significantly improves CTR. - Fault tolerance — Mechanisms for checkpointing embedding tables, recovering from worker/server failures without losing training progress. - Expiry and filtering — Old embeddings (for IDs not seen recently) are evicted — critical for memory management when dealing with billions of IDs.
6. Evaluation & Metrics for Recommendation Systems
Why: Understanding how Monolith's improvements are measured. - AUC (Area Under ROC Curve) — Standard metric for CTR prediction. Measures model's ability to rank positive interactions above negative ones. - Log-loss (Cross-entropy) — Measures prediction confidence. Lower is better. - Online A/B testing — The only reliable way to evaluate recommendation changes. The paper reports online metrics (engagement, CTR) from production experiments. - Offline vs Online gap — Offline AUC improvements don't always translate to online gains. Real-time training helps close this gap.
7. TensorFlow Internals (Relevant Portions)
Why: Monolith is built as an extension to TensorFlow — understanding TF's architecture helps comprehend the system design.
- TF Variables & Embedding layers — tf.nn.embedding_lookup, tf.keras.layers.Embedding. Standard TF uses fixed-size Variables for embedding tables.
- TF Serving — The model serving infrastructure. Monolith modifies this to support real-time parameter updates.
- SavedModel format — How TF exports models. Monolith needs to checkpoint dynamic hash tables, not just fixed tensors.
Plan to do a exercise implementing this approach for recommendation
Learning Plan from Grok
I reached out to Grok for a comprehensive learning roadmap on this topic. For a detailed plan and structured guidance, check out the following link: Grok
TODO / Remaining Work
- [ ] Summarize the Monolith paper section by section
- [ ] Add architecture diagrams (Mermaid) for the Monolith system
- [ ] Explain the collisionless hash table with a visual example
- [ ] Compare Monolith's approach with traditional batch recommendation systems
- [ ] Implement a simplified version of the embedding table with expiry
- [ ] Document the real-time training pipeline with a sequence diagram
- [ ] Add evaluation results from the paper with analysis
- [ ] Connect findings to the recommendation-example blog post
- [ ] Change status from
workinprogresstopublished
The Monolith Paper — A Readable Summary
The Core Problem
TikTok's feed is hyper-dynamic: a video can go viral in minutes, and user preferences shift continuously. Traditional batch-trained recommendation systems — retrained nightly or hourly — are fundamentally too slow for this. By the time a new video has enough interactions to be in the training set, the viral moment is over.
On top of that, the candidate space is astronomically large: billions of user IDs, billions of video IDs, hashtag IDs, device IDs, etc. Standard embedding tables of shape (num_ids, embedding_dim) are either impossibly large or forced to use hashing tricks that cause collisions — two different IDs sharing one embedding — which hurts model quality.
Monolith's contribution is a system that solves both problems at once:
- Collisionless hash table — a dynamic, sparse, expiring embedding store.
- Real-time online training — model parameters update within seconds of user interactions.
The Collisionless Hash Table (Simplified Python Mock)
Standard approach (collision-prone):
# Fixed table, modulo hash → collisions are guaranteed
table = np.zeros((1_000_000, 64)) # 1M rows
emb = table[hash(user_id) % 1_000_000]
Monolith's approach (conceptually):
class CollisionlessEmbedding:
def __init__(self, dim: int, ttl_seconds: int = 7 * 86400):
self.dim = dim
self.ttl = ttl_seconds
self.table: dict[int, np.ndarray] = {} # id → vector
self.last_seen: dict[int, float] = {} # id → unix ts
def lookup(self, ids: list[int], now: float) -> np.ndarray:
out = np.zeros((len(ids), self.dim), dtype=np.float32)
for i, k in enumerate(ids):
if k not in self.table:
# new ID — allocate a fresh embedding (e.g., Xavier init)
self.table[k] = np.random.randn(self.dim).astype(np.float32) * 0.01
self.last_seen[k] = now
out[i] = self.table[k]
return out
def update(self, ids: list[int], grads: np.ndarray, lr: float = 1e-3):
for i, k in enumerate(ids):
self.table[k] -= lr * grads[i]
def expire(self, now: float):
stale = [k for k, ts in self.last_seen.items() if now - ts > self.ttl]
for k in stale:
del self.table[k]
del self.last_seen[k]
In production, Monolith implements this as a sharded, thread-safe C++ extension to TensorFlow — but the algorithm is the same: on-demand allocation per real ID, TTL-based eviction, no collisions.
The Real-Time Training Pipeline
sequenceDiagram
participant User
participant Serving as Serving Model<br/>(TF Serving)
participant Kafka
participant Trainer as Online Trainer
participant PS as Parameter Server<br/>(collisionless hash table)
User->>Serving: Request feed
Serving->>PS: Pull embeddings for user + candidates
PS-->>Serving: Embeddings
Serving-->>User: Ranked videos
User->>Kafka: Interaction event<br/>(watched 80%, liked)
Kafka->>Trainer: Stream of events
Trainer->>PS: Pull current embeddings
Trainer->>Trainer: Compute gradients
Trainer->>PS: Push gradient updates
PS-->>Serving: (periodic) sync updated weights
Critically, Serving and Trainer share the same parameter server but are decoupled processes. Model weights are synced to serving on a short interval (seconds), not redeployed.
Key Evaluation Findings from the Paper
- Collisionless table vs hash-collision baseline: up to +1% AUC on CTR prediction — huge at TikTok's scale (each 0.1% AUC is worth millions in revenue).
- Online training vs daily batch: +14% engagement in online A/B testing on the TikTok For You feed. Offline metrics alone understated the gains.
- Fault tolerance: workers/servers can fail and recover via periodic checkpointing without losing training progress or requiring full restart.
- Memory management: TTL-based expiry keeps the embedding table size bounded despite unbounded ID space — a practical necessity for a system that cannot simply keep growing.
Where This Connects to the Rest of the Blog
- Recommendation example shows the simpler lookalike-modeling problem with tiny data. Monolith is what that system looks like when it scales to billions of users.
- The collisionless hash table + real-time training pattern is now used (in various forms) by Meta (ScalableEmbedding), Kuaishou (Persia), and ByteDance internally across multiple products.
When every TODO above is ticked and you've implemented and benchmarked a toy version of the collisionless table, flip this post to status: published.