System Design Mastery Hub

#01

Load Balancing

Infra Performance Medium

Distributing traffic across multiple servers to maximize throughput, minimize latency, and avoid overload. Covers algorithms: round-robin, least connections, IP hashing.

Resources

Practice Problems

Design a load balancer for a video streaming service (Netflix-like)
Implement a consistent hashing-based load balancer that handles node failures
Design URL shortener with load balancing across 3 regions

#02

SQL vs NoSQL

Databases Medium

Relational vs non-relational databases — when to choose each, trade-offs in consistency, scalability, and schema flexibility.

Resources

Practice Problems

Design a social graph — justify SQL vs NoSQL choice
Design a product catalog for e-commerce: when does NoSQL win?
Schema design for a multi-tenant SaaS app

#03

Idempotency

APIs Distributed Medium

Designing operations that can be safely retried without causing unintended side effects. Critical for payment systems, message queues, and distributed APIs.

Resources

Practice Problems

Design idempotent payment processing — handle duplicate charges
Build a retry mechanism for a distributed order service
Design an email notification system that guarantees exactly-once delivery

#04

Message Queues

Infra Distributed Medium

Async communication via Kafka, RabbitMQ, SQS. Topics: pub/sub, dead letter queues, ordering guarantees, consumer groups, and backpressure.

Resources

Practice Problems

Design a notification system using Kafka (email, SMS, push)
Design an order processing pipeline with guaranteed delivery
Build a job queue for a distributed task scheduler

#05

CAP Theorem

Distributed Databases Hard

Consistency, Availability, and Partition Tolerance — you can only guarantee two. Understanding CP vs AP systems and the PACELC extension.

Resources

Practice Problems

Design a banking system — justify your CAP trade-offs
Compare DynamoDB (AP) vs HBase (CP) for a shopping cart
Design a distributed key-value store, explain partition handling

#06

APIs

APIs Foundational

API design principles: versioning, pagination, error handling, authentication patterns, rate limiting, and documentation best practices.

Resources

Practice Problems

Design the Twitter API — tweets, follows, timeline endpoints
Design a paginated search API for a 1B-item catalog
Design a public API with versioning strategy (v1 → v2 migration)

#07

Batch vs Stream Processing

Infra Performance Hard

MapReduce, Spark, Flink, Kafka Streams. Lambda vs Kappa architecture. When to process data in bulk vs real-time event streams.

Resources

Practice Problems

Design a real-time fraud detection system (stream processing)
Design a nightly analytics pipeline for 100M daily events
Design YouTube's view count system (eventual vs real-time)

#08

Caching Strategies

Performance Infra Medium

Cache-aside, write-through, write-behind, read-through. Where to cache: client, CDN, app server, DB. Cache invalidation and stampede problems.

Resources

Practice Problems

Design Instagram's feed cache — what to cache, when to invalidate
Handle cache stampede in a high-traffic flash sale
Design a multi-layer cache (L1/L2/L3) for a search engine

#09

Webhooks

APIs Architecture Easy

Event-driven HTTP callbacks. Delivery guarantees, retry logic, signature verification, fan-out patterns, and webhooks vs polling vs SSE.

Resources

Practice Problems

Design a webhook delivery system with retries and failure handling
Design GitHub Actions triggers — webhook fan-out to 100k subscribers
Build webhook signature verification for a payment provider

#10

Availability

Distributed Infra Medium

SLAs, the "nines" (99.9% vs 99.99%), redundancy, failover, health checks, circuit breakers, and designing for high availability across regions.

Resources

Practice Problems

Design a 99.99% available global payment API
Design multi-region failover for a ride-sharing app
Calculate availability for a system with 5 dependent services

#11

Data Sharding & Partitioning

Databases Distributed Hard

Horizontal partitioning by key, range, or hash. Hotspot problems, cross-shard queries, rebalancing, and the difference between sharding and partitioning.

Resources

Practice Problems

Design sharding strategy for a 10TB user database
Handle a hot shard in a social media timeline system
Design cross-shard transaction handling for a banking app

#12

Bloom Filters

Performance Distributed Hard

Probabilistic data structure for membership tests. Space-efficient "maybe yes, definitely no" filter. Used in databases, CDNs, and duplicate detection.

Resources

Practice Problems

Use a Bloom filter to reduce DB lookups for non-existent users
Design duplicate URL detection for a web crawler (Googlebot-scale)
Implement Bloom filter for "safe browsing" malicious URL detection

#13

Stateful vs Stateless Architecture

Architecture Infra Medium

Stateless services scale horizontally easily; stateful ones need sticky sessions or external state stores. Implications for microservices, k8s, and auth.

Resources

Practice Problems

Refactor a stateful session service to be horizontally scalable
Design stateless auth with JWTs replacing server-side sessions
Design a real-time game server — where must state live?

#14

Algorithms in Distributed Systems

Distributed Hard

Paxos, Raft consensus, gossip protocols, vector clocks, two-phase commit (2PC), Lamport timestamps, and leader election algorithms.

Resources

Practice Problems

Design a leader election service using Raft for a distributed DB
Implement vector clocks for conflict resolution in a distributed KV store
Design failure detection using gossip protocol

#15

API Gateways

APIs Infra Medium

Single entry point for microservices. Handles auth, rate limiting, routing, SSL termination, request aggregation, and observability (Kong, AWS API GW).

Resources

Practice Problems

Design an API gateway for a microservices e-commerce platform
Add rate limiting + JWT auth to an existing gateway
Design request aggregation to reduce mobile client round-trips

#16

Proxy vs Reverse Proxy

Infra Easy

Forward proxy (client anonymity, filtering) vs reverse proxy (server protection, load balancing, caching). NGINX, HAProxy, and Envoy use cases.

Resources

Practice Problems

Configure NGINX as a reverse proxy with SSL termination
Design a corporate forward proxy with content filtering
Compare: API Gateway vs reverse proxy — when to use each?

#17

Sharding (Deep Dive)

Databases Distributed Hard

Advanced: dynamic sharding, directory-based sharding, resharding without downtime, celebrity/hotspot problem, and global vs local indexes.

Resources

Practice Problems

Shard Discord's message store for 1T+ messages
Redesign Twitter's tweet storage with sharding by user ID
Design zero-downtime resharding for a growing startup DB

#18

Long Polling vs WebSockets

APIs Architecture Medium

Real-time communication: polling, long-polling, SSE, WebSockets. Trade-offs in connection overhead, latency, scalability, and browser support.

Resources

Practice Problems

Design WhatsApp's real-time messaging (WebSockets vs long polling)
Design live sports score updates for 10M concurrent users
Design a collaborative doc editor (Google Docs) — real-time sync

#19

Consistent Hashing

Distributed Performance Hard

Hash ring for distributing keys across nodes with minimal remapping when nodes join/leave. Virtual nodes for balance. Used in Cassandra, DynamoDB, Memcached.

Resources

Practice Problems

Design a distributed cache using consistent hashing (like Memcached)
Handle node failure and rebalancing in a Dynamo-style KV store
Design CDN server selection with consistent hashing

#20

gRPC, tRPC, GraphQL, or REST

APIs Medium

Choosing between communication protocols: REST for simplicity, GraphQL for flexible queries, gRPC for performance, tRPC for type-safe TS full-stack.

Resources

Practice Problems

Design a GitHub-like API — justify REST vs GraphQL choice
Build inter-service communication for microservices with gRPC
Design a mobile app API — optimize for bandwidth with GraphQL

#21

Caching (Systems Level)

Performance Infra Medium

Redis vs Memcached. Cache tiers (CPU L1-L3, app cache, distributed cache). Cache hit ratio, TTL, cold start, and thundering herd problem at scale.

Resources

Practice Problems

Design a leaderboard using Redis sorted sets
Design distributed session storage with Redis cluster
Handle thundering herd on cache miss for a viral post

#22

Scaling

Infra Architecture Medium

Vertical (scale-up) vs horizontal (scale-out) scaling. Auto-scaling policies, database read replicas, stateless service scaling, and cost implications.

Resources

Practice Problems

Scale a monolith to 10M users — what breaks first?
Design auto-scaling for a flash sale that gets 100x traffic spike
Scale Twitter to handle the Super Bowl second-by-second

#23

Cache Eviction Policies

Performance Medium

LRU, LFU, FIFO, MRU, Random, TTL-based. When to use each, implementation complexity, and how Redis implements these under the hood.

Resources

Practice Problems

Implement LRU cache with O(1) get/put (LeetCode #146)
Implement LFU cache (LeetCode #460)
Design a CDN cache eviction policy for video segments

#24

Databases in System Design

Databases Medium

Choosing the right DB: OLTP vs OLAP, columnar stores, time-series DBs, graph DBs, full-text search. Replication, leader/follower, read replicas.

Resources

Practice Problems

Choose databases for Uber: trips, drivers, payments, analytics
Design database replication for 99.99% read availability
Design a time-series DB for IoT sensor data (1M writes/sec)

#25

JWTs

APIs Architecture Easy

JSON Web Tokens for stateless auth. Header/payload/signature structure, signing algorithms (HS256 vs RS256), token refresh patterns, and revocation challenges.

Resources

Practice Problems

Design JWT-based auth with refresh token rotation
Handle JWT revocation without a token blacklist at scale
Design SSO across microservices using JWTs

#26

Services in System Design

Architecture Medium

Microservices vs monolith vs SOA. Service decomposition, inter-service communication, service discovery, circuit breakers, and service mesh (Istio).

Resources

Practice Problems

Decompose an e-commerce monolith into microservices
Design service discovery for 50+ microservices
Implement circuit breaker pattern for a payment service

#27

Concurrency vs Parallelism

Performance Architecture Medium

Concurrency = dealing with multiple things at once; Parallelism = doing multiple things at once. Threads, goroutines, event loops, locks, deadlocks, race conditions.

Resources

Practice Problems

Design a thread-safe rate limiter (TokenBucket with mutex)
Design concurrent image processing pipeline without race conditions
Implement optimistic vs pessimistic locking for a booking system

#28

CDC (Change Data Capture)

Databases Distributed Hard

Tracking and propagating DB changes in real-time using WAL/binlog. Tools: Debezium, Kafka Connect. Powers data sync, search indexing, cache invalidation.

Resources

Practice Problems

Sync PostgreSQL changes to Elasticsearch in real-time using CDC
Invalidate cache on DB writes using CDC pipeline
Design an audit log system using CDC (zero-impact on app code)

#29

ACID Transactions

Databases Distributed Hard

Atomicity, Consistency, Isolation, Durability. Isolation levels (Read Uncommitted → Serializable), phantom reads, dirty reads, 2PL, MVCC, and BASE vs ACID.

Resources

Practice Problems

Design a bank transfer — ensure atomicity across 2 account rows
Identify and fix phantom read in a ticket reservation system
Design distributed transactions with Saga pattern (no 2PC)

#30

CDN

Infra Performance Easy

Content Delivery Networks — edge caching, PoP servers, anycast routing, push vs pull CDN, cache purging, and using CDN for dynamic content.

Resources

Practice Problems

Design a CDN strategy for Netflix video delivery
Design cache invalidation when a user updates their profile photo
Optimize a global e-commerce site with CDN for 50ms P99 latency

#31

Sync vs Async

Architecture Distributed Medium

Synchronous (blocking) vs asynchronous (non-blocking) communication. Callbacks, promises, async/await, event-driven architectures, and temporal coupling.

Resources

Practice Problems

Convert a sync order processing API to async with callbacks
Design an image resizing service using async queues
Design email sending — sync vs async trade-offs in checkout flow

#32

Rate Limiting Algorithms

APIs Infra Hard

Token Bucket, Leaky Bucket, Fixed Window, Sliding Window Log, Sliding Window Counter. Distributed rate limiting with Redis. Choosing the right algorithm.

Resources

Practice Problems

Design rate limiter for Twitter API (user + IP + global limits)
Implement distributed token bucket rate limiter using Redis
Design DDoS protection layer with adaptive rate limiting

#33

REST

APIs Foundational

RESTful constraints: statelessness, uniform interface, HATEOAS, resource naming, HTTP methods/status codes, and REST maturity model (Richardson).

Resources

Practice Problems

Design RESTful API for a blog (CRUD + pagination + filtering)
Design proper HTTP status codes for a payment API error taxonomy
Design a HATEOAS-compliant API for a workflow engine

#34

gRPC vs REST Trade-offs

APIs Performance Medium

Protocol Buffers vs JSON, HTTP/2 vs HTTP/1.1, bidirectional streaming, code generation, browser support limitations, and performance benchmarks.

Resources

Practice Problems

Choose REST vs gRPC for: public API, internal microservices, mobile
Design a real-time bidirectional chat with gRPC streaming
Migrate a REST internal API to gRPC — justify the decision

#35

Fault Tolerance

Distributed Infra Hard

Designing for failure: circuit breakers, bulkheads, retries with exponential backoff, timeouts, graceful degradation, chaos engineering, and disaster recovery.

Resources

Practice Problems

Design circuit breaker for payment service with graceful degradation
Design chaos engineering test plan for a ride-sharing app
Design disaster recovery with RPO < 1min and RTO < 5min