// system design mastery

Lock In &
Level Up

Your complete resource guide for all 35 system design topics — with curated learning materials and practice problems.

35Topics
100+Resources
70+Practice Problems
#01
Load Balancing
Infra Performance Medium

Distributing traffic across multiple servers to maximize throughput, minimize latency, and avoid overload. Covers algorithms: round-robin, least connections, IP hashing.

  • Design a load balancer for a video streaming service (Netflix-like)
  • Implement a consistent hashing-based load balancer that handles node failures
  • Design URL shortener with load balancing across 3 regions
#02
SQL vs NoSQL
Databases Medium

Relational vs non-relational databases — when to choose each, trade-offs in consistency, scalability, and schema flexibility.

  • Design a social graph — justify SQL vs NoSQL choice
  • Design a product catalog for e-commerce: when does NoSQL win?
  • Schema design for a multi-tenant SaaS app
#03
Idempotency
APIs Distributed Medium

Designing operations that can be safely retried without causing unintended side effects. Critical for payment systems, message queues, and distributed APIs.

  • Design idempotent payment processing — handle duplicate charges
  • Build a retry mechanism for a distributed order service
  • Design an email notification system that guarantees exactly-once delivery
#04
Message Queues
Infra Distributed Medium

Async communication via Kafka, RabbitMQ, SQS. Topics: pub/sub, dead letter queues, ordering guarantees, consumer groups, and backpressure.

  • Design a notification system using Kafka (email, SMS, push)
  • Design an order processing pipeline with guaranteed delivery
  • Build a job queue for a distributed task scheduler
#05
CAP Theorem
Distributed Databases Hard

Consistency, Availability, and Partition Tolerance — you can only guarantee two. Understanding CP vs AP systems and the PACELC extension.

  • Design a banking system — justify your CAP trade-offs
  • Compare DynamoDB (AP) vs HBase (CP) for a shopping cart
  • Design a distributed key-value store, explain partition handling
#06
APIs
APIs Foundational

API design principles: versioning, pagination, error handling, authentication patterns, rate limiting, and documentation best practices.

  • Design the Twitter API — tweets, follows, timeline endpoints
  • Design a paginated search API for a 1B-item catalog
  • Design a public API with versioning strategy (v1 → v2 migration)
#07
Batch vs Stream Processing
Infra Performance Hard

MapReduce, Spark, Flink, Kafka Streams. Lambda vs Kappa architecture. When to process data in bulk vs real-time event streams.

  • Design a real-time fraud detection system (stream processing)
  • Design a nightly analytics pipeline for 100M daily events
  • Design YouTube's view count system (eventual vs real-time)
#08
Caching Strategies
Performance Infra Medium

Cache-aside, write-through, write-behind, read-through. Where to cache: client, CDN, app server, DB. Cache invalidation and stampede problems.

  • Design Instagram's feed cache — what to cache, when to invalidate
  • Handle cache stampede in a high-traffic flash sale
  • Design a multi-layer cache (L1/L2/L3) for a search engine
#09
Webhooks
APIs Architecture Easy

Event-driven HTTP callbacks. Delivery guarantees, retry logic, signature verification, fan-out patterns, and webhooks vs polling vs SSE.

  • Design a webhook delivery system with retries and failure handling
  • Design GitHub Actions triggers — webhook fan-out to 100k subscribers
  • Build webhook signature verification for a payment provider
#10
Availability
Distributed Infra Medium

SLAs, the "nines" (99.9% vs 99.99%), redundancy, failover, health checks, circuit breakers, and designing for high availability across regions.

  • Design a 99.99% available global payment API
  • Design multi-region failover for a ride-sharing app
  • Calculate availability for a system with 5 dependent services
#11
Data Sharding & Partitioning
Databases Distributed Hard

Horizontal partitioning by key, range, or hash. Hotspot problems, cross-shard queries, rebalancing, and the difference between sharding and partitioning.

  • Design sharding strategy for a 10TB user database
  • Handle a hot shard in a social media timeline system
  • Design cross-shard transaction handling for a banking app
#12
Bloom Filters
Performance Distributed Hard

Probabilistic data structure for membership tests. Space-efficient "maybe yes, definitely no" filter. Used in databases, CDNs, and duplicate detection.

  • Use a Bloom filter to reduce DB lookups for non-existent users
  • Design duplicate URL detection for a web crawler (Googlebot-scale)
  • Implement Bloom filter for "safe browsing" malicious URL detection
#13
Stateful vs Stateless Architecture
Architecture Infra Medium

Stateless services scale horizontally easily; stateful ones need sticky sessions or external state stores. Implications for microservices, k8s, and auth.

  • Refactor a stateful session service to be horizontally scalable
  • Design stateless auth with JWTs replacing server-side sessions
  • Design a real-time game server — where must state live?
#14
Algorithms in Distributed Systems
Distributed Hard

Paxos, Raft consensus, gossip protocols, vector clocks, two-phase commit (2PC), Lamport timestamps, and leader election algorithms.

  • Design a leader election service using Raft for a distributed DB
  • Implement vector clocks for conflict resolution in a distributed KV store
  • Design failure detection using gossip protocol
#15
API Gateways
APIs Infra Medium

Single entry point for microservices. Handles auth, rate limiting, routing, SSL termination, request aggregation, and observability (Kong, AWS API GW).

  • Design an API gateway for a microservices e-commerce platform
  • Add rate limiting + JWT auth to an existing gateway
  • Design request aggregation to reduce mobile client round-trips
#16
Proxy vs Reverse Proxy
Infra Easy

Forward proxy (client anonymity, filtering) vs reverse proxy (server protection, load balancing, caching). NGINX, HAProxy, and Envoy use cases.

  • Configure NGINX as a reverse proxy with SSL termination
  • Design a corporate forward proxy with content filtering
  • Compare: API Gateway vs reverse proxy — when to use each?
#17
Sharding (Deep Dive)
Databases Distributed Hard

Advanced: dynamic sharding, directory-based sharding, resharding without downtime, celebrity/hotspot problem, and global vs local indexes.

  • Shard Discord's message store for 1T+ messages
  • Redesign Twitter's tweet storage with sharding by user ID
  • Design zero-downtime resharding for a growing startup DB
#18
Long Polling vs WebSockets
APIs Architecture Medium

Real-time communication: polling, long-polling, SSE, WebSockets. Trade-offs in connection overhead, latency, scalability, and browser support.

  • Design WhatsApp's real-time messaging (WebSockets vs long polling)
  • Design live sports score updates for 10M concurrent users
  • Design a collaborative doc editor (Google Docs) — real-time sync
#19
Consistent Hashing
Distributed Performance Hard

Hash ring for distributing keys across nodes with minimal remapping when nodes join/leave. Virtual nodes for balance. Used in Cassandra, DynamoDB, Memcached.

  • Design a distributed cache using consistent hashing (like Memcached)
  • Handle node failure and rebalancing in a Dynamo-style KV store
  • Design CDN server selection with consistent hashing
#20
gRPC, tRPC, GraphQL, or REST
APIs Medium

Choosing between communication protocols: REST for simplicity, GraphQL for flexible queries, gRPC for performance, tRPC for type-safe TS full-stack.

  • Design a GitHub-like API — justify REST vs GraphQL choice
  • Build inter-service communication for microservices with gRPC
  • Design a mobile app API — optimize for bandwidth with GraphQL
#21
Caching (Systems Level)
Performance Infra Medium

Redis vs Memcached. Cache tiers (CPU L1-L3, app cache, distributed cache). Cache hit ratio, TTL, cold start, and thundering herd problem at scale.

  • Design a leaderboard using Redis sorted sets
  • Design distributed session storage with Redis cluster
  • Handle thundering herd on cache miss for a viral post
#22
Scaling
Infra Architecture Medium

Vertical (scale-up) vs horizontal (scale-out) scaling. Auto-scaling policies, database read replicas, stateless service scaling, and cost implications.

  • Scale a monolith to 10M users — what breaks first?
  • Design auto-scaling for a flash sale that gets 100x traffic spike
  • Scale Twitter to handle the Super Bowl second-by-second
#23
Cache Eviction Policies
Performance Medium

LRU, LFU, FIFO, MRU, Random, TTL-based. When to use each, implementation complexity, and how Redis implements these under the hood.

  • Implement LRU cache with O(1) get/put (LeetCode #146)
  • Implement LFU cache (LeetCode #460)
  • Design a CDN cache eviction policy for video segments
#24
Databases in System Design
Databases Medium

Choosing the right DB: OLTP vs OLAP, columnar stores, time-series DBs, graph DBs, full-text search. Replication, leader/follower, read replicas.

  • Choose databases for Uber: trips, drivers, payments, analytics
  • Design database replication for 99.99% read availability
  • Design a time-series DB for IoT sensor data (1M writes/sec)
#25
JWTs
APIs Architecture Easy

JSON Web Tokens for stateless auth. Header/payload/signature structure, signing algorithms (HS256 vs RS256), token refresh patterns, and revocation challenges.

  • Design JWT-based auth with refresh token rotation
  • Handle JWT revocation without a token blacklist at scale
  • Design SSO across microservices using JWTs
#26
Services in System Design
Architecture Medium

Microservices vs monolith vs SOA. Service decomposition, inter-service communication, service discovery, circuit breakers, and service mesh (Istio).

  • Decompose an e-commerce monolith into microservices
  • Design service discovery for 50+ microservices
  • Implement circuit breaker pattern for a payment service
#27
Concurrency vs Parallelism
Performance Architecture Medium

Concurrency = dealing with multiple things at once; Parallelism = doing multiple things at once. Threads, goroutines, event loops, locks, deadlocks, race conditions.

  • Design a thread-safe rate limiter (TokenBucket with mutex)
  • Design concurrent image processing pipeline without race conditions
  • Implement optimistic vs pessimistic locking for a booking system
#28
CDC (Change Data Capture)
Databases Distributed Hard

Tracking and propagating DB changes in real-time using WAL/binlog. Tools: Debezium, Kafka Connect. Powers data sync, search indexing, cache invalidation.

  • Sync PostgreSQL changes to Elasticsearch in real-time using CDC
  • Invalidate cache on DB writes using CDC pipeline
  • Design an audit log system using CDC (zero-impact on app code)
#29
ACID Transactions
Databases Distributed Hard

Atomicity, Consistency, Isolation, Durability. Isolation levels (Read Uncommitted → Serializable), phantom reads, dirty reads, 2PL, MVCC, and BASE vs ACID.

  • Design a bank transfer — ensure atomicity across 2 account rows
  • Identify and fix phantom read in a ticket reservation system
  • Design distributed transactions with Saga pattern (no 2PC)
#30
CDN
Infra Performance Easy

Content Delivery Networks — edge caching, PoP servers, anycast routing, push vs pull CDN, cache purging, and using CDN for dynamic content.

  • Design a CDN strategy for Netflix video delivery
  • Design cache invalidation when a user updates their profile photo
  • Optimize a global e-commerce site with CDN for 50ms P99 latency
#31
Sync vs Async
Architecture Distributed Medium

Synchronous (blocking) vs asynchronous (non-blocking) communication. Callbacks, promises, async/await, event-driven architectures, and temporal coupling.

  • Convert a sync order processing API to async with callbacks
  • Design an image resizing service using async queues
  • Design email sending — sync vs async trade-offs in checkout flow
#32
Rate Limiting Algorithms
APIs Infra Hard

Token Bucket, Leaky Bucket, Fixed Window, Sliding Window Log, Sliding Window Counter. Distributed rate limiting with Redis. Choosing the right algorithm.

  • Design rate limiter for Twitter API (user + IP + global limits)
  • Implement distributed token bucket rate limiter using Redis
  • Design DDoS protection layer with adaptive rate limiting
#33
REST
APIs Foundational

RESTful constraints: statelessness, uniform interface, HATEOAS, resource naming, HTTP methods/status codes, and REST maturity model (Richardson).

  • Design RESTful API for a blog (CRUD + pagination + filtering)
  • Design proper HTTP status codes for a payment API error taxonomy
  • Design a HATEOAS-compliant API for a workflow engine
#34
gRPC vs REST Trade-offs
APIs Performance Medium

Protocol Buffers vs JSON, HTTP/2 vs HTTP/1.1, bidirectional streaming, code generation, browser support limitations, and performance benchmarks.

  • Choose REST vs gRPC for: public API, internal microservices, mobile
  • Design a real-time bidirectional chat with gRPC streaming
  • Migrate a REST internal API to gRPC — justify the decision
#35
Fault Tolerance
Distributed Infra Hard

Designing for failure: circuit breakers, bulkheads, retries with exponential backoff, timeouts, graceful degradation, chaos engineering, and disaster recovery.

  • Design circuit breaker for payment service with graceful degradation
  • Design chaos engineering test plan for a ride-sharing app
  • Design disaster recovery with RPO < 1min and RTO < 5min