System Design & DSA Masterclass Advanced +250 XP

HLD Fundamentals

Scaling Foundations: Vertical vs Horizontal

When an application experiences high traffic loads, you must scale its infrastructure to prevent bottlenecks. There are two primary strategies:

  • Vertical Scaling (Scale-Up): Adding more power (CPU, RAM, SSDs) to a single server. It is easy to implement but hits hard physical hardware limits and introduces a Single Point of Failure (SPOF).
  • Horizontal Scaling (Scale-Out): Adding more standard servers to the resource pool. It requires a **Load Balancer** to distribute traffic, but offers infinite scaling capabilities and high fault tolerance.

The CAP Theorem & PACELC Theorem Tradeoffs

Formulated by Eric Brewer, the **CAP Theorem** states that a distributed data store can simultaneously provide at most two of the following three guarantees when a network partition occurs:

  • Consistency (C): Every read receives the most recent write or an error.
  • Availability (A): Every non-failing node returns a non-error response (without guarantee of containing the latest write).
  • Partition Tolerance (P): The system continues to operate despite arbitrary network partition splits.
💡 Critical Interview Insight: Since networks are physically imperfect, partitions **will** happen. Therefore, in distributed databases, you must choose between **Consistency (CP)** or **Availability (AP)**. You cannot choose CA.

PACELC Theorem Extension

PACELC extends CAP by describing tradeoffs even when there is no network partition:

  • If there is a Partition (P): Choose between Availability (A) or Consistency (C).
  • Else (E) (Normal state): Choose between Latency (L) or Consistency (C).

High Availability Metrics: SLA vs SLO vs SLI

When designing large-scale distributed architectures, defining reliability metrics is essential for service delivery:

  • SLA (Service Level Agreement): A formal contract between a service provider and users specifying expected reliability, with financial/contractual penalties if unmet.
  • SLO (Service Level Objective): Internal reliability targets agreed upon by engineering and product teams (e.g. 'API requests should have a latency less than 200ms 99.9% of the time').
  • SLI (Service Level Indicator): Direct, real-time measurements indicating compliance with SLOs (e.g. 'The actual latency for GET requests is 142ms over the past hour').
📊 Downtime Calculations (The Power of Nines):
99% Availability (Two Nines): 3.65 days of downtime per year.
99.9% Availability (Three Nines): 8.77 hours of downtime per year.
99.999% Availability (Five Nines): 5.26 minutes of downtime per year.

NewSQL & Distributed Consensus (Raft / Paxos)

Traditional SQL scales vertically, while NoSQL scales horizontally by discarding strict ACID constraints. **NewSQL** (Google Spanner, CockroachDB) bridges this gap, providing **horizontal scaling with absolute ACID guarantees**.

To coordinate transactions and agree on system states across multiple geographical nodes, NewSQL utilizes **Distributed Consensus Algorithms**:

  • Raft: A highly understandable leader-based consensus protocol. Replicates logs across a majority of nodes. Handles leader election dynamically when the active leader fails.
  • Paxos: A classic, mathematically proven consensus protocol using proposers, acceptors, and learners to secure agreements over state transitions.

Database Sharding & Consistent Hashing

When database tables exceed billions of records, a single database node becomes sluggish. **Sharding** partitions tables horizontally across separate database machines.

To allocate queries to correct shards dynamically without resetting all indices when a node is added/removed, we use **Consistent Hashing**. Nodes and keys are mapped onto a circular hash ring, minimizing data migrations when nodes scale.

Distributed Caching Strategies

Caching stores high-frequency data in lightning-fast RAM to bypass slow disk reads. Common invalidation policies include:

  • Write-Through: Data is written to the cache and the backing database simultaneously. Keeps data consistent but adds write latency.
  • Write-Back (Write-Behind): Data is written to the cache immediately, and synced asynchronously to the database later. Low latency but risks data loss during failures.
  • Cache-Aside (Lazy Loading): App checks the cache. On miss, it queries the database, updates the cache, and returns data.

Messaging & Event-Driven: Queue vs Event Streaming

Asynchronous communications decouple workloads. Choose the correct pattern for your system:

  • Message Queues (RabbitMQ, SQS): Traditional point-to-point queues. A message is delivered to **one** consumer, parsed, and deleted. Best for job worker pools.
  • Event Streaming (Kafka, Kinesis): Append-only logs. Messages are retained on disk and can be read/replayed by **multiple independent consumer groups** at their own offsets. Best for real-time analytics.
  • Dead Letter Queues (DLQ): A secondary queue where messages that fail parsing/processing are automatically rerouted for administrative auditing, preventing head-of-line blocking.
  • Exactly-Once vs At-Least-Once: To guarantee exactly-once delivery, systems combine **idempotency checks** on consumers with retry tokens and deduplication keys.

Microservices consistency: Sagas & CQRS / Event Sourcing

In microservice architectures, maintaining transactions across separate database boundaries is challenging. We avoid two-phase commits (2PC) because they are blocking and slow, preferring:

  • Saga Pattern: A distributed transaction is broken down into a series of local transactions. Each microservice executes its step. If a step fails, the Saga orchestrator/coordinator triggers **compensating transactions** (rollbacks) in reverse order.
  • CQRS (Command Query Responsibility Segregation): Segregates the write database (Command) from the read database (Query), optimizing read performance.
  • Event Sourcing: Instead of storing only the active state of an entity, we store the **entire sequence of state-changing events** in an append-only event store, allowing absolute auditability and replaying system state to any point in history.