HLD Fundamentals
AI Learning Mentor
Generative insights & diagnostic help
Scaling Foundations: Vertical vs Horizontal
When an application experiences high traffic loads, you must scale its infrastructure to prevent bottlenecks. There are two primary strategies:
- Vertical Scaling (Scale-Up): Adding more power (CPU, RAM, SSDs) to a single server. It is easy to implement but hits hard physical hardware limits and introduces a Single Point of Failure (SPOF).
- Horizontal Scaling (Scale-Out): Adding more standard servers to the resource pool. It requires a **Load Balancer** to distribute traffic, but offers infinite scaling capabilities and high fault tolerance.
The CAP Theorem & PACELC Theorem Tradeoffs
Formulated by Eric Brewer, the **CAP Theorem** states that a distributed data store can simultaneously provide at most two of the following three guarantees when a network partition occurs:
- Consistency (C): Every read receives the most recent write or an error.
- Availability (A): Every non-failing node returns a non-error response (without guarantee of containing the latest write).
- Partition Tolerance (P): The system continues to operate despite arbitrary network partition splits.
PACELC Theorem Extension
PACELC extends CAP by describing tradeoffs even when there is no network partition:
- If there is a Partition (P): Choose between Availability (A) or Consistency (C).
- Else (E) (Normal state): Choose between Latency (L) or Consistency (C).
High Availability Metrics: SLA vs SLO vs SLI
When designing large-scale distributed architectures, defining reliability metrics is essential for service delivery:
- SLA (Service Level Agreement): A formal contract between a service provider and users specifying expected reliability, with financial/contractual penalties if unmet.
- SLO (Service Level Objective): Internal reliability targets agreed upon by engineering and product teams (e.g. 'API requests should have a latency less than 200ms 99.9% of the time').
- SLI (Service Level Indicator): Direct, real-time measurements indicating compliance with SLOs (e.g. 'The actual latency for GET requests is 142ms over the past hour').
• 99% Availability (Two Nines): 3.65 days of downtime per year.
• 99.9% Availability (Three Nines): 8.77 hours of downtime per year.
• 99.999% Availability (Five Nines): 5.26 minutes of downtime per year.
NewSQL & Distributed Consensus (Raft / Paxos)
Traditional SQL scales vertically, while NoSQL scales horizontally by discarding strict ACID constraints. **NewSQL** (Google Spanner, CockroachDB) bridges this gap, providing **horizontal scaling with absolute ACID guarantees**.
To coordinate transactions and agree on system states across multiple geographical nodes, NewSQL utilizes **Distributed Consensus Algorithms**:
- Raft: A highly understandable leader-based consensus protocol. Replicates logs across a majority of nodes. Handles leader election dynamically when the active leader fails.
- Paxos: A classic, mathematically proven consensus protocol using proposers, acceptors, and learners to secure agreements over state transitions.
Database Sharding & Consistent Hashing
When database tables exceed billions of records, a single database node becomes sluggish. **Sharding** partitions tables horizontally across separate database machines.
To allocate queries to correct shards dynamically without resetting all indices when a node is added/removed, we use **Consistent Hashing**. Nodes and keys are mapped onto a circular hash ring, minimizing data migrations when nodes scale.
Distributed Caching Strategies
Caching stores high-frequency data in lightning-fast RAM to bypass slow disk reads. Common invalidation policies include:
- Write-Through: Data is written to the cache and the backing database simultaneously. Keeps data consistent but adds write latency.
- Write-Back (Write-Behind): Data is written to the cache immediately, and synced asynchronously to the database later. Low latency but risks data loss during failures.
- Cache-Aside (Lazy Loading): App checks the cache. On miss, it queries the database, updates the cache, and returns data.
Messaging & Event-Driven: Queue vs Event Streaming
Asynchronous communications decouple workloads. Choose the correct pattern for your system:
- Message Queues (RabbitMQ, SQS): Traditional point-to-point queues. A message is delivered to **one** consumer, parsed, and deleted. Best for job worker pools.
- Event Streaming (Kafka, Kinesis): Append-only logs. Messages are retained on disk and can be read/replayed by **multiple independent consumer groups** at their own offsets. Best for real-time analytics.
- Dead Letter Queues (DLQ): A secondary queue where messages that fail parsing/processing are automatically rerouted for administrative auditing, preventing head-of-line blocking.
- Exactly-Once vs At-Least-Once: To guarantee exactly-once delivery, systems combine **idempotency checks** on consumers with retry tokens and deduplication keys.
Microservices consistency: Sagas & CQRS / Event Sourcing
In microservice architectures, maintaining transactions across separate database boundaries is challenging. We avoid two-phase commits (2PC) because they are blocking and slow, preferring:
- Saga Pattern: A distributed transaction is broken down into a series of local transactions. Each microservice executes its step. If a step fails, the Saga orchestrator/coordinator triggers **compensating transactions** (rollbacks) in reverse order.
- CQRS (Command Query Responsibility Segregation): Segregates the write database (Command) from the read database (Query), optimizing read performance.
- Event Sourcing: Instead of storing only the active state of an entity, we store the **entire sequence of state-changing events** in an append-only event store, allowing absolute auditability and replaying system state to any point in history.