Episode 6 — Scaling Reliability Microservices Web3 / 6.5 — Scaling Concepts

Interview Questions: Scaling Concepts

Model answers for vertical vs horizontal scaling, load balancers, and stateless design.

How to use this material (instructions)

Read lessons in order -- README.md, then 6.5.a -> 6.5.c.
Practice out loud -- definition -> example -> pitfall.
Pair with exercises -- 6.5-Exercise-Questions.md.
Quick review -- 6.5-Quick-Revision.md.

Beginner (Q1--Q4)

Q1. What is the difference between vertical and horizontal scaling?

Why interviewers ask: Tests your understanding of the two fundamental scaling strategies -- this is the foundation for every scaling conversation in system design interviews.

Model answer:

Vertical scaling (scale up) means upgrading a single machine: more CPU, more RAM, faster disks. It is simple -- no code changes needed -- but has a hard ceiling (you cannot buy an infinitely large machine) and creates a single point of failure. Cost grows exponentially: a machine with 2x the capacity typically costs 2.5-3x more.

Horizontal scaling (scale out) means adding more machines running the same application, with a load balancer distributing traffic across them. Cost grows linearly (2x machines = ~2x cost), there is no hard ceiling, and you get fault tolerance (if one server dies, others continue). The trade-off is complexity: your application must be stateless (no in-memory sessions, no local file storage), and you need load balancers, health checks, and deployment coordination.

Most production systems use both: scale the application tier horizontally (many small instances behind a load balancer) and scale the database tier vertically (bigger RDS instance) until the database needs read replicas or sharding.

Q2. What does a load balancer do and why is it necessary?

Why interviewers ask: Load balancers are ubiquitous in production architectures. This tests whether you understand the fundamentals of traffic distribution.

Model answer:

A load balancer sits between clients and backend servers. It performs four core functions:

Traffic distribution -- spreads incoming requests across a pool of healthy servers so no single server is overwhelmed.
Health checking -- periodically pings each server (e.g., GET /health). If a server fails, the LB removes it from the pool automatically.
SSL termination -- handles HTTPS encryption/decryption so backend servers deal with plain HTTP, simplifying certificate management.
High availability -- if one backend server dies, remaining servers continue handling traffic. Users may not notice the failure.

It is necessary for horizontal scaling because you need something to decide which server handles each request. Without a load balancer, clients would need to know about individual server IPs and handle failover themselves -- which is impractical.

Common load balancers include AWS ALB (Layer 7, HTTP-aware), AWS NLB (Layer 4, TCP-level), Nginx (open-source, highly configurable), and HAProxy (open-source, excellent for raw TCP).

Q3. What does "stateless" mean and why does it matter for scaling?

Why interviewers ask: Statelessness is the single most important design principle for horizontally scalable web applications. Interviewers want to know if you understand why.

Model answer:

A stateless server stores no per-client data (sessions, caches, uploaded files) in its own memory or local filesystem between requests. Every request contains everything needed to process it -- typically via a JWT token in the Authorization header or a session ID that maps to data in an external store like Redis.

This matters because it means any server instance can handle any request. The load balancer can use simple round-robin or least-connections without worrying about sending a user to "their" server. This unlocks:

Horizontal scaling -- add or remove instances freely.
Zero-downtime deployments -- replace servers one by one.
Auto-scaling -- spin up instances during peaks, terminate during valleys.
Fault tolerance -- a dead server loses no user state.

The common mistake is thinking stateless means "no state at all." State still exists -- it is just stored externally: sessions in Redis, files in S3, caches in Redis/Memcached, and jobs in a distributed queue like Bull.

Q4. What is the difference between Layer 4 and Layer 7 load balancing?

Why interviewers ask: Tests depth of networking knowledge beyond "it distributes traffic." The distinction affects your architecture choices.

Model answer:

Layer 4 (Transport) load balancers operate on TCP/UDP packets. They see IP addresses and port numbers but cannot inspect HTTP headers, URLs, cookies, or request bodies. They are extremely fast (minimal per-packet processing) and are used for raw TCP traffic like database connections, gaming servers, or any non-HTTP protocol. AWS NLB operates at Layer 4.

Layer 7 (Application) load balancers operate on HTTP/HTTPS requests. They can see the full request -- URL path, headers, cookies, body -- and make routing decisions based on this information. For example, routing /api/* requests to API servers and /static/* to a CDN origin. They can also perform SSL termination, inject headers like X-Forwarded-For, and enable features like sticky sessions via cookies. AWS ALB operates at Layer 7.

For most web applications and APIs, you want Layer 7 (ALB) because you need URL-based routing, SSL termination, and health checks that verify HTTP responses. You choose Layer 4 (NLB) when you need ultra-low latency, static IPs, or non-HTTP protocols.

Intermediate (Q5--Q8)

Q5. Explain the different load balancing algorithms and when you would use each.

Why interviewers ask: Shows you understand that "load balancing" is not one-size-fits-all -- different workloads need different strategies.

Model answer:

Round Robin: Requests go to servers in sequence (1, 2, 3, 1, 2, 3...). Simple and even distribution when servers are identical and requests are uniform. Weakness: ignores server load -- a server handling a slow query still gets the next request.

Weighted Round Robin: Same as round robin, but servers with more capacity receive proportionally more requests. Useful when you have mixed instance sizes (e.g., an m5.xlarge with weight 4 gets 4x the traffic of a t3.small with weight 1).

Least Connections: Each new request goes to the server with the fewest active connections. Adapts automatically to slow requests -- a server stuck processing a file upload naturally gets fewer new requests. Best default for most web applications.

IP Hash: The client's IP is hashed to deterministically route them to the same server. Provides sticky-session-like behavior without cookies. Problem: uneven distribution when many clients share an IP (corporate NAT), and adding/removing servers causes mass rehashing.

Least Response Time: Routes to the server with the lowest average latency combined with fewest connections. Most "intelligent" but requires constant latency measurement and can cause thundering herd toward a suddenly fast server.

My default recommendation for most APIs is least connections -- it handles variable request durations gracefully without the operational complexity of response-time tracking.

Q6. How would you scale a Node.js application from 1 to 10,000 requests per second?

Why interviewers ask: Tests practical scaling knowledge specific to the Node.js ecosystem, including understanding the single-threaded limitation.

Model answer:

Node.js runs on a single thread by default, so a single process cannot utilise multiple CPU cores. Here is the scaling progression:

Stage 1 (1-500 rps): Single process. Optimise code, add database indexes, implement caching. A well-written Express app on a 2-core machine handles 500+ rps for typical API endpoints.

Stage 2 (500-2000 rps): Cluster mode. Use the Node.js cluster module or PM2 in cluster mode to fork one worker per CPU core. On a 4-core machine, this roughly 4x your throughput. PM2 also gives you zero-downtime reloads and process monitoring.

Stage 3 (2000-5000 rps): Multiple machines. Deploy 3-5 instances behind an ALB or Nginx load balancer. The application must be stateless -- sessions in Redis, files in S3. Auto-scaling rules based on CPU utilization (scale out at 70%, scale in at 30%).

Stage 4 (5000-10000+ rps): Kubernetes or ECS. Containerise the app, deploy to Kubernetes with a HorizontalPodAutoscaler. Use 10-50 pods across multiple availability zones. The database likely needs read replicas at this point. Add a Redis caching layer to reduce database load.

Throughout all stages: use connection pooling for the database, implement caching aggressively, profile bottlenecks with clinic.js or 0x, and ensure the database scales alongside the application tier.

Q7. What are sticky sessions, why are they problematic, and what is the alternative?

Why interviewers ask: Sticky sessions are a common band-aid that reveals whether a candidate understands stateless design.

Model answer:

Sticky sessions (session affinity) configure the load balancer to route all requests from a given client to the same backend server, typically via a cookie (e.g., AWSALB). They exist because the application is stateful -- it stores session data in server memory, so the client must always reach "their" server.

Problems:

Uneven load distribution -- popular servers accumulate sessions while new servers sit idle.
Session loss on failure -- if the server dies, all sessions pinned to it are lost. Users are logged out and lose in-progress work.
Cannot scale in safely -- removing a server means terminating its users' active sessions.
No spot instances -- you cannot use cheap spot instances that may be reclaimed at any time.
Rolling deploys are risky -- you must drain sessions before killing old servers during deployment.

The alternative is stateless design: store sessions in Redis (shared by all servers) or use JWT tokens (session data lives in the token itself). With stateless servers, the load balancer can use any algorithm, any server can handle any request, and servers can be added, removed, or replaced at will.

Q8. How do you handle file uploads in a stateless, horizontally-scaled architecture?

Why interviewers ask: File uploads are the most commonly forgotten stateful pattern. Interviewers want to see if you think beyond session management.

Model answer:

The problem: in a stateful architecture, uploaded files are saved to the server's local disk (multer({ dest: './uploads' })). In a multi-instance deployment, a file uploaded to Server A is not available on Server B or C.

Solution 1: Upload through the server to S3. The server receives the file (in memory using multer.memoryStorage()), streams it to S3, and stores the S3 key in the database. Any server can then generate a URL to the file. This is simple but the server becomes a bottleneck for large files.

Solution 2 (preferred): Pre-signed URL for direct client-to-S3 upload. The server generates a pre-signed S3 PUT URL (valid for 5-15 minutes) and returns it to the client. The client uploads directly to S3, bypassing the server entirely. Then the client notifies the server with the S3 key. This eliminates the file from the server process completely -- it never touches the server's memory or disk.

// Generate a pre-signed upload URL
const url = await getSignedUrl(s3, new PutObjectCommand({
  Bucket: 'my-bucket',
  Key: `uploads/${userId}/${filename}`,
  ContentType: 'image/jpeg',
}), { expiresIn: 300 });

For serving files, use CloudFront in front of S3 for caching and global distribution, or generate pre-signed GET URLs for private files.

Advanced (Q9--Q11)

Q9. Design a scaling strategy for an application that receives 100 rps normally but 10,000 rps during flash sales (100x spike for 30 minutes).

Why interviewers ask: Tests your ability to handle extreme traffic variability -- a real-world challenge for e-commerce, ticket sales, and event-driven applications.

Model answer:

This is a burst scaling problem. The key insight is that auto-scaling alone is too slow -- spinning up instances takes 1-3 minutes, and the spike is immediate.

Pre-warming: Before a known flash sale, pre-scale to a higher baseline. If you normally run 5 instances, pre-scale to 50 before the sale starts. Use scheduled scaling policies (cron(0 9 * * *) to scale up 15 minutes before the sale).

Architecture for the spike:

CDN for static assets -- CloudFront serves product images, CSS, and JS from edge caches. This eliminates 80%+ of requests from hitting your servers.
API Gateway with throttling -- AWS API Gateway or a custom rate limiter to shed excess load gracefully (return 429 rather than crashing).
Queue-based decoupling -- checkout requests go into SQS. Backend workers process them at a sustainable rate. The user sees "Your order is being processed" instead of a timeout.
Read-through cache -- Redis caches product catalog data. 95% of flash sale traffic is reads (browsing products), so cache hit rate should be >99%.
Database read replicas -- scale reads across multiple replicas. Only the checkout write path hits the primary.

Auto-scaling configuration:

Minimum: 50 instances (pre-warmed)
Maximum: 200 instances
Scale-out: CPU > 60% for 30 seconds (aggressive)
Scale-in: CPU < 20% for 10 minutes (conservative, avoid flapping)
Use target tracking on request count per instance (e.g., target 100 rps per instance)

Post-sale: Scheduled scale-in 2 hours after the sale ends. Auto-scaling handles the gradual ramp-down.

Q10. Explain how database scaling differs from application scaling. Walk through the progression from a single database to a sharded architecture.

Why interviewers ask: Database scaling is the hardest part of system design and the most common bottleneck. This question separates candidates who have dealt with real scale.

Model answer:

Application scaling is straightforward: make the app stateless, add instances behind a load balancer. Database scaling is fundamentally harder because databases hold state and must guarantee consistency.

Stage 1: Vertical scaling (single instance). Start with a managed database (RDS PostgreSQL, MongoDB Atlas). Scale by upgrading instance size: more CPU, more RAM, more IOPS. This handles most workloads up to thousands of queries per second. Optimise queries, add indexes, use connection pooling (PgBouncer).

Stage 2: Read replicas. Most applications are read-heavy (80-95% reads). Add 1-3 read replicas. Route SELECT queries to replicas and writes to the primary. This multiplies read capacity by N. Caveat: replication lag (10-100ms) means a user may not immediately see their own write. Solution: "read your own writes" -- route reads-after-writes to the primary for the same user.

Stage 3: Caching layer. Add Redis in front of the database. Cache frequently-read data with TTL. A good cache strategy can reduce database load by 90%+ and often eliminates the need for sharding entirely.

Stage 4: Sharding (last resort). Split data across multiple database instances based on a shard key (e.g., user_id % N). Each shard holds a subset of the data. This scales writes linearly but introduces enormous complexity:

Cross-shard queries are expensive (must query all shards and aggregate).
Cross-shard transactions are extremely difficult (distributed transactions, saga pattern).
Re-sharding (adding more shards) requires data migration and is operationally risky.
JOIN operations across shards are not possible in traditional RDBMS.
Uneven shard sizes (hot shards) require monitoring and rebalancing.

My recommendation: exhaust vertical scaling + read replicas + caching before considering sharding. Many companies serving millions of users never need to shard because caching eliminates enough database load.

Q11. You inherit a monolithic Express application that stores sessions in memory, saves files to local disk, and uses setInterval for background jobs. It currently runs on one server and needs to support 10x growth. Walk through your migration plan.

Why interviewers ask: Tests practical migration skills and the ability to prioritise work. The most valuable scaling advice is often about what order to do things in.

Model answer:

This is a stateful-to-stateless migration. I would approach it in phases, with each phase delivering measurable value before moving to the next.

Phase 1: Externalize state (1-2 weeks).

Sessions: Replace MemoryStore with connect-redis. Install Redis (AWS ElastiCache or a managed Redis). No application route changes needed -- just swap the session store configuration.
File uploads: Replace multer({ dest: './uploads' }) with S3 uploads. Either upload through the server using multer.memoryStorage() + S3 PutObject, or switch to pre-signed URLs for direct upload. Update any code that reads files from local disk to read from S3.
Background jobs: Replace setInterval with a Bull queue backed by Redis. This ensures the job runs exactly once across all servers (not N times for N servers).

Phase 2: Load balancer + second instance (1 week).

Set up an ALB (or Nginx) in front of two instances.
Disable sticky sessions -- verify the app works with round-robin.
Run integration tests that hit the LB endpoint repeatedly and confirm sessions, uploads, and background jobs all work regardless of which instance handles the request.

Phase 3: Auto-scaling (1 week).

Create an Auto Scaling Group with min=2, max=10.
Configure target tracking: maintain 70% average CPU utilization.
Load test with artillery or k6 to verify auto-scaling triggers correctly.
Set up CloudWatch alarms for response latency and error rate.

Phase 4: Database optimization (ongoing).

Add connection pooling (PgBouncer or application-level pooling).
Add a Redis caching layer for frequently-read data.
If reads are the bottleneck, add a read replica.
Monitor slow queries and add indexes.

Phase 5: Production hardening.

Health check endpoints (liveness + readiness).
Graceful shutdown handling (drain connections before exiting).
Distributed rate limiting (Redis-based).
Structured logging with correlation IDs for request tracing across instances.

Total timeline: 4-6 weeks to go from a single stateful server to a horizontally-scaled, auto-scaling, fault-tolerant deployment. The most critical step is Phase 1 -- everything else depends on the app being stateless.

Quick-fire

#	Question	One-line answer
1	Scale up or scale out -- which is simpler?	Scale up -- no code changes, just a bigger machine
2	What must your app be for horizontal scaling?	Stateless -- no in-memory sessions, no local files
3	ALB or NLB for an Express API?	ALB -- Layer 7, HTTP-aware, path-based routing
4	Best default LB algorithm?	Least connections -- adapts to variable request durations
5	Where should sessions go in a stateless app?	Redis (shared external store) or JWT (client-side token)

<- Back to 6.5 -- Scaling Concepts (README)