1. System Overview
An e-commerce platform like Amazon must handle millions of
concurrent users, serve product catalogs in milliseconds, process payments
reliably, and scale elastically with traffic spikes. The architecture is built
around microservices, asynchronous communication, aggressive caching, and
global content delivery.
Key non-functional requirements to address in interviews:
•
High availability — 99.99% uptime (< 1 hour
downtime/year)
•
Low latency — product pages < 200ms, search <
100ms
•
Scalability — handle 10x traffic spikes (sales events
like Prime Day)
•
Consistency — order and payment data must never be lost
•
Security — PCI-DSS compliance for payments, secure auth
tokens
2. Architecture Components
The system is organized into distinct horizontal layers. Each
layer has a single responsibility and communicates through well-defined
interfaces.
|
Layer / Component |
Role / Responsibility |
Examples / Tech |
|
CDN |
Serves
static assets (images, JS, CSS) from edge nodes close to users. Reduces
origin load by 70–80% of requests. |
AWS CloudFront, Akamai,
Cloudflare |
|
Load Balancer |
Distributes
incoming HTTP requests evenly across server instances. Performs health checks
and removes unhealthy nodes. |
AWS ALB, NGINX, HAProxy |
|
API Gateway |
Single
entry point for all client requests. Handles SSL termination, rate limiting,
authentication header validation, and routing. |
Kong, AWS API Gateway, NGINX |
|
Auth Service |
Issues
and validates JWT / OAuth2 tokens. Stores session tokens in Redis. Handles
login, logout, token refresh. |
JWT, OAuth2, Redis sessions |
|
User Service |
Manages
user profiles, saved addresses, preferences, and account history. |
PostgreSQL, REST API |
|
Product Service |
Stores
and serves the product catalog — titles, descriptions, images, pricing, and
inventory levels. |
MongoDB, Elasticsearch sync |
|
Order Service |
Handles
shopping cart, checkout flow, order creation, and order status tracking. |
PostgreSQL, ACID transactions |
|
Payment Service |
Processes
payments via third-party gateways. Handles retries, refunds, and transaction
records. |
Stripe, Razorpay, PCI-DSS |
|
Message Queue |
Decouples
services via async events (order placed, payment completed). Ensures
reliability even if consumers are down. |
Apache Kafka, RabbitMQ |
|
Notification Service |
Consumes
queue events and sends emails, SMS, and push notifications to users. |
SendGrid, Twilio, Firebase FCM |
|
Cache (Redis) |
Stores
frequently read data in memory — product pages, sessions, cart, rate limit
counters. Reduces DB load. |
Redis, Memcached |
|
Search (Elasticsearch) |
Powers
the product search box with full-text search, filters, autocomplete, and
ranking. |
Elasticsearch, OpenSearch |
|
Primary Database |
Source
of truth for transactional data — users, orders, payments. Requires ACID
guarantees. |
PostgreSQL, MySQL |
|
Read Replica |
Copy of
primary DB that handles read-heavy queries — browsing, reports, analytics —
without impacting writes. |
PostgreSQL Streaming
Replication |
|
Object Storage |
Stores
binary files — product images, receipts, exports. Highly durable and cheap at
scale. |
AWS S3, Google Cloud Storage |
|
Monitoring |
Collects
metrics, logs, and distributed traces. Fires alerts on anomalies. |
Prometheus, Grafana, Datadog |
3. Request Flow Walkthrough
Understanding how a user request travels through the system
end-to-end is a common interview expectation. Here is the full lifecycle of a
product page load:
1.
User's browser requests the product page URL.
2.
CDN intercepts the request. If a cached version of
static assets exists at the edge node, it returns them instantly.
3.
Dynamic API request reaches the Load Balancer, which
routes to a healthy API server instance.
4.
API Gateway validates the JWT auth token with the Auth
Service and checks rate limits.
5.
Request is routed to the Product Service.
6.
Product Service checks Redis cache first. On a cache
hit, data is returned in < 2ms. On a cache miss, it queries MongoDB and
populates the cache.
7.
Response is sent back through the gateway to the
client.
4. Authentication & Authorization
JWT Token Flow
•
User submits credentials (email + password) to the Auth
Service via API Gateway.
•
Auth Service validates credentials against the User DB,
then issues a signed JWT (JSON Web Token).
•
JWT contains: user ID, role, expiry timestamp. It is
signed with a secret key (HS256 or RS256).
•
Client stores the JWT in memory or an HttpOnly cookie
(never localStorage for security).
•
Every subsequent request includes the JWT in the
Authorization: Bearer <token> header.
•
API Gateway or Auth Service validates the signature and
expiry on each request — no DB lookup needed.
•
Refresh tokens (long-lived) are stored in Redis and
used to issue new JWTs when they expire.
OAuth2 / Social Login
•
User clicks 'Login with Google'. App redirects to
Google's OAuth2 authorization server.
•
Google returns an authorization code to your callback
URL.
•
Your Auth Service exchanges the code for an access
token and fetches the user's profile.
•
A local JWT is issued and the session proceeds as
normal.
5. Caching Strategy
Redis is the primary cache. The key design decision is what to
cache and for how long (TTL — Time To Live).
|
Layer / Component |
Role / Responsibility |
Examples / Tech |
|
Product pages |
Read
millions of times, updated rarely. Cache the full serialized response. |
TTL: 10–30 minutes |
|
User sessions / JWT |
Avoid
DB lookups on every request. Redis lookup is < 1ms. |
TTL: matches token expiry |
|
Shopping cart |
Needs
fast read/write. Redis Hash per user ID. |
TTL: 7 days or session |
|
Rate limit counters |
Count
requests per IP per window. Redis INCR + EXPIRE. |
TTL: 1–60 seconds |
|
Search results |
Cache
popular search queries — e.g. 'iPhone 15'. |
TTL: 5 minutes |
|
Homepage banners |
Same
for all users. Cache aggressively. |
TTL: 1–6 hours |
Cache Invalidation
•
Write-through: update DB and cache together on every
write.
•
TTL expiry: let stale data expire naturally for
non-critical content.
•
Event-driven: when a product is updated, publish an
event that invalidates the cache key.
6. Message Queue & Async Processing
Synchronous processing of every side-effect (sending email,
updating inventory, logging analytics) on the same thread as the order request
would make checkout slow and brittle. Instead, services publish events to a
queue and workers consume them independently.
Order Placed Event Flow
•
Order Service publishes: { event: 'order.placed',
orderId, userId, items, total }
•
Notification Service consumes it → sends order
confirmation email
•
Inventory Worker consumes it → decrements stock count
•
Analytics Worker consumes it → logs the sale event
•
Payment Service publishes: { event:
'payment.completed', orderId }
•
Order Service consumes it → marks order as confirmed
Why Kafka over RabbitMQ?
•
Kafka retains messages for days — consumers can replay
events if a worker crashes.
•
Kafka scales to millions of messages/second with
partitioning.
•
RabbitMQ is simpler and better for low-volume, complex
routing patterns.
7. Database Design
SQL vs NoSQL Choice
Use the right tool for each data type:
|
Layer / Component |
Role / Responsibility |
Examples / Tech |
|
PostgreSQL (SQL) |
Users,
orders, payments — structured, relational, requires ACID transactions. |
Orders, Payments, Users |
|
MongoDB (NoSQL) |
Product
catalog — each product type has different attributes. Flexible JSON schema. |
Product Catalog |
|
Redis (In-memory) |
Sessions,
cache, counters — microsecond reads, no persistence needed. |
Sessions, Cart, Cache |
|
Elasticsearch |
Full-text
search, relevance ranking, filters — SQL LIKE queries are too slow. |
Product Search |
|
S3 / Object Storage |
Binary
blobs — images, PDFs, exports. Cheap, durable, not relational. |
Images, Files |
Scaling the Primary Database
•
Vertical scaling: upgrade server RAM/CPU first
(simplest).
•
Read replicas: add one or more replicas for read
traffic. Write only to primary.
•
Connection pooling: use PgBouncer to pool thousands of
app connections into fewer DB connections.
•
Sharding: partition data across multiple DB servers by
user ID or region (complex, last resort).
8. Interview Questions & Answers
The following questions are commonly asked in system design
interviews for senior and mid-level engineering roles. Each answer provides a
direct, interview-ready response.
Architecture & Basics
Q: Why
use microservices instead of a monolith for an e-commerce app?
A: Microservices
allow independent deployment, scaling, and failure isolation. If the Payment
Service goes down, users can still browse products. Teams can work in parallel
on separate services. The downside is operational complexity — more services to
monitor and deploy.
•
Independent scaling: product service can scale 10x
without scaling payments.
•
Fault isolation: one service failing doesn't cascade.
•
Tech flexibility: use Python for ML, Java for orders,
Node for notifications.
Q:
What is an API Gateway and why is it needed?
A: An API Gateway is
the single front door for all client requests. Without it, clients would need
to know the address of every microservice. It centralizes cross-cutting
concerns so individual services don't have to implement them.
•
SSL termination — decrypt HTTPS once at the gateway.
•
Rate limiting — block abusive clients before they reach
services.
•
Authentication — validate JWT tokens centrally.
•
Request routing — forward /products to Product Service,
/orders to Order Service.
•
Request/response transformation — add headers, reshape
payloads.
Caching & Performance
Q:
What would you cache in an e-commerce system, and why?
A: Cache data that
is read far more often than it is written, and where slightly stale data is
acceptable. The goal is to serve the majority of requests from memory without
touching the database.
•
Product details — millions of reads, few writes.
•
User sessions and JWT tokens — avoid DB on every
request.
•
Homepage and category pages — same for all users, very
high traffic.
•
Search results for popular queries.
•
Shopping cart data — fast reads and writes, low latency
needed.
Q: How
do you handle cache invalidation?
A: Cache
invalidation is one of the hardest problems. The main strategies are: TTL-based
expiry (simplest, may serve slightly stale data), write-through invalidation
(update cache and DB together on writes), and event-driven invalidation (a
product update event triggers deletion of the cache key). For e-commerce, TTL
works well for product pages (10-30 min) because minor staleness is acceptable.
For cart and session data, write-through is used to keep data consistent.
Authentication & Security
Q: How
does JWT authentication work? What are its advantages?
A: When a user logs
in, the Auth Service creates a JWT — a base64-encoded JSON payload containing
user ID, role, and expiry — signed with a secret key. On every subsequent
request, the client sends this token in the Authorization header. The server
verifies the signature without making a DB call, which makes JWT stateless and
horizontally scalable.
•
Stateless — no server-side session store needed.
•
Scalable — any server can verify the token with the
shared secret.
•
Self-contained — carries user metadata (role, ID)
without a DB lookup.
•
Downside: tokens can't be invalidated before expiry —
use short TTLs (15 min) and refresh tokens stored in Redis.
Q: How
would you prevent someone from buying 1000 units instantly (bot protection)?
A: Multiple layers
of defense are needed:
•
Rate limiting at the API Gateway — limit requests per
IP per second.
•
CAPTCHA on checkout for suspicious patterns.
•
Redis-based inventory locking — use atomic DECR to
prevent overselling.
•
Purchase limits per account enforced at the Order
Service.
•
Bot detection via browser fingerprinting and behavior
analysis.
Scalability & Reliability
Q: How
would you design the system to handle a 10x traffic spike during a sale?
A: Auto-scaling is
the key pattern. The approach is to make the system stateless so any instance
can handle any request, then scale horizontally on demand.
•
CDN absorbs 70-80% of traffic (static assets, cached
pages).
•
Load balancer auto-scales EC2/container instances when
CPU > 70%.
•
Read replicas handle the surge in browse/search
traffic.
•
Redis cache absorbs product page reads — DB sees
minimal load.
•
Message queue buffers order processing — no requests
are dropped, just delayed.
•
Pre-warm caches before the sale event starts.
Q:
What happens if the Payment Service goes down during checkout?
A: The order is not
lost. The system uses reliable messaging and idempotency to handle failures
gracefully.
•
Order is saved in PENDING state in the Order DB before
payment is attempted.
•
Payment request is published to Kafka — if the service
is down, Kafka retains the message.
•
When Payment Service recovers, it consumes the message
and processes the payment.
•
Idempotency key (unique order ID) prevents double
charging on retries.
•
User sees an 'order processing' status and is notified
by email when payment completes.
Q: How
do you prevent overselling inventory when two users buy the last item
simultaneously?
A: This is a classic
race condition. Solutions from simplest to most scalable:
•
Database-level atomic update: UPDATE inventory SET
stock = stock - 1 WHERE stock > 0 AND product_id = X. Only one transaction
succeeds.
•
Redis atomic decrement: DECR product:stock:X returns
the new value — if negative, reject the purchase and rollback.
•
Distributed locking with Redis SETNX: acquire a lock on
the product ID before updating stock.
•
Saga pattern for distributed transactions across Order
and Inventory services.
Data & Search
Q: Why
use Elasticsearch for search instead of a SQL LIKE query?
A: SQL LIKE
'%query%' requires a full table scan — O(n) with no index support. It doesn't
handle typos, synonyms, relevance ranking, or faceted filters. Elasticsearch is
purpose-built for text search.
•
Inverted index enables O(1) lookups for any term.
•
Typo tolerance with fuzzy matching (e.g. 'iphne' finds
'iPhone').
•
Relevance scoring — most relevant results appear first.
•
Faceted filters — filter by price, brand, rating
simultaneously.
•
Autocomplete — suggest queries as the user types.
•
Near real-time — product updates sync to Elasticsearch
within seconds via change data capture.
Q: How
would you design the database schema for orders?
A: Orders require
strong consistency and ACID transactions. Use a relational DB with normalized
tables.
•
orders table: order_id (PK), user_id (FK), status,
total_amount, created_at.
•
order_items table: item_id (PK), order_id (FK),
product_id, quantity, unit_price.
•
payments table: payment_id (PK), order_id (FK), amount,
gateway, status, transaction_ref.
•
Indexes on: user_id (for order history), status (for
admin queues), created_at (for reporting).
•
Use foreign key constraints to ensure referential
integrity.
Message Queues & Async
Q: Why
use a message queue instead of direct service-to-service calls?
A: Direct
synchronous calls create tight coupling. If the Notification Service is slow or
down, the Order Service blocks and the user's checkout slows down. A queue
decouples the producer from the consumer.
•
Reliability: messages are durably stored — if a
consumer crashes, messages are not lost.
•
Backpressure: consumers process at their own pace
without slowing producers.
•
Retry logic: failed message processing is retried
automatically.
•
Fan-out: one order event can trigger multiple consumers
(email, inventory, analytics) in parallel.
Monitoring & Observability
Q: How
would you monitor this system in production?
A: Observability is
built on three pillars: metrics, logs, and traces.
•
Metrics (Prometheus + Grafana): request rate, error
rate, latency (P50/P95/P99), cache hit rate, queue lag.
•
Logs (ELK Stack): structured JSON logs from each
service, searchable by order ID, user ID, trace ID.
•
Distributed tracing (Jaeger / Datadog APM): trace a
single request across all microservices to pinpoint bottlenecks.
•
Alerting: PagerDuty alerts when error rate > 1%, P99
latency > 2s, or payment failures spike.
•
Dashboards: separate dashboards per service and a
global health dashboard.
