Sunday, 12 October 2025

High Level Design (HLD) on How to design a E-Commerce Application

 

1. System Overview

An e-commerce platform like Amazon must handle millions of concurrent users, serve product catalogs in milliseconds, process payments reliably, and scale elastically with traffic spikes. The architecture is built around microservices, asynchronous communication, aggressive caching, and global content delivery.

 

Key non-functional requirements to address in interviews:

        High availability — 99.99% uptime (< 1 hour downtime/year)

        Low latency — product pages < 200ms, search < 100ms

        Scalability — handle 10x traffic spikes (sales events like Prime Day)

        Consistency — order and payment data must never be lost

        Security — PCI-DSS compliance for payments, secure auth tokens



 

2. Architecture Components

The system is organized into distinct horizontal layers. Each layer has a single responsibility and communicates through well-defined interfaces.

 

Layer / Component

Role / Responsibility

Examples / Tech

CDN

Serves static assets (images, JS, CSS) from edge nodes close to users. Reduces origin load by 70–80% of requests.

AWS CloudFront, Akamai, Cloudflare

Load Balancer

Distributes incoming HTTP requests evenly across server instances. Performs health checks and removes unhealthy nodes.

AWS ALB, NGINX, HAProxy

API Gateway

Single entry point for all client requests. Handles SSL termination, rate limiting, authentication header validation, and routing.

Kong, AWS API Gateway, NGINX

Auth Service

Issues and validates JWT / OAuth2 tokens. Stores session tokens in Redis. Handles login, logout, token refresh.

JWT, OAuth2, Redis sessions

User Service

Manages user profiles, saved addresses, preferences, and account history.

PostgreSQL, REST API

Product Service

Stores and serves the product catalog — titles, descriptions, images, pricing, and inventory levels.

MongoDB, Elasticsearch sync

Order Service

Handles shopping cart, checkout flow, order creation, and order status tracking.

PostgreSQL, ACID transactions

Payment Service

Processes payments via third-party gateways. Handles retries, refunds, and transaction records.

Stripe, Razorpay, PCI-DSS

Message Queue

Decouples services via async events (order placed, payment completed). Ensures reliability even if consumers are down.

Apache Kafka, RabbitMQ

Notification Service

Consumes queue events and sends emails, SMS, and push notifications to users.

SendGrid, Twilio, Firebase FCM

Cache (Redis)

Stores frequently read data in memory — product pages, sessions, cart, rate limit counters. Reduces DB load.

Redis, Memcached

Search (Elasticsearch)

Powers the product search box with full-text search, filters, autocomplete, and ranking.

Elasticsearch, OpenSearch

Primary Database

Source of truth for transactional data — users, orders, payments. Requires ACID guarantees.

PostgreSQL, MySQL

Read Replica

Copy of primary DB that handles read-heavy queries — browsing, reports, analytics — without impacting writes.

PostgreSQL Streaming Replication

Object Storage

Stores binary files — product images, receipts, exports. Highly durable and cheap at scale.

AWS S3, Google Cloud Storage

Monitoring

Collects metrics, logs, and distributed traces. Fires alerts on anomalies.

Prometheus, Grafana, Datadog

 

3. Request Flow Walkthrough

Understanding how a user request travels through the system end-to-end is a common interview expectation. Here is the full lifecycle of a product page load:

 

1.     User's browser requests the product page URL.

2.     CDN intercepts the request. If a cached version of static assets exists at the edge node, it returns them instantly.

3.     Dynamic API request reaches the Load Balancer, which routes to a healthy API server instance.

4.     API Gateway validates the JWT auth token with the Auth Service and checks rate limits.

5.     Request is routed to the Product Service.

6.     Product Service checks Redis cache first. On a cache hit, data is returned in < 2ms. On a cache miss, it queries MongoDB and populates the cache.

7.     Response is sent back through the gateway to the client.

 

4. Authentication & Authorization

JWT Token Flow

        User submits credentials (email + password) to the Auth Service via API Gateway.

        Auth Service validates credentials against the User DB, then issues a signed JWT (JSON Web Token).

        JWT contains: user ID, role, expiry timestamp. It is signed with a secret key (HS256 or RS256).

        Client stores the JWT in memory or an HttpOnly cookie (never localStorage for security).

        Every subsequent request includes the JWT in the Authorization: Bearer <token> header.

        API Gateway or Auth Service validates the signature and expiry on each request — no DB lookup needed.

        Refresh tokens (long-lived) are stored in Redis and used to issue new JWTs when they expire.

 

OAuth2 / Social Login

        User clicks 'Login with Google'. App redirects to Google's OAuth2 authorization server.

        Google returns an authorization code to your callback URL.

        Your Auth Service exchanges the code for an access token and fetches the user's profile.

        A local JWT is issued and the session proceeds as normal.

 

5. Caching Strategy

Redis is the primary cache. The key design decision is what to cache and for how long (TTL — Time To Live).

 

Layer / Component

Role / Responsibility

Examples / Tech

Product pages

Read millions of times, updated rarely. Cache the full serialized response.

TTL: 10–30 minutes

User sessions / JWT

Avoid DB lookups on every request. Redis lookup is < 1ms.

TTL: matches token expiry

Shopping cart

Needs fast read/write. Redis Hash per user ID.

TTL: 7 days or session

Rate limit counters

Count requests per IP per window. Redis INCR + EXPIRE.

TTL: 1–60 seconds

Search results

Cache popular search queries — e.g. 'iPhone 15'.

TTL: 5 minutes

Homepage banners

Same for all users. Cache aggressively.

TTL: 1–6 hours

 

Cache Invalidation

        Write-through: update DB and cache together on every write.

        TTL expiry: let stale data expire naturally for non-critical content.

        Event-driven: when a product is updated, publish an event that invalidates the cache key.

 

6. Message Queue & Async Processing

Synchronous processing of every side-effect (sending email, updating inventory, logging analytics) on the same thread as the order request would make checkout slow and brittle. Instead, services publish events to a queue and workers consume them independently.

 

Order Placed Event Flow

        Order Service publishes: { event: 'order.placed', orderId, userId, items, total }

        Notification Service consumes it → sends order confirmation email

        Inventory Worker consumes it → decrements stock count

        Analytics Worker consumes it → logs the sale event

        Payment Service publishes: { event: 'payment.completed', orderId }

        Order Service consumes it → marks order as confirmed

 

Why Kafka over RabbitMQ?

        Kafka retains messages for days — consumers can replay events if a worker crashes.

        Kafka scales to millions of messages/second with partitioning.

        RabbitMQ is simpler and better for low-volume, complex routing patterns.

 

7. Database Design

SQL vs NoSQL Choice

Use the right tool for each data type:

 

Layer / Component

Role / Responsibility

Examples / Tech

PostgreSQL (SQL)

Users, orders, payments — structured, relational, requires ACID transactions.

Orders, Payments, Users

MongoDB (NoSQL)

Product catalog — each product type has different attributes. Flexible JSON schema.

Product Catalog

Redis (In-memory)

Sessions, cache, counters — microsecond reads, no persistence needed.

Sessions, Cart, Cache

Elasticsearch

Full-text search, relevance ranking, filters — SQL LIKE queries are too slow.

Product Search

S3 / Object Storage

Binary blobs — images, PDFs, exports. Cheap, durable, not relational.

Images, Files

 

Scaling the Primary Database

        Vertical scaling: upgrade server RAM/CPU first (simplest).

        Read replicas: add one or more replicas for read traffic. Write only to primary.

        Connection pooling: use PgBouncer to pool thousands of app connections into fewer DB connections.

        Sharding: partition data across multiple DB servers by user ID or region (complex, last resort).

 

 

8. Interview Questions & Answers

The following questions are commonly asked in system design interviews for senior and mid-level engineering roles. Each answer provides a direct, interview-ready response.

 

Architecture & Basics

Q: Why use microservices instead of a monolith for an e-commerce app?

A: Microservices allow independent deployment, scaling, and failure isolation. If the Payment Service goes down, users can still browse products. Teams can work in parallel on separate services. The downside is operational complexity — more services to monitor and deploy.

        Independent scaling: product service can scale 10x without scaling payments.

        Fault isolation: one service failing doesn't cascade.

        Tech flexibility: use Python for ML, Java for orders, Node for notifications.

 

Q: What is an API Gateway and why is it needed?

A: An API Gateway is the single front door for all client requests. Without it, clients would need to know the address of every microservice. It centralizes cross-cutting concerns so individual services don't have to implement them.

        SSL termination — decrypt HTTPS once at the gateway.

        Rate limiting — block abusive clients before they reach services.

        Authentication — validate JWT tokens centrally.

        Request routing — forward /products to Product Service, /orders to Order Service.

        Request/response transformation — add headers, reshape payloads.

 

Caching & Performance

Q: What would you cache in an e-commerce system, and why?

A: Cache data that is read far more often than it is written, and where slightly stale data is acceptable. The goal is to serve the majority of requests from memory without touching the database.

        Product details — millions of reads, few writes.

        User sessions and JWT tokens — avoid DB on every request.

        Homepage and category pages — same for all users, very high traffic.

        Search results for popular queries.

        Shopping cart data — fast reads and writes, low latency needed.

 

Q: How do you handle cache invalidation?

A: Cache invalidation is one of the hardest problems. The main strategies are: TTL-based expiry (simplest, may serve slightly stale data), write-through invalidation (update cache and DB together on writes), and event-driven invalidation (a product update event triggers deletion of the cache key). For e-commerce, TTL works well for product pages (10-30 min) because minor staleness is acceptable. For cart and session data, write-through is used to keep data consistent.

 

Authentication & Security

Q: How does JWT authentication work? What are its advantages?

A: When a user logs in, the Auth Service creates a JWT — a base64-encoded JSON payload containing user ID, role, and expiry — signed with a secret key. On every subsequent request, the client sends this token in the Authorization header. The server verifies the signature without making a DB call, which makes JWT stateless and horizontally scalable.

        Stateless — no server-side session store needed.

        Scalable — any server can verify the token with the shared secret.

        Self-contained — carries user metadata (role, ID) without a DB lookup.

        Downside: tokens can't be invalidated before expiry — use short TTLs (15 min) and refresh tokens stored in Redis.

 

Q: How would you prevent someone from buying 1000 units instantly (bot protection)?

A: Multiple layers of defense are needed:

        Rate limiting at the API Gateway — limit requests per IP per second.

        CAPTCHA on checkout for suspicious patterns.

        Redis-based inventory locking — use atomic DECR to prevent overselling.

        Purchase limits per account enforced at the Order Service.

        Bot detection via browser fingerprinting and behavior analysis.

 

Scalability & Reliability

Q: How would you design the system to handle a 10x traffic spike during a sale?

A: Auto-scaling is the key pattern. The approach is to make the system stateless so any instance can handle any request, then scale horizontally on demand.

        CDN absorbs 70-80% of traffic (static assets, cached pages).

        Load balancer auto-scales EC2/container instances when CPU > 70%.

        Read replicas handle the surge in browse/search traffic.

        Redis cache absorbs product page reads — DB sees minimal load.

        Message queue buffers order processing — no requests are dropped, just delayed.

        Pre-warm caches before the sale event starts.

 

Q: What happens if the Payment Service goes down during checkout?

A: The order is not lost. The system uses reliable messaging and idempotency to handle failures gracefully.

        Order is saved in PENDING state in the Order DB before payment is attempted.

        Payment request is published to Kafka — if the service is down, Kafka retains the message.

        When Payment Service recovers, it consumes the message and processes the payment.

        Idempotency key (unique order ID) prevents double charging on retries.

        User sees an 'order processing' status and is notified by email when payment completes.

 

Q: How do you prevent overselling inventory when two users buy the last item simultaneously?

A: This is a classic race condition. Solutions from simplest to most scalable:

        Database-level atomic update: UPDATE inventory SET stock = stock - 1 WHERE stock > 0 AND product_id = X. Only one transaction succeeds.

        Redis atomic decrement: DECR product:stock:X returns the new value — if negative, reject the purchase and rollback.

        Distributed locking with Redis SETNX: acquire a lock on the product ID before updating stock.

        Saga pattern for distributed transactions across Order and Inventory services.

 

Data & Search

Q: Why use Elasticsearch for search instead of a SQL LIKE query?

A: SQL LIKE '%query%' requires a full table scan — O(n) with no index support. It doesn't handle typos, synonyms, relevance ranking, or faceted filters. Elasticsearch is purpose-built for text search.

        Inverted index enables O(1) lookups for any term.

        Typo tolerance with fuzzy matching (e.g. 'iphne' finds 'iPhone').

        Relevance scoring — most relevant results appear first.

        Faceted filters — filter by price, brand, rating simultaneously.

        Autocomplete — suggest queries as the user types.

        Near real-time — product updates sync to Elasticsearch within seconds via change data capture.

 

Q: How would you design the database schema for orders?

A: Orders require strong consistency and ACID transactions. Use a relational DB with normalized tables.

        orders table: order_id (PK), user_id (FK), status, total_amount, created_at.

        order_items table: item_id (PK), order_id (FK), product_id, quantity, unit_price.

        payments table: payment_id (PK), order_id (FK), amount, gateway, status, transaction_ref.

        Indexes on: user_id (for order history), status (for admin queues), created_at (for reporting).

        Use foreign key constraints to ensure referential integrity.

 

Message Queues & Async

Q: Why use a message queue instead of direct service-to-service calls?

A: Direct synchronous calls create tight coupling. If the Notification Service is slow or down, the Order Service blocks and the user's checkout slows down. A queue decouples the producer from the consumer.

        Reliability: messages are durably stored — if a consumer crashes, messages are not lost.

        Backpressure: consumers process at their own pace without slowing producers.

        Retry logic: failed message processing is retried automatically.

        Fan-out: one order event can trigger multiple consumers (email, inventory, analytics) in parallel.

 

Monitoring & Observability

Q: How would you monitor this system in production?

A: Observability is built on three pillars: metrics, logs, and traces.

        Metrics (Prometheus + Grafana): request rate, error rate, latency (P50/P95/P99), cache hit rate, queue lag.

        Logs (ELK Stack): structured JSON logs from each service, searchable by order ID, user ID, trace ID.

        Distributed tracing (Jaeger / Datadog APM): trace a single request across all microservices to pinpoint bottlenecks.

        Alerting: PagerDuty alerts when error rate > 1%, P99 latency > 2s, or payment failures spike.

        Dashboards: separate dashboards per service and a global health dashboard.