System Design - The Complete Course
System Design - The Complete Course
Follow
Featured on Hashnode
Featured on daily.dev
+301
Table of contents
Getting Started
Chapter I
IP
OSI Model
TCP and UDP
Domain Name System (DNS)
Load Balancing
Clustering
Caching
Content Delivery Network (CDN)
Proxy
Availability
Scalability
Storage
Chapter II
N-tier architecture
Message Brokers
Message Queues
Publish-Subscribe
Enterprise Service Bus (ESB)
Monoliths and Microservices
Event-Driven Architecture (EDA)
Event Sourcing
Command and Query Responsibility Segregation (CQRS)
API Gateway
REST, GraphQL, gRPC
Long polling, WebSockets, Server-Sent Events (SSE)
Chapter IV
Appendix
Next Steps
References
IP
An IP address is a unique address that identifies a device on the
internet or a local network. IP stands for "Internet Protocol", which is
the set of rules governing the format of data sent via the internet or
local network.
Versions
IPv4
The original Internet Protocol is IPv4 which uses a 32-bit numeric dot-
decimal notation that only allows for around 4 billion IP addresses.
Initially, it was more than enough but as internet adoption grew we
needed something better.
Example: 102.22.192.181
IPv6
Example: 2001:0db8:85a3:0000:0000:8a2e:0370:7334
Types
Public
Private
Static
A static IP address does not change and is one that was manually
created, as opposed to having been assigned. These addresses are
usually more expensive but are more reliable.
Example: They are usually used for important things like reliable geo-
location services, remote access, server hosting, etc.
Dynamic
A dynamic IP address changes from time to time and is not always the
same. It has been assigned by a Dynamic Host Configuration Protocol
(DHCP) server. Dynamic IP addresses are the most common type of
internet protocol addresses. They are cheaper to deploy and allow us
to reuse IP addresses within a network as needed.
Example: They are more commonly used for consumer equipment and
personal use.
OSI Model
The OSI Model is a logical and conceptual model that defines network
communication used by systems open to interconnection and
communication with other systems. The Open System Interconnection
(OSI Model) also defines a logical network and effectively describes
computer packet transfer by using various layers of protocols.
Layers
The seven abstraction layers of the OSI model can be defined as
follows, from top to bottom:
Application
This is the only layer that directly interacts with data from the user.
Software applications like web browsers and email clients rely on the
application layer to initiate communication. But it should be made clear
that client software applications are not part of the application layer,
rather the application layer is responsible for the protocols and data
manipulation that the software relies on to present meaningful data to
the user. Application layer protocols include HTTP as well as SMTP.
Presentation
The presentation layer is also called the Translation layer. The data
from the application layer is extracted here and manipulated as per the
required format to transmit over the network. The functions of the
presentation layer are translation, encryption/decryption, and
compression.
Session
Transport
Network
The network layer is responsible for facilitating data transfer between
two different networks. The network layer breaks up segments from
the transport layer into smaller units, called packets, on the sender's
device, and reassembles these packets on the receiving device. The
network layer also finds the best physical path for the data to reach its
destination this is known as routing. If the two devices communicating
are on the same network, then the network layer is unnecessary.
Data Link
The data link layer is very similar to the network layer, except the data
link layer facilitates data transfer between two devices on the same
network. The data link layer takes packets from the network layer and
breaks them into smaller pieces called frames.
Physical
This layer includes the physical equipment involved in the data transfer,
such as the cables and switches. This is also the layer where the data
gets converted into a bit stream, which is a string of 1s and 0s. The
physical layer of both devices must also agree on a signal convention
so that the 1s can be distinguished from the 0s on both devices.
TCP
UDP
TCP vs UDP
TCP provides ordered delivery of data from user to server (and vice
versa), whereas UDP is not dedicated to end-to-end communications,
nor does it check the readiness of the receiver.
Once the IP address has been resolved, the client should be able to
request content from the resolved IP address. For example, the
resolved IP may return a webpage to be rendered in the browser
Server types
Now, let's look at the four key groups of servers that make up the DNS
infrastructure.
DNS Resolver
TLD nameserver
A TLD nameserver maintains information for all the domain names that
share a common domain extension, such as .com , .net , or whatever
comes after the last dot in a URL.
Query Types
Recursive
Iterative
Non-recursive
Records Types
DNS records (aka zone files) are instructions that live in authoritative
DNS servers and provide information about a domain including what IP
address is associated with that domain and how to handle requests for
that domain.
There are more record types but for now, let's look at some of the most
commonly used ones:
Subdomains
DNS Zones
DNS Caching
Reverse DNS
A reverse DNS lookup is a DNS query for the domain name associated
with a given IP address. This accomplishes the opposite of the more
commonly used forward DNS lookup, in which the DNS system is
queried to return an IP address. The process of reverse resolving an IP
address uses PTR records. If the server does not have a PTR record, it
cannot resolve a reverse lookup.
Note: Reverse DNS lookups are not universally adopted as they are not
critical to the normal function of the internet.
Examples
Route53
Cloudflare DNS
Google Cloud DNS
Azure DNS
NS1
Load Balancing
Load balancing lets us distribute incoming network traffic across
multiple resources ensuring high availability and reliability by sending
requests only to resources that are online. This provides the flexibility
to add or subtract resources as demand dictates.
But why?
Modern high-traffic websites must serve hundreds of thousands, if not
millions, of concurrent requests from users or clients. To cost-
effectively scale to meet these high volumes, modern computing best
practice generally requires adding more servers.
A load balancer can sit in front of the servers and route client requests
across all servers capable of fulfilling those requests in a manner that
maximizes speed and capacity utilization. This ensures that no single
server is overworked, which could degrade performance. If a single
server goes down, the load balancer redirects traffic to the remaining
online servers. When a new server is added to the server group, the
load balancer automatically starts sending requests to it.
Workload distribution
Layers
Network layer
This is the load balancer that works at the network's transport layer,
also known as layer 4. This performs routing based on networking
information such as IP addresses and is not able to perform content-
based routing. These are often dedicated hardware devices that can
operate at high speed.
Application layer
This is the load balancer that operates at the application layer, also
known as layer 7. Load balancers can read requests in their entirety
and perform content-based routing. This allows the management of
load based on a full understanding of traffic.
Types
Software
Software load balancers usually are easier to deploy than hardware
versions. They also tend to be more cost-effective and flexible, and
they are used in conjunction with software development environments.
The software approach gives us the flexibility of configuring the load
balancer to our environment's specific needs. The boost in flexibility
may come at the cost of having to do more work to set up the load
balancer. Compared to hardware versions, which offer more of a
closed-box approach, software balancers give us more freedom to
make changes and upgrades.
Software load balancers are widely used and are available either as
installable solutions that require configuration and management or as a
managed cloud service.
Hardware
As the name implies, a hardware load balancer relies on physical, on-
premises hardware to distribute application and network traffic. These
devices can handle a large volume of traffic but often carry a hefty
price tag and are fairly limited in terms of flexibility.
DNS
DNS load balancing is the practice of configuring a domain in the
Domain Name System (DNS) such that client requests to the domain
are distributed across a group of server machines.
Routing Algorithms
Advantages
Scalability
Redundancy
Flexibility
Efficiency
And, if there's a failure detection and the active load balancer fails,
another passive load balancer can take over which will make our
system more fault-tolerant.
Features
Examples
Clustering
At a high level, a computer cluster is a group of two or more
computers, or nodes, that run in parallel to achieve a common goal.
This allows workloads consisting of a high number of individual,
parallelizable tasks to be distributed among the nodes in the cluster. As
a result, these tasks can leverage the combined memory and
processing power of each computer to increase overall performance.
Typically, at least one node is designated as the leader node and acts
as the entry point to the cluster. The leader node may be responsible
for delegating incoming work to the other nodes and, if necessary,
aggregating the results and returning a response to the user.
Types
Active-Active
Active-Passive
Like the active-active cluster configuration, an active-passive cluster
also consists of at least two nodes. However, as the name active-
passive implies, not all nodes are going to be active. For example, in
the case of two nodes, if the first node is already active, then the
second node must be passive or on standby.
Advantages
High availability
Scalability
Performance
Cost-effective
Load balancing shares some common traits with clustering, but they
are different processes. Clustering provides redundancy and boosts
capacity and availability. Servers in a cluster are aware of each other
and work together toward a common purpose. But with load balancing,
servers are not aware of each other. Instead, they react to the requests
they receive from the load balancer.
We can employ load balancing in conjunction with clustering but it also
is applicable in cases involving independent servers that share a
common purpose such as to run a website, business application, web
service, or some other IT resource.
Challenges
This becomes even more complicated if the nodes in the cluster are
not homogeneous. Resource utilization for each node must also be
closely monitored, and logs should be aggregated to ensure that the
software is behaving correctly.
Examples
Caching
"There are only two hard things in Computer Science: cache
invalidation and naming things." - Phil Karlton
A cache's primary purpose is to increase data retrieval performance by
reducing the need to access the underlying slower storage layer.
Trading off capacity for speed, a cache typically stores a subset of
data transiently, in contrast to databases whose data is usually
complete and durable.
No matter whether the cache is read or written, it's done one block at a
time. Each block also has a tag that includes the location where the
data was stored in the cache. When data is requested from the cache,
a search occurs through the tags to find the specific content that's
needed in level one (L1) of the memory. If the correct data isn't found,
more searches are conducted in L2.
If the data isn't found there, searches are continued in L3, then L4, and
so on until it has been found, then, it's read and loaded. If the data isn't
found in the cache at all, then it's written into it for quick retrieval the
next time.
Cache hit and Cache miss
Cache hit
A hot cache is an instance where data was read from the memory at
the fastest possible rate. This happens when the data is retrieved from
L1.
A cold cache is the slowest possible rate for data to be read, though,
it's still successful so it's still considered a cache hit. The data is just
found lower in the memory hierarchy such as in L3, or lower.
A warm cache is used to describe data that's found in L2 or L3. It's not
as fast as a hot cache, but it's still faster than a cold cache. Generally,
calling a cache warm is used to express that it's slower and closer to a
cold cache than a hot one.
Cache miss
A cache miss refers to the instance when the memory is searched and
the data isn't found. When this happens, the content is transferred and
written into the cache.
Cache Invalidation
Write-through cache
Write-around cache
Con: It increases cache misses because the cache system has to read
the information from the database in case of a cache miss. As a result,
this can lead to higher read latency in the case of applications that
write and re-read the information quickly. Read happen from slower
back-end storage and experiences higher latency.
Write-back cache
Where the write is only done to the caching layer and the write is
confirmed as soon as the write to the cache completes. The cache
then asynchronously syncs this write to the database.
Pro: This would lead to reduced latency and high throughput for write-
intensive applications.
Con: There is a risk of data loss in case the caching layer crashes. We
can improve this by having more than one replica acknowledging the
write in the cache.
Eviction policies
First In First Out (FIFO): The cache evicts the first block accessed
first without any regard to how often or how many times it was
accessed before.
Last In First Out (LIFO): The cache evicts the block accessed most
recently first without any regard to how often or how many times it
was accessed before.
Least Recently Used (LRU): Discards the least recently used items
first.
Most Recently Used (MRU): Discards, in contrast to LRU, the most
recently used items first.
Least Frequently Used (LFU): Counts how often an item is needed.
Those that are used least often are discarded first.
Random Replacement (RR): Randomly selects a candidate item
and discards it to make space when necessary.
Distributed Cache
Global Cache
As the name suggests, we will have a single shared cache that all the
application nodes will use. When the requested data is not found in the
global cache, it's the responsibility of the cache to find out the missing
piece of data from the underlying data store.
Use cases
Database Caching
Content Delivery Network (CDN)
Domain Name System (DNS) Caching
API Caching
Let's also look at some scenarios where we should not use cache:
Caching isn't helpful when it takes just as long to access the cache
as it does to access the primary data store.
Caching doesn't work as well when requests have low repetition
(higher randomness), because caching performance comes from
repeated memory access patterns.
Caching isn't helpful when the data changes frequently, as the
cached version gets out of sync, and the primary data store must
be accessed every time.
Advantages
Improves performance
Reduce latency
Reduce load on the database
Reduce network cost
Increase Read Throughput
Examples
Redis
Memcached
Amazon Elasticache
Aerospike
In a CDN, the origin server contains the original versions of the content
while the edge servers are numerous and distributed across various
locations around the world.
To minimize the distance between the visitors and the website's server,
a CDN stores a cached version of its content in multiple geographical
locations known as edge locations. Each edge location contains a
number of caching servers responsible for content delivery to visitors
within its proximity.
Once the static assets are cached on all the CDN servers for a
particular location, all subsequent website visitor requests for static
assets will be delivered from these edge servers instead of the origin,
thus reducing origin load and improving scalability.
Types
Push CDNs
Push CDNs receive new content whenever changes occur on the
server. We take full responsibility for providing content, uploading
directly to the CDN, and rewriting URLs to point to the CDN. We can
configure when content expires and when it is updated. Content is
uploaded only when it is new or changed, minimizing traffic, but
maximizing storage.
Sites with a small amount of traffic or sites with content that isn't often
updated work well with push CDNs. Content is placed on the CDNs
once, instead of being re-pulled at regular intervals.
Pull CDNs
Disadvantages
As we all know good things come with extra costs, so let's discuss
some disadvantages of CDNs:
Examples
Amazon CloudFront
Google Cloud CDN
Cloudflare CDN
Fastly
Proxy
A proxy server is an intermediary piece of hardware/software sitting
between the client and the backend server. It receives requests from
clients and relays them to the origin servers. Typically, proxies are used
to filter requests, log requests, or sometimes transform requests (by
adding/removing headers, encrypting/decrypting, or compression).
Types
Forward Proxy
A forward proxy, often called a proxy, proxy server, or web proxy is a
server that sits in front of a group of client machines. When those
computers make requests to sites and services on the internet, the
proxy server intercepts those requests and then communicates with
web servers on behalf of those clients, like a middleman.
Advantages
Although proxies provide the benefits of anonymity, they can still track
our personal information. Setup and maintenance of a proxy server can
be costly and requires configurations.
Reverse Proxy
Improved security
Caching
SSL encryption
Load balancing
Scalability and flexibility
Examples
Nginx
HAProxy
Traefik
Envoy
Availability
Availability is the time a system remains operational to perform its
required function in a specific period. It is a simple measure of the
percentage of time that a system, service, or machine remains
operational under normal conditions.
U ptime
Availability =
(U ptime + downtime)
Sequence
Overall availability decreases when two components are in sequence.
For example, if both Foo and Bar each had 99.9% availability, their total
availability in sequence would be 99.8%.
Parallel
Overall availability increases when two components are in parallel.
For example, if both Foo and Bar each had 99.9% availability, their total
availability in parallel would be 99.9999%.
Availability vs Reliability
Both high availability and fault tolerance apply to methods for providing
high uptime levels. However, they accomplish the objective differently.
Scalability
Scalability is the measure of how well a system responds to changes
by adding or removing resources to meet demands.
Vertical scaling
Advantages
Simple to implement
Easier to manage
Data consistent
Disadvantages
Risk of high downtime
Harder to upgrade
Can be a single point of failure
Horizontal scaling
Advantages
Increased redundancy
Better fault tolerance
Flexible and efficient
Easier to upgrade
Disadvantages
Increases complexity
Data inconsistency
Increased load on downstream services
Storage
Storage is a mechanism that enables a system to retain data, either
temporarily or permanently. This topic is mostly skipped over in the
context of system design, however, it is important to have a basic
understanding of some common types of storage techniques that can
help us fine-tune our storage components. Let's discuss some
important storage concepts:
RAID
There are different RAID levels, however, and not all have the goal of
providing redundancy. Let's discuss some commonly used RAID levels:
RAID 0: Also known as striping, data is split evenly across all the
drives in the array.
RAID 1: Also known as mirroring, at least two drives contains the
exact copy of a set of data. If a drive fails, others will still work.
RAID 5: Striping with parity. Requires the use of at least 3 drives,
striping the data across multiple drives like RAID 0, but also has a
parity distributed across the drives.
RAID 6: Striping with double parity. RAID 6 is like RAID 5, but the
parity data are written to two drives.
RAID 10: Combines striping plus mirroring from RAID 0 and RAID 1.
It provides security by mirroring all data on secondary drives while
using striping across each set of drives to speed up data transfers.
Comparison
Minimum Disks 2 2
Features RAID 0 RAID 1
Read Performance High High
Write Performance High Medium
Cost Low High
Fault Tolerance None Single-drive failure
Volumes
File storage
File storage is a solution to store data as files and present it to its final
users as a hierarchical directories structure. The main advantage is to
provide a user-friendly solution to store and retrieve files. To locate a
file in file storage, the complete path of the file is required. It is
economical and easily structured and is usually found on hard drives,
which means that they appear exactly the same for the user and on the
hard drive.
Block storage
Block storage divides data into blocks (chunks) and stores them as
separate pieces. Each block of data is given a unique identifier, which
allows a storage system to place the smaller pieces of data wherever it
is most convenient.
Object Storage
Example: Amazon S3, Azure Blob Storage, Google Cloud Storage, etc.
NAS
HDFS
What is a Database?
What is DBMS?
Components
Here are some common components found across different databases:
Schema
Table
Column
Row
Types
Document
Key-value
Graph
Timeseries
Wide column
Multi-model
SQL and NoSQL databases are broad topics and will be discussed
separately in SQL databases and NoSQL databases. Learn how they
compare to each other in SQL vs NoSQL databases.
Challenges
SQL databases
A SQL (or relational) database is a collection of data items with pre-
defined relationships between them. These items are organized as a
set of tables with columns and rows. Tables are used to hold
information about the objects to be represented in the database. Each
column in a table holds a certain kind of data and a field stores the
actual value of an attribute. The rows in the table represent a collection
of related values of one object or entity.
Materialized views
The N+1 query problem happens when the data access layer executes
N additional SQL statements to fetch the same data that could have
been retrieved when executing the primary SQL query. The larger the
value of N, the more queries will be executed, the larger the
performance impact.
Advantages
Disadvantages
Expensive to maintain
Difficult schema evolution
Performance hits (join, denormalization, etc.)
Difficult to scale due to poor horizontal scalability
Examples
PostgreSQL
MySQL
MariaDB
Amazon Aurora
NoSQL databases
NoSQL is a broad category that includes any database that doesn't use
SQL as its primary data access language. These types of databases
are also sometimes referred to as non-relational databases. Unlike in
relational databases, data in a NoSQL database doesn't have to
conform to a pre-defined schema. NoSQL databases follow BASE
consistency model.
Document
Advantages
Disadvantages
Schemaless
Non-relational
Examples
MongoDB
Amazon DocumentDB
CouchDB
Key-value
Advantages
Disadvantages
Basic CRUD
Values can't be filtered
Lacks indexing and scanning capabilities
Not optimized for complex queries
Examples
Redis
Memcached
Amazon DynamoDB
Aerospike
Graph
The graph relates the data items in the store to a collection of nodes
and edges, the edges representing the relationships between the
nodes. The relationships allow data in the store to be linked together
directly and, in many cases, retrieved with one operation.
Advantages
Query speed
Agile and flexible
Explicit data representation
Disadvantages
Complex
No standardized query language
Use cases
Fraud detection
Recommendation engines
Social networks
Network mapping
Examples
Neo4j
ArangoDB
Amazon Neptune
JanusGraph
Time series
Advantages
Use cases
IoT data
Metrics analysis
Application monitoring
Understand financial trends
Examples
InfluxDB
Apache Druid
Wide column
Advantages
Highly scalable, can handle petabytes of data
Ideal for real-time big data applications
Disadvantages
Expensive
Increased write time
Use cases
Business analytics
Attribute-based data storage
Examples
BigTable
Apache Cassandra
ScyllaDB
Multi-model
Multi-model databases combine different database models (i.e.
relational, graph, key-value, document, etc.) into a single, integrated
backend. This means they can accommodate various data types,
indexes, queries, and store data in more than one model.
Advantages
Flexibility
Suitable for complex projects
Data consistent
Disadvantages
Complex
Less mature
Examples
ArangoDB
Azure Cosmos DB
Couchbase
High-level differences
Storage
SQL stores data in tables, where each row represents an entity and
each column represents a data point about that entity.
Schema
In SQL, each record conforms to a fixed schema, meaning the columns
must be decided and chosen before data entry and each row must
have data for each column. The schema can be altered later, but it
involves modifying the database using migrations.
Whereas in NoSQL, schemas are dynamic. Columns can be added on
the fly, and each row (or equivalent) doesn't have to contain data for
each column.
Querying
SQL databases use SQL (structured query language) for defining and
manipulating the data, which is very powerful.
Scalability
Reliability
The vast majority of relational databases are ACID compliant. So, when
it comes to data reliability and a safe guarantee of performing
transactions, SQL databases are still the better bet.
Reasons
As always we should always pick the technology that fits the
requirements better. So, let's look at some reasons for picking SQL or
NoSQL based database:
For SQL
For NoSQL
Database Replication
Replication is a process that involves sharing information to ensure
consistency between redundant resources such as multiple databases,
to improve reliability, fault-tolerance, or accessibility.
Master-Slave Replication
The master serves reads and writes, replicating writes to one or more
slaves, which serve only reads. Slaves can also replicate additional
slaves in a tree-like fashion. If the master goes offline, the system can
continue to operate in read-only mode until a slave is promoted to a
master or a new master is provisioned.
Advantages
Disadvantages
Master-Master Replication
Both masters serve reads/writes and coordinate with each other. If
either master goes down, the system can continue to operate with both
reads and writes.
Advantages
Applications can read from both masters.
Distributes write load across both master nodes.
Simple, automatic, and quick failover.
Disadvantages
Synchronous vs Asynchronous
replication
The primary difference between synchronous and asynchronous
replication is how the data is written to the replica. In synchronous
replication, data is written to primary storage and the replica
simultaneously. As such, the primary copy and the replica should
always remain synchronized.
Indexes
Indexes are well known when it comes to databases, they are used to
improve the speed of data retrieval operations on the data store. An
index makes the trade-offs of increased storage overhead, and slower
writes (since we not only have to write the data but also have to update
the index) for the benefit of faster reads. Indexes are used to quickly
locate data without having to examine every row in a database table.
Indexes can be created using one or more columns of a database
table, providing the basis for both rapid random lookups and efficient
access to ordered records.
Dense Index
In a dense index, an index record is created for every row of the table.
Records can be located directly as each record of the index holds the
search key value and the pointer to the actual record.
Sparse Index
In a sparse index, records are created only for some of the records.
Sparse indexes require less maintenance than dense indexes at write-
time since they only contain a subset of the values. This lighter
maintenance burden means that inserts, updates, and deletes will be
faster. Having fewer entries also means that the index will use less
memory. Finding data is slower since a scan across the page typically
follows the binary search. Sparse indexes are also optional when
working with ordered data.
Terms
Keys
Super key: Set of all keys that can uniquely identify all the rows present
in a table.
Alternate key: Keys that are not primary keys are known as alternate
keys.
Anomalies
Example
ID Name Role
1 Peter Software Engineer
2 Brian DevOps Engineer
ID Name Role
3 Hailey Product Manager
4 Hailey Product Manager
5 Steve Frontend Engineer
Let's imagine, we hired a new person "John" but they might not be
assigned a team immediately. This will cause an insertion anomaly as
the team attribute is not yet present.
Next, let's say Hailey from Team C got promoted, to reflect that change
in the database, we will need to update 2 rows to maintain consistency
which can cause an update anomaly.
Finally, we would like to remove Team B but to do that we will also need
to remove additional information such as name and role, this is an
example of a deletion anomaly.
Normalization
1NF
For a table to be in the first normal form (1NF), it should follow the
following rules:
2NF
For a table to be in the second normal form (2NF), it should follow the
following rules:
3NF
For a table to be in the third normal form (3NF), it should follow the
following rules:
BCNF
_There are more normal forms such as 4NF, 5NF, and 6NF but we
won't discuss them here. Check out this amazing video that goes into
detail._
Advantages
Disadvantages
Denormalization
Advantages
Disadvantages
Below are some disadvantages of denormalization:
ACID
Atomic
Consistent
On the completion of a transaction, the database is structurally sound.
Isolated
Durable
Once the transaction has been completed and the writes and updates
have been written to the disk, it will remain in the system even if a
system failure occurs.
BASE
BASE properties are much looser than ACID guarantees, but there isn't
a direct one-for-one mapping between the two consistency models.
Let us understand these terms:
Basic Availability
Soft-state
Stores don't have to be write-consistent, nor do different replicas have
to be mutually consistent all the time.
Eventual consistency
CAP Theorem
CAP theorem states that a distributed system can deliver only two of
the three desired characteristics Consistency, Availability, and Partition
tolerance (CAP).
Consistency means that all clients see the same data at the same time,
no matter which node they connect to. For this to happen, whenever
data is written to one node, it must be instantly forwarded or replicated
across all the nodes in the system before the write is deemed
"successful".
Availability
Availability means that any client making a request for data gets a
response, even if one or more nodes are down.
Partition tolerance
Partition tolerance means the system continues to work despite
message loss or partial failure. A system that is partition-tolerant can
sustain any amount of network failure that doesn't result in a failure of
the entire network. Data is sufficiently replicated across combinations
of nodes and networks to keep the system up through intermittent
outages.
Consistency-Availability Tradeoff
CA database
AP database
PACELC Theorem
The PACELC theorem is an extension of the CAP theorem. The CAP
theorem states that in the case of network partitioning (P) in a
distributed system, one has to choose between Availability (A) and
Consistency (C).
Transactions
A transaction is a series of database operations that are considered to
be a "single unit of work". The operations in a transaction either all
succeed, or they all fail. In this way, the notion of a transaction
supports data integrity when part of a system fails. Not all databases
choose to support ACID transactions, usually because they are
prioritizing other optimizations that are hard or theoretically impossible
to implement together.
States
In this state, the transaction is being executed. This is the initial state of
every transaction.
Partially Committed
When a transaction executes its final operation, it is said to be in a
partially committed state.
Committed
If a transaction executes all its operations successfully, it is said to be
committed. All its effects are now permanently established on the
database system.
Failed
Aborted
If any of the checks fail and the transaction has reached a failed state,
then the recovery manager rolls back all its write operations on the
database to bring the database back to its original state where it was
prior to the execution of the transaction. Transactions in this state are
aborted.
The database recovery module can select one of the two operations
after a transaction aborts:
Terminated
If there isn't any roll-back or the transaction comes from the committed
state, then the system is consistent and ready for a new transaction
and the old transaction is terminated.
Distributed Transactions
A distributed transaction is a set of operations on data that is
performed across two or more databases. It is typically coordinated
across separate nodes connected by a network, but may also span
multiple databases on a single server.
In other words, all the nodes must commit, or all must abort and the
entire transaction rolls back. This is why we need distributed
transactions.
Phases
Prepare phase
Commit phase
If all participants respond to the coordinator that they are prepared,
then the coordinator asks all the nodes to commit the transaction. If a
failure occurs, the transaction will be rolled back.
Problems
Three-phase commit
Phases
Prepare phase
Pre-commit phase
Coordinator issues the pre-commit message and all the participating
nodes must acknowledge it. If a participant fails to receive this
message in time, then the transaction is aborted.
Commit phase
If the participant nodes are found in this phase, that means that
every participant has completed the first phase. The completion of
prepare phase is guaranteed.
Every phase can now time out and avoid indefinite waits.
Sagas
Coordination
There are two common implementation approaches:
Problems
Sharding
Before we discuss sharding, let's talk about data partitioning:
Data Partitioning
Methods
There are many different ways one could use to decide how to break
up an application database into multiple smaller DBs. Below are three
of the most popular methods used by various large-scale applications:
Vertical Partitioning
What is sharding?
Hash-Based
List-Based
Range Based
Composite
Advantages
Disadvantages
Here are some reasons where sharding might be the right choice:
Consistent Hashing
Let's first understand the problem we're trying to solve.
Where,
R = K/N
Where,
N : Number of nodes.
The output of the hash function is a range let's say 0...m-1 which we
can represent on our hash ring. We hash the requests and distribute
them on the ring depending on what the output was. Similarly, we also
hash the node and distribute them on the same ring as well.
Where,
key : Request/Node ID or IP.
Now, when the request comes in we can simply route it to the closest
node in a clockwise (can be counterclockwise as well) manner. This
means that if a new node is added or removed, we can use the nearest
node and only a fraction of the requests need to be re-routed.
Virtual Nodes
Where,
As VNodes help spread the load more evenly across the physical
nodes on the cluster by diving the hash ranges into smaller subranges,
this speeds up the re-balancing process after adding or removing
nodes. This also helps us reduce the probability of hotspots.
Data replication
The replication factor is the number of nodes that will receive the copy
of the same data. In eventually consistent systems, this is done
asynchronously.
Advantages
Disadvantages
Increases complexity.
Cascading failures.
Load distribution can still be uneven.
Key management can be expensive when nodes transiently fail.
Examples
Characteristics
Advantages
Disadvantages
N-tier architecture
N-tier architecture divides an application into logical layers and
physical tiers. Layers are a way to separate responsibilities and
manage dependencies. Each layer has a specific responsibility. A
higher layer can use services in a lower layer, but not the other way
around.
In a closed layer architecture, a layer can only call the next layer
immediately down.
In an open layer architecture, a layer can call any of the layers
below it.
3-Tier architecture
2-Tier architecture
Advantages
Disadvantages
Message Brokers
A message broker is a software that enables applications, systems, and
services to communicate with each other and exchange information.
The message broker does this by translating messages between formal
messaging protocols. This allows interdependent services to "talk" with
one another directly, even if they were written in different languages or
implemented on different platforms.
Models
Examples
NATS
Apache Kafka
RabbitMQ
ActiveMQ
Message Queues
A message queue is a form of service-to-service communication that
facilitates asynchronous communication. It asynchronously receives
messages from producers and sends them to consumers.
Working
Messages are stored in the queue until they are processed and
deleted. Each message is processed only once by a single consumer.
Here's how it works:
Advantages
Features
In these queues, the oldest (or first) entry, sometimes called the "head"
of the queue, is processed first.
At-Least-Once Delivery
Exactly-Once Delivery
Dead-letter Queues
A dead-letter queue is a queue to which other queues can send
messages that can't be processed successfully. This makes it easy to
set them aside for further inspection without blocking the queue
processing or spending CPU cycles on a message that might never be
consumed successfully.
Ordering
Poison-pill Messages
Poison pills are special messages that can be received, but not
processed. They are a mechanism used in order to signal a consumer
to end its work so it is no longer waiting for new inputs, and are similar
to closing a socket in a client/server model.
Security
Task Queues
Tasks queues receive tasks and their related data, run them, then
deliver their results. They can support scheduling and can be used to
run computationally-intensive jobs in the background.
Backpressure
If queues start to grow significantly, the queue size can become larger
than memory, resulting in cache misses, disk reads, and even slower
performance. Backpressure can help by limiting the queue size,
thereby maintaining a high throughput rate and good response times
for jobs already in the queue. Once the queue fills up, clients get a
server busy or HTTP 503 status code to try again later. Clients can
retry the request at a later time, perhaps with exponential backoff
strategy.
Examples
Amazon SQS
RabbitMQ
ActiveMQ
ZeroMQ
Publish-Subscribe
Similar to a message queue, publish-subscribe is also a form of
service-to-service communication that facilitates asynchronous
communication. In a pub/sub model, any message published to a topic
is pushed immediately to all the subscribers of the topic.
Working
Unlike message queues, which batch messages until they are
retrieved, message topics transfer messages with little or no queuing
and push them out immediately to all subscribers. Here's how it works:
Advantages
Features
Push Delivery
Fanout
Filtering
Security
Examples
Amazon SNS
Google Pub/Sub
Disadvantages
Examples
Monoliths
Disadvantages
Microservices
Services are responsible for persisting their own data or external state
(database per service). This differs from the traditional model, where a
separate data layer handles data persistence.
Characteristics
Advantages
Disadvantages
Best practices
Pitfalls
So, you might be wondering, monoliths seem like a bad idea to begin
with, why would anyone use that?
Well, it depends. While each approach has its own advantages and
disadvantages, it is advised to start with a monolith when building a
new system. It is important to understand, that microservices are not a
silver bullet instead they solve an organizational problem.
Microservices architecture is about your organizational priorities and
team as much as it's about technology.
What is an event?
Patterns
Sagas
Publish-Subscribe
Event Sourcing
Command and Query Responsibility Segregation (CQRS)
Note: Each of these methods is discussed separately.
Advantages
Challenges
Guaranteed delivery.
Error handling is difficult.
Event-driven systems are complex in general.
Exactly once, in-order processing of events.
Use cases
NATS
Apache Kafka
Amazon EventBridge
Amazon SNS
Google PubSub
Event Sourcing
Instead of storing just the current state of the data in a domain, use an
append-only store to record the full series of actions taken on that
data. The store acts as the system of record and can be used to
materialize the domain objects.
Advantages
Disadvantages
The CQRS pattern is often used along with the Event Sourcing pattern.
CQRS-based systems use separate read and write data models, each
tailored to relevant tasks and often located in physically separate
stores.
When used with the Event Sourcing pattern, the store of events is the
write model and is the official source of information. The read model of
a CQRS-based system provides materialized views of the data,
typically as highly denormalized views.
Advantages
Disadvantages
Use cases
API Gateway
The API Gateway is an API management tool that sits between a client
and a collection of backend services. It is a single entry point into a
system that encapsulates the internal system architecture and provides
an API that is tailored to each client. It also has other responsibilities
such as authentication, monitoring, load balancing, caching, throttling,
logging, etc.
Features
Advantages
The primary function of the backend for the frontend pattern is to get
the required data from the appropriate service, format the data, and
sent it to the frontend.
Examples
What's an API?
Before we even get into API technologies, let's first understand what is
an API.
REST
Concepts
Constraints
HTTP Verbs
For example, HTTP 200 means that the request was successful.
Advantages
Disadvantages
Over-fetching of data.
Sometimes multiple round trips to the server are required.
Use cases
REST APIs are pretty much used universally and are the default
standard for designing APIs. Overall REST APIs are quite flexible and
can fit almost all scenarios.
Example
GraphQL
Concepts
Queries
Resolvers
Advantages
Disadvantages
Example
Here's a GraphQL schema that defines a User type and a Query type.
COPY
type Query {
getUser: User
}
type User {
id: ID
name: String
city: String
state: String
}
Using the above schema, the client can request the required fields
easily without having to fetch the entire resource or guess what the API
might return.
COPY
{
getUser {
id
name
city
}
}
COPY
{
"getUser": {
"id": 123,
"name": "Karan",
"city": "San Francisco"
}
}
gRPC
Concepts
Protocol buffers
Advantages
Disadvantages
Use cases
COPY
service HelloService {
rpc SayHello (HelloRequest) returns (HelloResponse);
}
message HelloRequest {
string greeting = 1;
}
message HelloResponse {
string reply = 1;
}
Now that we know how these API designing techniques work, let's
compare them based on the following parameters:
Long polling
In Long polling, the server does not close the connection once it
receives a request from the client. Instead, the server responds only if
any new message is available or a timeout threshold is reached.
Once the client receives a response, it immediately sends a new
request to the server to have a new pending connection to send data to
the client, and the operation is repeated. With this approach, the server
emulates a real-time server push feature.
Working
Advantages
Disadvantages
A major downside of long polling is that it is usually not scalable. Below
are some of the other reasons:
WebSockets
Advantages
Disadvantages
Working
Advantages
Geohashing
9q8yy9 .
Use cases
Examples
MySQL
Redis
Amazon DynamoDB
Google Cloud Firestore
Quadtrees
Types of Quadtrees
Use cases
Circuit breaker
The circuit breaker is a design pattern used to detect failures and
encapsulates the logic of preventing a failure from constantly recurring
during maintenance, temporary external system failure, or unexpected
system difficulties.
The basic idea behind the circuit breaker is very simple. We wrap a
protected function call in a circuit breaker object, which monitors for
failures. Once the failures reach a certain threshold, the circuit breaker
trips, and all further calls to the circuit breaker return with an error,
without the protected call being made at all. Usually, we'll also want
some kind of monitor alert if the circuit breaker trips.
States
Closed
When everything is normal, the circuit breakers remain closed, and all
the request passes through to the services as normal. If the number of
failures increases beyond the threshold, the circuit breaker trips and
goes into an open state.
Open
Half-open
Rate Limiting
Rate limiting refers to preventing the frequency of an operation from
exceeding a defined limit. In large-scale systems, rate limiting is
commonly used to protect underlying services and resources. Rate
limiting is generally used as a defensive mechanism in distributed
systems, so that shared resources can maintain availability. It also
protects our APIs from unintended or malicious overuse by limiting the
number of requests that can reach our API in a given period of time.
Algorithms
There are various algorithms for API rate limiting, each with its
advantages and disadvantages. Let's briefly discuss some of these
algorithms:
Leaky Bucket
Token Bucket
Fixed Window
Sliding Log
Sliding Log rate-limiting involves tracking a time-stamped log for each
request. The system stores these logs in a time-sorted hash set or
table. It also discards logs with timestamps beyond a threshold. When
a new request comes in, we calculate the sum of logs to determine the
request rate. If the request would exceed the threshold rate, then it is
held.
Sliding Window
Sliding Window is a hybrid approach that combines the fixed window
algorithm's low processing cost and the sliding log's improved
boundary conditions. Like the fixed window algorithm, we track a
counter for each fixed window. Next, we account for a weighted value
of the previous window's request rate based on the current timestamp
to smooth out bursts of traffic.
Inconsistencies
The simplest way to solve this problem is to use sticky sessions in our
load balancers so that each consumer gets sent to exactly one node
but this causes a lack of fault tolerance and scaling problems. Another
approach might be to use a centralized data store like Redis but this
will increase latency and cause race conditions.
Race Conditions
Service Discovery
Service discovery is the detection of services within a computer
network. Service Discovery Protocol (SDP) is a networking standard
that accomplishes the detection of networks by identifying resources.
Implementations
Client-side discovery
In this approach, the client obtains the location of another service by
querying a service registry which is responsible for managing and
storing the network locations of all the services.
Server-side discovery
Service Registry
A service registry is basically a database containing the network
locations of service instances to which the clients can reach out. A
Service Registry must be highly available and up-to-date.
Service Registration
Self-Registration
Third-party Registration
Service mesh
etcd
Consul
Apache Thrift
Apache Zookeeper
SLAs, SLOs, and SLIs allow companies to define, track and monitor the
promises made for a service to its users. Together, SLAs, SLOs, and
SLIs should help teams generate more user trust in their services with
an added emphasis on continuous improvement to incident
management and response processes.
SLA
SLI
Disaster recovery
Disaster recovery (DR) is a process of regaining access and
functionality of the infrastructure after events like a natural disaster,
cyber attack, or even business disruptions.
Terms
RTO
Recovery Time Objective (RTO) is the maximum acceptable delay
between the interruption of service and restoration of service. This
determines what is considered an acceptable time window when
service is unavailable.
RPO
Strategies
A variety of disaster recovery (DR) strategies can be part of a disaster
recovery plan.
Back-up
This is the simplest type of disaster recovery and involves storing data
off-site or on a removable drive.
Cold Site
Hot site
A hot site maintains up-to-date copies of data at all times. Hot sites are
time-consuming to set up and more expensive than cold sites, but they
dramatically reduce downtime.
VMs are isolated from the rest of the system, and multiple VMs can
exist on a single piece of hardware, like a server. They can be moved
between host servers depending on the demand or to use resources
more efficiently.
What is a Hypervisor?
Containers
Separation of responsibility
Workload portability
Application isolation
Agile development
Efficient operations
Virtualization vs Containerization
In traditional virtualization, a hypervisor virtualizes physical hardware.
The result is that each virtual machine contains a guest OS, a virtual
copy of the hardware that the OS requires to run, and an application
and its associated libraries and dependencies.
OAuth 2.0
Concepts
The OAuth 2.0 protocol defines the following entities:
Disadvantages
OpenID Connect
Concepts
Both OAuth 2.0 and OIDC are easy to implement and are JSON based,
which is supported by most web and mobile applications. However, the
OpenID Connect (OIDC) specification is more strict than that of basic
OAuth.
The user credentials and other identifying information are stored and
managed by a centralized system called Identity Provider (IdP). The
Identity Provider is a trusted system that provides access to other
websites and applications.
Single Sign-On (SSO) based authentication systems are commonly
used in enterprise environments where employees require access to
multiple applications of their organizations.
Components
Service Provider
Identity Broker
There are many differences between SAML, OAuth, and OIDC. SAML
uses XML to pass messages, while OAuth and OIDC use JSON. OAuth
provides a simpler experience, while SAML is geared towards
enterprise security.
Advantages
Disadvantages
Examples
Okta
Google
Auth0
OneLogin
SSL stands for Secure Sockets Layer, and it refers to a protocol for
encrypting and securing communications that take place on the
internet. It was first developed in 1995 but since has been deprecated
in favor of TLS (Transport Layer Security).
TLS
mTLS
mTLS helps ensure that the traffic is secure and trusted in both
directions between a client and server. This provides an additional
layer of security for users who log in to an organization's network or
applications. It also verifies connections with client devices that do not
follow a login process, such as Internet of Things (IoT) devices.
Requirements clarifications
Functional requirements
These are the requirements that the end user specifically demands as
basic functionalities that the system should offer. All these
functionalities need to be necessarily incorporated into the system as
part of the contract.
For example:
"What are the features that we need to design for this system?"
"What are the edge cases we need to consider, if any, in our
design?"
Non-functional requirements
These are the quality constraints that the system must satisfy
according to the project contract. The priority or extent to which these
factors are implemented varies from one project to another. They are
also called non-behavioral requirements. For example, portability,
maintainability, reliability, scalability, security, etc.
For example:
For example:
"What is the desired scale that this system will need to handle?"
"What is the read/write ratio of our system?"
"How many requests per second?"
"How much storage will be needed?"
Once we have the estimations, we can start with defining the database
schema. Doing so in the early stages of the interview would help us to
understand the data flow which is the core of every system. In this
step, we basically define all the entities and relationships between
them.
API design
Next, we can start designing APIs for the system. These APIs will help
us define the expectations from the system explicitly. We don't have to
write any code, just a simple interface defining the API requirements
such as parameters, functions, classes, types, entities, etc.
For example:
COPY
Now we have established our data model and API design, it's time to
identify system components (such as Load Balancers, API Gateway,
etc.) that are needed to solve our problem and draft the first design of
our system.
Detailed design
Now it's time to go into detail about the major components of the
system we designed. As always discuss with the interviewer which
component may need further improvements.
URL Shortener
Let's design a URL shortener, similar to services like Bitly, TinyURL.
A URL shortener service creates an alias or a short URL for a long URL.
Users are redirected to the original URL when they visit these short
links.
For example, the following long URL can be changed to a shorter URL.
Requirements
Functional requirements
Given a URL, our service should generate a shorter and unique
alias for it.
Users should be redirected to the original URL when they visit the
short link.
Links should expire after a default timespan.
Non-functional requirements
Extended requirements
Note: Make sure to check any scale or traffic related assumptions with
your interviewer.
Traffic
100 million requests per month translate into 40 requests per second.
100 million
=∼ 40 U RLs/second
(30 days × 24 hrs × 3600 seconds)
And with a 100:1 read/write ratio, the number of redirections will be:
Bandwidth
Storage
Cache
For caching, we will follow the classic Pareto principle also known as
the 80/20 rule. This means that 80% of the requests are for 20% of the
data, so we can cache around 20% of our requests.
High-level estimate
Type Estimate
Writes (New URLs) 40/s
Reads (Redirection) 4K/s
Bandwidth (Incoming) 20 KB/s
Bandwidth (Outgoing) 2 MB/s
Storage (10 years) 6 TB
Memory (Caching) ~35 GB/day
Next, we will focus on the data model design. Here is our database
schema:
Initially, we can get started with just two tables:
users
urls
API design
This API should create a new short URL in our system given an original
URL.
COPY
Parameters
Returns
Get URL
This API should retrieve the original URL from a given short URL.
COPY
Parameters
Returns
COPY
Parameters
Returns
As you must've noticed, we're using an API key to prevent abuse of our
services. Using this API key we can limit the users to a certain number
of requests per second or minute. This is quite a standard practice for
developer APIs and should cover our extended requirement.
High-level design
URL Encoding
Base62 Approach
In this approach, we can encode the original URL using Base62 which
consists of the capital letters A-Z, the lower case letters a-z, and the
numbers 0-9.
N
N umber of U RLs = 62
Where,
This is the simplest solution here, but it does not guarantee non-
duplicate or collision-resistant keys.
MD5 Approach
However, this creates a new issue for us, which is duplication and
collision. We can try to re-compute the hash until we find a unique one
but that will increase the overhead of our systems. It's better to look for
more scalable approaches.
Counter Approach
In this approach, we will start with a single server which will maintain
the count of the keys generated. Once our service receives a request,
it can reach out to the counter which returns a unique number and
increments the counter. When the next request comes the counter
again returns the unique number and this goes on.
The problem with this approach is that it can quickly become a single
point for failure. And if we run multiple instances of the counter we can
have collision as it's essentially a distributed system.
Once the key is used, we can mark it in the database to make sure we
don't reuse it, however, if there are multiple server instances reading
data concurrently, two or more servers might try to use the same key.
The easiest way to solve this would be to store keys in two tables. As
soon as a key is used, we move it to a separate table with appropriate
locking in place. Also, to improve reads, we can keep some of the keys
in memory.
While 390 GB seems like a lot for this simple use case, it is important to
remember this is for the entirety of our service lifetime and the size of
the keys database would not increase like our main database.
Caching
Now, let's talk about caching. As per our estimations, we will require
around ~35 GB of memory per day to cache 20% of the incoming
requests to our services. For this use case, we can use Redis or
Memcached servers alongside our API server.
Design
Now that we have identified some core components, let's do the first
draft of our system design.
Here's how it works:
1. When a user creates a new URL, our API server requests a new
unique key from the Key Generation Service (KGS).
2. Key Generation Service provides a unique key to the API server
and marks the key as used.
3. API server writes the new URL entry to the database and cache.
4. Our service returns an HTTP 201 (Created) response to the user.
Accessing a URL
Data Partitioning
Hash-Based Partitioning
List-Based Partitioning
Range Based Partitioning
Composite Partitioning
The above approaches can still cause uneven data and load
distribution, we can solve this using Consistent hashing.
Database cleanup
Active cleanup
Passive cleanup
For passive cleanup, we can remove the entry when a user tries to
access an expired link. This can ensure a lazy cleanup of our database
and cache.
Cache
Least Recently Used (LRU) can be a good policy for our system. In this
policy, we discard the least recently used key first.
Whenever there is a cache miss, our servers can hit the database
directly and update the cache with the new entries.
Security
We can also use an API Gateway as they can support capabilities like
authorization, rate limiting, and load balancing out of the box.
Whatsapp
Let's design a Whatsapp like instant messaging service, similar to
services like Whatsapp, Facebook Messenger, and WeChat.
What is Whatsapp?
Requirements
Functional requirements
Non-functional requirements
Extended requirements
Traffic
2 billion requests per day translate into 24K requests per second.
2 billion
=∼ 24K requests/second
(24 hrs × 3600 seconds)
Storage
10.2 T B
=∼ 120 M B/second
(24 hrs × 3600 seconds)
High-level estimate
Type Estimate
Daily active users (DAU) 50 million
Requests per second (RPS) 24K/s
Storage (per day) ~10.2 TB
Storage (10 years) ~38 PB
Bandwidth ~120 MB/s
users
messages
As the name suggests, this table will store messages with properties
such as type (text, image, video, etc.), content , and timestamps for
message delivery. The message will also have a corresponding chatID
or groupID .
chats
This table basically represents a private chat between two users and
can contain multiple messages.
users_chats
This table maps users and chats as multiple users can have multiple
chats (N:M relationship) and vice versa.
groups
users_groups
This table maps users and groups as multiple users can be a part of
multiple groups (N:M relationship) and vice versa.
While our data model seems quite relational, we don't necessarily need
to store everything in a single database, as this can limit our scalability
and quickly become a bottleneck.
API design
This API will get all chats or groups for a given userID .
COPY
Parameters
User ID ( UUID ): ID of the current user.
Returns
Result ( Chat[] | Group[] ): All the chats and groups the user is a part of.
Get messages
Get all messages for a user given the channelID (chat or group id).
COPY
Parameters
Returns
Send message
COPY
Parameters
Message ( Message ): The message (text, image, video, etc.) that the user
wants to send.
Returns
COPY
Parameters
Returns
High-level design
Architecture
We will be using microservices architecture since it will make it easier
to horizontally scale and decouple our services. Each service will have
ownership of its own data model. Let's try to divide our system into
some core services.
User Service
Chat Service
The chat service will use WebSockets and establish connections with
the client to handle chat and group message-related functionality. We
can also use cache to keep track of all the active connections sort of
like sessions which will help us determine if the user is online or not.
Notification Service
This service will simply send push notifications to the users. It will be
discussed in detail separately.
Presence Service
The presence service will keep track of the last seen status of all users.
It will be discussed in detail separately.
Media service
This service will handle the media (images, videos, files, etc.) uploads.
It will be discussed in detail separately.
Note: Learn more about REST, GraphQL, gRPC and how they compare
with each other.
Real-time messaging
Pull model
Push model
The client opens a long-lived connection with the server and once new
data is available it will be pushed to the client. We can use WebSockets
or Server-Sent Events (SSE) for this.
Last seen
To implement the last seen functionality, we can use a heartbeat
mechanism, where the client can periodically ping the servers
indicating its liveness. Since this needs to be as low overhead as
possible, we can store the last active timestamp in the cache as
follows:
Key Value
User A 2022-07-01T14:32:50
User B 2022-07-05T05:10:35
User C 2022-07-10T04:33:25
This will give us the last time the user was active. This functionality will
be handled by the presence service combined with Redis or
Memcached as our cache.
Another way to implement this is to track the latest action of the user,
once the last activity crosses a certain threshold, such as "user hasn't
performed any action in the last 30 seconds", we can show the user as
offline and last seen with the last recorded timestamp. This will be
more of a lazy update approach and might benefit us over heartbeat in
certain cases.
Notifications
If the recipient is not active, the chat service will add an event to a
message queue with additional metadata such as the client's device
platform which will be used to route the notification to the correct
platform later on.
The notification service will then consume the event from the message
queue and forward the request to Firebase Cloud Messaging (FCM) or
Apple Push Notification Service (APNS) based on the client's device
platform (Android, iOS, web, etc). We can also add support for email
and SMS.
Read receipts
Handling read receipts can be tricky, for this use case we can wait for
some sort of Acknowledgment (ACK) from the client to determine if the
message was delivered and update the corresponding deliveredAt
field. Similarly, we will mark message the message seen once the user
opens the chat and update the corresponding seenAt timestamp field.
Design
Now that we have identified some core components, let's do the first
draft of our system design.
Detailed design
Data Partitioning
Hash-Based Partitioning
List-Based Partitioning
Range Based Partitioning
Composite Partitioning
The above approaches can still cause uneven data and load
distribution, we can solve this using Consistent hashing.
Caching
In a messaging application, we have to be careful about using cache as
our users expect the latest data, but many users will be requesting the
same messages, especially in a group chat. So, to prevent usage
spikes from our resources we can cache older messages.
Some group chats can have thousands of messages and sending that
over the network will be really inefficient, to improve efficiency we can
add pagination to our system APIs. This decision will be helpful for
users with limited network bandwidth as they won't have to retrieve old
messages unless requested.
We can use solutions like Redis or Memcached and cache 20% of the
daily traffic but what kind of cache eviction policy would best fit our
needs?
Least Recently Used (LRU) can be a good policy for our system. In this
policy, we discard the least recently used key first.
Whenever there is a cache miss, our servers can hit the database
directly and update the cache with the new entries.
As we know, most of our storage space will be used for storing media
files such as images, videos, or other files. Our media service will be
handling both access and storage of the user media files.
But where can we store files at scale? Well, object storage is what
we're looking for. Object stores break data files up into pieces called
objects. It then stores those objects in a single repository, which can
be spread out across multiple networked systems. We can also use
distributed file storage such as HDFS or GlusterFS.
Fun fact: Whatsapp deletes media on its servers once it has been
downloaded by the user.
We can use object stores like Amazon S3, Azure Blob Storage, or
Google Cloud Storage for this use case.
API gateway
We can use services like Amazon API Gateway or Azure API Gateway
for this use case.
Twitter
Let's design a Twitter like social media service, similar to services like
Facebook, Instagram, etc.
What is Twitter?
Twitter is a social media service where users can read or post short
messages (up to 280 characters) called tweets. It is available on the
web and mobile platforms such as Android and iOS.
Requirements
Functional requirements
Should be able to post new tweets (can be text, image, video, etc.).
Should be able to follow other users.
Should have a newsfeed feature consisting of tweets from the
people the user is following.
Should be able to search tweets.
Non-Functional requirements
Extended requirements
Traffic
1 billion requests per day translate into 12K requests per second.
1 billion
=∼ 12K requests/second
(24 hrs × 3600 seconds)
Storage
Bandwidth
5.1 T B
=∼ 60 M B/second
(24 hrs × 3600 seconds)
High-level estimate
Type Estimate
Daily active users (DAU) 100 million
Requests per second (RPS) 12K/s
Storage (per day) ~5.1 TB
Storage (10 years) ~19 PB
Bandwidth ~60 MB/s
Data model design
users
This table will contain a user's information such as name , email , dob ,
and other details.
tweets
As the name suggests, this table will store tweets and their properties
such as type (text, image, video, etc.), content , etc. We will also store
the corresponding userID .
favorites
This table maps tweets with users for the favorite tweets functionality
in our application.
followers
This table maps the followers and followees as users can follow each
other (N:M relationship).
feeds
feeds_tweets
While our data model seems quite relational, we don't necessarily need
to store everything in a single database, as this can limit our scalability
and quickly become a bottleneck.
API design
Post a tweet
This API will allow the user to post a tweet on the platform.
COPY
Parameters
User ID ( UUID ): ID of the user.
Returns
This API will allow the user to follow or unfollow another user.
COPY
Parameters
Returns
Get newsfeed
This API will return all the tweets to be shown within a given newsfeed.
COPY
getNewsfeed(userID: UUID): Tweet[]
Parameters
Returns
High-level design
Architecture
User Service
Newsfeed Service
Tweet Service
The tweet service will handle tweet-related use cases such as posting
a tweet, favorites, etc.
Search Service
Media service
This service will handle the media (images, videos, files, etc.) uploads.
It will be discussed in detail separately.
Notification Service
Analytics Service
This service will be used for metrics and analytics use cases.
Note: Learn more about REST, GraphQL, gRPC and how they compare
with each other.
Newsfeed
Generation
Let's assume we want to generate the feed for user A, we will perform
the following steps:
1. Retrieve the IDs of all the users and entities (hashtags, topics, etc.)
user A follows.
2. Fetch the relevant tweets for each of the retrieved IDs.
3. Use a ranking algorithm to rank the tweets based on parameters
such as relevance, time, engagement, etc.
4. Return the ranked tweets data to the client in a paginated manner.
Publishing
Publishing is the step where the feed data is pushed according to each
specific user. This can be a quite heavy operation, as a user may have
millions of friends or followers. To deal with this, we have three
different approaches:
The downside of this approach is that the users will not be able to view
recent feeds unless they "pull" the data from the server, which will
increase the number of read operations on the server.
Hybrid Model
A third approach is a hybrid model between the pull and push model. It
combines the beneficial features of the above two models and tries to
provide a balanced approach between the two.
The hybrid model allows only users with a lesser number of followers
to use the push model and for users with a higher number of followers
celebrities, the pull model will be used.
Ranking Algorithm
Where,
Decay : is the measure of the creation of the edge. The older the edge,
the lesser will be the value of decay and eventually the rank.
Retweets
id userID type
ad34-291a-45f6-b36c 7a2c-62c4-4dc8-b1bb text
Search
Notifications
Push notifications are an integral part of any social media platform. We
can use a message queue or a message broker such as Apache Kafka
with the notification service to dispatch requests to Firebase Cloud
Messaging (FCM) or Apple Push Notification Service (APNS) which
will handle the delivery of the push notifications to user devices.
Detailed design
Data Partitioning
Hash-Based Partitioning
List-Based Partitioning
Range Based Partitioning
Composite Partitioning
The above approaches can still cause uneven data and load
distribution, we can solve this using Consistent hashing.
Mutual friends
For mutual friends, we can build a social graph for every user. Each
node in the graph will represent a user and a directional edge will
represent followers and followees. After that, we can traverse the
followers of a user to find and suggest a mutual friend. This would
require a graph database such as Neo4j and ArangoDB.
Caching
In a social media application, we have to be careful about using cache
as our users expect the latest data. So, to prevent usage spikes from
our resources we can cache the top 20% of the tweets.
We can use solutions like Redis or Memcached and cache 20% of the
daily traffic but what kind of cache eviction policy would best fit our
needs?
Least Recently Used (LRU) can be a good policy for our system. In this
policy, we discard the least recently used key first.
As we know, most of our storage space will be used for storing media
files such as images, videos, or other files. Our media service will be
handling both access and storage of the user media files.
But where can we store files at scale? Well, object storage is what
we're looking for. Object stores break data files up into pieces called
objects. It then stores those objects in a single repository, which can
be spread out across multiple networked systems. We can also use
distributed file storage such as HDFS or GlusterFS.
Netflix
Let's design a Netflix like video streaming service, similar to services
like Amazon Prime Video, Disney Plus, Hulu, Youtube, Vimeo, etc.
What is Netflix?
Requirements
Functional requirements
Non-Functional requirements
High availability with minimal latency.
High reliability, no uploads should be lost.
The system should be scalable and efficient.
Extended requirements
Traffic
1
× 1 billion = 50 million/day
200
1 billion requests per day translate into 12K requests per second.
1 billion
=∼ 12K requests/second
(24 hrs × 3600 seconds)
Storage
If we assume each video is 100 MB on average, we will require about 5
PB of storage every day.
Bandwidth
5 PB
=∼ 58 GB/second
(24 hrs × 3600 seconds)
High-level estimate
Type Estimate
Daily active users (DAU) 200 million
Requests per second (RPS) 12K/s
Storage (per day) ~5 PB
Storage (10 years) ~18,250 PB
Bandwidth ~58 GB/s
users
This table will contain a user's information such as name , email , dob ,
and other details.
videos
As the name suggests, this table will store videos and their properties
such as title , streamURL , tags , etc. We will also store the
corresponding userID .
tags
views
comments
This table stores all the comments received on a video (like YouTube).
What kind of database should we use?
While our data model seems quite relational, we don't necessarily need
to store everything in a single database, as this can limit our scalability
and quickly become a bottleneck.
API design
Upload a video
COPY
Parameters
Returns
Result ( boolean ): Represents whether the operation was successful or
not.
Streaming a video
This API allows our users to stream a video with the preferred codec
and resolution.
COPY
Parameters
Offset ( int ): Offset of the video stream in seconds to stream data from
any point in the video (optional).
Returns
This API will enable our users to search for a video based on its title or
tags.
COPY
Next Page ( string ): Token for the next page, this can be used for
pagination (optional).
Returns
Videos ( Video[] ): All the videos available for a particular search query.
Add a comment
This API will allow our users to post a comment on a video (like
YouTube).
COPY
Parameters
Returns
High-level design
Architecture
We will be using microservices architecture since it will make it easier
to horizontally scale and decouple our services. Each service will have
ownership of its own data model. Let's try to divide our system into
some core services.
User Service
Stream Service
Search Service
Media service
This service will handle the video uploads and processing. It will be
discussed in detail separately.
Analytics Service
This service will be used for metrics and analytics use cases.
Video processing
File Chunker
This is the first step of our processing pipeline. File chunking is the
process of splitting a file into smaller pieces called chunks. It can help
us eliminate duplicate copies of repeating data on storage, and
reduces the amount of data sent over the network by only selecting
changed chunks.
Usually, a video file can be split into equal size chunks based on
timestamps but Netflix instead splits chunks based on scenes, this
slight variation becomes a huge factor for a better user experience as
whenever the client requests a chunk from the server, there is a lower
chance of interruption as a complete scene will be retrieved.
Content Filter
This step checks if the video adheres to the content policy of the
platform, this can be pre-approved in the case of Netflix as per the
content rating of the media or can be strictly enforced like YouTube.
Transcoder
This results in a smaller size file and a much more optimized format for
the target devices. Standalone solutions such as FFmpeg or cloud-
based solutions like AWS Elemental MediaConvert can be used to
implement this step of the pipeline.
Quality Conversion
This is the last step of the processing pipeline and as the name
suggests, this step handles the conversion of the transcoded media
from the previous step into different resolutions such as 4K, 1440p,
1080p, 720p, etc.
This allows us to fetch the desired quality of the video as per the user's
request, and once the media file finishes processing, it will be
uploaded to a distributed file storage such as HDFS, GlusterFS, or an
object storage such as Amazon S3 for later retrieval during streaming.
Video streaming
Video streaming is a challenging task from both the client and server
perspectives. Moreover, internet connection speeds vary quite a lot
between different users. To make sure users don't re-fetch the same
content, we can use a Content Delivery Network (CDN).
Netflix takes this a step further with its Open Connect program. In this
approach, they partner with thousands of Internet Service Providers
(ISPs) to localize their traffic and deliver their content more efficiently.
Lastly, for playing the video from where the user left off (part of our
extended requirements), we can simply use the offset property we
stored in the views table to retrieve the scene chunk at that particular
timestamp and resume the playback for the user.
Searching
Sharing
Sharing content is an important part of any platform, for this, we can
have some sort of URL shortener service in place that can generate
short URLs for the users to share.
Detailed design
Data Partitioning
To scale out our databases we will need to partition our data.
Horizontal partitioning (aka Sharding) can be a good first step. We can
use partitions schemes such as:
Hash-Based Partitioning
List-Based Partitioning
Range Based Partitioning
Composite Partitioning
The above approaches can still cause uneven data and load
distribution, we can solve this using Consistent hashing.
Geo-blocking
Recommendations
Netflix uses a machine learning model which uses the user's viewing
history to predict what the user might like to watch next, an algorithm
like Collaborative Filtering can be used.
However, Netflix (like YouTube) uses its own algorithm called Netflix
Recommendation Engine which can track several data points such as:
Caching
In a streaming platform, caching is important. We have to be able to
cache as much static media content as possible to improve user
experience. We can use solutions like Redis or Memcached but what
kind of cache eviction policy would best fit our needs?
Least Recently Used (LRU) can be a good policy for our system. In this
policy, we discard the least recently used key first.
Whenever there is a cache miss, our servers can hit the database
directly and update the cache with the new entries.
As most of our storage space will be used for storing media files such
as thumbnails and videos. Per our discussion earlier, the media service
will be handling both the upload and processing of media files.
Uber
Let's design an Uber like ride-hailing service, similar to services like
Lyft, OLA Cabs, etc.
What is Uber?
Requirements
Functional requirements
We will design our system for two types of users: Customers and
Drivers.
Customers
Customers should be able to see all the cabs in the vicinity with an
ETA and pricing information.
Customers should be able to book a cab to a destination.
Customers should be able to see the location of the driver.
Drivers
Non-Functional requirements
High reliability.
High availability with minimal latency.
The system should be scalable and efficient.
Extended requirements
Traffic
Let us assume we have 100 million daily active users (DAU) with 1
million drivers and on average our platform enables 10 million rides
daily.
1 billion requests per day translate into 12K requests per second.
1 billion
=∼ 12K requests/second
(24 hrs × 3600 seconds)
Storage
If we assume each message on average is 400 bytes, we will require
about 400 GB of database storage every day.
Bandwidth
400 GB
=∼ 5 M B/second
(24 hrs × 3600 seconds)
High-level estimate
Type Estimate
Daily active users (DAU) 100 million
Requests per second (RPS) 12K/s
Storage (per day) ~400 GB
Storage (10 years) ~1.4 PB
Bandwidth ~5 MB/s
customers
drivers
This table will contain a driver's information such as name , email , dob
trips
This table represents the trip taken by the customer and stores data
such as source , destination , and status of the trip.
cabs
This table stores data such as the registration number, and type (like
Uber Go, Uber XL, etc.) of the cab that the driver will be driving.
ratings
As the name suggests, this table stores the rating and feedback for the
trip.
payments
While our data model seems quite relational, we don't necessarily need
to store everything in a single database, as this can limit our scalability
and quickly become a bottleneck.
API design
Request a Ride
COPY
Parameters
Returns
COPY
Parameters
Returns
COPY
Returns
Using this API, a driver will be able to start and end the trip.
COPY
Parameters
Returns
COPY
Returns
High-level design
Architecture
Customer Service
Driver Service
Ride Service
This service will be responsible for ride matching and quadtree
aggregation. It will be discussed in detail separately.
Trip Service
Payment Service
Notification Service
This service will simply send push notifications to the users. It will be
discussed in detail separately.
Analytics Service
This service will be used for metrics and analytics use cases.
Note: Learn more about REST, GraphQL, gRPC and how they compare
with each other.
Location Tracking
How do we efficiently send and receive live location data from the
client (customers and drivers) to our backend? We have two different
options:
Pull model
Push model
The client opens a long-lived connection with the server and once new
data is available it will be pushed to the client. We can use WebSockets
or Server-Sent Events (SSE) for this.
Ride Matching
SQL
COPY
SELECT * FROM locations WHERE lat BETWEEN X-R AND X+R AND long BETW
Geohashing
Quadtrees
A Quadtree is a tree data structure in which each internal node has
exactly four children. They are often used to partition a two-
dimensional space by recursively subdividing it into four quadrants or
regions. Each child or leaf node stores spatial information. Quadtrees
are the two-dimensional analog of Octrees which are used to partition
three-dimensional space.
Quadtree seems perfect for our use case, we can update the Quadtree
every time we receive a new location update from the driver. To reduce
the load on the quadtree servers we can use an in-memory datastore
such as Redis to cache the latest updates. And with the application of
mapping algorithms such as the Hilbert curve, we can perform efficient
range queries to find nearby drivers for the customer.
What about race conditions?
For more details, learn how surge pricing works with Uber.
Payments
Notifications
Detailed design
Data Partitioning
Caching
Least Recently Used (LRU) can be a good policy for our system. In this
policy, we discard the least recently used key first.
Whenever there is a cache miss, our servers can hit the database
directly and update the cache with the new entries.
Next Steps
Congratulations, you've finished the course!
Now that you know the fundamentals of System Design, here are some
additional resources:
Microsoft Engineering
Google Research Blog
Netflix Tech Blog
AWS Blog
Facebook Engineering
Uber Engineering Blog
Airbnb Engineering
GitHub Engineering Blog
Intel Software Blog
LinkedIn Engineering
Paypal Developer Blog
Twitter Engineering
Last but not least, volunteer for new projects at your company, and
learn from senior engineers and architects to further improve your
system design skills.
I hope this course was a great learning experience. I would love to hear
feedback from you.
References
Here are the resources that were referenced while creating this course.
All the diagrams were made using Excalidraw and are available here.
778 25
General Programming
Written by
Follow
MORE ARTICLES
Understand Context in Go
In concurrent programs, it's often necessary to preempt operations
because of timeouts, cancellation…
Sync package in Go
As we learned earlier, goroutines run in the same address space, so
access to shared memory must be …