Unit 2 Handouts
Unit 2 Handouts
Unit 2
No SQL
Introduction to NoSQL
• NoSQL is a type of database management system (DBMS) that is
designed to handle and store large volumes of unstructured and
semi-structured data.
1
6/7/2024
Key-Value stores
Key-Value Stores • This is the first category of NoSQL database. Key-value stores have a simple
data model, which allow clients to put a map/dictionary and request value per
key. In the key-value storage, each key has to be unique to provide non-
Keys are mapped to (possibly) more complex value ambiguous identification of values.
(e.g., lists) • A value, which can be basically any piece of data or information, is stored
with a key that identifies its location.
• In fact, this is a design concept that exists in every piece of programming as
Keys can be stored in a hash table and can be an array or map object.
distributed easily • The difference here is that it’s stored persistently in a database management
system.
2
6/7/2024
Column Family stores E.g., HBase and Vertica Column Family Store
A column family consists of multiple rows.
Columnar databases are a hybrid of RDBMSs and Key-Value stores
Each row contains its own set of columns.
Values are stored in groups of zero or more columns in Column-Order (as
Each row can contains a different number of columns and the columns don’t
opposed to Row-Order)
have to match the columns in the other rows (i.e. they can have different
The Structure of a Column Store Database:
column names, data types, etc).
Columns store databases use a concept called a keyspace.
Each column is related to its row. It doesn’t span all rows like in a relational
A keyspace is like a schema in the relational model. The keyspace contains all
database. Each column contains a name/value pair, along with a timestamp.
the column families (like tables in the relational model).
Here’s how each row is constructed:
Row Key. Each row has a unique key, which is a unique identifier for that row.
Column. Each column contains a name, a value, and timestamp.
Name. This is the name of the name/value pair.
Value. This is the value of the name/value pair.
Timestamp. This provides the date and time that the data was inserted. This can
be used to determine the most recent version of data.
A Column Store family containing 3 rows. Representing customer information in a column-family structure
3
6/7/2024
Graph Databases Example: We have a social network in which five friends are all connected.
These friends are Anay, Bhagya, Chaitanya, Dilip, and Erica. A graph database that
will store their personal information may look something like this:
Data are represented as vertices and edges
Bhagya Another table is required to capture the
friendship/relationship between
Anay
Chaitanya users/friends. The friendship table will
Id: 2 look something like this:
Name: Bob Friend of
relationship
Age: 22
Erica
Dilip
Id: 1
Name: Alice
Age: 18
Id: 3
Name: Chess
Type: Group
Assume that our social network here has a feature that allows every user to see the
personal information of his/her friends. So, If Chaitanya were requesting information
then it would mean she needs information about Anay, Bhagya, Dilip and Erica. We
Aggregate Data Models
will approach this problem the traditional way(Relational database). We must first
identify Chaitanya’s id in the User’s table:
• The term aggregate means a collection of
objects that we use to treat as a unit. An
aggregate is a collection of data that we
Now, we’d look for all tuples in friendship table where the user_id is 3. Now, we’d look
for all tuples in friendship table where the user_id is 3. Resulting relation would be interact with as a unit.
something like this:
• These units of data or aggregates form the
boundaries for ACID operation.
Graph databases organize data into node and edge graphs; they work best for data
that has complex relationship structures.
Using the above data model, an example Customer and Order would look like this:
// in customers
In this model, we have two main {
aggregates: customer and order. " customer": { "id": 1,
"name": "Martin",
"billingAddress": [{“street”: “XYZ”, "city": “Trichy“, State: “TamilNadu”, “Postcode“:620012}],
The black-diamond composition
"orders": [ { "id":99,
marker in UML is used to show "customerId":1,
how data fit into the aggregation "orderItems":[ { "productId":27, "price": 32.45, "productName": "NoSQL Distilled”} ],
structure. "shippingAddress":[{"city":"Chicago"}],
"orderPayment":[ { "ccinfo":"1000-1000-1000-1000", "txnId":"abelif879rft", "billingAddress": {"city": "Chicago“}}]
}
The customer contains a list of ]
billing addresses; the order }
contains a list of order items, a }
4
6/7/2024
How does a schemaless database work? What are the benefits of using a schemaless database?
• In schemaless databases, information is stored in JSON-style documents
which can have varying sets of fields with different data types for each
field. So, a collection could look like this: • Greater flexibility over data types
{
name:”abc”,age:30,interest:”football” • No pre-defined database schemas
}
• No data truncation
{
name:”xyz”,age:25 • Suitable for real-time analytics functions
}
• Enhanced scalability and flexibility
5
6/7/2024
Aggregate orientation fits well with scaling out because the aggregate is a natural unit to
use for distribution.
Replication takes the same data and copies it over multiple nodes.
Replication comes into two forms: Master-slave Replication and Peer-to-peer Figure: Sharding puts different data on separate nodes,
Replication. each of which does its own reads and writes.
We have to ensure that data that are accessed together are clump together on
* Sharding is the process of splitting a large dataset into many small the same node to provide the best data access. Aggregate orientation helps in
achieving this. Aggregates combine data that are commonly accessed together—
portions which are placed on different machines. Each portion is
so aggregates can be used as unit of distribution.
known as a ‘shard’. Factors that can help arranging the data on the nodes so as to improve
performance:
* Each shard has the same database schema as the original database. • Placing data close to where it’s being accessed.
• Keeping the load even. This means that the aggregates must be evenly
* Data is distributed such that each row appears in exactly one shard.
distributed across the nodes such that all get equal amounts of the load.
• In some cases, it is useful to put aggregates together if they may be read in
* The combined data from all shards is the same as the original
sequence.
database. • The Bigtable keeps its rows in lexicographic order.
* Sharding helps in balancing out the load between servers, for example,
Auto-sharding:
if we have five servers, each one has to handle only 20% of the load. • If sharding is done as part of application logic it complicates the
programming model, as application code needs to ensure that queries are
* The NoSQL databases are designed to support automatic distribution distributed across the various shards.
of data and queries across multiple servers located in different • Furthermore, for rebalancing the sharding the application code must be
geographic regions. This permits rapid, automatic and transparent changed and the data must be migrated.
replacement of data without any disruption. • To overcome these problems, many NoSQL databases offer auto-sharding.
• With auto-sharding, the database takes on the responsibility of allocating
data to shards and ensuring that data access goes to the right shard. This can
make it much easier to use sharding in an application.
6
6/7/2024
d) Sharding does not improve resilience when used alone. Although the data is on Figure: Data is replicated from master to slaves.
different nodes, a node failure makes that shard’s data unavailable. The resilience The master services all writes; reads may come
benefit it provides is that only the users of the data on that shard will suffer but it is from either master or slaves.
not good to have a database with part of its data missing.
Master-Slave Replication
Peer-to-Peer Replication
• A peer-to-peer replication cluster, offers tolerance to node failures
• Master-slave replication helps with read scalability but doesn’t help scalability of
writes. It provides resilience against failure of a slave, but not of a master. The master without losing access to data.
is a bottleneck and a single point of failure.
• We can easily add nodes to improve performance.
• Peer-to-peer replication (see Figure) solves these problems by not having a master.
• All the replicas have equal weight, they can all accept writes, and the loss of any of • The biggest complication is consistency. When two people attempt
them doesn’t prevent access to the data store. to update the same record at the same time a write-write conflict
occurs.
• Inconsistencies on read lead to problems but at least they are
relatively transient. Inconsistent writes are forever.
• Two solutions to write-write conflict:
1. We can ensure that whenever we write data, the replicas
coordinate to ensure that a conflict is avoided. This can give us
strong guarantee as a master slave replication, but at the cost
of network traffic to coordinate the writes. We don’t need all
the replicas to agree on the write, just a majority.
Figure: Peer-to-peer replication has all nodes applying 2. We can decide to cope with an inconsistent write with some
reads and writes to all the data. policies. We can trade of consistency for availability.
7
6/7/2024
The consistency property ensures that any transaction will bring the database 1. Update Consistency (or write-write conflict):
from one valid state to another. Martin and Pramod are looking at the company website and notice
Any data written to the database must be valid according to all defined rules, that the phone number is outdated. They both have update access,
including constraints, cascades, triggers, and any combination. so they both go in at the same time to update the number. We’ll
assume they update it slightly differently, because each uses a
Relational databases offer strong consistency whereas NoSQL systems mostly slightly different format. This issue is called a write-write conflict:
provide eventual consistency. two people updating the same data item at the same time.
Various forms of consistency When the writes reach the server, the server will serialize them—
1. Update Consistency (or write-write conflict) decide to apply one, then the other. Let’s assume it uses alphabetical
Solution: 1. Pessimistic approach (conditional updates, save both updates) 2. Optimistic approach (write locks)
order and picks Martin’s update first, then Pramod’s. Without any
2. Read Consistency (or read-write conflict) concurrency control, Martin’s update would be applied and immediately
Solution: Eventual consistency (allows certain degree of inconsistency between replicas)
overwritten by Pramod’s. In this case Martin’s is a lost update. This is a
Prob: Session consistency (Inconsistencies must not occur within user’s own writes within a session)
Solution: 1. Sticky session 2. Version stamps
failure of consistency.
8
6/7/2024
• We refer to this type of consistency as logical consistency: ensuring that different Replication consistency
Example for breach of Replication
data items make sense together. • Ensuring that the same data item has the Consistency (replication
• To avoid a logically inconsistent readwrite conflict, relational databases support same value when read from different inconsistency):
the notion of transactions. replicas is called Replication consistency. • Let’s imagine there’s one last hotel
room for a desirable event. The hotel
reservation system runs on many
• If Martin wraps his two writes in a transaction, the system guarantees that
nodes. Martin and Cindy considering
Pramod will either read both data items before the update or both after the update. this room, but they are discussing this
on the phone because Martin is in
• Lack of transactions applies to only some NoSQL databases, in particular the London and Cindy is in Boston.
aggregate-oriented ones. In contrast, graph databases tend to support ACID • Meanwhile Pramod, who is in
transactions just the same as relational databases. Mumbai, goes and books that last
room. That updates the replicated
• Aggregate-oriented databases support atomic updates, but only within a single room availability, but the update gets
to Boston quicker than it gets to
aggregate. This means that we will have logical consistency within an aggregate
London.
but not between aggregates. So in the example, we could avoid inconsistency if the • When Martin and Cindy open their
order, the delivery charge, and the line items are all part of a single order browsers to see if the room is available,
aggregate. Cindy sees it booked and Martin sees it
free. This is replication inconsistency.
Figure. An example of replication inconsistency
• The length of time an inconsistency is present is called the inconsistency window. ( Figure ).
A NoSQL system may have a quite short inconsistency window. The inconsistency
Eventually, the updates will propagate fully, and Martin will see the room is fully booked.
window size for Amazon’s SimpleDB service is usually less than a second. Therefore this situation is generally referred to as eventually consistent, meaning that at any
time nodes may have replication inconsistencies but, if there are no further updates,
eventually all nodes will be updated to the same value.
Session Consistency
Inconsistency windows can be problematic when users get inconsistencies with their own Relaxing Consistency
writes.
Sometimes we have to sacrifice Consistency. It is not possible to design a system
Consider the example of posting comments on a blog entry. Inconsistency windows of to avoid inconsistencies, without making sacrifices in other characteristics of the
even a few minutes can’t be tolerated while people are typing in their latest thoughts. system. As a result, we often have to tradeoff consistency for something else like
Systems handle the load of such sites by running on a cluster and load-balancing
Availability and Partition Tolerance.
incoming requests to different nodes. Therein lies a danger: We may post a message using The CAP Theorem: The basic statement of the CAP
one node, then refresh our browser, but the refresh goes to a different node which hasn’t theorem is that, given the three properties of
received our post yet—and it looks like our post was lost. Consistency, Availability, and Partition tolerance, we
can only get two.
In situations like this, we can tolerate reasonably long inconsistency windows, but we
Consistency: All people see the same data at the same
need read-your-writes consistency which means that, once we’ve made an update, we’re
time
guaranteed to continue seeing that update.
Availability : If we can communicate to a node in the
One way to get this is to provide session consistency: Within a user’s session it is cluster, we should be able to read and write data.
necessary to provide read-your-writes consistency. Partition Tolerance: The cluster can survive
communication breakages that separate the cluster
Two techniques to provide session consistency: into partitions that are unable to communicate with
i) Sticky session: Sticky session is a session that’s tied to one node (this is also called each other.
session affinity). A sticky session allows us to ensure that as long as we keep read-your-
writes consistency on a node, we’ll get it for sessions too. The downside is that sticky The CAP theorem states that if we get a network partition, we have to
sessions reduce the ability of the load balancer to do its job. trade off availability (A) of data versus consistency (C). Very large systems
will―partition at some point. That leaves either C or A to choose from
ii) Version Stamp: Every interaction with the data store includes the latest version stamp (traditional DBMS prefers C over A and P ). In almost all cases, for
seen by a session. The server node must then ensure that it has the updates that systems that use distribution models we would choose A over C.
include that version stamp before responding to a request.
Availability + Partition Tolerance forfeit Consistency When horizontally scaling databases to 1000s of machines,
the likelihood of a node or a network failure
increases tremendously
Consistency + Partition Tolerance entails that one side of
the partition must act as if it is unavailable, thus
forfeiting Availability Therefore, in order to have strong guarantees on
Availability and Partition Tolerance, they had to sacrifice
Consistency + Availability is only possible if there is no “strict” Consistency (implied by the CAP theorem)
network partition, thereby forfeiting Partition Tolerance
9
6/7/2024
10
6/7/2024
DATA MODEL
11