0% found this document useful (0 votes)
28 views11 pages

Unit 2 Handouts

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views11 pages

Unit 2 Handouts

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

6/7/2024

Unit 2
No SQL
Introduction to NoSQL
• NoSQL is a type of database management system (DBMS) that is
designed to handle and store large volumes of unstructured and
semi-structured data.

• The term NoSQL originally referred to “non-SQL” or “non-relational”


databases, but the term has evolved to mean “not only SQL,” as
NoSQL databases have expanded to include a wide range of
different database architectures and data models.

• NoSQL databases use flexible data models that can adapt to


changes in data structures, unlike traditional relational databases
that use tables with pre-defined schemas to store data.

• No SQL databases are capable of scaling horizontally to handle


growing amounts of data.

Why are NoSQL Databases Interesting? / Why we should use Nosql?


/ when to use Nosql? Benefits of NoSQL
• Application development productivity: A lot of application development effort is
spent on mapping data between in-memory data structures and a relational
database. A NoSQL database may provide a data model that better fits the • "Not only SQL" (NoSQL) databases were designed to fill
application’s needs, thus simplifying that interaction and resulting in less code to the gaps left by relational databases. Consider the core
write, debug, and evolve.
•Large data: Organizations are finding it valuable to capture more data and process it characteristics of a NoSQL database:
more quickly: They are finding it expensive, if even possible, to do so with relational • Schema-less/Dynamic schema with no complex
databases.
• Analytics: Well suited to performing analytical queries. relationships
• Scalability
• Massive write performance
• Distributed by replicating data to avoid a single point of
• Flexible data model and flexible datatypes failure
• Schema migration: Schemalessness makes it easier to deal with schema migrations
without so much worrying. • Flexible storage of both unstructured and semi-
• Write availability: Writes need to succeed no mater what. structured data
• Easier maintainability, administration and operations: This is very product specific,
but many NoSQL vendors are trying to gain adoption by making it easy for • Highly scalable no matter how much data is entered
developers to adopt them.
• No single point of failure
• Generally available parallel computing
• Programmer ease of use

1
6/7/2024

Types/Categories of No SQL Types/Categories of No SQL databases


databases NoSQL databases are generally classified into four
main categories:
1. Key-value stores: These databases store data
as key-value pairs, and are optimized for
simple and fast read/write operations.
NoSQL Databases 2. Document databases: These databases store
data as semi-structured documents, such as
JSON or XML, and can be queried using
document-oriented query languages.

3. Column-family stores: These databases store data


Key-Value Document Columnar Graph as column families, which are sets of columns
Stores Stores Databases Databases that are treated as a single entity. They are
optimized for fast and efficient querying of large
Amazon DynamoDB, MongoDB, Hbase, Cassandra, Neo4j, FlockDB, amounts of data.
BerkleyDB, Aerospike, CouchDB, Vertica, Bigtable VertexDB,
Couchbase, Riak, Elasticsearch ArangoDB
Memcached DynamoDB 4. Graph databases: These databases store data as
nodes and edges, and are designed to handle
complex relationships between data.

Key-Value stores
Key-Value Stores • This is the first category of NoSQL database. Key-value stores have a simple
data model, which allow clients to put a map/dictionary and request value per
key. In the key-value storage, each key has to be unique to provide non-
 Keys are mapped to (possibly) more complex value ambiguous identification of values.
(e.g., lists) • A value, which can be basically any piece of data or information, is stored
with a key that identifies its location.
• In fact, this is a design concept that exists in every piece of programming as
 Keys can be stored in a hash table and can be an array or map object.
distributed easily • The difference here is that it’s stored persistently in a database management
system.

 Such stores typically support regular CRUD (create,


read, update, and delete) operations
 No joins and aggregate functions

 E.g., Amazon DynamoDB and Apache Cassandra

Document Stores Document Stores


 Documents are stored in some standard format or
encoding (e.g., XML, JSON, PDF or Office Documents)
 These are typically referred to as Binary Large Objects Relational database - Table
(BLOBs) In this example, there are four
columns defined for a table, and it
would be necessary to alter the table
Document Store
 Documents can be indexed schema if we wanted a fifth column,
or if we wanted to change the
 This allows document stores to outperform traditional maximum length of the name column,
or if we wanted to allow nulls in date-
file systems of-birth. But because document
databases as schema-free, they aren’t
subject to these constraints. This
 E.g., MongoDB and CouchDB makes them ideal when we have a
rapidly evolving schema, as is usually
the case in software development
today.

2
6/7/2024

Column Family stores E.g., HBase and Vertica Column Family Store
A column family consists of multiple rows.
 Columnar databases are a hybrid of RDBMSs and Key-Value stores
 Each row contains its own set of columns.
 Values are stored in groups of zero or more columns in Column-Order (as
 Each row can contains a different number of columns and the columns don’t
opposed to Row-Order)
have to match the columns in the other rows (i.e. they can have different
 The Structure of a Column Store Database:
column names, data types, etc).
Columns store databases use a concept called a keyspace.
 Each column is related to its row. It doesn’t span all rows like in a relational
A keyspace is like a schema in the relational model. The keyspace contains all
database. Each column contains a name/value pair, along with a timestamp.
the column families (like tables in the relational model).
 Here’s how each row is constructed:

Row Key. Each row has a unique key, which is a unique identifier for that row.
Column. Each column contains a name, a value, and timestamp.
Name. This is the name of the name/value pair.
Value. This is the value of the name/value pair.
Timestamp. This provides the date and time that the data was inserted. This can
be used to determine the most recent version of data.

A Column Store family containing 3 rows. Representing customer information in a column-family structure

Some DBMSs expand on the column family concept to provide extra


functionality/storage ability.
For example, Cassandra has the concept of composite columns,
which allow you to nest objects inside a column.

Wide Column Stores/Super Column Family Graph Databases


• In a graph database, each node is a record and each arc is a
relationship between two nodes.
• Graph databases are optimized to represent complex
relationships with many foreign keys or many-to-many
relationships.
• Graphs databases offer high performance for data models
with complex relationships, such as a social network.
• Many graphs can only be accessed with REST APIs.
• E.g., Neo4j, FlockDB, ArangoDB, VertexDB

3
6/7/2024

Graph Databases Example: We have a social network in which five friends are all connected.
These friends are Anay, Bhagya, Chaitanya, Dilip, and Erica. A graph database that
will store their personal information may look something like this:
 Data are represented as vertices and edges
Bhagya Another table is required to capture the
friendship/relationship between
Anay
Chaitanya users/friends. The friendship table will
Id: 2 look something like this:
Name: Bob Friend of
relationship
Age: 22
Erica
Dilip

Id: 1
Name: Alice
Age: 18

Id: 3
Name: Chess
Type: Group

Assume that our social network here has a feature that allows every user to see the
personal information of his/her friends. So, If Chaitanya were requesting information
then it would mean she needs information about Anay, Bhagya, Dilip and Erica. We
Aggregate Data Models
will approach this problem the traditional way(Relational database). We must first
identify Chaitanya’s id in the User’s table:
• The term aggregate means a collection of
objects that we use to treat as a unit. An
aggregate is a collection of data that we
Now, we’d look for all tuples in friendship table where the user_id is 3. Now, we’d look
for all tuples in friendship table where the user_id is 3. Resulting relation would be interact with as a unit.
something like this:
• These units of data or aggregates form the
boundaries for ACID operation.

Graph databases organize data into node and edge graphs; they work best for data
that has complex relationship structures.

Using the above data model, an example Customer and Order would look like this:

// in customers
In this model, we have two main {
aggregates: customer and order. " customer": { "id": 1,
"name": "Martin",
"billingAddress": [{“street”: “XYZ”, "city": “Trichy“, State: “TamilNadu”, “Postcode“:620012}],
The black-diamond composition
"orders": [ { "id":99,
marker in UML is used to show "customerId":1,
how data fit into the aggregation "orderItems":[ { "productId":27, "price": 32.45, "productName": "NoSQL Distilled”} ],
structure. "shippingAddress":[{"city":"Chicago"}],
"orderPayment":[ { "ccinfo":"1000-1000-1000-1000", "txnId":"abelif879rft", "billingAddress": {"city": "Chicago“}}]
}
The customer contains a list of ]
billing addresses; the order }
contains a list of order items, a }

shipping address, and payments.

The payment itself contains a


billing address for that payment.

4
6/7/2024

The diagram has two aggregates Customer and Orders


Schemaless Databases
• Link between them represent an aggregate. Traditional relational databases are well-defined, using a schema to describe every
• The diamond shows how data fit into the aggregate structure. functional element, including tables, rows views, indexes, and relationships. In a SQL
• Customer contains a list of billing address database, the schema is enforced by the Relational Database Management System (RDBMS)
• Payment also contains the billing address whenever data is written to disk.
• The address appears three times and it is copied each time
• The domain is fit where we don’t want to change shipping and billing address. But in order to work, data needs to be heavily formatted and shaped to fit into the table
structure. This means sacrificing any undefined details during the save, or storing valuable
Consequences of Aggregate Orientation: information outside the database entirely.
• Aggregate-oriented databases make inter-aggregate relationships more difficult to
A schemaless database, like MongoDB, does not have these constraints. Each document is
handle than intra-aggregate relationships. They support atomic manipulation of a single
created with a partial schema to aid retrieval.
aggregate at a time. This means that if we need to manipulate multiple aggregates in an
atomic way, we have to manage that ourselves in the application code. Any formal schema is applied in the code of applications; this layer of abstraction protects
the raw data in the NoSQL database and allows for rapid transformation as needs change.
• The reason for aggregate orientation is that it helps greatly with running on a cluster,
which as you’ll remember is the killer argument for the rise of NoSQL. If we’re running Any data, formatted or not, can be stored in a non-tabular NoSQL type of database.
on a cluster by explicitly including aggregates, we give the database important Using the right tools in the form of a schemaless database can unlock the value of all of
information about which bits of data will be manipulated together, and thus should live structured and unstructured data types.
on the same node.
• Aggregate-oriented databases often compute materialized views to provide data
organized differently from their primary aggregates. This is often done with map-reduce
computations.

How does a schemaless database work? What are the benefits of using a schemaless database?
• In schemaless databases, information is stored in JSON-style documents
which can have varying sets of fields with different data types for each
field. So, a collection could look like this: • Greater flexibility over data types
{
name:”abc”,age:30,interest:”football” • No pre-defined database schemas
}
• No data truncation
{
name:”xyz”,age:25 • Suitable for real-time analytics functions
}
• Enhanced scalability and flexibility

Materialized Views Materialized Views


• A view is like a table but it is defined by computation over the base tables.
When we access a view, the database computes the data in the view—a
form of encapsulation.
• Views provide a mechanism to hide from the client whether data is derived
data or base data.
• Aggregate-oriented databases often compute materialized views to provide
data organized differently from their primary aggregates. This is often done
with map-reduce computations. Materialized views are stored in the disk.

A view is a virtual table that is based on the result of a SELECT query.


It does not store the data itself; instead, it provides a way to represent
the result of a query as if it were a table. A materialized view, on the
other hand, is a physical copy or snapshot of the result set of a query.

5
6/7/2024

View Materialized View Scaling Traditional Databases


A View is never stored it is only A Materialized View is stored on the  Traditional RDBMSs can be either scaled:
displayed. disk.  Vertically (or Up)
View is the virtual table formed from Materialized view is a physical copy of  Can be achieved by hardware upgrades
one or more base tables or views. the base table. (e.g., faster CPU, more memory, or larger
disk)
If the main table is dropped the view Even if the main table is dropped the  Limited by the amount of CPU, RAM and
will be inaccessible. view will be accessible. disk that can be configured on a single
machine
View is updated each time the virtual Materialized View has to be updated
table (View) is used. manually or using triggers.  Horizontally (or Out)
Slow processing. Fast processing.  Can be achieved by adding more machines
Views do not require memory space. Materialized View utilizes memory  Requires database sharding and replication
space.  Limited by the Read-to-Write ratio and
communication overhead

Distribution Models Sharding


• Often, a data store is busy because different people are accessing different
 The primary driver of interest in NoSQL has been its ability to run databases on a large parts of the dataset.
cluster. • In these circumstances we can support horizontal scalability by putting
 As data volumes increase, it becomes more difficult and expensive to scale up different parts of the data onto different servers—a technique that’s called
(expensive to buy a bigger server to run the database on). sharding. (shown in the following figure)
• This allows larger datasets to be split into smaller chunks and stored in
 A more appealing option is to scale out ie. run the database on a cluster of servers. multiple data nodes, increasing the total storage capacity.

Aggregate orientation fits well with scaling out because the aggregate is a natural unit to
use for distribution.

 There are two paths to data distribution: Replication and Sharding.

 Replication takes the same data and copies it over multiple nodes.

 Sharding puts different data on different nodes.

 We can use either or both of them.

 Replication comes into two forms: Master-slave Replication and Peer-to-peer Figure: Sharding puts different data on separate nodes,
Replication. each of which does its own reads and writes.

 We have to ensure that data that are accessed together are clump together on
* Sharding is the process of splitting a large dataset into many small the same node to provide the best data access. Aggregate orientation helps in
achieving this. Aggregates combine data that are commonly accessed together—
portions which are placed on different machines. Each portion is
so aggregates can be used as unit of distribution.
known as a ‘shard’.  Factors that can help arranging the data on the nodes so as to improve
performance:
* Each shard has the same database schema as the original database. • Placing data close to where it’s being accessed.
• Keeping the load even. This means that the aggregates must be evenly
* Data is distributed such that each row appears in exactly one shard.
distributed across the nodes such that all get equal amounts of the load.
• In some cases, it is useful to put aggregates together if they may be read in
* The combined data from all shards is the same as the original
sequence.
database. • The Bigtable keeps its rows in lexicographic order.
* Sharding helps in balancing out the load between servers, for example,
 Auto-sharding:
if we have five servers, each one has to handle only 20% of the load. • If sharding is done as part of application logic it complicates the
programming model, as application code needs to ensure that queries are
* The NoSQL databases are designed to support automatic distribution distributed across the various shards.
of data and queries across multiple servers located in different • Furthermore, for rebalancing the sharding the application code must be
geographic regions. This permits rapid, automatic and transparent changed and the data must be migrated.
replacement of data without any disruption. • To overcome these problems, many NoSQL databases offer auto-sharding.
• With auto-sharding, the database takes on the responsibility of allocating
data to shards and ensuring that data access goes to the right shard. This can
make it much easier to use sharding in an application.

6
6/7/2024

Master Slave Replication


With master-slave distribution, we replicate data across multiple
nodes. One node is designated as the master, or primary. The other
nodes are slaves, or secondaries. A replication process synchronizes
the slaves with the master. Figure shows master-slave replication.

e) While replication can improve read performance it does not improve


performance for applications that have a lot of writes. But sharding can
improve both read and write performance.

d) Sharding does not improve resilience when used alone. Although the data is on Figure: Data is replicated from master to slaves.
different nodes, a node failure makes that shard’s data unavailable. The resilience The master services all writes; reads may come
benefit it provides is that only the users of the data on that shard will suffer but it is from either master or slaves.
not good to have a database with part of its data missing.

Master-Slave Replication

Peer-to-Peer Replication
• A peer-to-peer replication cluster, offers tolerance to node failures
• Master-slave replication helps with read scalability but doesn’t help scalability of
writes. It provides resilience against failure of a slave, but not of a master. The master without losing access to data.
is a bottleneck and a single point of failure.
• We can easily add nodes to improve performance.
• Peer-to-peer replication (see Figure) solves these problems by not having a master.
• All the replicas have equal weight, they can all accept writes, and the loss of any of • The biggest complication is consistency. When two people attempt
them doesn’t prevent access to the data store. to update the same record at the same time a write-write conflict
occurs.
• Inconsistencies on read lead to problems but at least they are
relatively transient. Inconsistent writes are forever.
• Two solutions to write-write conflict:
1. We can ensure that whenever we write data, the replicas
coordinate to ensure that a conflict is avoided. This can give us
strong guarantee as a master slave replication, but at the cost
of network traffic to coordinate the writes. We don’t need all
the replicas to agree on the write, just a majority.

Figure: Peer-to-peer replication has all nodes applying 2. We can decide to cope with an inconsistent write with some
reads and writes to all the data. policies. We can trade of consistency for availability.

7
6/7/2024

Combining Sharding and Replication Difference between Sharding and Replication


• Replication and sharding are strategies that can be combined. If we use both master
slave replication and sharding (Figure a), this means that we have multiple masters, but
each data item only has a single master. A node may be a master for some data and Replication Sharding
slave for others, or dedicate nodes may be dedicated for master or slave duties.
• Using peer-to-peer replication and sharding is a common strategy for column family
databases. In a scenario like this we might have tens or hundreds of nodes in a cluster
with data sharded over them. Peer-to-peer replication normally has a replication factor
of 3 ie. each shard is present on three nodes. If a node fails, then the shards on that node
will be built on the other nodes.

Figure a: Using master-slave replication Figure b: Using peer-to-peer


together with sharding replication together with sharding

Consistency Various forms of consistency

The consistency property ensures that any transaction will bring the database 1. Update Consistency (or write-write conflict):
from one valid state to another. Martin and Pramod are looking at the company website and notice
Any data written to the database must be valid according to all defined rules, that the phone number is outdated. They both have update access,
including constraints, cascades, triggers, and any combination. so they both go in at the same time to update the number. We’ll
assume they update it slightly differently, because each uses a
Relational databases offer strong consistency whereas NoSQL systems mostly slightly different format. This issue is called a write-write conflict:
provide eventual consistency. two people updating the same data item at the same time.

Various forms of consistency When the writes reach the server, the server will serialize them—
1. Update Consistency (or write-write conflict) decide to apply one, then the other. Let’s assume it uses alphabetical
Solution: 1. Pessimistic approach (conditional updates, save both updates) 2. Optimistic approach (write locks)
order and picks Martin’s update first, then Pramod’s. Without any
2. Read Consistency (or read-write conflict) concurrency control, Martin’s update would be applied and immediately
Solution: Eventual consistency (allows certain degree of inconsistency between replicas)
overwritten by Pramod’s. In this case Martin’s is a lost update. This is a
Prob: Session consistency (Inconsistencies must not occur within user’s own writes within a session)
Solution: 1. Sticky session 2. Version stamps
failure of consistency.

8
6/7/2024

• We refer to this type of consistency as logical consistency: ensuring that different Replication consistency
Example for breach of Replication
data items make sense together. • Ensuring that the same data item has the Consistency (replication
• To avoid a logically inconsistent readwrite conflict, relational databases support same value when read from different inconsistency):
the notion of transactions. replicas is called Replication consistency. • Let’s imagine there’s one last hotel
room for a desirable event. The hotel
reservation system runs on many
• If Martin wraps his two writes in a transaction, the system guarantees that
nodes. Martin and Cindy considering
Pramod will either read both data items before the update or both after the update. this room, but they are discussing this
on the phone because Martin is in
• Lack of transactions applies to only some NoSQL databases, in particular the London and Cindy is in Boston.
aggregate-oriented ones. In contrast, graph databases tend to support ACID • Meanwhile Pramod, who is in
transactions just the same as relational databases. Mumbai, goes and books that last
room. That updates the replicated
• Aggregate-oriented databases support atomic updates, but only within a single room availability, but the update gets
to Boston quicker than it gets to
aggregate. This means that we will have logical consistency within an aggregate
London.
but not between aggregates. So in the example, we could avoid inconsistency if the • When Martin and Cindy open their
order, the delivery charge, and the line items are all part of a single order browsers to see if the room is available,
aggregate. Cindy sees it booked and Martin sees it
free. This is replication inconsistency.
Figure. An example of replication inconsistency
• The length of time an inconsistency is present is called the inconsistency window. ( Figure ).
A NoSQL system may have a quite short inconsistency window. The inconsistency
Eventually, the updates will propagate fully, and Martin will see the room is fully booked.
window size for Amazon’s SimpleDB service is usually less than a second. Therefore this situation is generally referred to as eventually consistent, meaning that at any
time nodes may have replication inconsistencies but, if there are no further updates,
eventually all nodes will be updated to the same value.

Session Consistency
 Inconsistency windows can be problematic when users get inconsistencies with their own Relaxing Consistency
writes.
Sometimes we have to sacrifice Consistency. It is not possible to design a system
 Consider the example of posting comments on a blog entry. Inconsistency windows of to avoid inconsistencies, without making sacrifices in other characteristics of the
even a few minutes can’t be tolerated while people are typing in their latest thoughts. system. As a result, we often have to tradeoff consistency for something else like
 Systems handle the load of such sites by running on a cluster and load-balancing
Availability and Partition Tolerance.
incoming requests to different nodes. Therein lies a danger: We may post a message using The CAP Theorem: The basic statement of the CAP
one node, then refresh our browser, but the refresh goes to a different node which hasn’t theorem is that, given the three properties of
received our post yet—and it looks like our post was lost. Consistency, Availability, and Partition tolerance, we
can only get two.
 In situations like this, we can tolerate reasonably long inconsistency windows, but we
Consistency: All people see the same data at the same
need read-your-writes consistency which means that, once we’ve made an update, we’re
time
guaranteed to continue seeing that update.
Availability : If we can communicate to a node in the
 One way to get this is to provide session consistency: Within a user’s session it is cluster, we should be able to read and write data.
necessary to provide read-your-writes consistency. Partition Tolerance: The cluster can survive
communication breakages that separate the cluster
Two techniques to provide session consistency: into partitions that are unable to communicate with
i) Sticky session: Sticky session is a session that’s tied to one node (this is also called each other.
session affinity). A sticky session allows us to ensure that as long as we keep read-your-
writes consistency on a node, we’ll get it for sessions too. The downside is that sticky The CAP theorem states that if we get a network partition, we have to
sessions reduce the ability of the load balancer to do its job. trade off availability (A) of data versus consistency (C). Very large systems
will―partition at some point. That leaves either C or A to choose from
ii) Version Stamp: Every interaction with the data store includes the latest version stamp (traditional DBMS prefers C over A and P ). In almost all cases, for
seen by a session. The server node must then ensure that it has the updates that systems that use distribution models we would choose A over C.
include that version stamp before responding to a request.

The CAP Theorem (Cont’d) Large-Scale Databases


 Let us assume two nodes on opposite sides of a
network partition:  When companies such as Google and Amazon were
designing large-scale databases, 24/7 Availability was a
key
 A few minutes of downtime means lost revenue

 Availability + Partition Tolerance forfeit Consistency  When horizontally scaling databases to 1000s of machines,
the likelihood of a node or a network failure
increases tremendously
 Consistency + Partition Tolerance entails that one side of
the partition must act as if it is unavailable, thus
forfeiting Availability  Therefore, in order to have strong guarantees on
Availability and Partition Tolerance, they had to sacrifice
 Consistency + Availability is only possible if there is no “strict” Consistency (implied by the CAP theorem)
network partition, thereby forfeiting Partition Tolerance

9
6/7/2024

Trading-Off Consistency The BASE Properties


 Maintaining consistency should balance between the  The CAP theorem proves that it is impossible to
strictness of consistency versus vailability/scalability guarantee strict Consistency and Availability while
being able to tolerate network partitions
 Good-enough consistency depends on the application

Eventual Consistency  This resulted in databases with relaxed ACID


 A database is termed as Eventually Consistent if: guarantees
 All replicas will gradually become consistent in the
absence of updates  In particular, such databases apply the BASE
properties:
 Basically Available: the system guarantees Availability
 Soft-State: the state of the system may change over time
 Eventual Consistency: the system will eventually
become consistent

Cassandra Features of Cassandra


● Elastic scalability: Cassandra is highly scalable; it allows to add more hardware
Apache Cassandra is an open source, distributed and decentralized/distributed
to accommodate more customers and more data as per requirement.
storage system (database), for managing very large amounts of structured data
spread out across the world. ● Always on architecture: Cassandra has no single point of failure and it is
continuously available for business-critical applications that cannot afford a failure.
Some of the notable points of Apache Cassandra:
 It provides highly available service with no single point of failure. • Fast linear-scale performance: Cassandra is linearly scalable, i.e., it increases
through the number of nodes in the cluster increases. Therefore it maintains a
quick response time.
 It is scalable, fault-tolerant, and consistent.
● Flexible data storage: Cassandra accommodates all possible data formats
 It is a key-value as well as a column-oriented database.
including: structured, semi-structured, and unstructured. It can dynamically
 Its distribution design is based on Amazon’s Dynamo and its data model on accommodate changes to data structures.
Google’s Bigtable.
● Easy data distribution: Cassandra provides the flexibility to distribute data
 Created at Facebook (later took over by Apache), it differs sharply from where we need by replicating data across multiple datacenters.
relational database management systems.
● Transaction support: Cassandra supports properties like Atomicity, Consistency,
 Cassandra implements a Dynamo-style replication model with no single point Isolation, and Durability (ACID).
of failure, but adds a more powerful “column family” data model.
● Fast writes: Cassandra was designed to run on cheap commodity hardware. It
 Cassandra is being used by some of the biggest companies such as Facebook, performs fast writes and can store hundreds of terabytes of data, without
Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more. sacrificing the read efficiency.

Cassandra Architecture Cassandra Architecture


Cassandra database is distributed over several machines that operate together.
Components of Cassandra
The outermost container is known as the Cluster. For failure handling, every node
The key components of Cassandra are as follows:
contains a replica, and in case of a failure, the replica takes charge. Cassandra
 Cluster: A cluster is a component that contains one or
arranges the nodes in a cluster, in a ring format, and assigns data to them.
more data centers.
Data Replication in Cassandra
 Data center: It is a collection of related nodes.
 Node: It is the place where data is stored.
 Commit log: The commit log is a crash-recovery
mechanism in Cassandra. Every write operation is written to
the commit log.
 Mem-table: A mem-table is a memory-resident data
structure. After commit log, the data will be written to the
mem-table. Sometimes, for a single-column family, there will
be multiple mem-tables.
 SSTable: It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
 Bloom filter: These are nothing but quick, nondeterministic, algorithms for
testing whether an element is a member of a set. It is a special kind of cache. Bloom
filters are accessed after every query.

10
6/7/2024

DATA MODEL

Cassandra Query Language


• Users can access Cassandra through its nodes using Cassandra Query
Language (CQL). CQL treats the database (Keyspace) as a container of
tables.
• Programmers use cqlsh: a prompt to work with CQL or separate
application language drivers.
• Clients approach any of the nodes for their read-write operations. That
node (coordinator) plays as a proxy between the client and the nodes
holding the data.
• Every write activity of nodes is captured by the commit logs written in the
nodes. Later the data will be captured and stored in the mem-table.
Whenever the mem-table is full, data will be written into the SStable data
file. All writes are automatically partitioned and replicated throughout the
cluster. Cassandra periodically consolidates the SSTables, discarding
unnecessary data.
• During read operations, Cassandra gets values from the mem-table and
checks the bloom filter to find the appropriate SSTable that holds the
required data.

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy