NOSQL Databases
NOSQL Databases
1. Introduction
Data models are abstractions that help organize the information conveyed by the data in
Databases. They organize multiple kinds of related information. A customer management data
model could model information about customers’ names, addresses, orders, and payment
histories. Clinical databases could include information such as patients’ names, ages, genders,
current prescriptions, past surgeries, allergies, and other medically relevant details. Data
structures are well-defined data storage structures that are implemented using elements of
underlying hardware, particularly random access memory and persistent data storage, such as
hard drives and flash devices. For example, an integer variable in a programming language may
be implemented as a set of four contiguous bytes, or 32 bits. Data structures offer a higher level
of organization so you do not have to think in lowlevel terms of memory addresses and machine-
level operations on those addresses. Data models serve a similar purpose.
The elements of data models vary with the type of database. Relational databases are organized
around tables. Tables are used to store information about entities, such as customers, patients,
orders, and surgeries. Entities have attributes that capture information about particular entities.
Attributes include names, ages, shipping addresses, and so forth.
Value : The definition of value with respect to key-value databases is so amorphous that it is
almost not useful. A value is an object, typically a set of bytes, that has been associated with a
key. Values can be integers, floating-point numbers, strings of characters, binary large objects
(BLOBs), semistructured constructs such as JSON objects, images, audio, and just about any
other data type you can represent as a series of bytes.
A value, which can be basically any piece of data or information, is stored with a key that
identifies its location. Few important points are:
● There is no connection between the values stored in a key – value database.
● Keys are unique to ensure that there is no ambiguity when searching for a specific value.
● Flexible compared to relational databases which need aggregation of different tables to
retrieve the information. Instead, key-value databases looks up the value associated with
a specific key
A namespace is a collection of key-value pairs, , a list of key-value pairs without duplicates A
namespace could be an entire key-value database. The essential characteristic of a namespace is
it is a collection of key-value pairs that has no duplicate keys. It is permissible to have duplicate
values in a namespace. Namespaces are helpful when multiple applications use a key-value
database. Developers of different applications should not have to coordinate their key-naming
strategy unless they are sharing data.
Essential Features
•Simplicity
•Speed
•Scalability
Simplicity
In key-value databases, you work with a very simple data model which resembles dictionary.
The syntax for manipulating data is simple. Regardless of the type of an operation, you specify a
namespace, and a key to indicate you want to perform an action on a key-value pair. Type of it
depends on your call. There are three operations performed on a key-value store: put, get, and
delete.
● put adds a new key-value pair to the table or updates a value if this key is already present.
an update means replacing an existing value with a new one.
● get returns the value for a given key.
● • delete removes a key and its value from the table.
Typelessnes : values are, generally speaking, BLOBs so you can put everything you want. It’s up
to the application to determine what type of data is being used, such as an integer, string, JSON,
XML file, or even binary data like image. This feature is especially useful when the data type
changes or you need to support two or more data types for the same attribute.
Speed
There is no need for complicated query resolving logic. Every query directly specifies the key
and always it is only one key. The only job is to find the value corresponding to it. Supported
with internal design features optimizing performance, key-value databases deliver high-
throughput for applications with data-intensive operations.
Scalability
Working with key value databases you have no relational dependencies and all write and read
requests are independent and this seems to be a perfect state for scaling.
Schemaless
Each key value pair is independent, and structure of the values can vary across different records.
Consistency
•Prioritize availability and partition tolerance over strong consistency. So after write operation, it
may take some time for the data to be replicated and become consistent across all nodes.
•Since the value may have already been replicated to other nodes, there are two ways of
resolving update conflicts:
•either the newest write wins and older writes loose,
•both (all) values are returned allowing the client to resolve the conflict.
When to use?
1.Session management : to store session attributes in online applications. (manage the sessions of
individual players in multiplayer online game)
2.In-memory data caching: an effective cache mechanism for frequently accessed but rarely
updated data.
4.Implementing block chain base solutions : key is the hash value and the value is the
corresponding block
5.Real time data access : offer fast in-memory access- useful for many small continuous reads
and writes.
6.Storing basic information : e.g. to store URL as keys and websites as values.
Multioperation Transactions: If you’re saving multiple keys and there is a failure to save any one
of them, and you want to revert or roll back the rest of the operations, key-value stores are not
the best solution to be used.
Query by Data If you need to search the keys based on something found in the value part of the
key-value pairs, then key-value stores are not going to perform well for you.
Operations by Sets: Since operations are limited to one key at a time, there is no way to operate
upon multiple keys at the same time. If you need to operate upon multiple keys, you have to
handle this from the client side.
1.2. Document Oriented Databases
a document is a set of key-value pairs. Keys are represented as strings of characters. Values may
be basic data types (such as numbers, strings, and Booleans) or structures (such as arrays and
objects). Documents contain both structure information and data. The name in a name-value pair
indicates an attribute and the value in a name-value pair is the data assigned to that attribute.
JSON and XML are two formats commonly used to define documents.
These documents are self-describing, hierarchical tree data structures which can consist of maps,
collections, and scalar values. The documents stored are similar to each other but do not have to
be exactly the same. Document databases store documents in the value part of the key-value
store; think about document databases as key-value stores where the value is examinable.
Documents, like relational tables, organize multiple attributes in a single object. This allows
database developers to more easily implement common requirements, such as returning all
attributes of an entity based on a filter applied to one of the attributes. For example, in one step
you could filter a list of customer documents to identify those whose last purchase was at least
six months ago and return their IDs, names, and addresses. If you were using a key-value
database, you would need to query all last purchase dates, generate a list of unique identifiers
associated with those customers with a purchase date
greater than six months, and then query for names and addresses associated with each identifier
in the list.
Documents are generally grouped into collections of similar documents. One of the key parts of
modeling document databases is deciding how you will organize your documents into
collections. Documents in the same collection do not need to have identical structures, but they
should share some common structures
{
{
“customer_id”:187693,
“name”: “Kiera Brown”
“address” :
{
“street” : “1232 Sandy Blvd.”,
“city” : “Vancouver”,
“state” : “WA”,
“zip” : “99121”
},
“first_order” : “01/15/2013”,
“last_order” : ” 06/27/2014”
}
{
“customer_id”:187694,
“name”: “Bob Brown”,
“address” :
{
“street” : “1232 Sandy Blvd.”,
“city” : “Vancouver”,
“state” : “WA”,
“zip” : “99121”
},
“first_order” : “02/25/2013”,
“last_order” : ” 05/12/2014”
}
{
“customer_id”:179336,
“name”: “Hui Li”,
“address” :
{
“street” : “4904 Main St.”,
“city” : “St Louis”,
“state” : “MO”,
“zip” : “99121”
},
“first_order” : “05/29/2012”,
“last_order” : ” 08/31/2014”,
“loyalty_level” : “Gold”,
“large_purchase_discount” : 0.05,
“large_purchase_amount” : 250.00
}
{
“customer_id”:290981,
“name”: “Lucas Lambert”,
“address” :
{
“street” : “974 Circle Dr.”,
“city” : “Boston”,
“state” : “MA”,
“zip” : “02150”
},
“first_order” : “02/14/2014”,
“last_order” : ” 02/14/2014”,
“number_of_orders” : 1,
“number_of_returns” : 1
}
}
the first two documents have the same structure while the third and fourth documents have
additional attributes. The third document contains three new fields: loyalty_level,
large_purchase_discount, and large_purchase_amount. These are used to indicate this person is
considered a valued customer who should receive a 5% discount on all orders over $250. (The
currency type is implicit.) The fourth document has two other new fields, number_of_orders and
number_of_returns. In this case, it appears that the customer made one purchase on February 14,
2014, and returned it. One of the advantages of document databases is that they provide
flexibility when it comes to the structure of documents.
Suitable Use Cases
Event Logging : Applications have different event logging needs; within the enterprise, there are
many different applications that want to log events. Document databases can store all these
different types of events and can act as a central data store for event storage. This is especially
true when the type of data being captured by the events keeps changing. Events can be sharded
by the name of the application where the event originated or by the type of event such as
order_processed or customer_logged.
Web Analytics or Real-Time Analytics : Document databases can store data for real-time
analytics; since parts of the document can be updated, it’s very easy to store page views or
unique visitors, and new metrics can be easily added without schema changes.
E-Commerce Applications : E-commerce applications often need to have flexible schema for
products and orders, as well as the ability to evolve their data models without expensive database
refactoring or data migration
•In 2006, Google published a paper entitled “BigTable: A Distributed Storage System for
Structured Data.” The paper described a new type of database, the column family database. It
allows storing data with keys mapped to values and the values grouped into multiple column
families, each column family being a map of data. Column-family databases store data in column
families as rows that have many columns associated with a row key. Column families are groups
of related data that are often accessed together. Cassandra is one of the popular column-family
databases. Cassandra can be described as fast and easily scalable with write operations spread
across the cluster. The cluster does not have a master node, so any read and write can be handled
by any node in the cluster. Column families are organized into groups of data items that are
frequently used together. Column families for a single row may or may not be near each other
when stored on disk, but columns within a column family are kept together.
In BigTable, a data value is indexed by its row identifier, column name, and time stamp. The
row identifier is analogous to a primary key in a relational database. It uniquely identifies a row.
Remember, a single row can have multiple column families. The time stamp orders versions of
the column value. When a new value is written to a BigTable database, the old value is not
overwritten. Instead, a new value is added along with a time stamp. The time stamp allows
applications to determine the latest version of a column value.
Column families store columns together in persistent storage, making it more likely that reading
a single data block can satisfy a query. Columns families store columns together in persistent
storage, making it more likely that reading a single data block can satisfy a query. As you read a
set of columns, you will be able to read all the columns needed or none of them. There are no
partial results allowed with atomic operations (Atomic Read). If you update several columns in
different column values, atomic writes guarantee that the write to all columns will succeed or
they will all fail. You will never be left with partially written data. For example, if a customer
moves from Portland, Oregon, to Lincoln, Nebraska, and you update the customer’s address, you
would never find a case in which the city changes from Portland to Lincoln but the state does not
change from Oregon to Nebraska.
BigTable maintains rows in sorted order. This makes it straightforward to perform range
queries. Sales transactions, for example, may be ordered by date. When a user needs to
retrieve a list of sales transactions for the past week, the data can be retrieved without
sorting a large transaction table or using a secondary index that maintains date order.
Apache Cassandra, like Apache HBase, is designed for high availability, scalability, and
consistency. Cassandra takes a different architectural approach than HBase. Rather than
use a hierarchical structure with fixed functions per server, Cassandra uses a peer-to-peer
model
Graphs are mathematical objects that consist of two parts: vertices and edges.
Vertices represent things.
Cities
Employees in a company
Proteins
Junctions in a water line
Organisms in an ecosystem
Train stations
these things is that they have relationships to other things—often in the same category
The links or connections between entities are represented by edges.
Graphs and Network Modeling
Geographic locations are modeled as vertices. These could be cities, towns, or intersections of
highways. Vertices have properties, like names, latitudes, and longitudes. In the case of towns
and cities, they have populations and size measured in square miles or kilometers. Highways and
railways are modeled as edges between two vertices. They also have properties, such as length,
year built, and maximum speed.
•The edges between students and courses allow users to quickly query all the courses a
particular student is enrolled in.
Simplified Modeling
Multiple Relations Between Entities : Using multiple types of edges allows database designers to
readily model multiple
relations between entities. This is particularly useful when modeling transportation options
between entities. For example, a transportation company might want to consider road, rail,
and air transportation between cities (see Figure 12.11). Each has different properties, such
In exchange for improved read and write performance, you may lose other features of relational
databases, such as immediate consistency and ACID transactions (although this is not always the
case). Queries have driven the design of data models because queries describe how data will be
used. Queries are also a good starting point for understanding how well various NoSQL
databases will meet your needs. Other factors are,
Key-value databases are well suited to applications that have frequent small reads and writes
along with simple data models. The values stored in key-value databases may be simple scalar
values, such as integers or Booleans, but they may be structured data types, such as lists and
JSON structures. Key-value databases generally have simple query facilities that allow you to
look up a value by its key. Some key-value databases support search features that provide for
somewhat more flexibility. Developers can use tricks, such as enumerated keys, to implement
range queries, but these databases usually lack the query capabilities of document, column
family, and graph databases. Key-value databases are used in a wide range of applications, such
as the following:
• Caching data from relational databases to improve performance
• Tracking transient attributes in a web application, such as a shopping cart
• Storing configuration and user data information for mobile applications
• Storing large objects, such as images and audio files
Document databases are designed for flexibility. If an application requires the ability to store
varying attributes along with large amounts of data, then document databases are a good option.
For example, to represent products in a relational database, a modeler may use a table for
common attributes and additional tables for each subtype of product to store attributes used only
in the subtype of product. Document databases can handle this situation easily.
Document databases provide for embedded documents, which are useful for denormalizing.
Instead of storing data in different tables, data that is frequently queried together is stored
together in the same document. Document databases improve on the query capabilities of key-
value databases with indexing and the ability to filter documents based on attributes in the
document. Document databases are probably the most popular of the NoSQL databases because
of their flexibility, performance, and ease of use. These databases are well suited to a number of
use cases, including
• Back-end support for websites with high volumes of reads and writes
• Managing data types with variable attributes, such as products
• Tracking variable types of metadata
• Applications that use JSON data structures
• Applications benefiting from denormalization by embedding structures within structures
Document databases are also available from cloud services such as Microsoft Azure Document
and Cloudant’s database.
Column family databases are designed for large volumes of data, read and write performance,
and high availability. Google introduced BigTable to address the needs of its services. Facebook
developed Cassandra to back its Inbox Search service. These database management systems run
on clusters of multiple servers. If the data is small enough to run with a single server, then a
column family database is probably more than you need—consider a document or key-value
database instead.
(Google. 2014, March 20. “Cassandra Hits One Million Writes Per Second on Google Compute
Engine.”http://googlecloudplatform.blogspot.com/2014/03/cassandra-hits-one-million-writes-
per-second-on-googlecompute-engine.html )
With this configuration, the Cassandra cluster reached one million writes per second with 95%
completing in under 23 milliseconds. When one-third of the nodes were lost, the one million
writes were sustained but with higher latency.
Several areas can use this kind of Big Data processing capability, such as
Key-value, document, and column family databases are well suited to a wide range of
applications. Graph databases, however, are best suited to a particular type of problem.
Problem domains that lend themselves to representations as networks of connected entities are
well suited for graph databases. One way to assess the usefulness of a graph database is to
determine if instances of entities have relations to other instances of entities. For example, two
orders in an e-commerce application probably have no connection to each other. They might be
ordered by the same customer, but that is a shared attribute, not a connection. Similarly, a game
player’s configuration and game state have little to do with other game players’ configurations.
Entities like these are readily modeled with key-value, document, or relational databases.
Now consider examples mentioned in the discussion of graph databases, such as highways
connecting cities, proteins interacting with other proteins, and employees working with other
employees. In all of these cases, there is some type of connection, link, or direct relationship
between two instances of entities. These are the types of problem domains that are well suited to
graph databases. Other examples of these types of problem domains include
• Network and IT infrastructure management
• Identity and access management
• Business process management
• Recommending products and services
• Social networking
When there is a need to model explicit relations between entities and rapidly traverse paths
between entities, then graph databases are a good database option. Large-scale graph processing,
such as with large social networks, may actually use column family databases for storage and
retrieval. Graph operations are built on top of the database management system. The Titan graph
database and analysis platform takes this approach.
Key-value, document, column family, and graph databases meet different types of needs. Unlike
relational databases that essentially displaced their predecessors, these NoSQL databases will
continue to coexist with each other and relational databases because there is a growing need for
different types of applications with varying requirements and competing demands.
Modern data management infrastructure is responsible for a wider range of applications and data
types than ever before. Mobile devices generate large volumes of data about users’ behaviors and
location. The instrumentation of cars, appliances, and other devices, referred to as the Internet of
Things (IoT), is another potential data source. With so many changes in the scope and size of
data and applications, additional database management techniques are needed.
Relational databases will continue to support transaction processing systems and business
intelligence applications. Decades of work with transaction processing systems and data
warehouses has led to best practices and design principles that continue to meet the needs of
businesses, governments, and other organizations. At the same time, these organizations are
adapting to technologies that did not exist when the relational model was first formulated.
Customer-facing web applications, mobile services, and Big Data analytics might work well with
relational databases, but in some cases they do not.
The current technology landscape requires a variety of database technologies. Just as there is no
best programming language, there is no best database management system. There are database
systems better suited to some problems than others, and the job of developers and designers is to
find the best database for the requirements at hand.