Introduction To NoSQL
Introduction To NoSQL
Introduction To NoSQL
NoSQL encompasses a wide variety of different database technologies that were developed in
response to the demands presented in building modern applications:
Developers are working with applications that create massive volumes of new, rapidly
changing data types — structured, semi-structured, unstructured and polymorphic data.
Long gone is the twelve-to-eighteen month waterfall development cycle. Now small
teams work in agile sprints, iterating quickly and pushing code every week or two, some
even multiple times every day.
Applications that once served a finite audience are now delivered as services that must be
always-on, accessible from many different devices and scaled globally to millions of
users.
Organizations are now turning to scale-out architectures using open source software,
commodity servers and cloud computing instead of large monolithic servers and storage
infrastructure.
History of NoSQL
In the early 1970, Flat File Systems are used. Data were stored in flat files and the biggest
problems with flat files are each company implement their own flat files and there are no
standards. It is very difficult to store data in the files, retrieve data from files because there is
no standard way to store data.
Then the relational database was created by E.F. Codd and these databases answered the
question of having no standard way to store data. But later relational database also get a
problem that it could not handle big data, due to this problem there was a need of database
which can handle every types of problems then NoSQL database was developed.
RDBMS is sufficient to store and manipulate all the structured data efficiently but in
today’s world the velocity and nature of data used/generated over the Internet is
growing exponentially. As we can often see in areas like social media, the data used has
no specific structure boundary. This makes unavoidable the need to handle unstructured
data which is non-relational and schema-less in nature. For RDBMS it becomes a real
challenge to provide the cost effective and fast Create, Read, Update and Delete
(CRUD) operation as it has to deal with the overhead of joins and maintaining
relationships amongst various data.
Therefore a new mechanism is required to deal with such data in an easy and efficient
way. This is where NoSQL comes into the picture to handle unstructured BIG data in an
efficient way to provide maximum business value and customer satisfaction.
NoSQL do not use SQL as their primary query language, instead providing access by
means of Application Programming Interfaces (APIs).
The reason behind such a big switch or in other words the advantages of NoSQL are the
following:
High scalability
Distributed Computing
Lower cost
Schema flexibility
Un/semi-structured data
No complex relationships
Big User
Many organizations like Facebook, Google, Yahoo, Twitter have millions of users. But the
amount of userd is not constant. It other words, sometimes millions of users become active
and sometimes only a thousand users are active. So the numbers of users are consantly
changing. Supporting large numbers of concurrent users is important, but because app usage
requirements are hard to predict, it's just as important to dynamically support rapidly growing
(or shrinking) numbers of concurrent users.
So due to the inconsistent numbers of active users we should have a more easily scalable
database technology. Using the relational database technique, we can't achieve dynamic
scalability. It is also important that during achieving this approach the performance of an
application must be maintained. So we can use NoSQL for this purpose.
Big Data
Big Data is one of the key forces driving the growth and popularity of NoSQL for businesses.
Due to the explosive growth in internet usage, each time a bulk of data is generated. This data
is generated by computers, mobiles, social apps and machine-to-machine communication. Let
us see a simple example. A commercial flight generates approximate 10 GB of data per hour
during its travel. According to the IDC estimate until 2013 the size of the world's digital data
is 4.4 zettabytes (4.4 trillion gigabytes) and it will become 44 zettabytes in 2020.
So developers want a highly flexible solution that can handle big data. We can't do this with
schema-based relation databases. So NoSQL is a perfect solution for handling big and
schema-less data.
At the present time the continuous availability of data is very important. The downtime of a
few seconds can generate a huge loss in business and a company's reputation. The best
solution to avoid this is to use a distributed approach. NoSQL also works on a distributed
approach. In a distributed approach we remove dependency from a single machine and spread
it out on several machines. If one or more database servers or "nodes" go down then the other
nodes in the system are able to continue with operations without data loss. NoSQL databases
work on a distributed approach so a NoSQL database is able to provide continuous
availability whether in single locations, across data centers and in the cloud.
Dynamic Schema
Relational database systems require a schema to be defined before inserting any data. For
example, if we want to insert information about an employee like his name, salary and age
then we first must define three columns in the table and then their data types. But this
approach is not suitable in a present-time application. Because we don't know what type of
data will come from the user end and how much In the future, if we are required to change
the schema of the database then it will become very difficult and make work for the
developer. If database is very large then it will generate downtime of system. There's also no
way, using a relational database, to effectively address data that's completely unstructured or
unknown in advance.
NoSQL databases are different from relational databases. NoSQL are schema-less databases
so we are not required to define a schema. We can make changes in the database without
worrying about service interruptions. In other words NoSQL makes development faster.
Integrated Caching
Some products provide a caching tier for relational database systems for reading the data. So
it increases only the read performance. But these products don't provide any caching for
writes. So if our application is predominately read-only then we can use a distributed cache
but if our application is either predominately write or read-write then we cannot use a
distributed cache.
NoSQL has an integrated cache capability for both read and write. We can keep frequently-
used data in system memory as much as possible and eliminates the need for a separate
caching layer that must be maintained.
Cloud Computing
Toady each new application uses cloud storage, either directly or indirectly. This cloud may
be public, private or hybrid. All cloud applications use a three-tier internet architecture. In
this architecture the application is accessed through a web browser or a mobile application. A
load balancer is responsible for incoming traffic. Load balancing uses a scaling-out approach
to handle the incoming traffic. In a scaling-out approach we add a new commodity server
when traffic increases. But in a relational database we use a scaling-up approach instead of
scaling-out. This makes them a poor fit for applications that require easy and dynamic
scalability.
Because NoSQL uses a scaling-out approach and relational databases use a scaling-up
approach, NoSQL is a better fit with the highly distributed nature of the three-tier internet
architecture.
Scale Up
This is also known as Vertical scaling. In a scaling up approach we add resources
within the same logical unit to increase capacity. Relational databases mainly use a
scale-up approach. For example, add a CPU to a single server or add (increase)
memory or add some external storage device to increase the storage capacity. This
approach increases the size of the server. Such types of big server becomes highly
complex and expensive. This approach has the big disadvantage that If the server fails
then the entire system blocks.
Scale Out
This is also known as horizontal scaling. In this approach we add a new node (server)
to the system such that the entire load becomes distributed over all servers. A NoSQL
database uses a Scaling Out approach. NoSQL database uses a simple approach to
achieve a scaling out mechanism. It starts with a single or multiple nodes. If 10,000
new users connect with an application then it adds another server. NoSQL uses a
cluster of standard, physical or virtual servers to store data and support database
operations. When a new server (node) is connected to a cluster then data and database
operations are spread across the entire cluster system.
Replication
Data replication is the concept of having data, within a system, be geo-distributed, preferably
through a non-interactive, reliable process. In traditional RDBMS databases, implementing
any sort of replication is a struggle because these systems were not developed with horizontal
scaling in mind. Most of NoSQL database support automatic replication. In other words we
get high availability of data and disaster recovery without adding any external applications.
Auto Sharding
NoSQL has a main advantage that the data is spread across servers without effecting the
performance of the application. Any server can be added or removed without application
downtime. A well-established and configured NoSQL database never becomes offline. In
other words, it provides 24x365 services.
Sharding in relational databases can reduce the capacity to perform complex queries. But
NoSQL always retains its query expressive power even though system contains hundreds of
servers.
Data generated by IOT are mainly semi-structured or unstructured and that poses a challenge
for relational databases because relational databases work on a fixed schema and structured
data.
To overcome all these problems an inventor uses a NoSQL database to store the data to
improve performance.
Features of NoSQL
Four core features of NoSQL, shown in the following list, apply to most NoSQL databases.
The list compares NoSQL to traditional relational DBMS:
Schema agnostic (non-believer): A database schema is the description of all possible
data and data structures in a relational database. With a NoSQL database, a schema isn’t
required, giving you the freedom to store information without doing up‐front schema
design.
Nonrelational: Relations in a database establish connections between tables of data. For
example, a list of transaction details can be connected to a separate list of delivery details.
With a NoSQL database, this information is stored as an aggregate — a single record with
everything about the transaction, including the delivery address.
Commodity hardware: Some databases are designed to operate best (or only) with
specialized storage and processing hardware. With a NoSQL database, cheap off‐the‐shelf
servers can be used. Adding more of these cheap servers allows NoSQL databases to scale
to handle more data.
Highly distributable: Distributed databases can store and process a set of information on
more than one device. With a NoSQL database, a cluster of servers can be used to hold a
single large database.
Advantages of NoSQL
The advantages of NoSQL include being able to handle:
Large volumes of structured, semi-structured, and unstructured data
Agile sprints, quick iteration, and frequent code pushes
Object-oriented programming that is easy to use and flexible
Geographically distributed scale-out architecture instead of expensive, monolithic
architecture.
It provides fast performance
Provides horizontal scalability
Currently open-source
Does not use the Relational Model
Schema-less database
Running well on clusters
Designed for a cloud
Today, companies leverage NoSQL databases for a growing number of use cases. NoSQL
databases also tend to be open-source and that means a relatively low-cost way of
developing, implementing and sharing software. It also supports following features:
Dynamic Schemas
Auto-sharding
Replication
Integrated Caching
Types of NoSQL
The NoSQL databases currently being used can be grouped into four broad categories:
1. Key-value databases
2. Column-based databases
3. Document-based databases
4. Graph-based databases
1. Key-value databases
Data is stored as key-value pairs. Values are retrieved by providing keys. The data consists of
two parts, a string which represents the key and the actual data which is to be referred as
value thus creating a key-value pair. The user can search or delete data using this key value.
This key is like a primary key. It can't be a duplicate.
These stores are similar to hash tables where the keys are used as indexes, thus making it
faster than RDBMS. Because Key-Value stores are represented as a hashmap, they’re
powerful for basic Create-Read-Update-Delete operations, and these databases typically scale
quite well and shard easily across ‘x’ number of nodes.
The key-value data stores are efficient and powerful model. They’re great when quick
performance is required and the data are not connected. It has a simple application
programming interface (API). A key value data store allows the user to store data in a schema
less manner.
Limitations
They are not meant for complex queries attempting to connect multiple pieces of data, and
are fitting for single key operations only. When there are many-to-many relationships in the
data, a Key-Value store is likely to exhibit poor performance. Another weakness of key value
data sore is the lack of schema which makes it much more difficult to create custom views of
the data.
Use Cases
Key-value databases can be used when one needs quick performance for basic Create-Read-
Update-Delete operations and your data is not connected. For example:
Storing and retrieving user's session information for a Web application.
Storing user profiles and preferences and favorite products within an application
Storing user's shopping cart data for online stores or marketplaces.
Column Families are several rows, each with a unique key or identifier, that belong to one or
more columns. These columns are grouped together in families because they are often
accessed together.
A Column database stores its data in such a manner that it can be aggregated rapidly with less
I/O activity. It offers very high performance and highly scalable architecture. They are good
for data warehousing, data mining and analytics applications.
Use Cases
Some example use cases for a Column-Family database include event logging and blogs,
similar to document databases, but the data would be stored in a different fashion.
For enterprise logging, every application can write to its own set of columns and have each
row key formatted in such a way to promote easy lookup based on application and timestamp.
Counters are a unique use case. You may come across applications that need an easy way to
count or increment as events occur. Some Column-Family databases, like Cassandra, have
special column types that allow for simple counters. In addition, columns can have a time-to-
live parameter, making them useful for data with an end date, like trial periods or ad timing.
3. Document-based Database
In Document-based databases data are stored as documents and organized as a collection of
documents. The documents are flexible; each document can have any number of fields. These
are designed for storing, retrieving and managing document-oriented information, also known
as semi-structured data. Document stores offer great performance and horizontal scalability
options.
The documents are of standard formats such as XML, PDF, JSON, BSON etc. In relational
databases, a record inside the same database will have same data fields and the unused data
fields are kept empty, but in case of document stores, each document may have similar as
well as dissimilar data. Documents in the database are addressed using a unique key that
represents that document. These keys may be a simple string or a string that refers to URI or
path.
Document stores are slightly more complex as compared to key-value stores as they allow to
cover the key-value pairs in document also known as key-document pairs.
Document oriented databases should be used for applications in which data need not be
stored in a table with uniform sized fields, but instead the data has to be stored as a document
having special characteristics. Document stores should be avoided if the database will have a
lot of relations and normalization.
Use Cases
The first example would be for event logging for an application or process. Each instance would
constitute a new document or aggregate, containing all the information corresponding to the
event.
Another would be online blogging. Each user would be represented as a document; each post a
document; and each comment, like, or action would be a document. All documents would contain
information about the type of data, such as username, post content, or timestamp when the
document was created.
More generally speaking, document stores work well with working datasets for Web and mobile
applications. They were designed with the internet in mind – think JSON, RESTful API, and
unstructured data.
Limitations
It’s not possible for a document store to handle a transaction that operates over multiple
documents and a relational database may be a better choice in this instance.
Document databases may not be the right choice if you find yourself forcing your data into an
aggregate-oriented design
4. Graph-based Database
These databases apply the computer science graph theory for storing and retrieving data.
They focus on interconnectivity of different parts of data. Units of data are visualized as
nodes and relationships among them are defined by edges connecting the nodes. Graph
databases are databases which store data in the form of a graph. The graph consists of nodes
and edges, where nodes act as the objects and edges act as the relationship between the
objects. The graph also consists of properties related to nodes. It uses a technique called
index free adjacency meaning every node consists of a direct pointer which points to the
adjacent node. Millions of records can be navigated using this technique.
Graph databases provides schema less and efficient storage of semi structured data. Graph
databases are ACID compliant and offer rollback support. Graph Databases can be very
powerful when your data is highly connected and related in some way.
Graph databases can be used for a variety of applications like social networking applications,
recommendation software, bioinformatics, content management, security and access control,
network and cloud management etc. It is very difficult to achieve sharding in Graph
databases. Graph databases are difficult to cluster. Neo4j is one of the notable DBaaS
provider using graph data stores.
Use Cases
Graph-based databases are used to store information about networks, such as social
connections. Social networking sites can benefit by quickly locating friends, friends of
friends, likes, and so on. Routing, spacial, and map applications may use graphs to easily
model their data for finding close locations or building shortest routes for directions.
Lastly, recommendation engines can leverage the close relationships and links between
products to easily give other options to their customers.
Limitations
Graph Databases are not a good fit for when you’re looking for some of the advantages
offered by the other NoSQL variations. When an application needs to scale horizontally,
you’re going to quickly reach the limitations associated with these types of data stores.
Another general negative surfaces when trying to update all or a subset of nodes with a given
parameter. These types of operations can prove to be difficult.
Describe the factors affecting return on investment for using locally hosted
database vs. database-as-a-service
First consider the types of questions you need to ask your database and how long you are willing to
wait for answers. If you have a web or mobile application that requires interactive responses, then you will
want to use a database that aims to be an operational data store. NoSQL databases may be a good
choice. If your application requires data warehousing for batch analytics, then often a relational database or
Hadoop- based technology would be a better fit.
Second, it’s important to consider how big your data will get and how many concurrent connections
you expect. If you need a really scalable solution, don’t completely know your capacity requirements up
front, or need something that scales as your application grows, then a NoSQL database might be a good
choice. Also consider whether or not you need the database to scale horizontally. If your applications are
running in the cloud, then you need your database solution to be compatible with the underlying
architecture. Many NoSQL databases offer horizontal scalability that fits well with cloud architectures. Data
durability is an important consideration based on your application requirements. Some databases offer the
ability to store your data in memory for faster access. However, with this approach, there is an increased
risk of losing the data when a server crashes. If data durability is paramount, then choose a database that
writes the data immediately to disk.
Next, consider your consistency and transactional requirements. Relational databases provide strong
consistency and transactional rollback capabilities, and would be a good choice if you have a use case that
requires these traits. Other considerations relate to your availability, replication, and geo-location
requirements. Many NoSQL databases operate inherently in a cluster, and therefore can meet severe high
availability requirements. Data replication is an important feature to achieve disaster recovery objectives by
storing the data in additional data centers, and allow for syncing to application clients for offline access. A
few, but not all, NoSQL databases are built to handle these complex replication scenarios while avoiding
data corruption. Flexible schemas are a common trait amongst many NoSQL databases. If you require a
flexible schema for rapid development where your data model may change over time, then you will often
want to go with a NoSQL database for your application. Many of them require no database downtime while
making schema changes, making development easier and faster. It is important to assess the skill sets of
those developing the application, and administering the database and servers. Make sure you choose a
technology that fits with your existing resources before bringing it on- premise in your environment.
Think about whether or not your database layer can integrate easily with your application layer. For
web and mobile applications that use JSON, it makes sense to use a NoSQL database that also uses
JSON. However, if the business intelligence tools (or reporting dashboard) are expecting to consume
rows/columns, then a relational datastore might work better for you. A few final considerations for choosing
the best database for your application are around where to host it and how it is being managed. It is
important to understand all of the components to make sure you end up with the simplest and most cost
effective option. In a traditional do-it-yourself scenario, you will own the setup of the underlying hardware
and operating system, installation and configuration of the chosen database management system, overall
administration including patching and support, and of course how the application data is designed. In
comparison, using a fully managed database-as-a-service is really meant to eliminate the complexity and
risk of DIY, and help development teams get to market faster, scale more smoothly and massively, and
provide better performance and availability for end users. This contrasts a little bit with hosted database
solutions. A hosted database solution means the provider is choosing what hardware your database runs
on, and they’re provisioning it for you. So at the end of the day, they’re handing the administrative keys
over to you and it’s really up to your team of database administrators to keep things scaling and running
smoothly. That can end up being a distraction for developers and result in overhead costs that most
companies don’t want to bear these days. The only concern for users of database-as-a-service is the
design and development of their product! Guaranteed uptime, availability, and scalability are all the result of
a fully-managed service!
You can mitigate risk by offloading database administration and data layer management issues from your
development team. And you ensure that your developers need only concern themselves with what really
matters –developing better applications for your customers.