0% found this document useful (0 votes)
3 views

BDA Unit 5

The document discusses the evolution of data storage solutions from RDBMS to Hadoop and HBase, highlighting the limitations of Hadoop's batch processing and the advantages of HBase for random access to large datasets. It explains the architecture and data model of HBase, its features, and its applications in handling big data. Additionally, it covers the benefits of NoSQL databases, including scalability, flexibility, and high availability, as well as various types of NoSQL databases like document-based and key-value stores.

Uploaded by

koushik.p1102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

BDA Unit 5

The document discusses the evolution of data storage solutions from RDBMS to Hadoop and HBase, highlighting the limitations of Hadoop's batch processing and the advantages of HBase for random access to large datasets. It explains the architecture and data model of HBase, its features, and its applications in handling big data. Additionally, it covers the benefits of NoSQL databases, including scalability, flexibility, and high availability, as well as various types of NoSQL databases like document-based and key-value stores.

Uploaded by

koushik.p1102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT V

Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the
advent of big data, companies realized the benefit of processing big data and started opting for
solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop
excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even
unstructured.

Limitations of Hadoop

Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.

A huge dataset when processed results in another huge data set, which should also be processed
sequentially. At this point, a new solution is needed to access any point of data in a single unit of
time (random access).

Year Event

Nov 2006 Google released the paper on BigTable.

Feb 2007 Initial HBase prototype was created as a Hadoop contribution.

Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.

Jan 2008 HBase became the sub project of Hadoop.

Oct 2008 HBase 0.18.1 was released.

Jan 2009 HBase 0.19.0 was released.

Sept 2009 HBase 0.20.0 was released.

May 2010 HBase became Apache top-level project.

What is HBase?

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).

1
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in
the Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses
the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and
provides read and write access.

HBase and HDFS


HDFS HBase

HDFS is a distributed file system


HBase is a database built on top of the HDFS.
suitable for storing large files.

HDFS does not support fast


HBase provides fast lookups for larger tables.
individual record lookups.

It provides high latency batch


It provides low latency access to single rows from
processing; no concept of batch
billions of records (Random access).
processing.

HBase internally uses Hash tables and provides


It provides only sequential
random access, and it stores the data in indexed
access of data.
HDFS files for faster lookups.

2
Storage Mechanism in HBase
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table have multiple column
families and each column family can have any number of columns. Subsequent column values
are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an
HBase:
1. Table is a collection of rows.
2. Row is a collection of column families.
3. Column family is a collection of columns.
4. Column is a collection of key value pairs.
Column Family Column Family Column Family Column Family
Rowid
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

Column Oriented and Row Oriented


Column-oriented databases are those that store data tables as sections of columns of data, rather
than as rows of data. Shortly, they will have column families.
Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction Process It is suitable for Online Analytical


(OLTP). Processing (OLAP).

Such databases are designed for small Column-oriented databases are


number of rows and columns. designed for huge tables.

3
The following image shows column families in a column-oriented database:

HBase and RDBMS


HBase RDBMS

HBase is schema-less, it doesn't have the An RDBMS is governed by its


concept of fixed columns schema; defines schema, which describes the whole
only column families. structure of tables.

It is built for wide tables. HBase is It is thin and built for small tables.
horizontally scalable. Hard to scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.

It is good for semi-structured as well as


It is good for structured data.
structured data.
Features of HBase
 HBase is linearly scalable.
 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.

4
Where to Use HBase
 Apache HBase is used to have random, real-time read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable
acts up on Google File System, likewise Apache HBase works on top of Hadoop and
HDFS.
Applications of HBase
 It is used whenever there is a need to write heavy applications.
 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
HBase Architecture
 The HBase Physical Architecture consists of servers in a Master-Slave relationship as
shown below. Typically, the HBase cluster has one Master node, called HMaster and
multiple Region Servers called HRegionServer. Each Region Server contains multiple
Regions – HRegions.
 Just like in a Relational Database, data in HBase is stored in Tables and these Tables are
stored in Regions. When a Table becomes too big, the Table is partitioned into multiple
Regions. These Regions are assigned to Region Servers across the cluster. Each Region
Server hosts roughly the same number of Regions.
https://netwoven.com/data-engineering-and-analytics/data-engineering/hbase-overview-of-
architecture-and-data-model/
The HMaster in the HBase is responsible for
 Performing Administration
 Managing and Monitoring the Cluster
 Assigning Regions to the Region Servers
 Controlling the Load Balancing and Failover
On the other hand, the HRegionServer perform the following work
 Hosting and managing Regions
 Splitting the Regions automatically
 Handling the read/write requests
 Communicating with the Clients directly

5
Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each
Region in turn is made up of a MemStore and multiple StoreFiles (HFile). The data lives in these
StoreFiles in the form of Column Families (explained below). The MemStore holds in-memory
modifications to the Store (data).
The mapping of Regions to Region Server is kept in a system table called .META. When trying
to read or write data from HBase, the clients read the required Region information from the
.META table and directly communicate with the appropriate Region Server. Each Region is
identified by the start key (inclusive) and the end key (exclusive).
HBase Data Model
The Data Model in HBase is designed to accommodate semi-structured data that could vary in
field size, data type and columns. Additionally, the layout of the data model makes it easier to
partition the data and distribute it across the cluster. The Data Model in HBase is made of
different logical components such as Tables, Rows, Column Families, Columns, Cells and
Versions.
https://netwoven.com/data-engineering-and-analytics/data-engineering/hbase-overview-of-
architecture-and-data-model/
Tables – The HBase Tables are more like logical collection of rows stored in separate partitions
called Regions. As shown above, every Region is then served by exactly one Region Server. The
figure above shows a representation of a Table.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are
unique in a Table and are always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each Column
Family has one more Columns and these Columns in a family are stored together in a low level
storage file known as HFile. Column Families form the basic unit of physical storage to which
certain HBase features like compression are applied. Hence it’s important that proper care be
taken when designing Column Families in table.
The table above shows Customer and Sales Column Families. The Customer Column Family is
made up 2 columns – Name and City, whereas the Sales Column Families is made up to 2
columns – Product and Amount.

6
Columns – A Column Family is made of one or more columns. A Column is identified by a
Column Qualifier that consists of the Column Family name concatenated with the Column name
using a colon – example: columnfamily:columnname. There can be multiple Columns within a
Column Family and Rows within a table can have varied number of Columns.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family and
the Column (Column Qualifier). The data stored in a Cell is called its value and the data type is
always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by the
timestamp. The number of versions of data retained in a column family is configurable and this
value by default is 3.

Using NoSQL Databases for Big Data Storage and


Retrieval

No SQL Databases

The Importance of Big Data Storage and Retrieval


In today’s digital age, data has become the lifeblood of businesses. Companies are generating
massive amounts of data every day, from customer transactions to social media interactions. This
data holds valuable insights that can drive business decisions and help organizations stay ahead of
the competition.

7
However, the challenge lies in storing and analyzing this vast amount of data in a scalable and
efficient way. Traditional relational databases have been the go-to solution for storing structured
data for decades.
While these databases are reliable and offer a clear schema, they tend to struggle when it comes to
handling large volumes of unstructured or semi-structured data such as text, images, videos or
sensor logs. These limitations have driven the need for new database technologies that can handle
big data more effectively.
NoSQL Databases and Their Benefits
One such technology is NoSQL (Not Only SQL) databases which provide an alternative approach
to storing unstructured or semi-structured big data. Unlike traditional relational databases that use
a fixed schema defined before any data is stored, NoSQL databases allow you to store
unstructured or semi-structured big data without predefined table structures. NoSQL databases are
designed to be highly scalable so that they can handle vast amounts of unstructured or semi-
structured big data with ease.
Additionally, these databases offer flexible storage models that allow you to store different types
of information in various formats such as JSON documents, key-value pairs or column-family
stores. Another advantage of NoSQL databases is their high availability due to their distributed
architecture which allows them to maintain uptime even during hardware failures by replicating
and distributing copies across servers.
NoSQL databases offer significant advantages over traditional relational databases when it comes
to handling large volumes of unstructured or semi-structured big-data efficiently. In the next
sections, we’ll dive into the different types of NoSQL databases, their use cases, and best
practices for implementing them in your organization.
Advantages of Using NoSQL Databases for Big Data Storage
Scalability: ability to handle large amounts of data with ease
One of the most significant advantages of using a NoSQL database for big data storage is its
scalability. Traditional relational databases are designed to work on single servers, which can lead
to performance issues when handling large amounts of data. In contrast, NoSQL databases are
designed to scale horizontally across multiple servers, meaning that they can handle vast amounts
of data without any issues.

8
The ability to scale horizontally makes NoSQL databases ideal for applications that need to store
and process large volumes of data, such as social media platforms and e-commerce sites. By
allowing for seamless scalability, businesses can easily adapt their database infrastructure as their
needs change over time.
Flexibility: ability to store data in various formats without predefined schema
NoSQL databases offer a great deal more flexibility than traditional relational databases. Unlike
relational databases that require users to define a schema before storing any data, NoSQL
databases allow users to store data in various formats without predefined schema.
This flexibility means that businesses can store virtually any type of data in the database without
worrying about the need for costly and time-consuming schema changes. For example, document-
based NoSQL databases like MongoDB allow users to store and retrieve JSON documents
seamlessly.
High availability: ability to maintain uptime even during hardware failures
Another significant advantage of using a NoSQL database for big data storage is its high
availability. Traditional relational databases often experience downtime during hardware failures
or maintenance windows because they run on single servers that cannot tolerate outages or
failures. In contrast, most NoSQL databases are designed with high availability in mind.
They offer features like automatic failover and replication that help ensure that even in the event
of hardware failure or maintenance window downtime is minimized. This high level of
availability makes NoSQL databases perfect for applications that require uninterrupted access to
data, such as online banking platforms or healthcare information systems.
Conclusion
NoSQL databases offer many advantages over traditional relational databases when it comes to
big data storage and retrieval. Their ability to scale horizontally, their flexibility, and their high
availability make them ideal for businesses that need to store and process large amounts of data
reliably.
Whether you’re running a social media platform, an e-commerce site, or a healthcare information
system, switching to a NoSQL database can help ensure your infrastructure is scalable, flexible,
and highly available. With the right approach to implementation and management, NoSQL
databases can help businesses unlock the full potential of their big data assets.

9
Types of NoSQL Databases for Big Data Storage

No SQL Databases

NoSQL databases are designed to handle unstructured, semi-structured, and structured data
without the need for a predefined schema. They have become increasingly popular in recent years
because of their scalability, flexibility, and high availability. There are different types of NoSQL
databases available in the market that can be used for big data storage.
Document-based databases (e.g. MongoDB)
Document-based databases are designed to store data in the form of documents instead of tables
with rows and columns. These documents can be JSON or BSON format, which allows them to
store nested data structures easily.
MongoDB is one such database that is widely used for this purpose. MongoDB offers a lot of
benefits when it comes to big data storage.
One notable feature is its ability to scale horizontally by adding more nodes to a cluster without
any downtime. Another advantage is its rich query language that allows users to perform complex
queries on large datasets quickly and efficiently.
Benefits of Document-based Databases
One benefit of using document-based databases is their ability to handle unstructured data
effectively. This makes them a great choice for applications that deal with social media content or
user-generated content where the structure of the data might change over time. Another benefit is
their support for nested arrays and objects, which allows them to store hierarchical structures
easily without having to normalize them into multiple tables.

10
Drawbacks of Document-based Databases
One drawback of document-based databases is their lack of support for transactions across
multiple documents. This means that if one document fails during an operation, it might leave
other documents in an inconsistent state.
Another drawback is their limited support for joins between collections/documents within the
database system itself. It requires clients to perform these operations externally which can add
overheads and complexity on an application layer.
Key-value stores (e.g. Redis)
Key-value stores are designed to store data in the form of key-value pairs. The keys are unique
identifiers for the data, and the values can be anything from simple strings to complex objects.
Redis is a popular key-value store that is widely used in big data storage. Redis is known for its
high performance and scalability, making it a great choice for applications that require fast read
and write operations on large datasets.
Benefits of Key-value Stores
One benefit of using key-value stores is their ability to handle high write loads efficiently. They
can scale horizontally by adding more nodes to a cluster without any downtime. Another benefit
is their support for atomic operations on individual keys, which ensures consistency across
multiple clients accessing the same set of keys.
Drawbacks of Key-Value Stores
One drawback of key-value stores is their limited ability to perform complex queries on large
datasets. Since they do not have any predefined schema or indexes, performing complex queries
requires clients to scan through all the keys/values in the database, which can be slow and
inefficient. Another drawback is their lack of support for joins between different sets of data
within the database system itself.
Column-family stores (e.g. Cassandra)
Column-family stores are designed to store data in columns instead of rows like traditional
relational databases. They are an excellent choice when dealing with semi-structured or structured
data that require fast read/write operations at scale. Cassandra is a widely used column-family
store in big data storage.

11
Cassandra offers a lot of benefits when it comes to big data storage, especially when dealing with
large-scale distributed systems. One notable feature is its support for linear scalability, allowing
users to add more nodes easily as needed without any downtime.
Benefits of Column-Family Stores
One benefit of using column-family stores is their ability to handle large datasets with ease. They
can store data in a distributed fashion, allowing for fast read/write operations even when dealing
with large-scale datasets. Another benefit is their support for column-level indexes, which allows
users to perform complex queries on specific columns quickly and efficiently.
Drawbacks of Column-family Stores
One drawback of column-family stores is their limited support for joins between different tables
within the database system itself. This requires clients to perform these operations externally,
which can add overheads and complexity on an application layer. Another drawback is their lack
of support for transactions across multiple columns, which makes it difficult to ensure consistency
across different sets of data within the database system itself.
Use Cases for NoSQL Databases in Big Data Storage and Retrieval
Social Media Analytics: Storing and Analyzing Vast Amounts of User-Generated Content
Social media platforms generate an enormous amount of data daily. From status updates, tweets,
shares, likes, comments to photos and videos, social media platforms store a wide range of user-
generated content.
NoSQL databases are capable of storing such large amounts of unstructured data in various
formats with ease. Analyzing this vast amount of data can be a daunting task for traditional
databases due to their rigid structure.
However, using NoSQL databases like MongoDB can help users analyze social media data more
efficiently by querying multiple collections at once. It allows businesses to gain valuable insights
from user behavior patterns that can help them improve their marketing strategies.
For instance, Twitter uses a distributed database system called FlockDB that runs on top of
Apache Cassandra. The system stores followers’ relationships and real-time counts for millions of
users without compromising performance or availability.

12
IoT Applications: Managing and Analyzing Sensor Data from Connected Devices
The Internet of Things (IoT) is another use case that benefits greatly from NoSQL databases. IoT
devices generate huge volumes of sensor data such as temperatures, humidity levels, motion
detection readings among others.
NoSQL databases like Apache Cassandra are ideal for handling high volume writes and reads in
real-time with low latency. They provide the scalability needed to handle the increasing number
of IoT devices being connected while maintaining high availability.
One example is Philips Lighting’s Hue smart lighting system which uses Redis as its primary
datastore for storing sensor readings from connected light bulbs across the globe. The system
enables granular control over each connected light bulb through an API while providing real-time
feedback via the dashboard.
E-commerce Platforms: Handling Large Volumes of Transactional Data
E-commerce platforms are another area NoSQL databases have proven to be useful. E-commerce
businesses generate a vast amount of transactional data daily, including orders, product
information, and customer details. NoSQL databases like Couchbase are ideal for handling this
type of data thanks to their flexible schema design.
They provide the scalability to handle high volumes of reads and writes with low latency while
ensuring high availability. For instance, Walmart uses Cassandra as its primary datastore for
handling product catalog data that is updated in real-time.
The system allows for fast retrieval of product information by customers while providing real-
time inventory counts. Using NoSQL databases for big data storage and retrieval is becoming
more popular due to its flexibility, scalability, and ability to handle unstructured data.
Social media analytics, IoT applications, and e-commerce platforms are three use cases where
NoSQL databases can deliver excellent results. By choosing the right type of database based on
specific requirements along with proper security measures in place, businesses can tap into the
power of NoSQL databases for big data storage and retrieval.
Best Practices for Implementing NoSQL Databases for Big Data Storage
Choosing the Right Type of Database Based on Specific Use Case Requirements
When it comes to implementing NoSQL databases for big data storage, choosing the right type of
database is crucial. The three main types of NoSQL databases are document-based, key-value
stores, and column-family stores.

13
Each type has its own unique strengths and weaknesses that make them better suited for specific
use cases. Document-based databases like MongoDB are ideal for storing unstructured data such
as text documents or social media posts.
Key-value stores like Redis are great for storing simple data structures such as user profiles or
session data. Column-family stores like Cassandra are best suited for handling large amounts of
time-series data such as IoT sensor readings or financial transactions.
It’s important to carefully consider your use case requirements before selecting a NoSQL
database. You should also evaluate factors like scalability, flexibility, performance, and ease of
maintenance when making your decision.
Designing a Scalable Architecture That Can Handle Future Growth
One of the biggest advantages of using NoSQL databases is their ability to scale horizontally to
handle massive amounts of data. However, designing a scalable architecture that can handle future
growth requires careful planning and consideration.
One approach is to use a sharding strategy where data is distributed across multiple servers based
on predefined criteria such as geographic location or user ID. This helps distribute the workload
and improve performance while allowing you to easily add new servers as needed.
Another approach is to use a replication strategy where data is duplicated across multiple servers
in different locations in case one server fails. This helps ensure high availability and minimize
downtime while also improving response times by allowing users to access their nearest replica.
Ensuring Proper Security Measures Are in Place to Protect Sensitive Data
With big data comes big responsibility when it comes to data security. It’s important to ensure
proper security measures are in place to protect sensitive data from unauthorized access, theft, or
manipulation.
One approach is to implement strict access control policies that limit who can access sensitive
data and what actions they can perform on it. This can be done through user authentication and
authorization protocols such as OAuth or SAML.
Another approach is to use encryption techniques like SSL/TLS for securing data in transit and
AES for securing data at rest. This helps prevent eavesdropping and interception of sensitive data
by cybercriminals.

14
Regular security audits and vulnerability scans should be performed to identify potential
weaknesses or vulnerabilities that could be exploited by attackers. This helps ensure your NoSQL
database remains secure over time.
Conclusion
Implementing NoSQL databases for big data storage requires careful consideration of factors like
use case requirements, scalability, performance, and security. By choosing the right type of
database based on your specific needs, designing a scalable architecture that can handle future
growth, and ensuring proper security measures are in place to protect sensitive data, you can reap
the benefits of NoSQL databases while minimizing risks associated with storing large amounts of
valuable information.
Challenges and Limitations of Using NoSQL Databases for Big Data Storage and Retrieval
Scalability Challenges
One of the primary challenges with using NoSQL databases for big data storage is ensuring
scalability. While NoSQL databases are designed to be scalable, there are still challenges that
need to be addressed.
One major challenge is the need to continually add new nodes as data grows. This requires a lot of
planning and coordination to ensure that the system remains cohesive and continues to function
well.
Complexity Limitations
Another limitation of using NoSQL databases for big data storage is their complexity. While they
are highly flexible, this also means that they can be more difficult to work with than traditional
SQL databases. Developers must understand how different types of NoSQL databases work, as
well as which type is best suited for their particular use case.
Data Consistency Challenges
Data consistency can also be a challenge when using NoSQL databases for big data storage.
Unlike SQL databases, which have strong consistency guarantees, many NoSQL databases only
offer eventual consistency. This means that it may take some time for updates made in one part of
the database to propagate throughout the entire system.

15
Security Limitations
Another limitation of using NoSQL databases is security concerns. Many popular NoSQL
databases lack built-in security features and may require additional customization or integration
with third-party tools in order to provide robust security capabilities.
Conclusion
Overall, while there are certainly some challenges and limitations associated with using NoSQL
databases for big data storage, these drawbacks must be weighed against the significant benefits
provided by these tools. With their ability to handle large amounts of unstructured data quickly
and easily, they are an essential component in any modern big data architecture.
As technology continues to evolve rapidly in this space, we can expect to see continued
innovation and refinement of NoSQL databases, as well as the emergence of new tools and
techniques for managing big data. Ultimately, the key to success with these tools will lie in
understanding the strengths and limitations of each type of NoSQL database and crafting a system
that is tailored to your specific needs.

NoSQL Data Models:Key-Value Stores, Column-Based Stores Graph-Based


Stores Document-Based Stores.

There are nearly a dozen types of database. Some of the more commonly used categories of
database include:

Hierarchical Databases
Developed in the 1960s, the hierarchical database looks similar to a family tree. A single object
(the “parent”) has one or more objects beneath it (the “child”). No child can have more than one
parent. In exchange for the rigid and complex navigation of the parent child structure, the
hierarchical database offers high performance, as there’s easy access and a quick querying time.
The Windows Registry is one example of this system.
Relational Databases
Relational databases are a system designed in the 1970s. This database commonly uses
Structured Query Language (SQL) for operations like creating, reading, updating, and deleting
(CRUD) data.

16
This database stores data in discrete tables, which can be joined together by fields known as
foreign keys. For example, you might have a User table that contains data about your users, and
join the users table to a Purchases table, which contains data about the purchases the users have
made. MySQL, Microsoft SQL Server, and Oracle are examples.
Non-Relational Databases
Non-relational management systems are commonly referred to as NoSQL databases. This type of
database matured due to increasingly complex modern web applications. These databases'
varieties have proliferated over the last decade. Examples include MongoDB and Redis.
Object oriented databases
Object oriented databases store and manage objects on a database server's disk. Object oriented
databases are unique because associations between objects can persist. This means that object
oriented programming and the querying of data across complex relationships is fast and
powerful. One example of an object oriented database is MongoDB Realm, where the query
language constructs native objects through your chosen SDK. Object oriented programming is
the most popular programming paradigm.
All about NoSQL
NoSQL is an umbrella term for any alternative system to traditional SQL databases. Sometimes,
when we say NoSQL management systems, we mean any database that doesn't use a relational
model. NoSQL databases use a data model that has a different structure than the rows and
columns table structure used with RDBMS.

NoSQL databases are different from each other. There are four kinds of this database: document
databases, key-value stores, column-oriented databases, and graph databases.
Document databases
A Document Data Model is a lot different than other data models because it stores data in
JSON, BSON, or XML documents. in this data model, we can move documents under one
document and apart from this, any particular elements can be indexed to run queries faster.
Often documents are stored and retrieved in such a way that it becomes close to the data
objects which are used in many applications which means very less translations are required to
use data in applications. JSON is a native language that is often used to store and query data
too.

17
So in the document data model, each document has a key-value pair below is an example for
the same.

"Name" : "Yashodhra",

"Address" : "Near Patel Nagar",

"Email" : "yahoo123@yahoo.com",

"Contact" : "12345"

Working of Document Data Model:


This is a data model which works as a semi-structured data model in which the records and
data associated with them are stored in a single document which means this data model is not
completely unstructured. The main thing is that data here is stored in a document.

Features:
 Document Type Model: As we all know data is stored in documents rather than tables or
graphs, so it becomes easy to map things in many programming languages.
 Flexible Schema: Overall schema is very much flexible to support this statement one must
know that not all documents in a collection need to have the same fields.
 Distributed and Resilient: Document data models are very much dispersed which is the
reason behind horizontal scaling and distribution of data.
 Manageable Query Language: These data models are the ones in which query language
allows the developers to perform CRUD (Create Read Update Destroy) operations on the
data model.
Examples of Document Data Models :
 Amazon DocumentDB
 MongoDB
 Cosmos DB
 ArangoDB
 Couchbase Server

18
 CouchDB
Advantages:
 Schema-less: These are very good in retaining existing data at massive volumes because
there are absolutely no restrictions in the format and the structure of data storage.
 Faster creation of document and maintenance: It is very simple to create a document
and apart from this maintenance requires is almost nothing.
 Open formats: It has a very simple build process that uses XML, JSON, and its other
forms.
 Built-in versioning: It has built-in versioning which means as the documents grow in size
there might be a chance they can grow in complexity. Versioning decreases conflicts.
Disadvantages:
 Weak Atomicity: It lacks in supporting multi-document ACID transactions. A change in
the document data model involving two collections will require us to run two separate
queries i.e. one for each collection. This is where it breaks atomicity requirements.
 Consistency Check Limitations: One can search the collections and documents that are
not connected to an author collection but doing this might create a problem in the
performance of database performance.
 Security: Nowadays many web applications lack security which in turn results in the
leakage of sensitive data. So it becomes a point of concern, one must pay attention to web
app vulnerabilities.
Applications of Document Data Model :
 Content Management: These data models are very much used in creating various video
streaming platforms, blogs, and similar services Because each is stored as a single
document and the database here is much easier to maintain as the service evolves over time.
 Book Database: These are very much useful in making book databases because as we
know this data model lets us nest.
 Catalog: When it comes to storing and reading catalog files these data models are very
much used because it has a fast reading ability if incase Catalogs have thousands of
attributes stored.
 Analytics Platform: These data models are very much used in the Analytics Platform.

19
Key-value stores
This is the simplest type of NoSQL database. Every element is stored as a key-value pair
consisting of an attribute name ("key") and a value. This database is like an RDBMS with two
columns: the attribute name (such as "state") and the value (such as "Alaska").

Use cases for NoSQL databases include shopping carts, user preferences, and user profiles.
A key-value database is a type of non-relational database, also known as NoSQL database, that
uses a simple key-value method to store data. It stores data as a collection of key-value pairs in
which a key serves as a unique identifier. Both keys and values can be anything, ranging from
simple objects to complex compound objects. Key-value databases (or key-value stores) are
highly partitionable and allow horizontal scaling at a level that other types of databases cannot
achieve.
What are the advantages of key-value databases
Traditional relational databases (SQL databases) store data in the form of tables containing rows
and columns. They enforce a rigid structure on data and are not optimal for every use case. On
the other hand, key-value databases are NoSQL databases. They allow flexible database schemas
and improved performance at scale for certain use cases. The advantages of key-value stores
include:

Scalability
As every user query requires data interaction, databases can often become a bottleneck in
application performance. Several strategies to solve the issue, such as replication and sharding,
add complexity to the application code. Many key-value databases provide built-in support for
advanced scaling features. They scale horizontally and automatically distribute data across
servers to reduce bottlenecks at a single server.
Ease of use
Key-value databases follow the object-oriented paradigm that allows developers to map real-
world objects directly to software objects. Several programming languages, such as Java, also
follow the same paradigm. Instead of mapping their code objects to multiple underlying tables,
engineers can create key-value pairs that match their code objects. This makes key-value stores
more intuitive for developers to use.

20
Performance
Key-value databases process constant read-write operations with low-overhead server calls.
Improved latency and reduced response time give better performance at scale. They are based on
simple, single-table structures rather than multiple interrelated tables. Unlike relational
databases, key-value databases don't have to perform resource-intensive table joins, which makes
them much faster.
What are the use cases of key-value databases
You can use key-value database systems as the primary database for your application or to
handle niche requirements. We give some example key-value database use cases below.
Session management
A session-oriented application, such as a web application, starts a session when a user logs in to
an application and is active until the user logs out or the session times out. During this period, the
application stores all user session attributes either in the main memory or in a database. User
session data may include profile information, messages, personalized data and themes,
recommendations, targeted promotions, and discounts.

Each user session has a unique identifier. Session data is never queried by anything other than a
primary key, so a fast key-value store is a better fit for session data. In general, key-value
databases may provide smaller per-page overhead than relational databases.
Shopping cart
An e-commerce website may receive billions of orders per second during the holiday shopping
season. A key-value database can handle the scaling of large amounts of data and extremely high
volumes of state changes, while also servicing millions of simultaneous users through distributed
processing and storage. Key-value stores also have built-in redundancy, which can handle the
loss of storage nodes.
Metadata storage engine
Your key-value store can act as an underlying storage layer for higher levels of data access. For
example, you can scale throughput and concurrency for media and entertainment workloads such
as real-time video streaming and interactive content. You can also build out your game platform
with player data, session history, and leaderboards for millions of concurrent users.

21
Caching
You can use a key-value database for storing data temporarily for faster retrieval. For example,
social media applications can store frequently accessed data like news feed content. In-memory
data caching systems also use key-value stores to accelerate application responses.
How do key-value databases work
Key-value databases work by organizing all data as a set of key-value pairs. You can think of the
key as a question and the value as the answer to the question. In the example below, the primary
key is a composite of two keys, Product ID and Type. The Product ID is the partition key which
describes the partition in which the item will be stored. The Type is the sort key, which
determines the order in which items will be stored in disk. The combination of the Partition Key
and the Sort Key forms a unique primary key, which maps to a single value in the database.

In this example, the data object book has attributes like title, author, and publishing date. Every
book data object has a key called BookID. You can directly link the BookID and associated book
object in the key-value store. In addition, you can retrieve data by looking up the BookID in the
table. Also, each item has its own schema, making key-value stores highly flexible for storing
data of varying structures.

22
What are the features of key-value databases
Depending on the solution you choose, your key-value store can provide several additional
features as listed below.
Support for complex data types
Key-value stores provide support for defined data types like integers and text. However, many of
them can also support more complex objects like arrays, nested dictionaries, images, videos, and
semi-structured data. By giving the database more information about your data, there is room for
more storage and query performance optimization.
No need for table joins
Key-value databases don't need to perform any resource-intensive table joins. Their flexibility
accommodates all the needed information in a single table. This is one of the reasons key-value
stores perform so well.
Sorted keys
A key-value store can sort keys so that data is stored systematically and for implementing
partitioning. For example, keys may be sorted:

 Alphabetically or numerically

 Chronologically

 By data size

Consider a key-value store that uses the customer's email address as the unique key. Email
addresses can be sorted alphabetically, so all data for A-J email lists are stored on server 1, K-S
on server 2, and so on.
Secondary key support
Some key-value stores allow you to define two or more different keys or secondary indexes to
access the same data. For example, you can store customer data by key email address and key
phone number.
Replication
Many key-value stores offer built-in replication support by automatically copying data across
multiple storage nodes. This helps with auto-recovery from disasters; you still have your data in
case of server failure.

23
Partitioning
Partitioning is how you distribute data across nodes. Many key-value databases provide default
partitioning options. Some also give you the option to define input parameters for your partitions.
For example, you could partition numerical keys into groups of 1000. Advanced key-value
databases also provide automatic support for distributing your key-value database across
multiple geographical locations. This improves application availability and reliability because
you can respond to queries close to the user's location.
ACID support
Atomicity, Consistency, Isolation, and Durability (ACID) are database properties that ensure
data accuracy and reliability in all circumstances. For instance, if you are making multiple
changes to your data in a sequence, atomicity requires that all changes go through in order. If one
change fails, everything fails.

Advanced key-value databases provide native, server-side support for ACID. This simplifies the
developer experience of making coordinated, all-or-nothing changes to multiple items both
within and across tables. With transaction support, developers can extend the scale, performance,
and enterprise benefits to a broader set of mission-critical workloads.
What are the limitations of key-value databases
Key-value databases do require some trade-offs, as with any kind of technology choice.
Absence of complex queries
As key-value databases don't support complex queries, developers must work around this in the
code. Data operations are mainly through simple query language terms like get, put, and delete.
There are limitations to how much you can filter and sort data before accessing it.
Schema mismanagement
Key-value store design does not enforce a schema on developers. Anyone can modify the
schema in the database program. Development teams have to plan the data model systematically
to avoid long-term problems. The lack of a tight schema also means that the application is
responsible for the proper interpretation of the data it consumes, often referred to as 'schema on
read'.

24
How can AWS support your key-value database requirements
Amazon DynamoDB is one of the most popular key-value databases designed to run high-
performance applications at any scale. It's a fully managed, multi-region, multi-active
database that provides features like:

 Limitless scalability, including scale-to-zero, with consistent single-digit millisecond latency.

 Serverless with no version upgrades, no maintenance windows, and no servers or software to


manage.

 Designed for 99.999% availability, with DynamoDB Global Tables providing active-active
replication so you can build globally distributed applications with local read performance.

 Highly secure and reliable with default encryption at rest, point-in time recovery, on-demand
backup and restore, and more.

 Easy to use with integrations with many AWS services including Amazon DynamoDB
Accelerator (DAX) compatibility, bulk import/export from Amazon S3, Amazon Kinesis Data
Streams, Amazon Cloudwatch, and more.

Column-oriented databases
The Columnar Data Model of NoSQL is important. NoSQL databases are different from SQL
databases. This is because it uses a data model that has a different structure than the previously
followed row-and-column table model used with relational database management systems
(RDBMS). NoSQL databases are a flexible schema model which is designed to scale
horizontally across many servers and is used in large volumes of data.
Columnar Data Model of NoSQL :
Basically, the relational database stores data in rows and also reads the data row by row,
column store is organized as a set of columns. So if someone wants to run analytics on a small
number of columns, one can read those columns directly without consuming memory with the
unwanted data. Columns are somehow are of the same type and gain from more efficient

25
compression, which makes reads faster than before. Examples of Columnar Data Model:
Cassandra and Apache Hadoop Hbase.

Working of Columnar Data Model:


In Columnar Data Model instead of organizing information into rows, it does in columns. This
makes them function the same way that tables work in relational databases. This type of data
model is much more flexible obviously because it is a type of NoSQL database. The below
example will help in understanding the Columnar data model:

Row-Oriented Table:

S.No. Name Course Branch ID

01. Tanmay B-Tech Computer 2

02. Abhishek B-Tech Electronics 5

03. Samriddha B-Tech IT 7

04. Aditi B-Tech E & TC 8

Column – Oriented Table:

S.No. Name ID

01. Tanmay 2

02. Abhishek 5

03. Samriddha 7

04. Aditi 8

26
S.No. Course ID

01. B-Tech 2

02. B-Tech 5

03. B-Tech 7

04. B-Tech 8

S.No. Branch ID

01. Computer 2

02. Electronics 5

03. IT 7

04. E & TC 8

Columnar Data Model uses the concept of keyspace, which is like a schema in relational
models.
Advantages of Columnar Data Model :
 Well structured: Since these data models are good at compression so these are very
structured or well organized in terms of storage.
 Flexibility: A large amount of flexibility as it is not necessary for the columns to look like
each other, which means one can add new and different columns without disrupting the
whole database
 Aggregation queries are fast: The most important thing is aggregation queries are quite
fast because a majority of the information is stored in a column. An example would be
Adding up the total number of students enrolled in one year.

27
 Scalability: It can be spread across large clusters of machines, even numbering in
thousands.
 Load Times: Since one can easily load a row table in a few seconds so load times are
nearly excellent.
Disadvantages of Columnar Data Model:
 Designing indexing Schema: To design an effective and working schema is too difficult
and very time-consuming.
 Suboptimal data loading: incremental data loading is suboptimal and must be avoided,
but this might not be an issue for some users.
 Security vulnerabilities: If security is one of the priorities then it must be known that the
Columnar data model lacks inbuilt security features in this case, one must look into
relational databases.
 Online Transaction Processing (OLTP): Online Transaction Processing (OLTP)
applications are also not compatible with columnar data models because of the way data is
stored.
Applications of Columnar Data Model:
 Columnar Data Model is very much used in various Blogging Platforms.
 It is used in Content management systems like WordPress, Joomla, etc.
 It is used in Systems that maintain counters.
 It is used in Systems that require heavy write requests.
 It is used in Services that have expiring usage.

Graph databases

A graph database is a systematic collection of data that emphasizes the relationships between the
different data entities. The NoSQL database uses mathematical graph theory to show data
connections. Unlike relational databases, which store data in rigid table structures, graph
databases store data as a network of entities and relationships. As a result, these databases often
provide better performance and flexibility as they are more suited for modeling real-world
scenarios.

28
What is a graph
The term “graph” comes from the field of mathematics. A graph contains a collection of nodes
and edges.
Nodes
Nodes are vertices that store the data objects. Each node can have an unlimited number and types
of relationships.
Edges
Edges represent relationships between nodes. For example, edges can describe parent-child
relationships, actions, or ownership. They can represent both one-to-many and many-to-many
relationships. An edge always has a start node, end node, type, and direction.
Properties
Each node has properties or attributes that describe it. In some cases, edges have properties as
well. Graphs with properties are also called property graphs.
Graph example
The following property graph shows an example of a social network graph. Given the people
(nodes) and their relationships (edges), you can find out who the "friends of friends" of a
particular person are—for example, the friends of Howard's friends.

What are the use cases of graph databases


Graph databases have advantages for use cases such as social networking, recommendation
engines, and fraud detection when used to create relationships between data and quickly query
these relationships.
Fraud detection
Graph databases are capable of sophisticated fraud prevention. For example, you can use
relationships in graph databases to process financial transactions in near-real time. With fast

29
graph queries, you can detect that a potential purchaser is using the same email address and
credit card included in a known fraud case. Graph databases can also help you detect fraud
through relationship patterns, such as multiple people associated with a personal email address or
multiple people sharing the same IP address but residing in different physical locations.
Recommendation engines
The graph model is a good choice for applications that provide recommendations. You can store
graph relationships between information categories such as customer interests, friends, and
purchase history. You can use a highly available graph database to make product
recommendations to a user based on which products are purchased by others who have similar
interests and purchase histories. You can also identify people who have a mutual friend but don’t
yet know each other and then make a friendship recommendation.
Route optimization
Route optimization problems involve analyzing a dataset and finding values that best suit a
particular scenario. For example, you can use a graph database to find the following:

 The shortest route from point A to B on a map by considering various paths.

 The right employee for a particular shift by analyzing varied availabilities, locations, and skills.

 The optimum machinery for operations by considering parameters like cost and life of the equip-
ment.

Graph queries can analyze these situations much faster because they can count and compare the
number of links between two nodes.
Pattern discovery
Graph databases are well suited for discovering complex relationships and hidden patterns in
data. For instance, a social media company uses a graph database to distinguish between bot
accounts and real accounts. It analyzes account activity to discover connections between account
interactions and bot activity.
Knowledge management
Graph databases offer techniques for data integration, linked data, and information sharing. They
represent complex metadata or domain concepts in a standardized format and provide rich
semantics for natural language processing. You can also use these databases for knowledge

30
graphs and master data management. For example, machine learning algorithms distinguish
between the Amazon rainforest and the Amazon brand using graph models.
What are the advantages of graph databases
A graph database is custom-built to manage highly connected data. As the connectedness and
volume of modern data increase, graph databases present an opportunity to utilize and analyze
the data cost-effectively. Here are the three main advantages of graph analytics.
Flexibility
The schema and structure of graph models can change with your applications. Data analysts can
add or modify existing graph structures without impacting existing functions. There is no
requirement to model domains in advance.
Performance
Relational database models become less optimal as the volume and depth of relationships
increase. This results in data duplication and redundancy—multiple tables need processing to
discover query results. In contrast, graph database performance improves by several orders of
magnitude when querying relationships. Performance stays constant even when graph data
volume increases.

Efficiency
Graph queries are shorter and more efficient at generating the same reports compared to
relational databases. Graph technologies take advantage of linked nodes. Traversing the joins or
relationships is a very fast process, as the relationships between nodes are not calculated at query
times but are persisted in the database.
How do graph analytics and graph databases work
Graph databases work using a standardized query language and graph algorithms.
Graph query languages
Graph query languages are used to interact with a graph database. Similar to SQL, the language
has features to add, edit, and query data. However, these languages take advantage of the
underlying graph structures to process complex queries efficiently. They provide an interface so
you can ask questions like:

31
 Number of hops between nodes

 Longest path/shortest path/optimal paths

 Value of nodes

Apache TinkerPop Gremlin, SPARQL, and openCypher are popular graph query languages.
Graph algorithms
Graph algorithms are operations that analyze relationships and behaviors in interconnected data.
For instance, they explore the distance and paths between nodes or analyze incoming edges and
neighbor nodes to generate reports. The algorithms can identify common patterns, anomalies,
communities, and paths that connect the data elements. Some examples of graph algorithms
include:
Clustering
Applications like image processing, statistics, and data mining use clustering to group nodes
based on common characteristics. Clustering can be done on both inter-cluster differences and
intra-cluster similarities.
Partitioning
You can partition or cut graphs at the node with the fewest edges. Applications such as network
testing use partitioning to find weak spots in the network.
Search
Graph searches or traversals can be one of two types—breadth-first or depth-first. Breadth-first
search moves from one node to the other across the graph. It is useful in optimal path discovery.
Depth-first search moves along a single branch to find all relations of a particular node.
When are graph databases not suitable
A dedicated graph database provides the most value for highly connected datasets and any
analyses that require searching for hidden and apparent relationships. If this doesn’t fit your use
case, other database types may be better suited.

For example, imagine a scenario where you need to record product inventory by item. You only
need to store details like item name and available units. Since you don’t need to retain additional
information, the columns on the table will not change. Due to the tabular nature, a relational
database is better suited for such unrelated data.

32
It is also important not to use graph databases simply as key-value stores. A lookup result from a
known key does not maximize the function of what graph databases were created to do.
How can AWS support your graph database requirements
Amazon Neptune is a purpose-built, high-performance graph database engine optimized for
storing billions of relationships and querying the graph with milliseconds latency. Neptune
supports the popular graph models—property graph and W3C's Resource Description
Framework (RDF). It also supports respective query languages—Apache TinkerPop Gremlin and
SPARQL—to allow you to build queries that efficiently navigate highly connected datasets. The
top features of Neptune include:

 Serverless—enabling you to instantly scale graph workloads in fine-grained increments and


save up to 90% on database costs vs. provisioning for peak capacity.

 Highly available—including Amazon Neptune Global Database for globally distributed


applications supporting fast local read performance.

 Decoupled storage and compute so you can increase read performance with up to 15 read
replicas that share the same underlying storage, without having to perform writes at the replica
nodes.

 Highly reliable and durable with fault-tolerant and self-healing storage, point-in-time recovery,
continuous backups, and more. Amazon Neptune makes your data durable across three AZs
within a Region by replicating new writes six ways while you only pay for one copy.

 Highly secure with default encryption at rest, network isolation, and advanced auditing while
provid-ing ability to control resource-level permissions with fine-grained access.

 Broad compliance coverage including FedRAMP (Moderate and High) to SOC (1, 2, and 3), and
is HIPAA eligible.

 Fully managed, so you no longer need to worry about database management tasks such as hard-
ware provisioning, software patching, setup, configuration, or backups.

33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy