0% found this document useful (0 votes)

3 views

Unit - IV_Notes

This document provides an overview of column-oriented NoSQL databases, specifically focusing on Apache HBase and its architecture, features, and applications. It discusses the evolution of HBase, its scalability, consistency models, and mechanisms for data storage and retrieval. Additionally, it compares HBase with traditional relational databases and highlights its use cases in various industries, including medical, sports, and e-commerce.

Uploaded by

tharanthamo018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Unit - IV_Notes

Uploaded by

tharanthamo018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

UNIT-4

COLUMN ORIENTED NOSQL DB

Column-oriented NoSQLdatabases using Apache HBASE, Column-oriented
NoSQLdatabases using Apache Cassandra, Architecture of HBASE, Column-Family
Data Store Features, Consistency, Transactions, Availability, Query Features, Scaling,
Suitable Use Cases, Event Logging, Content Management Systems, Blogging
Platforms, Counters, Expiring Usage.

As the world becomes more digital and the amount of data to process increases day by day. The
usual database management systems became unable to handle and query the data efficiently, these
limitations led to the development of new solutions such as HBase which is the main topic of this
article.

In this article we are going to cover the basics of HBase and its major components’ functionality.
Overview

History

 Initially, in Nov 2006, Google released the paper on BigTable.

 The first HBase prototype was created as a Hadoop contribution in the year Feb 2007.

 The first usable HBase was released in the same year Oct 2007 along with Hadoop 0.15.0.

 HBase became the subproject of Hadoop, in Jan 2008.

 In the year 2010, May HBase became Apache top-level project.

Definition

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.

Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop
Distributed File System (HDFS).

Hadoop is a framework for handling large datasets in a distributed computing environment.

what is a column-oriented database ?

This video showcases the major differences between column-oriented databases and row-Oriented
databases in a practical way.

A column-oriented DBMS (or columnar database management system) is a database

management system (DBMS) that stores data tables by column rather than by row. Practical use
of a column store versus a row store differs little in the relational DBMS world.
Both columnar and row databases can use traditional database query languages like SQL to load
data and perform queries. Both row and columnar databases can become the backbone in a system
to serve data for common extract, transform, load (ETL) and data visualization tools.

However, by storing data in columns rather than rows, the database can more precisely access the
data it needs to answer a query rather than scanning and discarding unwanted data in rows. Query
performance is increased for certain workloads.

For example :

A common method of storing a table is to serialize each row of data, like this :

Column-oriented database serializes all of the values of a column together, then the values of the
next column, and so on. For our example table, the data would be stored in this fashion:
Features

Scalability

Hbase is horizontally scalable so what do we mean by that?

To understand horizonal scalability we need to compare it with vertical scalability.

Horizontal scalability is the ability to increase capacity by connecting multiple hardware or

software entities so that they work as a single logical unit. When servers are clustered, the
original server is being scaled out horizontally. If a cluster requires more resources to improve
performance and provide high availability (HA), an administrator can scale out by adding more
servers to the cluster. An important advantage of horizontal scalability is that it can provide
administrators with the ability to increase capacity on the fly. Another advantage is that in theory,
horizontal scalability is only limited by how many entities can be connected successfully

Vertical scalability, on the other hand, increases capacity by adding more resources, such as
more memory or an additional CPU, to a machine. Scaling vertically, which is also called scaling
up, usually requires downtime while new resources are being added and has limits that are defined
by hardware.

“What you need to consider while choosing Horizontal scalability on Vertical Scalability

“Scaling horizontally has both advantages and disadvantages. For example, adding inexpensive
commodity computers to a cluster might seem to be a cost-effective solution at first glance, but
it’s important for the administrator to know whether the licensing costs for those additional
servers, the additional operations cost of powering and cooling and the large footprint they will
occupy in the data center truly makes scaling horizontally a better choice than scaling vertically.”

Automatic Recovery from Failure using write ahead log(WAL)

 HFile : stores the rows of data as sorted keyvalue on disk

 MemStore : is a write cache that stores new data that has not yet been written to disk, there is
one memstore per column family

 A HBase Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a
column family for a table for a given region.

 The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. if a
RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL
ensures that the changes to the data can be replayed.

Consistency
Consistency in database systems refers to the requirement that any given database transaction
must change affected data only in allowed ways. Any data written to the database must be valid
according to all defined rules, including constraints, cascades, triggers, and any combination
thereof.

Write transactions are always performed in strong consistency model in HBase which guarantees
that transactions are ordered, and replayed in the same order by all copies of the data. In
timeline consistency, the get and scan requests can be answered from data that may be stale.

Java API client

The java client API for HBase is used to perform CRUD operations on HBase tables. HBase is
written in Java and has a Java Native API. Therefore it provides programmatic access to Data
Manipulation Language (DML).

Block cache

HBase supports block cache to improve read performance. When performing a scan, if block
cache is enabled and there is room remaining, data blocks read from StoreFiles on HDFS are
cached in region server’s Java heap space, so that next time, accessing data in the same block can
be served by the cached block. Block cache helps in reducing disk I/O for retrieving data.

Block cache is configurable at table’s column family level. Different column families can have
different cache priorities or even disable the block cache. Applications leverage this cache
mechanism to fit different data sizes and access patterns.

Bloom filter

A Bloom filter, named for its creator, Burton Howard Bloom, is a data structure which is designed
to predict whether a given element is a member of a set of data. A positive result from a Bloom
filter is not always accurate, but a negative result is guaranteed to be accurate. Bloom filters are
designed to be “accurate enough” for sets of data which are so large that conventional hashing
mechanisms would be impractical.

In terms of HBase, Bloom filters provide a lightweight in-memory structure to reduce the number
of disk reads for a given Get operation to only the StoreFiles likely to contain the desired Row.
The potential performance gain increases with the number of parallel reads.
Using Bloom filters to help reduce the number of I/O operations. The files are all from one
column family and have a similar spread in row keys, although only a few really hold an update to
a specific row. The block index has a spread across the entire row key range, and therefore always
reports positive to contain a random row. The region server would need to load every block to
check if the block actually contains a cell of the row or not.

HBase vs RDMS

HBase Architecture

Hbase column oriented storage

Whereas, RDMS store table records in a sequence of rows (Row-oriented databases), HBase is a
column-oriented databases, which store table records in a sequence of columns.

 Row key : the reference of the row, it’s used to make the search of a record faster.

 Column Families : combination of a set of columns. Data belonging to the same column
family can be accessed together in a single seek, allowing a faster process.

 Column Qualifiers: Each column’s name is known as its column qualifier.

 Cell: the storage area of data. Each cell is connected to a row key and a column qualifiers.

HBase architectural components

HBase has three crucial components:

 Zookeeper used for monitoring.

 HMaster Server assigns regions and load-balancing.

 Region Server serves data for write and read. it refers to different computers in the Hadoop
cluster. Each Region Server have a region, HLog, a store memory store

To manage this system, Zookeeper and HMaster works together. Zookeeper verify the status of
HMaster, if it’s active, it’ll send a heartbeat of the zookeeper (active HMaster), and to guarantee
system tolerance, there is a Inactive HMaster that acts as a backup.

Region servers send a heartbeat signal to Zookeeper to send their status (ready for write and read
operations).

Regions servers and HMaster are connected to Zookeeper via a session.

HBase : Read and Write mechanism

In this section we will discuss what happens when a client reads or write data to Hbase.

HBase : read mechanism

There is a special Hbase catalog called the META table which holds the location of the regions
in the cluster.
 The client sends a request to get the region server that hosts the META table from ZooKeeper.

 The Zookeeper replies by sending the META table location

 The client will query the META server to get the region server corresponding to the row key it
wants to access

 The client caches this information along side with the META table location

 Finally the region Server answer with the row key, so now it could get row or rows.
META table

Hbase write Mechanism

These following steps occur in HBase Operations, while the client gives a command to Write:
1. Write the data to the write-ahead-log (WAL), HBase always has WAL to look into, if any error
occurs while writing data.

2. Once the data is written to the WAL, it is then copied to the MemStore

3. Once the data is placed in the MemStore, the client then receives the acknowledgement (ACK)

4. When the MemStore reaches the threshold, it dumps or commit the data into HFile

Applications of HBase

Medical

In the medical field, HBase is used for the purpose of storing genome sequences and running
MapReduce on it, storing the disease history of people or an area.

Sports

For storing match histories for better analytics and prediction.

E-Commerce
For the purpose of recording and storing logs about customer search history, as well as to perform
analytics and then target advertisement for the better business.

Companies Using HBase in 2019

There are many popular companies using HBase, some of them are:

1 .Mozilla

“Mozilla” uses HBase to store all crash data in HBase

2 .Facebook

To store real-time messages, “Facebook” uses HBase storage.

3 .Infolinks

To process advertisement selection and user events for the In-Text ad network, Infolinks uses
HBase. It is is an In-Text ad provider company. Moreover, to optimize ad selection, they use the
reports which HBase generates as feedback for their production system.

4 .Twitter
A company like Twitter also runs HBase across its entire Hadoop cluster. For them, HBase offers
a distributed, read/write the backup of all MySQL tables in their production backend. That helps
engineers to run MapReduce jobs over the data while maintaining the ability to apply periodic
row updates.

5 .Yahoo!

One of the most famous companies Yahoo! also uses HBase. There HBase helps to store
document fingerprint in order to detect near-duplicates.

So, this was all HBase Use Cases.

Chapter 10. Column-Family Stores

Column-family stores, such as Cassandra [Cassandra], HBase [Hbase], Hypertable [Hypertable], and
Amazon SimpleDB [Amazon SimpleDB], allow you to store data with keys mapped to values and the
values grouped into multiple column families, each column family being a map of data.

10.1. What Is a Column-Family Data Store?

There are many column-family databases. In this chapter, we will talk about Cassandra but also
reference other column-family databases to discuss features that may be of interest in particular
scenarios.
Column-family databases store data in column families as rows that have many columns associated
with a row key (Figure 10.1). Column families are groups of related data that is often accessed
together. For a Customer, we would often access their Profile information at the same time, but not
their Orders.

Figure 10.1. Cassandra’s data model with column families

Cassandra is one of the popular column-family databases; there are others, such as HBase,
Hypertable, and Amazon DynamoDB [Amazon DynamoDB]. Cassandra can be described as fast and
easily scalable with write operations spread across the cluster. The cluster does not have a master
node, so any read and write can be handled by any node in the cluster.
10.2. Features
Let’s start by looking at how data is structured in Cassandra. The basic unit of storage in Cassandra is
a column. A Cassandra column consists of a name-value pair where the name also behaves as the key.
Each of these key-value pairs is a single column and is always stored with a timestamp value. The
timestamp is used to expire data, resolve write conflicts, deal with stale data, and do other things.
Once the column data is no longer used, the space can be reclaimed later during a compaction phase.
Click here to view code image
{
name: "fullName",
value: "Martin Fowler",
timestamp: 12345667890
}

The column has a key of firstName and the value of Martin and has a timestamp attached to it. A
row is a collection of columns attached or linked to a key; a collection of similar rows makes a
column family. When the columns in a column family are simple columns, the column family is known
as standard column family.
Click here to view code image
//column family
{
//row
"pramod-sadalage" : {
firstName: "Pramod",
lastName: "Sadalage",
lastVisit: "2012/12/12"
}
//row
"martin-fowler" : {
firstName: "Martin",
lastName: "Fowler",
location: "Boston"
}
}

Each column family can be compared to a container of rows in an RDBMS table where the key
identifies the row and the row consists on multiple columns. The difference is that various rows do
not have to have the same columns, and columns can be added to any row at any time without having
to add it to other rows. We have the pramod-sadalage row and the martin-fowler row with
different columns; both rows are part of the column family.
When a column consists of a map of columns, then we have a super column. A super column
consists of a name and a value which is a map of columns. Think of a super column as a container of
columns.
Click here to view code image
{
name: "book:978-0767905923",
value: {
author: "Mitch Albon",
title: "Tuesdays with Morrie",
isbn: "978-0767905923"
}
}

When we use super columns to create a column family, we get a super column family.
Click here to view code image

//super column family

{
//row
name: "billing:martin-fowler",
value: {
address: {
name: "address:default",
value: {
fullName: "Martin Fowler",
street:"100 N. Main Street",
zip: "20145"
}
},
billing: {
name: "billing:default",
value: {
creditcard: "8888-8888-8888-8888",
expDate: "12/2016"
}
}
}
//row
name: "billing:pramod-sadalage",
value: {
address: {
name: "address:default",
value: {
fullName: "Pramod Sadalage",
street:"100 E. State Parkway",
zip: "54130"
}
},
billing: {
name: "billing:default",
value: {
creditcard: "9999-8888-7777-4444",
expDate: "01/2016"
}
}
}
}

Super column families are good to keep related data together, but when some of the columns are
not needed most of the time, the columns are still fetched and deserialized by Cassandra, which may
not be optimal.
Cassandra puts the standard and super column families into keyspaces. A keyspace is similar to a
database in RDBMS where all column families related to the application are stored. Keyspaces have
to be created so that column families can be assigned to them:
create keyspace ecommerce

10.2.1. Consistency
When a write is received by Cassandra, the data is first recorded in a commit log, then written to an
in-memory structure known as memtable. A write operation is considered successful once it’s
written to the commit log and the memtable. Writes are batched in memory and periodically written
out to structures known as SSTable. SSTables are not written to again after they are flushed; if there
are changes to the data, a new SSTable is written. Unused SSTables are reclaimed by compactation.
Let’s look at the read operation to see how consistency settings affect it. If we have a consistency
setting of ONE as the default for all read operations, then when a read request is made, Cassandra
returns the data from the first replica, even if the data is stale. If the data is stale, subsequent reads
will get the latest (newest) data; this process is known as read repair. The low consistency level is
good to use when you do not care if you get stale data and/or if you have high read performance
requirements.
Similarly, if you are doing writes, Cassandra would write to one node’s commit log and return a
response to the client. The consistency of ONE is good if you have very high write performance
requirements and also do not mind if some writes are lost, which may happen if the node goes down
before the write is replicated to other nodes.
Click here to view code image
quorum = new ConfigurableConsistencyLevel();
quorum.setDefaultReadConsistencyLevel(HConsistencyLevel.QUORUM);
quorum.setDefaultWriteConsistencyLevel(HConsistencyLevel.QUORUM);

Using the QUORUM consistency setting for both read and write operations ensures that majority of the
nodes respond to the read and the column with the newest timestamp is returned back to the client,
while the replicas that do not have the newest data are repaired via the read repair operations. During
write operations, the QUORUM consistency setting means that the write has to propagate to the majority
of the nodes before it is considered successful and the client is notified.
Using ALL as consistency level means that all nodes will have to respond to reads or writes, which
will make the cluster not tolerant to faults—even when one node is down, the write or read is
blocked and reported as a failure. It’s therefore upon the system designers to tune the consistency
levels as the application requirements change. Within the same application, there may be different
requirements of consistency; they can also change based on each operation, for example showing
review comments for a product has different consistency requirements compared to reading the status
of the last order placed by the customer.
During keyspace creation, we can configure how many replicas of the data we need to store. This
number determines the replication factor of the data. If you have a replication factor of 3, the data
copied on to three nodes. When writing and reading data with Cassandra, if you specify the
consistency values of 2, you get that R + W is greater than the replication factor (2 + 2 > 3) which
gives you better consistency during writes and reads.
We can run the node repair command for the keyspace and force Cassandra to compare every key
it’s responsible for with the rest of the replicas. As this operation is expensive, we can also just
repair a specific column family or a list of column families:
repair ecommerce

repair ecommerce customerInfo

While a node is down, the data that was supposed to be stored by that node is handed off to other
nodes. As the node comes back online, the changes made to the data are handed back to the node. This
technique is known as hinted handoff. Hinted handoff allows for faster restore of failed nodes.
10.2.2. Transactions
Cassandra does not have transactions in the traditional sense—where we could start multiple writes
and then decide if we want to commit the changes or not. In Cassandra, a write is atomic at the row
level, which means inserting or updating columns for a given row key will be treated as a single
write and will either succeed or fail. Writes are first written to commit logs and memtables, and are
only considered good when the write to commit log and memtable was successful. If a node goes
down, the commit log is used to apply changes to the node, just like the redo log in Oracle.
You can use external transaction libraries, such as ZooKeeper [ZooKeeper], to synchronize your
writes and reads. There are also libraries such as Cages [Cages] that allow you to wrap your
transactions over ZooKeeper.
10.2.3. Availability
Cassandra is by design highly available, since there is no master in the cluster and every node is a
peer in the cluster. The availability of a cluster can be increased by reducing the consistency level of
the requests. Availability is governed by the (R + W) > N formula (“Quorums,” p. 57) where W is the
minimum number of nodes where the write must be successfully written, R is the minimum number of
nodes that must respond successfully to a read, and N is the number of nodes participating in the
replication of data. You can tune the availability by changing the R and W values for a fixed value of N.
In a 10-node Cassandra cluster with a replication factor for the keyspace set to 3 (N = 3), if we set
R = 2 and W = 2, then we have (2 + 2) > 3. In this scenario, when one node goes down,
availability is not affected much, as the data can be retrieved from the other two nodes. If W = 2 and
R = 1, when two nodes are down the cluster is not available for write but we can still read.
Similarly, if R = 2 and W = 1, we can write but the cluster is not available for read. With the R + W
> N equation, you are making conscious decisions about consistency tradeoffs.
You should set up your keyspaces and read/write operations based on your needs—higher
availability for write or higher availability for read.
10.2.4. Query Features
When designing the data model in Cassandra, it is advised to make the columns and column families
optimized for reading the data, as it does not have a rich query language; as data is inserted in the
column families, data in each row is sorted by column names. If we have a column that is retrieved
much more often than other columns, it’s better performance-wise to use that value for the row key
instead.
10.2.4.1. Basic Queries
Basic queries that can be run using a Cassandra client include the GET, SET, and DEL. Before starting
to query for data, we have to issue the keyspace command use ecommerce;. This ensures that all of
our queries are run against the keyspace that we put our data into. Before starting to use the column
family in the keyspace, we have to define the column family.
Click here to view code image
CREATE COLUMN FAMILY Customer
WITH comparator = UTF8Type
AND key_validation_class=UTF8Type
AND column_metadata = [
{column_name: city, validation_class: UTF8Type}
{column_name: name, validation_class: UTF8Type}
{column_name: web, validation_class: UTF8Type}
];

We have a column family named Customer with name, city, and web columns, and we are
inserting data in the column family with a Cassandra client.
Click here to view code image
SET Customer['mfowler']['city']='Boston';
SET Customer['mfowler']['name']='Martin Fowler';
SET Customer['mfowler']['web']='www.martinfowler.com';

Using the Hector [Hector] Java client, we can insert the same data in the column family.
Click here to view code image
ColumnFamilyTemplate<String, String> template =
cassandra.getColumnFamilyTemplate();
ColumnFamilyUpdater<String, String> updater =
template.createUpdater(key);
for (String name : values.keySet()) {
updater.setString(name, values.get(name));
}
try {
template.update(updater);
} catch (HectorException e) {
handleException(e);
}

We can read the data back using the GET command. There are multiple ways to get the data; we can
get the whole column family.
GET Customer['mfowler'];

We can even get just the column we are interested in from the column family.
GET Customer['mfowler']['web'];

Getting the specific column we need is more efficient, as only the data we care about is returned—
which saves lots of data movement, especially when the column family has a large number of
columns. Updating the data is the same as using the SET command for the column that needs to be set
to the new value. Using DEL command, we can delete either a column or the entire column family.
Click here to view code image
DEL Customer['mfowler']['city'];

DEL Customer['mfowler'];

10.2.4.2. Advanced Queries and Indexing

Cassandra allows you to index columns other than the keys for the column family. We can define an
index on the city column.
Click here to view code image
UPDATE COLUMN FAMILY Customer
WITH comparator = UTF8Type
AND column_metadata = [{column_name: city,
validation_class: UTF8Type,
index_type: KEYS}];

We can now query directly against the indexed column.

GET Customer WHERE city = 'Boston';

These indexes are implemented as bit-mapped indexes and perform well for low-cardinality
column values.
10.2.4.3. Cassandra Query Language (CQL)
Cassandra has a query language that supports SQL-like commands, known as Cassandra Query
Language (CQL). We can use the CQL commands to create a column family.
Click here to view code image
CREATE COLUMNFAMILY Customer (
KEY varchar PRIMARY KEY,
name varchar,
city varchar,
web varchar);

We insert the same data using CQL.

Click here to view code image
INSERT INTO Customer (KEY,name,city,web)
VALUES ('mfowler',
'Martin Fowler',
'Boston',
'www.martinfowler.com');

We can read data using the SELECT command. Here we read all the columns:
SELECT * FROM Customer

Or, we could just SELECT the columns we need.

SELECT name,web FROM Customer

Indexing columns are created using the CREATE INDEX command, and then can be used to query the
data.
Click here to view code image
SELECT name,web FROM Customer WHERE city='Boston'

CQL has many more features for querying data, but it does not have all the features that SQL has.
CQL does not allow joins or subqueries, and its where clauses are typically simple.
10.2.5. Scaling
Scaling an existing Cassandra cluster is a matter of adding more nodes. As no single node is a master,
when we add nodes to the cluster we are improving the capacity of the cluster to support more writes
and reads. This type of horizontal scaling allows you to have maximum uptime, as the cluster keeps
serving requests from the clients while new nodes are being added to the cluster.
10.3. Suitable Use Cases
Let’s discuss some of the problems where column-family databases are a good fit.
10.3.1. Event Logging
Column-family databases with their ability to store any data structures are a great choice to store
event information, such as application state or errors encountered by the application. Within the
enterprise, all applications can write their events to Cassandra with their own columns and the
rowkey of the form appname:timestamp. Since we can scale writes, Cassandra would work ideally
for an event logging system (Figure 10.2).

Figure 10.2. Event logging with Cassandra

10.3.2. Content Management Systems, Blogging Platforms
Using column families, you can store blog entries with tags, categories, links, and trackbacks in
different columns. Comments can be either stored in the same row or moved to a different keyspace;
similarly, blog users and the actual blogs can be put into different column families.
10.3.3. Counters
Often, in web applications you need to count and categorize visitors of a page to calculate analytics.
You can use the CounterColumnType during creation of a column family.
Click here to view code image
CREATE COLUMN FAMILY visit_counter
WITH default_validation_class=CounterColumnType
AND key_validation_class=UTF8Type AND comparator=UTF8Type;

Once a column family is created, you can have arbitrary columns for each page visited within the
web application for every user.
Click here to view code image

INCR visit_counter['mfowler'][home] BY 1;
INCR visit_counter['mfowler'][products] BY 1;
INCR visit_counter['mfowler'][contactus] BY 1;

Incrementing counters using CQL:

Click here to view code image
UPDATE visit_counter SET home = home + 1 WHERE KEY='mfowler'

10.3.4. Expiring Usage

You may provide demo access to users, or may want to show ad banners on a website for a specific
time. You can do this by using expiring columns: Cassandra allows you to have columns which, after
a given time, are deleted automatically. This time is known as TTL (Time To Live) and is defined in
seconds. The column is deleted after the TTL has elapsed; when the column does not exist, the access
can be revoked or the banner can be removed.
Click here to view code image
SET Customer['mfowler']['demo_access'] = 'allowed' WITH ttl=2592000;

10.4. When Not to Use

There are problems for which column-family databases are not the best solutions, such as systems that
require ACID transactions for writes and reads. If you need the database to aggregate the data using
queries (such as SUM or AVG), you have to do this on the client side using data retrieved by the client
from all the rows.
Cassandra is not great for early prototypes or initial tech spikes: During the early stages, we are
not sure how the query patterns may change, and as the query patterns change, we have to change the
column family design. This causes friction for the product innovation team and slows down developer
productivity. RDBMS impose high cost on schema change, which is traded off for a low cost of query
change; in Cassandra, the cost may be higher for query change as compared to schema change.

Pradeep M - Data Analyst
No ratings yet
Pradeep M - Data Analyst
1 page
Advanced Architecting On AWS
0% (1)
Advanced Architecting On AWS
3 pages
HBase
No ratings yet
HBase
6 pages
Big Data Unit 5
No ratings yet
Big Data Unit 5
18 pages
BDA Unit 5
No ratings yet
BDA Unit 5
33 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
No ratings yet
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
6 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
34 pages
HBASE
No ratings yet
HBASE
35 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
Hbase in Practice
No ratings yet
Hbase in Practice
46 pages
Cse 17CS82 M2 S4 PPT
No ratings yet
Cse 17CS82 M2 S4 PPT
19 pages
10_HBase
No ratings yet
10_HBase
13 pages
HBase
No ratings yet
HBase
31 pages
BDA Unit-4 Part-2 HBase,Hive,Pig
No ratings yet
BDA Unit-4 Part-2 HBase,Hive,Pig
74 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
Columnar Database
No ratings yet
Columnar Database
18 pages
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
No ratings yet
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
32 pages
BDT UNIT - V
No ratings yet
BDT UNIT - V
15 pages
Hbase - Quick Guide Hbase - Overview
No ratings yet
Hbase - Quick Guide Hbase - Overview
53 pages
HBase Presentation
No ratings yet
HBase Presentation
23 pages
HBase - Tutorial
No ratings yet
HBase - Tutorial
14 pages
lec18
No ratings yet
lec18
18 pages
HBASE (1)
No ratings yet
HBASE (1)
18 pages
UNIT5
No ratings yet
UNIT5
42 pages
HBase
No ratings yet
HBase
27 pages
Chapter 12 HBase[1]
No ratings yet
Chapter 12 HBase[1]
108 pages
4 4HBase
No ratings yet
4 4HBase
17 pages
Assignment 10
No ratings yet
Assignment 10
9 pages
HBase (Unit 4)
No ratings yet
HBase (Unit 4)
37 pages
UNIT 5 Notes
No ratings yet
UNIT 5 Notes
47 pages
HBASE
No ratings yet
HBASE
11 pages
lec18
No ratings yet
lec18
21 pages
Bda Unit-4
No ratings yet
Bda Unit-4
63 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
Assignment Day 10: Task 1
No ratings yet
Assignment Day 10: Task 1
8 pages
9 HBase
No ratings yet
9 HBase
77 pages
Big Data 22MSM40206
No ratings yet
Big Data 22MSM40206
9 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
Unit 5 Hbase
No ratings yet
Unit 5 Hbase
15 pages
10 NoSQL Databases - HBase Hive Cassandra
No ratings yet
10 NoSQL Databases - HBase Hive Cassandra
74 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
HBase
No ratings yet
HBase
30 pages
Unit III_Full
No ratings yet
Unit III_Full
31 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
Apache HBase PPT
No ratings yet
Apache HBase PPT
12 pages
Unit v Hadoop Related Tools_b5f716067e8295de72a527efb7a3698b
No ratings yet
Unit v Hadoop Related Tools_b5f716067e8295de72a527efb7a3698b
54 pages
Lesson 6 NoSQL Databases HBase
100% (1)
Lesson 6 NoSQL Databases HBase
47 pages
Hbase
No ratings yet
Hbase
13 pages
Unit V
No ratings yet
Unit V
6 pages
Hbase
100% (1)
Hbase
30 pages
C7 Hbase
No ratings yet
C7 Hbase
36 pages
Unit - 5 Part - 1
No ratings yet
Unit - 5 Part - 1
8 pages
Hbase
No ratings yet
Hbase
23 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
Unit 5 Notes
100% (3)
Unit 5 Notes
66 pages
DSS - U4 - HBASE Rev 1.0
No ratings yet
DSS - U4 - HBASE Rev 1.0
20 pages
Cs525: Special Topics in DBS: Large-Scale Data Management
No ratings yet
Cs525: Special Topics in DBS: Large-Scale Data Management
35 pages
BDA1
No ratings yet
BDA1
42 pages
DBMS Unit3
No ratings yet
DBMS Unit3
28 pages
unit-5 notes
No ratings yet
unit-5 notes
61 pages
Informatica Powercenter Questions
No ratings yet
Informatica Powercenter Questions
26 pages
Stored Procedure
No ratings yet
Stored Procedure
16 pages
Defence Data Management Strategy
No ratings yet
Defence Data Management Strategy
29 pages
Implementasi Open Source Intelligence Dalam Praktik Jurnalisme Di Media Online
No ratings yet
Implementasi Open Source Intelligence Dalam Praktik Jurnalisme Di Media Online
7 pages
NI - Handout No.2
No ratings yet
NI - Handout No.2
44 pages
Geomatic Engineering As A Profession
100% (1)
Geomatic Engineering As A Profession
33 pages
Project On Consumer Perception
100% (1)
Project On Consumer Perception
34 pages
Datasource List For PPM and RPM Linking
No ratings yet
Datasource List For PPM and RPM Linking
2 pages
Upwork Test
50% (2)
Upwork Test
22 pages
SSRN Id3910244
No ratings yet
SSRN Id3910244
5 pages
DBMS PBL
No ratings yet
DBMS PBL
27 pages
Contoh Jurnal Informasi Dan Komunikasi
No ratings yet
Contoh Jurnal Informasi Dan Komunikasi
13 pages
XML Bursting
No ratings yet
XML Bursting
7 pages
Topics For Literature Review in Biology
100% (3)
Topics For Literature Review in Biology
7 pages
How Do You Write An Economic Analysis
No ratings yet
How Do You Write An Economic Analysis
11 pages
Criminological Research 2 Prelim Notes
No ratings yet
Criminological Research 2 Prelim Notes
15 pages
Engaging Struggling Readers
No ratings yet
Engaging Struggling Readers
46 pages
HDFC Project
No ratings yet
HDFC Project
52 pages
Oracle - FS1 - Flash - Storage - System - Field - de From Inet
No ratings yet
Oracle - FS1 - Flash - Storage - System - Field - de From Inet
110 pages
Generative+AI+Foundations+Certificate+Brochure+(1)
No ratings yet
Generative+AI+Foundations+Certificate+Brochure+(1)
10 pages
Python Codes Arules
100% (1)
Python Codes Arules
17 pages
Power Bi - Azure Ebook Rsm17e - 014
No ratings yet
Power Bi - Azure Ebook Rsm17e - 014
12 pages
JSON Lecture PDF
No ratings yet
JSON Lecture PDF
48 pages
Selfstudys Com File
No ratings yet
Selfstudys Com File
5 pages
SAP BW On HANA Sample Resume 1
100% (2)
SAP BW On HANA Sample Resume 1
5 pages
Attachment
100% (1)
Attachment
41 pages
Big Data and Analytics Leaders - The Changing Role of CIO
No ratings yet
Big Data and Analytics Leaders - The Changing Role of CIO
8 pages
GT-511C1R - Datasheet - V1 5 - 20140312
No ratings yet
GT-511C1R - Datasheet - V1 5 - 20140312
36 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit - IV_Notes

Uploaded by

Unit - IV_Notes

Uploaded by

UNIT-4

COLUMN ORIENTED NOSQL DB

 Initially, in Nov 2006, Google released the paper on BigTable.

 HBase became the subproject of Hadoop, in Jan 2008.

 In the year 2010, May HBase became Apache top-level project.

Hadoop is a framework for handling large datasets in a distributed computing environment.

what is a column-oriented database ?

A column-oriented DBMS (or columnar database management system) is a database

Hbase is horizontally scalable so what do we mean by that?

To understand horizonal scalability we need to compare it with vertical scalability.

Horizontal scalability is the ability to increase capacity by connecting multiple hardware or

Automatic Recovery from Failure using write ahead log(WAL)

 HFile : stores the rows of data as sorted keyvalue on disk

Java API client

Hbase column oriented storage

 Column Qualifiers: Each column’s name is known as its column qualifier.

HBase architectural components

 Zookeeper used for monitoring.

 HMaster Server assigns regions and load-balancing.

Regions servers and HMaster are connected to Zookeeper via a session.

HBase : Read and Write mechanism

HBase : read mechanism

 The Zookeeper replies by sending the META table location

Hbase write Mechanism

For storing match histories for better analytics and prediction.

Companies Using HBase in 2019

“Mozilla” uses HBase to store all crash data in HBase

To store real-time messages, “Facebook” uses HBase storage.

So, this was all HBase Use Cases.

10.1. What Is a Column-Family Data Store?

Figure 10.1. Cassandra’s data model with column families

//super column family

repair ecommerce customerInfo

10.2.4.2. Advanced Queries and Indexing

We can now query directly against the indexed column.

We insert the same data using CQL.

Or, we could just SELECT the columns we need.

Figure 10.2. Event logging with Cassandra

Incrementing counters using CQL:

10.3.4. Expiring Usage

10.4. When Not to Use

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.