0% found this document useful (0 votes)
31 views

9 HBase

Uploaded by

nkr189
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

9 HBase

Uploaded by

nkr189
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

HBASE

Vinod Kumar S,
ESDM Team,
CDAC,Hydrabad
HBase
◼ HBase (Hadoop Database) is a NoSQL database which
is most popularly used in Big Data for storage.
◼ It can run on top of Hadoop (HDFS) as distributed mode
or as standalone mode.
◼ It is meant to host large tables with billions of rows with
potentially millions of columns and run across a cluster
of commodity hardware.
◼ HBase is a powerful database for real-time query
capabilities with the speed of a key/value store and
offline or batch processing via MapReduce.
◼ HBase allows you to query for individual records as well
as derive aggregate analytic reports across a massive
amount of data.
◼ For the timely search on Google via internet, it has introduced
following technologies
❑ Google File System : A scalable distributed file system for

large distributed data-intensive applications


❑ MapReduce : A programming model and an associated

implementation for processing and generating large data


sets
Single Coordinating Software

Google File System HDFS

MapReduce MapReduce
Hadoop

HDFS MapReduce

A file system to manage A framework to process


the storage of data data across multiple
servers in parallel

◼ Hadoop is a big data Processing Framework


◼ Hadoop is not a database
◼ And database goes beyond storage and processing data
❑ provides many other data needs to the user.
❑ While Hadoop does not.
Requirements of databases

◼ Structured : Rows and columns


◼ Random access : update one row at a time
◼ Low Latency: very fast read /write/ update
operations
◼ ACID complaint : Ensure data integrity
Limitations of Haoop
◼ Unstructured data – Data in HDFS has no schema eg text
files, Log files, Audio files, video files
❑ Basic structure exists for some type of files eg csv, xml,

json
❑ Hadoop enforces no constraints on these

◼ No random access – Cannot create, access and modify


individual records in a file
❑ MapReduce parses entire files to extract information

◼ High Latency – Not suited for real time processing where a


user waits for data to be retireved
❑ Batch processing with long running jobs

◼ Not ACID compliant – HDFS is file storage and provides no


guarantees for data integrity
◼ For the timely search on Google via internet, it has introduced
following technologies
❑ Google File System : A scalable distributed file system for

large distributed data-intensive applications


❑ MapReduce : A programming model and an associated

implementation for processing and generating large data


sets
❑ Big Table : A distributed storage system for managing

structured data that is designed to scale to a large size:


petabytes of data across thousands of commodity servers

◼ In 2007, Mike Cafarella released code for an open source


BigTable implementation that he called Hbase
◼ Facebook, Twitter, and Adobe etc..
Bigtable HBase

Google
Filesystem MapReduce HDFS MapReduce

HBASE is a distributed database management system which


runs on top of Hadoop
HBase

◼ Distributed: Stores data in HDFS


◼ Scalable: Capacity directly proportional to
number of nodes in the cluster
◼ Fault tolerant: Based on Hadoop
HBase

◼ Structured: A loose data structure


◼ Low latency: Real-time access using row based indices
called row keys
◼ Random access: Row keys allow access updates to one
record
◼ Somewhat ACID compliant: Some transactions will have
ACID properties
◼ Batch processing using MapReduce
◼ Real-time processing using row keys
Properties of HBase

◼ Columnar Store
◼ Denormalized Storage
◼ Only CRUD operations
◼ ACID at row level
◼ Unique row and similar column for each row
◼ Fixed schema for each row
◼ If the attribute does not exist that particular cell in the data will be empty.
Every column is a attribute of a particular record.
◼ A traditional database is a two-dimensional model.
◼ You have to specify two dimensions, the unique identifier of the row and the
specific column in order to identify a single cell.
Columnar Storage
Columnar Storage
Columnar Storage
Columnar Storage
Columnar Storage
Columnar Storage
Columnar Storage
Columnar Storage
Advantages of columnar store

◼ Sparse tables: No wastage of space when


storing data
◼ Dynamic attributes: Update attributes
dynamically without changing storage
structure
◼ For an RDBMS, adding a column will require
schema changes
◼ RDBMS
❑ Structural changes are hard to do in RDBMS
❑ Empty cells when data is not applicable to certain rows
❑ Empty cell occupy space
◼ Columnar Storage
❑ Dynamically add new attributes as rows in this table
❑ No wastage of space with empty cells!
Columnar Storage

◼ Columnar Storage
❑ Dynamically add new
attributes as rows in this table
❑ No wastage of space with
empty cells!
RDBMS Minimize Redundancy
◼ Employee Details
◼ Employee Subordinates
◼ Employee Address

Employees referenced only by


ids everywhere else

Data is made more granular by


splitting it across multiple tables

◼ Normalization
❑ Optimizes Storage
Denormalized Storage

◼ Distributed system has plenty of storage


◼ Optimize number of disk seeks
◼ Store everything related to an employee in
the same table
◼ Read a single record to get all details about
an employee in one read operation
Denormalized Storage

◼ Hbase allows complex data types like array,


structs within a single cell
Denormalized Storage

◼ Store everything related to an employee in the same table


◼ Read a single record to get all details about an employee in
one read operation
Traditional Databases and SQL

◼ Joins: Combining information across tables


using keys
◼ Group By: Grouping and aggregating data for
the groups
◼ Order By: Sorting rows by a certain column
Hbase CRUD Operations

◼ Hbase does not support SQL


◼ Only a limited set of operations are allowed in
Hbase
◼ Create Read Update Delete
◼ No operations involving multiple tables
◼ No indexes on tables
◼ No constraints
❑ This is why all details need to be self contained in
one row
HBase

◼ ACID at row level


◼ Updates to a single row are atomic
◼ All columns in a row are updated or none are
◼ Updates to multiple rows are not atomic
◼ Even if the update is on the same column in
multiple rows
Traditional RDBMS vs. HBase

◼ Traditional RDBMS
❑ Data arranged in rows and columns
❑ Supports SQL
❑ Complex queries such as grouping, aggregates, joins etc
❑ Normalized storage to minimize redundancy and optimize space
❑ ACID compliant

◼ Hbase
❑ Data arranged in a column-wise manner
❑ NoSQL database
❑ Only basic operations such as create, read, update and delete
❑ Denormalized storage to minimize disk seeks
❑ ACID compliant at the row level
Hbase has 4 dimensional data model
◼ Row key
◼ Column Family
◼ Column
◼ Timestamp
◼ Rowkey
❑ Uniquely identifies a row
❑ Can be primitives, structures,arrays
❑ Represented internally as a bytearray
❑ Sorted in ascending order
◼ Column Family
❑ All rows have the same set of column families
❑ Each column family is stored in a separate data file
❑ Set up at schema definition time
❑ Can have different columns for each row
◼ Column
❑ Columns are units within a column family
❑ New columns can be added on the fly
❑ ColumnFamily: ColumnName =Work:Department
◼ Timestamp
❑ Used as the version number for the values stored
in a column
❑ The value for any version can be accessed
Insert and update data using the put
command
◼ Rowkey
◼ Insert data one cell at a time
◼ The column family prefix for every colmn
qualifier
put 'census', 1, 'personal:name', 'Mike Jones„

scan „census‟

◼ This command will return how many rows has


been retireved
scan „census‟
Put

put '<HBase_table_name>',
'row_key', '<colfamily:colname>', '<value>'
SQL> select * from tablename
hbase> scan 'tablename'

SQL> select colname from tablename


hbase> scan 'tablename'
{COLUMNS =>['columnfamily : column']}

SQL> select * from tablename limit 1


hbase> scan 'tablename', {LIMIT => 1}
Hbase Shell commands

◼ Hbase Shell General commands


◼ DDL commands
◼ DML commands
◼ Other commands
General commands
◼ status – shows the cluster status
❑ hbase> status 'simple'
❑ hbase> status 'summary'
❑ hbase> status 'detailed„
◼ table_help – help on Table reference
commands, scan, put, get, disable, drop etc.
◼ version – displays HBase version
◼ whoami – shows the current HBase user.
◼ List
◼ help
DDL Commands
◼ alter, alter_async, alter_status,
◼ create, describe,
◼ disable, disable_all,
◼ drop, drop_all,
◼ enable, enable_all,
◼ exists,
◼ get_table, is_disabled, is_enabled, list,
locate_region, show_filters
◼ To know more on any command : >help “ddl”
DML Commands
◼ count, delete, deleteall,
◼ get, get_counter, get_splits, incr,
◼ put,
◼ scan,
◼ truncate, truncate_preserve, append,
Security Commands
◼ Grant : Grant users specific rights.
❑ permissions is either zero or more letters from the set "RWXCA".
❑ READ('R'), WRITE('W'), EXEC('X'), CREATE('C'), ADMIN('A')
❑ Eg: hbase> grant 'bobsmith', 'RWXCA'
❑ hbase> grant '@admins', 'RWXCA'
❑ hbase> grant 'bobsmith', 'RWXCA', '@ns1'

◼ Revoke : Revoke a user‟s access rights.


◼ user_permission : Show all permissions for
the particular user.
Filters
◼ Filters are used to get a subset of the scan
results.
◼ Instead of scanning the entire dataset, return
a subset closer to what we need in less time
◼ Use filters with scan or get commands
◼ List of filters
❑ Hbase> show_filters
◼ FirstKeyOnlyFilter
❑ This filter doesn‟t take any arguments. It returns
the primary key-value from every row.

Syntax: KeyOnlyFilter ()
Ex: scan „tablename',{FILTER=> "KeyOnlyFilter()"}

◼ KeyOnlyFilter –
❑ This filter doesn‟t take any arguments. It returns
solely the key part of every key-value.
Syntax: FirstKeyOnlyFilter ()
Ex: scan „tablename',{FILTER=> "KeyOnlyFilter()“}
◼ ColumnPrefixFilter-
❑ This filter takes one argument a column prefix. It returns only
those key-values present in a column that starts with the
specified column prefix. The column prefix must be of the form
qualifier

ColumnPrefixFilter („<column_prefix>‟)
Example: ColumnPrefixFilter („Col‟)

◼ MultipleColumnPrefixFilter
❑ This filter takes a list of column prefixes. It returns key-values that
are present in a column that starts with any of the specified
column prefixes. Each of the column prefixes must be of the form
qualifier
MultipleColumnPrefixFilter („<column_prefix>‟,
„<column_prefix>‟, …, „<column_prefix>‟)
Example: MultipleColumnPrefixFilter („Col1‟, „Col2‟)
◼ ValueFilter-
❑ takes a compare operator and a comparator. It compares each
value with the comparator using the compare operator and if the
comparison returns true, it returns that key-value.

ValueFilter (<compareOp>, „<value_comparator>‟)


Example: ValueFilter (!=, „binary:Nick‟)

◼ PrefixFilter-
❑ This filter takes one argument as a prefix of a row key. It returns
solely those key-values present in the very row that starts with
the specified row prefix

PrefixFilter („<row_prefix>‟)
Example: PrefixFilter („Row‟)
Hbase Architecture
Hbase Architecture
◼ HBase has three major components:
❑ The Client library,
❑ A Master server, and
❑ Region servers (Region servers can be added or
removed as per requirement)
HBase Architecture
◼ The master server -
❑ Assigns Regions to the RegionServers and with the
help of Apache ZooKeeper.
❑ Handles load balancing of the regions across region
servers. It unloads the busy servers and shifts the
regions to less occupied servers.
❑ Maintains the state of the cluster by negotiating the
load balancing.
❑ Is responsible for schema changes and other metadata
operations such as creation of tables and column
families.
HBase - Architecture
◼ Regions
❑ Regions are nothing but tables that are split up and spread across

the region servers.


◼ Region server
❑ The region servers have regions that -

◼ Communicate with the client and handle data-related


operations.
◼ Handle read and write requests for all the regions under it.

◼ Decide the size of the region by following the region size


thresholds.
◼ When we take a deeper look into the region server, it contain
regions and stores
HBase - Architecture
◼ The store contains memory store and
HFiles.
◼ Memstore is just like a cache memory.
◼ Anything that is entered into the HBase
is stored here initially.
◼ Later, the data is transferred and saved
in Hfiles as blocks and the memstore is
flushed.
Tables, Regions And RegionServer
◼ Conceptually, table is a
collection of rows and
columns. In Hbase, tables
are physically stored in
partitions called regions.
◼ In Hbase, Tables are
automatically split into
Regions
◼ These Regions are handled
by the RegionServers.
◼ Regionservers are nothing
but slave nodes.
◼ Every region is served by
exactly one region server,
which in turn serves the
stored values directly to
clients.
◼ Hbase depends on Zookeeper
❑ Zookeeper is a centralized
service for maintaining
configuration information,
naming, providing distributed
synchronization etc
◼ By default Hbase manages the
zookeeper instance
❑ Eg. Starts and stops

zookeeper
◼ HMaster and HRegionServers
register themselves with
zookeeper
Big Picture
◼ HBase is has three types of
servers in a master slave type of
architecture.
◼ Hbase Master process (Hmaster)
handles Region assignment, DDL
(create, delete tables) operations
◼ Region servers serve data for
reads and writes. When
accessing data, clients
communicate with HBase
RegionServers directly.
◼ Zookeeper, which is part of
HDFS, maintains a live cluster
state and provides server failure
notification.
◼ Hadoop DataNode stores the
data that the RegionServer is
managing
◼ HMaster and HRegionServers
register themselves with
zookeeper
◼ HBase Tables are divided horizontally by row key range into “Regions.”
A region contains all rows in the table between the region‟s start key and
end key. Regions are assigned to the nodes in the cluster, called
“Region Servers,” and these serve data for reads and writes. A region
server can serve about 1,000 regions.
A master :
 Coordinating the region servers
❑ Assigning regions on startup , re-assigning regions for recovery
or load balancing- Monitoring all RegionServer instances in the
cluster (listens for notifications from zookeeper)

Admin functions
❑ Interface for creating, deleting, updating tables
 HBase uses ZooKeeper as a distributed coordination service to
maintain server state in the cluster. Zookeeper maintains which
servers are alive and available, and provides server failure
notification.
 HBase catalog table called META
table, it holds the location of the
regions in the cluster. Zookeeper
stores META table.

When first time a client reads or writes to


HBase:
◼ The client gets the Region server that
hosts the META table from ZooKeeper.

◼ The client will query the .META. server to


get the region server corresponding to the
row key it wants to access. The client
caches this information along with the
META table location.

◼ It will get the Row from the corresponding


Region Server.

◼ For future reads, the client uses the cache


to retrieve the META location and
previously read row keys. Over time, it
does not need to query the META table,
unless there is a miss because a region
has moved; then it will re-query and
update the cache.
◼ META table is an HBase table that keeps a list of all
regions in the system.
◼ The .META. table is like a b tree.
◼ The .META. table structure is as follows:
◼ - Key: region start key,region id- Values: RegionServer
RegionServer runs on datanode and
has following components:
◼ Hfiles store the rows as sorted
KeyValues on disk.
◼ MemStore: is the write cache. It
stores new data which has not yet
been written to disk. It is sorted
before writing to disk. There is one
MemStore per column family per
region.
◼ WAL: Write Ahead Log is a file on
the distributed file system. The WAL
is used to store new data that hasn't
yet been persisted to permanent
storage; it is used for recovery in the
case of failure.
◼ BlockCache: is the read cache. It
stores frequently read data in
memory. Least Recently Used data
is evicted when full.
Write Operation

◼ First, data is written to a commit log, called


WAL (write-ahead-log)
◼ Then data is moved into memory, in a
structure called memstore
◼ When the size of the memstore exceeds a
given threshold it is flushed to an HFile to
disk
Hbase Write Steps(1)
◼ When the client issues
a Put request, the first
step is to write the data
to the write-ahead log,
the WAL:
❑ Edits are appended

to the end of the


WAL file that is
stored on disk.
❑ The WAL is used to

recover not-yet-
persisted data in
case a server
crashes.
Hbase Write Steps(2)
◼ Once the data is written
to the WAL, it is placed in
the MemStore. Then, the
put request
acknowledgement
returns to the client.

◼ There is one Memstore


per column family.
Hbase Memstore
◼ When the MemStore
accumulates enough data,
the entire sorted set is
written to a new HFile in
HDFS.
◼ HBase uses multiple
HFiles per column family,
which contain the actual
cells, or KeyValue
instances.
◼ These files are created
over time as KeyValue
edits sorted in the
MemStores are flushed as
files to disk.
Zookeeper
❑ Zookeeper is an open-source project that provides
services like maintaining configuration information,
providing distributed synchronization, etc.
❑ Zookeeper has ephemeral nodes representing different
region servers. Master servers use these nodes to
discover available servers.
❑ In addition to availability, the nodes are also used to
track server failures or network partitions.
❑ Clients communicate with region servers via
zookeeper.
❑ In pseudo and standalone modes, HBase itself will
take care of zookeeper.
◼ Region Server
❑ Each region is served by exactly one Region
Server
❑ Region servers can serve multiple regions
❑ The number of region servers and their sizes
depend on the capability of a single region server
Automating Sharding

◼ Tables are dynamically distributed by the


system to different region servers when they
become too large.
◼ Splitting and serving regions can be thought
of as auto sharding
◼ The scalability and load balancing is handled
using region. Regions are contiguous ranges
of rows stored together.
Automating Sharding
◼ Region
❑ This is the basic unit of scalability and load balancing

❑ Regions are contiguous ranges of rows stored together ! they

are the equivalent of range partitions in sharded RDBMS


❑ Regions are dynamically split by the system when they
become too large
❑ Regions can also be merged to reduce the number of storage

files
◼ Regions in practice
❑ Initially, there is one region
❑ System monitors region size: if a threshold is attained, SPLIT
❑ Regions are split in two at the middle key

❑ This creates roughly two equivalent (in size) regions


Thanks for Attention !!
◼ It is distributed column-oriented database built on top of the
Hadoop file system.
◼ Horizontal scaling
❑ Example : If a cluster expands from 10 to 20

RegionServers, for example, it doubles both in terms of


storage and as well as processing capacity

◼ Quick random access to huge amounts of structured data


Hbase
◼ Column-oriented database
◼ Hbase has denormalized storage
◼ One disk seek to have all row records
◼ Hbase only allows CRUD operation
◼ It leverages the fault tolerance provided by
the Hadoop File System (HDFS).
◼ It is a part of the Hadoop ecosystem that
provides random real-time read/write access
to data in the Hadoop File System.
Java API to work with HBase

◼ Connect to and Access Hbase


◼ Create, delete or manipulate data and tables
◼ Instantiate a configuration object
Configuration conf =
HBaseConfiguration.create();
◼ Establish a connection to Hbase
Connection connection
Connection =
connection =
ConnectionFactory.createConnection(conf);
ConnectionFactory.createConnection(conf)

◼ Use an administration object to manipulate tables


Admin admin = connection.getAdmin();

◼ Use a table instance to manipulate data within a table


HTableDescriptor
HTableDescriptor tableName = new
tableName =
HTableDescriptor(TableName.valueOf("census"));
new HTableDescriptor(“tablename”);

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy