NO SQL3 Columnstore
NO SQL3 Columnstore
l
Column family stores use row and column identifiers as general
purposes keys for data lookup.
l
They lack typed columns, secondary indexes, triggers, and query
languages.
l
Almost all column family stores have been heavily influenced by the
original Google Bigtable paper.
l
HBase, Hypertable, and Cassandra are good examples of systems
that have Bigtable-like interfaces, although how they’re implemented
varies.
Column Store
l
A column store database stores all information within a column of a table at
the same location on disk in the same way a row-store keeps row data
together.
l
Column stores are used in many OLAP systems because their strength is
rapid column aggregate calculation.
l
The key structure in column family stores makes use of Row-ID and
column name but also has two additional attributes.
l
In addition to the column name, a column family is used to group similar
column names together.
l
The addition of a timestamp in the key also allows each cell in the table
to store multiple versions of a value over time.
Benefits of column family systems
l
The column family approach of using a row ID and column name as a lookup
key is a flexible way to store data, gives you benefits of higher scalability and
availability
l At the corecolumn family systems are noted for their scalable nature, which
means that as you add more data to your system, your investment will be in
the new nodes added to the computing cluster
l By building a system that scales on distributed networks, you gain the ability
to replicate data on multiple nodes in a network
l
Saves you time and hassles when adding new data to your system
l a key feature of the column family store is that you don’t need to fully
design your data model before you begin inserting data.
l Your groupings of column families should be known in advance, but row ID s
and column names can be created at any time
l
Since column family systems don’t rely on joins, they tend to scale well on
distributed systems. Column family systems have automatic failover built in to
detect failing nodes and algorithms to identify corrupt data.
l
They leverage advanced hashing and indexing tools such as Bloom filters to
perform probabilistic analysis on large data sets. The larger the dataset, the
better these tools perform.
Drawbacks of column family
systems
l
may not be appropriate for small datasets
l
You usually need at least five processors to
justify a column family cluster, since many
systems are designed to store data on three
different nodes for replication.
l
Column family systems also don’t support
standard SQL queries for real-time data access.
l
They may have higher-level query languages,
but these systems often are used to generate
batch MapReduce jobs.
Comparison
• Key
• Byte array
• Serves as the primary key for
the table
• Indexed far fast lookup Column named “apache.com”
• Column Family
• Has a name (string)
• Contains one or more related
columns
• Column
• Belongs to one column family
• Included inside the row
• familyName:columnName
Version number for each row
• Version Number
• Unique within each key
• By default System’s value
timestamp
• Data type is Long
• Value (Cell)
• Byte array
Notes on Data Model
• HBase schema consists of several Tables
• Each table consists of a set of Column Families
• Columns are not part of the schema
• HBase has Dynamic Columns
• Because column names are encoded inside the cells
• Different cells can have different columns