0% found this document useful (0 votes)
164 views6 pages

Cassandra Notes

Cassandra is a distributed database that stores huge datasets across commodity servers in a way that maintains high availability and no single point of failure. It uses a decentralized architecture with no master node and provides tunable consistency. Data is distributed across nodes through consistent hashing and replicated for fault tolerance. The main interfaces to Cassandra are CQL, a SQL-like language, and Thrift.

Uploaded by

Amit Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views6 pages

Cassandra Notes

Cassandra is a distributed database that stores huge datasets across commodity servers in a way that maintains high availability and no single point of failure. It uses a decentralized architecture with no master node and provides tunable consistency. Data is distributed across nodes through consistent hashing and replicated for fault tolerance. The main interfaces to Cassandra are CQL, a SQL-like language, and Thrift.

Uploaded by

Amit Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 6

http://wiki.apache.

org/cassandra/ArticlesAndPresentations
http://docs.datastax.com/en/landing_page/doc/landing_page/current.html

Info from website: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis


================================================
Written in: Java
Main point: Store huge datasets in "almost" sql
License: Apache
Protocol: CQL3 & Thrift
- CQL3 is very similar SQL, but with some limitations that come from the
scalability (most notably: no JOINs, no aggregate functions.)
- CQL3 is now the official interface. Don't look at Thrift, unless you're working
on a legacy app. This way, you can live without understanding ColumnFamilies,
SuperColumns, etc.
- Querying by key, or key range (secondary indices are also available)
- Tunable trade-offs for distribution and replication (N, R, W)
- Data can have expiration (set on INSERT).
- Writes can be much faster than reads (when reads are disk-bound)
- Map/reduce possible with Apache Hadoop
- All nodes are similar, as opposed to Hadoop/HBase
- Very good and reliable cross-datacenter replication
- Distributed counter datatype.
- You can write triggers in Java.
Best used: When you need to store data so huge that it doesn't fit on server, but
still want a friendly familiar interface to it.
For example: Web analytics, to count hits by hour, by browser, by IP, etc.
Transaction logging. Data collection from huge sensor arrays.
================================================

A) Cassandra Architecture: -

Cassandra
- A distributed database.
- There is no master-slave concept and each node is equal.
- A cluster can easily be across more than one data center.

Snitch
- It is, How the nodes in a cluster know about the topology of the cluster.
- There is no master-slave concept and each node is equal.
- Type: Dynamic Snitching, SimpleSnitch, RackInferring Snitch, PropretyFileSnitch,
GossipingPropertyFileSnitch, EC2Snitch, EC2MultiRegionSnitch

Gossip (Internal communication)


- It is, How the nodes in a cluster communicates with each other.
- Every one second, each node communicates with up to three other nodes, exchanging
information about itself and all other nodes that it has information about.
Note: For External communication, such as from an application to C* database,
CQL(Cassandra Query Language) or Thrift are used.

Data Distribution
- It is done through consistent hashing, to strive for even distribution of data
across the nodes in cluster.
- Rather than all rows of a table existing on only one node, the rows are
distributed across the nodes in cluster, in an attempt to evenly spread out the
load of the table's data.
- To distribute the rows across the nodes, a partitioner is used. The partitioner
uses an algorithm to determine which node a given row of data will go to
- The default partitioner in cassandra is Murmur3
Murmur3: It takes the value in the first column of the row to generate a unique
number between -2^63 and 2^63.
Calculate the token ranges: -
<In below python formula it is calculated for 4 nodes, you can replace it with
actual number of nodes for your env.>
$ python -c 'print [str(((2**64 / 4) * i - 2**63) for in range(4)]'
['-9223372036854775808', '-4611686018427387904', 0, '461686018427387904']
OR
Use a Murmur3 calculator
- Each nodes in a cluster is assigned one token range. (OR multiple ranges with
virtual nodes)
e.g.: Each node is responsible for the token range between its endpoint and the
endpoint of the previous node.
Node wise endpoint is defined below.
NodeA: -100
NodeB: 0
NodeC: 51
NodeD: 100
-> NodeA can store value from value greater than 100 in +ve and value less than
-100 in -ve
-> NodeB can store value from -99 to 0
-> NodeC can store value from 1 to 51
-> NodeD can store value from 52 to 100

Replication Factor:
- It must be specified whenever a database is defined.
- It specifies how many instances of the data there will be within a given
database.
- Although 1 can be specified, it is common to specify 2,3, or more so that if a
node goes down, there is at least one other replica of the data, so that the data
is not lost with down node.

Virtual Nodes:
- They are alternative way to assign token ranges to nodes, and "Virtual Nodes" are
now the default in Cassandra.
- With Virtual Nodes, instead of a node being responsible for just one token range,
it is instead responsible for many small token range (by default, 256 of them)
- Virtual Nodes allow for assigning a high number of ranges to a powerful
computer(e.g. 512) and a lower number of ranges (e.g. 128) to a less powerful
computer
- Virtual Nodes (aka vnodes) were created to make it easier to add new nodes to a
cluster while keeping the cluster balanced
- When a new node is added, it receives many small token range slices from the
existing nodes, to maintain a balanced cluster

===================================================================================
=========================================================================
B) Installing and Configuring

Installation: -
- http://www.planetcassandra.org/cassandra/
- Where you unzip the folder Casssndra is installed in that directory.

Configuration: -
- Go inside conf directory to see configuration files.
(/Users/ashah/cassandra/dsc-cassandra-3.0.0/conf)
- cassandra.yaml is main configuration file.
File permission: -
<if you have modified cassandra.yaml as per below then create those directories
and give permission>
- sudo mkdir /var/lib/cassandra
- sudo mkdir /var/log/cassandra
- sudo chown -R $USER:$GROUP /var/lib/cassandra
- sudo chown -R $USER:$GROUP /var/log/cassandra

Starting/Stoping Cassandra: -

Way 1)
-> Start
<for now it is via root user>
- $pwd
o/p:/Users/ashah/cassandra/dsc-cassandra-3.0.0
- bin/cassandra

-> Stop
- ps aux | grep cass
- kill <pid>

Way 2)
- start: bin/cassandra -f
- stop: control or command + c

Checking Status: -
- bin/nodetool status
- bin/nodetool info [-h <host>]
- bin/nodetool ring

Accessing the Cassandra system.log File


- Location: /Users/ashah/cassandra/dsc-cassandra-3.0.0/logs
- File name is system.log and debug.log.
- Current version:: Setting of log file direcgory: /Users/ashah/cassandra/dsc-
cassandra-3.0.0/conf/logback.xml
- Earlier version:: Setting of log file directory: /Users/ashah/cassandra/dsc-
cassandra-<x>/conf/log4j-server.properties

===================================================================================
=========================================================================
C) Communicating with Cassandra

Understanding ways to communicate with Cassandra: -


- CQL (Cassandra Query Langauge) is a SQL-like query language for communicating
with Cassandra, created to make it easy for people familiar with SQL to work with
Cassandra.
e.g.: select home_id, datetime, event, code_used from activity;
* CQL commands are not case-sensitive.
* Although CQL looks similar to SQL, it does not have all of these options as
SQL, due to the distributed nature of C* database.
- Thrift is a low-level API, currently still supported in Cassandra (support may be
phased out in future release of C*)(It exists before CQL)
- For Administrative activities, such as cluster monitoring and management tasks,
tool built on JMX (Java Management Extentions) are commonly used.

CQLSH: -
- bin/cqlsh
- cqlsh> HELP
- cqlsh> help create_keyspace
- Semicolon (";") is optional for CQLSH command but mandatory for CQL command.

===================================================================================
========================================================================
D) Creating a database

Understanding a Cassandra Database: -


- In C*, a database is defined as a keyspace -> Within keyspace tables can be
defined.
- Check existing keyspaces: -
cqlsh> describe keyspaces;
- To see inside keyspace: -
cqlsh> describe keyspace <name>;

Defining a keyspace: -
- A keyspace name is case sensitive only if you put it inside double quote
otherwise it will go in lower case.
e.g.: a) CREATE KEYSPACE "Test" :: This will be created as Test.
b) CREATE KEYSPACE Test :: This will be created as test.
- A keyspace can be defined through the create keyspace command.
->
CREATE KEYSPACE vehicle_traker WITH REPLICATION =
{'class':'NetworkTopologyStrategy', 'dc1':3, 'dc2':2};
<dc1 3 means data center 1 contains 3 replica of data and same way data center 2
contains 2 replica of data>
->
CREATE KEYSPACE vehicle_traker WITH REPLICATION = {'class':'SimpleStrategy',
'replication_factor':1}

Deleting a keyspace: -
- DROP KEYSPACE vehicle_tracker;

Working inside a keyspace: -


- USE <keyspace_name>

===================================================================================
=========================================================================
E) Creating a Table

Creating/dropping a Table: -
- CREATE TABLE activity
(home_id text, datetime timestamp, event text, code_used text PRIMARY
KEY(home_id, datatime)) WITH CLUSTERING ORDER BY (datetime DESC);
- DROP TABLE activity;

Defining Columns And Data Type: -


- Data types: ascii, bigint, blob, boolean, counter, decimal, double, float, inet,
int, list, map, set, text, timestamp, uuid, timeuuid, varchar, varint

Defining a primary key: -


- same as other database

Reconizing a partition key: -


- The partition key is hashed by the partitioner to determine which node in the
cluster will store the partition.
- The primary key column defines the partition key.
- For compound primary key, first column listed in primary key defines the
partition key.
-> How data is stored internally is that, all of the CQL rows that have the same
partition key value are stored in the same partition key (aka RowKey).

Specifying a descending clustering order


- A table can be defined to store its data in ascending (default) or descending
order
e.g.: WITH CLUSTERING ORDER BY (datetime DESC)
- Specifying descending causes writes to take a little longer, as cells are
inserted at the start of a partition, rather than added at the end, but improves
read performance when descending order needed by an application
- Once clustering order is defined, changing the clustering order of a table is not
an option.

===================================================================================
=========================================================================
F) Inserting Data

Understanding Ways to Write Data


- INSERT INTO (CQL command)
- COPY command
- sstableloader tool (bulk loading)

Using the INSERT INTO command


- Same as other DB Insert command.
e.g. : INSERT INTO activity (home_id, datetime , event, code_used) VALUES
('H01474777', '2014-05-21 07:32:16', 'alarm set', '5599');

Using the COPY command


- The COPY command can be used to import data (COPY FROM) from a .csv file.
e.g.: COPY activity (home_id, datetime , event, code_used) FROM
'/Users/ashah/events.csv' WITH header = true AND delimiter = '|';

- The COPY command can be used to export data (COPY TO) a .csv file.

How Data is stored in C*


- Internally, a partition key value(in Thrift, referred to as a row key value) is
what makes an internal storate row unique.

How Data is stored on Disk


- When data is written to a table in Cassandra, it goes to both a commit log on
disk(for playback, in case of node failure) and to memory(called memcache).
- Once the memcache for a table is full, it is flused to disk, as an SSTable
- For each table on each node there is a memcache
- The SSTables for a table are stored on disk, in the location specified in the
Cassandra.yaml file.
- To see the contents of an SSTable, sstable2json can be used. (looks like obsolate
in 3.0)
- To flush the content to disk use below command.
-> bin/nodetool flush home_security

===================================================================================
=========================================================================
G) Modelling Data

===================================================================================
=========================================================================
H) Creating an application

===================================================================================
=========================================================================

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy