Apache Cassandra Database - Instaclustr
Apache Cassandra Database - Instaclustr
Apache Cassandra is an open source non-relational, or NoSQL, a database that enables continuous availability,
tremendous scale, and data distribution across multiple data centers and cloud availability zones.
Simply put, Cassandra provides a highly reliable data storage engine for applications requiring immense scale.
The open source version of the Cassandra database is used by some of the largest technology companies in the
world to run mission-critical applications. It is widely known that the largest deployment of the open source
version of the Cassandra database is at Apple. Netflix is also a very large user of open source Apache
Cassandra—the foundation for big data. It is estimated that Cassandra is deployed by over 50% of the Fortune
500 companies.
To know more about open source technologies and benefits of open source Cassandra, view our
webinar “Power of the Open Source”. The webinar is a great resource to understand the pitfalls of
proprietary technologies.
Our CPO Ben Slater provides an understanding of where Cassandra fits in the NoSQL world as
well Cassandra’s ecosystem.
In the blog post “Surveying the Cassandra-compatible database landscape”, Ben Slater, CPO, Instaclustr
shares details on a range of Cassandra-compatible offerings available in the market.
Watch the YouTube video Cassandra Serving Netflix @ Scale – Vinay Chella, Netflix to see how Cassandra is
serving Netflix with several millions of operations/sec with multiple nines of availability with 250+ Clusters,
10,000+ Nodes and 3+ PB of data deployment. Download our whitepaper “How to Maximize Availability With
Apache Cassandra” to learn various strategies you could apply for your Cassandra deployment. In this white
paper, you will learn the architectural, infrastructure, and application-level strategies.
In Cassandra, all nodes play an identical role; there is no concept of a master node, with all nodes
communicating with each other via a distributed, scalable protocol. Writes are distributed among nodes using a
hash function and reads are channeled onto specific nodes.
Cassandra stores data by dividing the data evenly around its cluster of nodes. Each node is responsible for part
of the data. The act of distributing data across nodes is referred to as data partitioning.
Cassandra Architecture
Cassandra is a built-for-scale architecture, meaning that it is capable of handling large amounts of data and
millions of concurrent users or operations per second—even across multiple data centers—as easily as it can
manage much smaller amounts of data and user traffic. To add more capacity, you simply add new nodes to an
existing cluster without having to take it down first. Unlike other master-slave or sharded systems, Cassandra
has no single point of failure and therefore is capable of offering true continuous availability and uptime.
The key components of the Cassandra architecture include the following terms and concepts:
In the blog post “Surveying the Cassandra-compatible database landscape”, Ben Slater, CPO, Instaclustr shares
the range of Cassandra-compatible offerings available in the market. Download our white paper Apache
Cassandra vs DynamoDB to understand the differences and identify the technology you should adopt for your
unique use case.
We have an abundance of resources on our support portal to help you with creating your cluster.
Download white paper on Avoiding the Pitfall and Challenges of Cassandra Implementation to identify
mistakes while implementing Cassandra for Big Data technology.
Cassandra CQL
Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the
database (keyspace) as a container of tables. CQL is a typed language and supports a rich set of data types,
including native types, collection types, user-defined types, tuple types, and custom types.
Programmers use cqlsh—a prompt to work with CQL or separate application language drivers. Read our support
article to understand how cqlsh can be used to connect to clusters in Instaclustr, and the blog Consulting
Cassandra: Second Contact with the Monolith (CQLSH).
Cassandra Migration
Planning to migrate to Cassandra? You need to keep a few things in mind, which include knowing when to
consider migration, how to prepare your application, as well as having an understanding of migration
approaches. Our CPO, Ben Slater’s presentation on migrating to Apache Cassandra is a great resource if you
are considering migrating your cluster to Cassandra.
Download the presentation, Introduction to Managing Apache Cassandra. This presentation by Brooke Thorley,
VP Technical Operations and Customer Services, Instaclustr provides an introduction to managing Apache
Cassandra. If you are new to Cassandra, this presentation will help clear any doubts as you learn tricks used by
experts in managing Cassandra. Using Cassandra, but dealing with high severity incidents in unknown
environments in a Cassandra cluster? You may find the presentation Apache Cassandra consulting and
firefighting useful.
One of the advantages of deploying Spark with Instaclustr is that it is a collocated data engine—it is right where
your operational database resides, with no need for extracting, transforming, and loading into a new
environment. Spark and Cassandra clusters are deployed to the same set of machines. Cassandra stores the
data; Spark worker nodes are co-located with Cassandra and do the data processing. Spark is a batch-
processing system, designed to deal with large amounts of data. When a job arrives, the Spark workers load
data into memory, spilling to disk if necessary.
A blog post by our CPO Ben Slater outlines some of the solution patterns where it makes sense to use Spark
Streaming alongside Cassandra.
Ben Bromhead, CTO, Instaclustr takes an in-depth look at how Spark and Cassandra can be used together in his
presentation “Processing 200K Transactions per Second with Apache Spark and Apache Cassandra”.
Our tutorial on getting started with Instaclustr Spark and Cassandra is a good starting point to learn how to
provision a cluster using Spark, Cassandra, and more.
Our technology evangelist, Paul Brebner, wrote an introductory “2001 Space Odyssey themed” series on using
Cassandra, Spark, and Zeppelin for Big Data Predictive Analytics (Machine Learning over Instaclustr’s
Instametrics Cassandra cluster monitoring data):
Third contact with a Monolith – Long Range Sensor Scan (using materialized views for summary statistics)
Third Contact with a Monolith – Beam Me Down Scotty (linear regression)
Third Contact with a Monolith – In the Pod (SPARK, MLLib, RDDs, Decision Trees)
Fourth Contact with a Monolith – DataFrames, ML Pipelines and Scala
Behind the Scenes – creating the wide table
Using a data notebook (Zeppelin) for data analytics with Cassandra and Spark
The final blog in the series covers Spark Streaming: Apache Spark Structured Streaming with DataFrames.
During the initial days when we released the Cassandra + Spark managed service offering, we have had
opportunities to dig deeper into using the Cassandra connector for Spark, both with our
own Instametrics application and while assisting customers with developing and troubleshooting. During this
process, we’ve learnt a few key lessons about how to get the best out of the Cassandra connector for Spark,
check out the 5-easy tips.
Cassandra on AWS
Cassandra on AWS EBS Infrastructure
Traditionally it was believed that Cassandra and AWS EBS don’t mix. However, with the release of the latest
generation EBS-optimized instances, this belief has changed, and we now know people have had success using
these nodes to run Cassandra. In his blog post, Ben answers many questions around Cassandra on AWS EBS
infrastructure and the Cost of Cassandra on AWS.
VPC Peering
A VPC peering connection is a networking connection between two VPCs that enables you to route traffic
between them privately. Instaclustr supports VPC peering as a mechanism for connecting directly to your
Instaclustr managed cluster. VPC Peering allows you to access your cluster via private IP and results in a much
more secure network setup. View our support page on using VPC Peering.
Cassandra on Azure
Download the presentation “Tips and Tricks of Cassandra on Azure” to learn more about how to get started
with Cassandra on Azure—from production stage, through the first 6 months.
However, it only forms one part of the data layer, with a range of other core open source technologies that can be
effectively integrated to provide a more complete data layer solution. The DbaaS is moving away from the
database and is including the data layer components that interact with the database, such as integrated data
software and related infrastructure.
The Instaclustr Managed Platform provides an integrated data layer with the following complementary open
source technologies.
The “Pick‘n’Mix: Cassandra, Spark, Zeppelin, Elassandra, Kibana, and Kafka” blog looks at possible ways of
using these technologies together.
Managed Cassandra Database
Cassandra is the database of choice for scalability, highly available, reliable, and high-performance applications.
Instaclustr Managed Service for Apache Cassandra gets you up and running quickly, and is the most reliable
way to run Cassandra for your application. We are so confident in the performance of our clusters that we
include latency and performance guarantees in our contracted SLAs. You can enjoy our hosted and fully
managed Apache Cassandra on AWS, Azure, GCP, IBM cloud, or in your own private data center with 24×7
support.
Apache Lucene: The Cassandra Lucene Index plugin expands Cassandra’s native secondary index to perform
comprehensive search functionality through multivariable, geospatial, and bi-temporal search capabilities.
Cassandra Lucene Index resides right where your operational database resides, thus, no need for extracting,
transforming, and loading into a new environment.
Apache Zeppelin: Apache Zeppelin provides a notebook user interface to allow interactive development and
execution of code against both Cassandra and Spark, along with data visualization capabilities. Zeppelin gives
you an interactive analytics environment to start querying data in your Cassandra database or running complex
analytics using Apache Spark as soon as your cluster is provisioned. This blog covers Using a data notebook
(Zeppelin) for data analytics with Cassandra and Spark.
Our second white paper “The Unmatchable ROI of Managed Cassandra Service” will take you through the 3 key
points you need to consider when deciding between building your own Cassandra competency center or
outsourcing to an expert Cassandra service provider.
Cassandra Consulting
We have extensive experience in Apache Cassandra Consulting helping our customers develop and deploy high
performance and continually available solutions.
We offer a wide range of Consulting Service Packages that will help you take advantage of our expertise in open
source, and be guided by our team of experts
We provide support for all Cassandra database use cases as well as complimentary open source technologies
across various industries. We have gained a wealth of experience helping new companies to disrupt, and mature
companies looking to transform their business.
Following a certification process across several critical variables, enterprises can build applications with even
greater confidence.
The Certification framework provides increased assurance that specific releases of Apache Cassandra have
been tested for a range of functional, performance, and integration properties prior to being enabled on the
Instaclustr Managed Platform.