Introduction To Nosql: Topics To Be Covered
Introduction To Nosql: Topics To Be Covered
Topics to be covered:
An aggregate is a collection of data that we interact with as a unit. Aggregates form the
boundaries for ACID operations with the database. Key value, document, and column family
databases can all be seen as forms of aggregate oriented database. Aggregates make it easier for
the database to manage data storage over clusters.
Aggregate oriented databases work best when most data interaction is done with the same
aggregate; Aggregate ignorant databases are better when interactions use data organized in many
different formations. Aggregate oriented databases make inter aggregate relationships more
difficult to handle than intra aggregate relationships. They often compute materialized views to
provide data organized differently from their primary aggregates. This is often done with map
reduce computations.
It is the difference between the relational model and the in memory data structures. The
relational data model organizes data into a structure of tables and rows, or more properly,
relations and tuples in the relational model, a tuple is a set of name value pairs and a relation is a
set of tuples. All operations in SQL consume and return relations, which leads to the
mathematically elegant relational algebra.
This foundation on relations provides a certain elegance and simplicity, but it also introduces
limitations. In particular, the values in a relational tuple have to be simple—they cannot contain
any structure, such as a nested record or a list. This limitation isn‘t true for in memory data
structures, which can take on much richer structures than relations. As a result, if you want to use
a richer in memory data structure, you have to translate it to a relational representation to store it
on disk. Hence the impedance mismatches—two different representations that require translation
Replication takes the same data and copies it over multiple nodes. Sharding puts different
data on different nodes
Sharding is particularly valuable for performance because it can improve both read and
write performance. Using replication, particularly with caching, can greatly improve read
performance but does little for applications that have a lot of writes. Sharding provides a
way to horizontally scale
Software-as-a-service systems in general do not provide an SQL-like store. Hence, people get
more interested in the NoSQL type stores. I think much of the take-off can be related to this
history. Scaling Google took some new ideas at Google and everyone else follows suit because
this is the only solution they know to the scaling problem right now. Hence, you are willing to
rework everything around the distributed database idea of Google because it is the only way to
scale beyond a certain size.
Architecture
Big data applications are generally looked from 4 perspectives: Volume, Velocity, Variety and
Veracity. Whereas, NoSQL applications are driven by the inability of a current application to
efficiently scale. Though volume and velocity are important, NoSQL also focuses on variability
and agility.
NoSQL is often used to store big data. NoSQL stores provide simpler scalability and improved
performance relative to traditional RDMS. They help big data moment in a big way by storing
unstructured data and providing a means to query them as per requirements. There are different
kinds of NoSQL data stores, which are useful for different kind of applications. While evaluating
a particular NoSQL solution, one should looks for their requirements in terms of automatic
scalability, data loss, payment model etc.
Features
When compared to relational databases, NoSQL databases are more scalable and provide
superior performance, and their data model addresses several issues that the relational model is
not designed to address:
Key-value store: A simple data storage system that uses a key to access a value.
Examples- Redis, Riak, DynamoDB, Memcache
Column family store: A sparse matrix system that uses a row and a column as keys.
Example- HBase, Cassandra, Big Table
Graph store: For relationship-intensive problems. Example- Neo4j, InfiniteGraph
Document store: Storing hierarchical data structures directly in the database. Example-
MongoDB, CouchDB, Marklogic
Scala
Scala is an object functional programming and scripting language for general software
applications designed to express solutions in a concise manner.
Scala source code is intended to be compiled to Java byte code, so that the resulting executable
code runs on a Java virtual machine. Scala provides language interoperability with Java, so that
libraries written in either language may be referenced directly in Scala or Java code. Like Java,
Scala is object-oriented, and uses a curly-brace syntax reminiscent of the C programming
language. Unlike Java, Scala has many features of functional programming languages like
Scheme, Standard ML and Haskell, including currying, immutability, lazy evaluation, and
pattern matching. It also has an advanced type system supporting algebraic data types,
covariance and contra variance, higher-order types (but not higher-rank types), and anonymous
types. Other features of Scala not present in Java include operator overloading, optional
parameters, named parameters, and raw strings. Conversely, a feature of Java not in Scala is
checked exceptions, which has proved controversial.
Advantage Of Scala
The arrays uses regular generics while in other language, generics are bolted on as an
afterthought and are completely separate but have overlapping behaviors with arrays.
Scala has immutable ―Val‖ as a first class language feature. The ―Val‖ of Scala is similar
to Java final variables. Contents may mutate but top reference is immutable.
Scala lets ‗if blocks ‘, ‗for-yield loops and ‗code ‘in braces to return a value. It is more
preferable, and eliminates the need for a separate ternary operator.
Singleton has singleton objects rather than C++/Java/ C# classic static. It is a cleaner
solution
Persistent immutable collections are the default and built into the standard library.
It has native tuples and a concise code
It has no boiler plate code
Variables
Values and variables are two shapes that come in Scala. A value variable is constant and cannot
be changed once assigned. It is immutable, while a regular variable, on the other hand, is
mutable, and you can change the value.
Scala Literals
Integer literals
Floating point literals
Boolean literals
Symbol literals
Character literals
String literals
Multi-Line strings
Apache Spark
Apache Spark is an open-source distributed general-purpose cluster-computing framework.
Spark provides an interface for programming entire clusters with implicit data parallelism and
fault tolerance.
Spark Ecosystem
Spark MLib- Machine learning library in Spark for commonly used learning algorithms like
clustering, regression, classification, etc.
Spark Streaming - This library is used to process real time streaming data.
Spark GraphX – Spark API for graph parallel computations with basic operators like
joinVertices, subgraph, aggregateMessages, etc.
Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI
tools.
Transformation
Transformations are functions executed on demand, to produce a new RDD. All transformations
are followed by actions. Some examples of transformations include map, filter and
reduceByKey.
Action
Actions are the results of RDD computations or transformations. After an action is performed,
the data from RDD moves back to the local machine. Some examples of actions include reduce,
collect, first, and take.
Lazy Evaluation
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate
on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but
it does nothing, unless asked for the final result.
Executor: The worker processes that run the individual tasks of a Spark job.
Cluster Manager: A pluggable component in Spark, to launch Executors and Drivers. The
cluster manager allows Spark to run on top of other external managers like Apache Mesos or
YARN.
Apache SSL
SSL (Secure Socket Layer) data transport requires encryption, and many governments have
restrictions upon the import, export, and use of encryption technology. If Apache included SSL
in the base package, its distribution would involve all sorts of legal and bureaucratic issues, and
it would no longer be freely available. Also, some of the technology required to talk to current
clients using SSL is patented by RSA Data Security, who restricts its use without a license.
Spark SQL:
Spark SQL is a library provided in Apache Spark for processing structured data. Spark SQL
provides various APIs that provides information about the structure of the data and the
computation being performed on that data. You can use SQL as well as Dataset APIs to interact
with Spark SQL.
DataFrame:
A DataFrame is a Dataset organized into named columns. A DataFrame is equivalent to a
Relational Database Table. DataFrames can be created from a variety of sources such as
structured data files, external databases, Hive tables and Resilient Distributed Datasets.
Shark
Most of the data users know only SQL and are not good at programming. Shark is a tool,
developed for people who are from a database background - to access Scala MLib capabilities
through Hive like SQL interface. Shark tool helps data users run Hive on Spark - offering
compatibility with Hive metastore, queries and data.
MLlib
MLlib is a library provided in Apache Spark for machine learning. It provides tools for common
machine learning algorithms, featurizations, Pipelines, Persistence and utilities for statistics, data
handling etc. pache Spark MLlib provides ML Pipelines which is a chain of algorithms
combined into a single workflow. ML Pipelines consists of the following key components.
DataFrame - The Apache Spark ML API uses DataFrames provided in the Spark SQL library to
hold a variety of data types such as text, feature vectors, labels and predictions.
Different algorithms:
Extraction algorithms
Spark MLlib machine learning library provides the following feature extraction algorithms.
TF-IDF - Term frequency-inverse document frequency (TF-IDF) is a feature extraction
algorithm that determines the importance of a term to a document.
Transformation algorithms
StopWordsRemover - Stop words remover takes a sequence of strings as input and removes all
stop words for the input. Stop words are words that occur frequently in a document but carries
little importance.
Discrete Cosine Transform - The discrete cosine transformation transforms a sequence in the
time domain to another sequence in the frequency domain.
Imputer - The Imputer transformer completes missing values in a dataset, either using the mean
or the median of the columns in which the missing values are located.
Selection algorithms:
VectorSlicer - VectorSlicer is a selection algorithm that takes a feature vector as input and
outputs a new feature vector that is a sub array of original features.
Locality Sensitive Hashing - LSH is a feature selection algorithm that hashes data points into
buckets, so that the data points which are close to each other are in the same buckets with high
probability, while data points that are far away from each other are very likely in different
buckets. Locality Sensitive Hashing is used in clustering, approximate nearest neighbor search
and outlier detection with large datasets.
Classification Algorithm:
Decision tree classifier - Decision trees are a popular family of classification and regression
methods.
Random forest classifier - Random forests are a popular family of classification and regression
methods.
Linear support vector machine - A support vector machine constructs a hyperplane or set of
hyperplanes in a high- or infinite-dimensional space, which can be used for classification,
regression, or other tasks.
Regression algorithms
Linear Regression -
Decision Tree Regression - Decision trees are a popular family of classification and regression
methods.
Random Forest Regression - Random forests are a popular family of classification and
regression methods.
Gradient-boosted tree regression - Gradient-boosted trees (GBTs) are a popular regression
method using ensembles of decision trees
Survival Regression - Spark MLlib implements the Accelerated failure time (AFT) model
which is a parametric survival regression model for censored data.
Clustering algorithm
K-means - k-means is one of the most commonly used clustering algorithms that clusters the
data points into a predefined number of clusters.
Bisecting k-means - Bisecting k-means is a kind of hierarchical clustering using a divisive (or
―top-down‖) approach: all observations start in one cluster, and splits are performed recursively
as one moves down the hierarchy.
Filtering Algorithm:
Collaborative filtering is mostly used for recommender systems. Spark MLlib implements the
following collaborative filtering algorithms.
Explicit vs. implicit feedback - The standard approach to matrix factorization based
collaborative filtering treats the entries in the user-item matrix as explicit preferences given by
the user to the item, for example, users giving ratings to movies.
Spark Streaming is a library provided in Apache Spark for processing live data streams that is
scalable, has high-throughput and is fault-tolerant. Spark Streaming can ingest data from
multiple sources such as Kafka, Flume, Kinesis or TCP sockets; and process this data using
complex algorithms provided in the Spark API including algorithms provided in the Spark MLlib
and GraphX libraries. Processed data can be pushed to live dashboards, file systems and
databases.
Apache Spark Streaming component receives live data streams from input sources such as
Kafka, Flume, Kinesis etc. and divides them into batches. The Spark engine processes these
input batches and produces the final stream of results in batches.
DStream:
DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that
represents a continuous stream of data. DStreams can be either created from input sources such
as Kafka, Flume or Kinesis; or by applying high-level operations on existing DStreams.
Apache GraphX:
Apache Spark GraphX is a component library provided in the Apache Spark ecosystem that
seamlessly works with both graphs as well as with collections.
GraphX implements a variety of graph algorithms and provides a flexible API to utilize the
algorithms.
This kind of NoSQL database fits best in the case where in a connected set of all nodes,edges
satisfy a given predicate, starting from a given node.A classic example may be any social
engineering site.
Apache Spark GraphX provides the following types of operators - Property operators, Structural
operators and Join operators.
Property Operators- Property operators modify the vertex or edge properties using a user
defined map function and produces a new graph.
Structural Operators- Structural operators operate on the structure of an input graph and
produces a new graph.
Join Operators- Join operators add data to graphs and produces a new graphs.
Join operators join data from external collections (RDDs) with graphs. Apache Spark Graphx
provides the following join property operators.
joinVertices() - The joinVertices() operator joins the input RDD data with vertices and returns a
new graph. The vertex properties are obtained by applying the user defined map() function to the
result of the joined vertices. Vertices without a matching value in the RDD retain their original
value.
outerJoinVertices() - The outerJoinVertices() operator joins the input RDD data with vertices
and returns a new graph. The vertex properties are obtained by applying the user defined map()
function to the all vertices, and includes ones that are not present in the input RDD.
Pros:
Cons
Because of the high degree of interconnectedness between nodes, graph databases are
generally not suitable for network partitioning.
Graph databases don‘t scale out well.
Apache Storm:
Storm UI is used in monitoring the topology. The Storm UI provides information about errors
happening in tasks and fine-grained stats on the throughput and latency performance of each
component of each running topology.
Benefit:
Real fast: It can process 100 messages per second per node.
Fault Tolerant: It detects the fault automatically and re-starts the functional attributes.
Reliable: It guarantees that each unit of data will be executed at least once or exactly once.
Field grouping:
Field grouping in storm uses a mod hash function to decide which task to send a tuple, ensuring
which task will be processed in the correct order. For that, you don‘t require any cache. So, there
is no time-out or limit to known field values.
The stream is partitioned by the fields specified in the grouping. For example, if the stream is
grouped by the ―user-id‖ field, tuples with the same ―user-id‖ will always go to the same task,
but tuples with different ―user-id‖‗s may go to different tasks.
Installation files
Apache is a Web (HTTP) server, not an application server. The base package does not include
any such functionality. PHP project and the mod_perl project allow you to work with databases
from within the Apache environment.
srm.conf :- This is the default file for the ResourceConfig directive in httpd.conf. It is processed
after httpd.conf but before access.conf.
access.conf :- This is the default file for the AccessConfig directive in httpd.conf.It is processed
after httpd.conf and srm.conf.
1. Perform real-time anomaly detection on known patterns of activities and use learned
patterns from prior modeling and simulations.
2. Correlate transaction data with other streams (chat, email, etc.) in a cost-effective parallel
processing environment.
3. Reduce query time from hours to minutes on large volumes of data.
4. Build a single platform for operational applications and analytics that reduces total cost
of ownership (TCO)
Order routing : Order routing is the process by which an order goes from the end user to an
exchange. An order may go directly to the exchange from the customer, or it may go first to a
broker who then routes the order to the exchange.
Pricing : Pricing is the process whereby a business sets the price at which it will sell its products
and services, and may be part of the business‘s marketing plan.
Apache Kafka: It is a distributed and robust messaging system that can handle huge amount of
data and allows passage of messages from one end-point to another. Kafka is designed to allow a
single cluster to serve as the central data backbone for a large organization. It can be elastically
and transparently expanded without downtime. Data streams are partitioned and spread over a
cluster of machines to allow data streams larger than the capability of any single machine and to
allow clusters of coordinated consumers.
Apache Storm: It is a real time message processing system, and you can edit or manipulate data
in real-time. Storm pulls the data from Kafka and applies some required manipulation. It makes
it easy to reliably process unbounded streams of data, doing real-time processing what Hadoop
did for batch processing. Storm is simple, can be used with any programming language, and is a
lot of fun to use.