0% found this document useful (0 votes)
37 views15 pages

Introduction To Nosql: Topics To Be Covered

This document provides an introduction and overview of NoSQL databases and related concepts: 1. Aggregate oriented databases organize data into collections or aggregates that form boundaries for transactions. Key-value, document, and column family databases are aggregate oriented. 2. NoSQL databases address issues that relational databases are not designed for, like large volumes of structured, semi-structured, and unstructured data. They provide simpler scalability and improved performance for big data. 3. Different types of NoSQL databases include key-value stores, column family stores, graph stores, and document stores, each suited for different application requirements.

Uploaded by

amish gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views15 pages

Introduction To Nosql: Topics To Be Covered

This document provides an introduction and overview of NoSQL databases and related concepts: 1. Aggregate oriented databases organize data into collections or aggregates that form boundaries for transactions. Key-value, document, and column family databases are aggregate oriented. 2. NoSQL databases address issues that relational databases are not designed for, like large volumes of structured, semi-structured, and unstructured data. They provide simpler scalability and improved performance for big data. 3. Different types of NoSQL databases include key-value stores, column family stores, graph stores, and document stores, each suited for different application requirements.

Uploaded by

amish gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Introduction to NoSQL

Topics to be covered:

Aggregate oriented databases

An aggregate is a collection of data that we interact with as a unit. Aggregates form the
boundaries for ACID operations with the database. Key value, document, and column family
databases can all be seen as forms of aggregate oriented database. Aggregates make it easier for
the database to manage data storage over clusters.

Aggregate oriented databases work best when most data interaction is done with the same
aggregate; Aggregate ignorant databases are better when interactions use data organized in many
different formations. Aggregate oriented databases make inter aggregate relationships more
difficult to handle than intra aggregate relationships. They often compute materialized views to
provide data organized differently from their primary aggregates. This is often done with map
reduce computations.

Impedance Mismatch in Database Terminology

It is the difference between the relational model and the in memory data structures. The
relational data model organizes data into a structure of tables and rows, or more properly,
relations and tuples in the relational model, a tuple is a set of name value pairs and a relation is a
set of tuples. All operations in SQL consume and return relations, which leads to the
mathematically elegant relational algebra.

This foundation on relations provides a certain elegance and simplicity, but it also introduces
limitations. In particular, the values in a relational tuple have to be simple—they cannot contain
any structure, such as a nested record or a list. This limitation isn‘t true for in memory data
structures, which can take on much richer structures than relations. As a result, if you want to use
a richer in memory data structure, you have to translate it to a relational representation to store it
on disk. Hence the impedance mismatches—two different representations that require translation

Replication and Sharding

 Replication takes the same data and copies it over multiple nodes. Sharding puts different
data on different nodes
 Sharding is particularly valuable for performance because it can improve both read and
write performance. Using replication, particularly with caching, can greatly improve read
performance but does little for applications that have a lot of writes. Sharding provides a
way to horizontally scale

NoSQL V/s Relational Database


Google needs a storage layer for their inverted search index. They figure a traditional RDBMS is
not going to cut it. So they implement a NoSQL data store, Bitable on top of their GFS file
system. The major part is that thousands of cheap commodity hardware machines provide the
speed and the redundancy. Everyone else realizes what Google just did. Brewers CAP theorem is
proven. All RDBMS systems of use are CA systems. People begin playing with CP and AP
systems as well. K/V stores are vastly simpler, so they are the primary vehicle for the research.

Software-as-a-service systems in general do not provide an SQL-like store. Hence, people get
more interested in the NoSQL type stores. I think much of the take-off can be related to this
history. Scaling Google took some new ideas at Google and everyone else follows suit because
this is the only solution they know to the scaling problem right now. Hence, you are willing to
rework everything around the distributed database idea of Google because it is the only way to
scale beyond a certain size.

Architecture

Relation of NoSQL with Big Data

Big data applications are generally looked from 4 perspectives: Volume, Velocity, Variety and
Veracity. Whereas, NoSQL applications are driven by the inability of a current application to
efficiently scale. Though volume and velocity are important, NoSQL also focuses on variability
and agility.

NoSQL is often used to store big data. NoSQL stores provide simpler scalability and improved
performance relative to traditional RDMS. They help big data moment in a big way by storing
unstructured data and providing a means to query them as per requirements. There are different
kinds of NoSQL data stores, which are useful for different kind of applications. While evaluating
a particular NoSQL solution, one should looks for their requirements in terms of automatic
scalability, data loss, payment model etc.

Features

When compared to relational databases, NoSQL databases are more scalable and provide
superior performance, and their data model addresses several issues that the relational model is
not designed to address:

 Large volumes of structured, semi-structured, and unstructured data


 Agile sprints, quick iteration, and frequent code pushes
 Object-oriented programming that is easy to use and flexible
 Efficient, scale-out architecture instead of expensive, monolithic architecture

Different Kinds Of NoSQL Data Stores


There are varieties of NoSQL data stores available which can be widely distributed among four
categories:

 Key-value store: A simple data storage system that uses a key to access a value.
Examples- Redis, Riak, DynamoDB, Memcache
 Column family store: A sparse matrix system that uses a row and a column as keys.
Example- HBase, Cassandra, Big Table
 Graph store: For relationship-intensive problems. Example- Neo4j, InfiniteGraph
 Document store: Storing hierarchical data structures directly in the database. Example-
MongoDB, CouchDB, Marklogic

Introduction to Scala and Spark

Scala

Scala is an object functional programming and scripting language for general software
applications designed to express solutions in a concise manner.

Scala source code is intended to be compiled to Java byte code, so that the resulting executable
code runs on a Java virtual machine. Scala provides language interoperability with Java, so that
libraries written in either language may be referenced directly in Scala or Java code. Like Java,
Scala is object-oriented, and uses a curly-brace syntax reminiscent of the C programming
language. Unlike Java, Scala has many features of functional programming languages like
Scheme, Standard ML and Haskell, including currying, immutability, lazy evaluation, and
pattern matching. It also has an advanced type system supporting algebraic data types,
covariance and contra variance, higher-order types (but not higher-rank types), and anonymous
types. Other features of Scala not present in Java include operator overloading, optional
parameters, named parameters, and raw strings. Conversely, a feature of Java not in Scala is
checked exceptions, which has proved controversial.

Advantage Of Scala

 Less error prone functional style


 High maintainability and productivity
 High scalability
 High testability
 Provides features of concurrent programming

Scala Is Better Than Other Programming Language

 The arrays uses regular generics while in other language, generics are bolted on as an
afterthought and are completely separate but have overlapping behaviors with arrays.
 Scala has immutable ―Val‖ as a first class language feature. The ―Val‖ of Scala is similar
to Java final variables. Contents may mutate but top reference is immutable.
 Scala lets ‗if blocks ‘, ‗for-yield loops and ‗code ‘in braces to return a value. It is more
preferable, and eliminates the need for a separate ternary operator.
 Singleton has singleton objects rather than C++/Java/ C# classic static. It is a cleaner
solution
 Persistent immutable collections are the default and built into the standard library.
 It has native tuples and a concise code
 It has no boiler plate code

Variables

Values and variables are two shapes that come in Scala. A value variable is constant and cannot
be changed once assigned. It is immutable, while a regular variable, on the other hand, is
mutable, and you can change the value.

The two types of variables are

var myVar : Int =0;


val myVal : Int =1;

Scala Literals

The different types of literals in scala are

 Integer literals
 Floating point literals
 Boolean literals
 Symbol literals
 Character literals
 String literals
 Multi-Line strings

Apache Spark
Apache Spark is an open-source distributed general-purpose cluster-computing framework.
Spark provides an interface for programming entire clusters with implicit data parallelism and
fault tolerance.

Spark Ecosystem
Spark MLib- Machine learning library in Spark for commonly used learning algorithms like
clustering, regression, classification, etc.

Spark Streaming - This library is used to process real time streaming data.

Spark GraphX – Spark API for graph parallel computations with basic operators like
joinVertices, subgraph, aggregateMessages, etc.

Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI
tools.

RDDs (Resilient Distributed Datasets)


RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the
data coming into the system in object format. RDDs are used for in-memory computations on
large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records,
that are –

Immutable – RDDs cannot be altered.


Resilient – If a node holding the partition fails the other node takes the data.

Transformation

Transformations are functions executed on demand, to produce a new RDD. All transformations
are followed by actions. Some examples of transformations include map, filter and
reduceByKey.

Action
Actions are the results of RDD computations or transformations. After an action is performed,
the data from RDD moves back to the local machine. Some examples of actions include reduce,
collect, first, and take.

Lazy Evaluation

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate
on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but
it does nothing, unless asked for the final result.

When a transformation like map () is called on a RDD-the operation is not performed


immediately. Transformations in Spark are not evaluated till you perform an action. This helps
optimize the overall data processing workflow.

Core Components Of A Distributed Spark Application


Driver: The process that runs the main () method of the program to create RDDs and perform
transformations and actions on them.

Executor: The worker processes that run the individual tasks of a Spark job.

Cluster Manager: A pluggable component in Spark, to launch Executors and Drivers. The
cluster manager allows Spark to run on top of other external managers like Apache Mesos or
YARN.

Apache SSL

SSL (Secure Socket Layer) data transport requires encryption, and many governments have
restrictions upon the import, export, and use of encryption technology. If Apache included SSL
in the base package, its distribution would involve all sorts of legal and bureaucratic issues, and
it would no longer be freely available. Also, some of the technology required to talk to current
clients using SSL is patented by RSA Data Security, who restricts its use without a license.
Spark SQL:
Spark SQL is a library provided in Apache Spark for processing structured data. Spark SQL
provides various APIs that provides information about the structure of the data and the
computation being performed on that data. You can use SQL as well as Dataset APIs to interact
with Spark SQL.

DataFrame:
A DataFrame is a Dataset organized into named columns. A DataFrame is equivalent to a
Relational Database Table. DataFrames can be created from a variety of sources such as
structured data files, external databases, Hive tables and Resilient Distributed Datasets.

Shark
Most of the data users know only SQL and are not good at programming. Shark is a tool,
developed for people who are from a database background - to access Scala MLib capabilities
through Hive like SQL interface. Shark tool helps data users run Hive on Spark - offering
compatibility with Hive metastore, queries and data.

MLlib
MLlib is a library provided in Apache Spark for machine learning. It provides tools for common
machine learning algorithms, featurizations, Pipelines, Persistence and utilities for statistics, data
handling etc. pache Spark MLlib provides ML Pipelines which is a chain of algorithms
combined into a single workflow. ML Pipelines consists of the following key components.

DataFrame - The Apache Spark ML API uses DataFrames provided in the Spark SQL library to
hold a variety of data types such as text, feature vectors, labels and predictions.

Transformer - A transformer is an algorithm that transforms one dataframe into another


dataframe.

Estimators - An estimator is an algorithm that can be applied on a dataframe to produce a


Transformer.

Different algorithms:

Extraction algorithms
Spark MLlib machine learning library provides the following feature extraction algorithms.
TF-IDF - Term frequency-inverse document frequency (TF-IDF) is a feature extraction
algorithm that determines the importance of a term to a document.

Word2Vec - Word2Vec is an estimator algorithm which takes a sequence of words and


generates a Word2VecModel which can be used as features for prediction, document similarity
and other similar calculations.

CountVectorizer - CountVectorization is an extraction algorithm that converts a collection of


text documents to vectors of token counts, that can be passed to learning algorithms.

Transformation algorithms

Tokenizer - Tokenizer breaks text into smaller terms usually words.

StopWordsRemover - Stop words remover takes a sequence of strings as input and removes all
stop words for the input. Stop words are words that occur frequently in a document but carries
little importance.

n-gram - An n-gram contains a sequence of n tokens, usually words, where n is an integer.


NGram takes as input a sequence of strings and outputs a sequence of n-grams.

Binarizer - Binarizer is a transformation algorithm that transforms numerical features to binary


features based on a threshold value. Features greater than the threshold value are set to 1 and
features equal to or less than 1 are set to 0.

PolynomialExpansion - PolynomialExpansion class provided in the Spark MLlib library


implements the polynomial expansion algorithm. Polynomial expansion is the process of
expanding features into a polynomial space, based on n-degree combination of original
dimensions.

Discrete Cosine Transform - The discrete cosine transformation transforms a sequence in the
time domain to another sequence in the frequency domain.

StringIndexer - StringIndexer assigns a column of string labels to a column of indices.

IndexToString - IndexToString maps a column of label indices back to a column of original


label strings.

OneHotEncoder - One-hot encoder maps a column of label indices to a column of binary


vectors.

VectorIndexer - VectorIndexer helps index categorical features in dataset of vectors.


Interaction - Interaction is a transformer which takes a vector or double-valued columns and
generates a single column that contains the product of all combinations of one value from each
input column.

Normalizer - Normalizer is a Transformer which transforms a dataset of Vector rows,


normalizing each Vector to have unit norm. This normalization can help standardize your input
data and improve the behavior of learning algorithms.

StandardScaler - StandardScaler transforms a dataset of Vector rows, normalizing each feature


to have unit standard deviation and/or zero mean.

MinMaxScaler - MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a


specific range (often [0, 1]).

MaxAbsScaler - MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to


range [-1, 1] by dividing through the maximum absolute value in each feature. It does not
shift/center the data, and thus does not destroy any sparsity.

Bucketizer - Bucketizer transforms a column of continuous features to a column of feature


buckets, where the buckets are specified by users.

ElementwiseProduct - ElementwiseProduct multiplies each input vector by a provided ―weight‖


vector, using element-wise multiplication. In other words, it scales each column of the dataset by
a scalar multiplier. This represents the Hadamard product between the input vector, v and
transforming vector, w, to yield a result vector.

SQLTransformer - SQLTransformer implements the transformations which are defined by SQL


statement.

VectorAssembler - VectorAssembler is a transformer that combines a given list of columns into


a single vector column.

QuantileDiscretizer - QuantileDiscretizer takes a column with continuous features and outputs a


column with binned categorical features.

Imputer - The Imputer transformer completes missing values in a dataset, either using the mean
or the median of the columns in which the missing values are located.

Selection algorithms:
VectorSlicer - VectorSlicer is a selection algorithm that takes a feature vector as input and
outputs a new feature vector that is a sub array of original features.

RFormula - RFormula selects columns specified by an RFormula. RFormula produces a vector


column of features and a double or string column of label.
ChiSqSelector - ChiSqSelector, which stands for Chi-Squared feature selection, operates on
labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence
to select features.

Locality Sensitive Hashing - LSH is a feature selection algorithm that hashes data points into
buckets, so that the data points which are close to each other are in the same buckets with high
probability, while data points that are far away from each other are very likely in different
buckets. Locality Sensitive Hashing is used in clustering, approximate nearest neighbor search
and outlier detection with large datasets.

Classification Algorithm:

Logistic Regression - Logistic regression is a classification algorithm that predicts categorical


responses. Spark MLlib uses either logistic regression to predict a binary outcome by using
binomial logistic regression, or multinomial logistic regression to predict a multi-class outcome.

Decision tree classifier - Decision trees are a popular family of classification and regression
methods.

Random forest classifier - Random forests are a popular family of classification and regression
methods.

Gradient-boosted tree classifier - Gradient-boosted trees (GBTs) are a popular classification


and regression method using ensembles of decision trees.

Multilayer perception classifier - Multilayer perception classifier (MLPC) is a classifier based


on the feed forward artificial neural network. MLPC consists of multiple layers of nodes. Each
layer is fully connected to the next layer in the network. Nodes in the input layer represent the
input data. All other nodes map inputs to outputs by a linear combination of the inputs with the
node's weight w and bias b and applying an activation function.

Linear support vector machine - A support vector machine constructs a hyperplane or set of
hyperplanes in a high- or infinite-dimensional space, which can be used for classification,
regression, or other tasks.

Regression algorithms

Linear Regression -

Decision Tree Regression - Decision trees are a popular family of classification and regression
methods.

Random Forest Regression - Random forests are a popular family of classification and
regression methods.
Gradient-boosted tree regression - Gradient-boosted trees (GBTs) are a popular regression
method using ensembles of decision trees

Survival Regression - Spark MLlib implements the Accelerated failure time (AFT) model
which is a parametric survival regression model for censored data.

Isotonic Regression - Isotonic regression belongs to the family of regression algorithms.

Clustering algorithm

K-means - k-means is one of the most commonly used clustering algorithms that clusters the
data points into a predefined number of clusters.

Latent Dirichlet allocation - LDA is implemented as an Estimator that supports both


EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model.
Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel
if needed.

Bisecting k-means - Bisecting k-means is a kind of hierarchical clustering using a divisive (or
―top-down‖) approach: all observations start in one cluster, and splits are performed recursively
as one moves down the hierarchy.

Gaussian Mixture Model (GMM) - A Gaussian Mixture Model represents a composite


distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its
own probability.

Filtering Algorithm:
Collaborative filtering is mostly used for recommender systems. Spark MLlib implements the
following collaborative filtering algorithms.

Explicit vs. implicit feedback - The standard approach to matrix factorization based
collaborative filtering treats the entries in the user-item matrix as explicit preferences given by
the user to the item, for example, users giving ratings to movies.

Scaling of the regularization parameter - Scale the regularization parameter regParam in


solving each least squares problem by the number of ratings the user generated in updating user
factors, or the number of ratings the product received in updating product factors.

Cold-start strategy - When making predictions using an ALSModel, it is common to encounter


users and/or items in the test dataset that were not present during training the model. This
typically occurs in two scenarios.
Spark Streaming:

Spark Streaming is a library provided in Apache Spark for processing live data streams that is
scalable, has high-throughput and is fault-tolerant. Spark Streaming can ingest data from
multiple sources such as Kafka, Flume, Kinesis or TCP sockets; and process this data using
complex algorithms provided in the Spark API including algorithms provided in the Spark MLlib
and GraphX libraries. Processed data can be pushed to live dashboards, file systems and
databases.

Apache Spark Streaming component receives live data streams from input sources such as
Kafka, Flume, Kinesis etc. and divides them into batches. The Spark engine processes these
input batches and produces the final stream of results in batches.

DStream:
DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that
represents a continuous stream of data. DStreams can be either created from input sources such
as Kafka, Flume or Kinesis; or by applying high-level operations on existing DStreams.

Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream


contains data from a certain interval.

Apache GraphX:

Apache Spark GraphX is a component library provided in the Apache Spark ecosystem that
seamlessly works with both graphs as well as with collections.

GraphX implements a variety of graph algorithms and provides a flexible API to utilize the
algorithms.
This kind of NoSQL database fits best in the case where in a connected set of all nodes,edges
satisfy a given predicate, starting from a given node.A classic example may be any social
engineering site.

Apache Spark GraphX provides the following types of operators - Property operators, Structural
operators and Join operators.

Property Operators- Property operators modify the vertex or edge properties using a user
defined map function and produces a new graph.

Structural Operators- Structural operators operate on the structure of an input graph and
produces a new graph.

Join Operators- Join operators add data to graphs and produces a new graphs.

Join operators join data from external collections (RDDs) with graphs. Apache Spark Graphx
provides the following join property operators.

joinVertices() - The joinVertices() operator joins the input RDD data with vertices and returns a
new graph. The vertex properties are obtained by applying the user defined map() function to the
result of the joined vertices. Vertices without a matching value in the RDD retain their original
value.

outerJoinVertices() - The outerJoinVertices() operator joins the input RDD data with vertices
and returns a new graph. The vertex properties are obtained by applying the user defined map()
function to the all vertices, and includes ones that are not present in the input RDD.

Pros and Cons of Graph database

Pros:

 Graph databases seem to be tailor-made for networking applications. The prototypical


example is a social network, where nodes represent users who have various kinds of
relationships to each other. Modeling this kind of data using any of the other styles is
often a tough fit, but a graph database would accept it with relish.
 They are also perfect matches for an object-oriented system.

Cons

 Because of the high degree of interconnectedness between nodes, graph databases are
generally not suitable for network partitioning.
 Graph databases don‘t scale out well.
Apache Storm:

Storm UI is used in monitoring the topology. The Storm UI provides information about errors
happening in tasks and fine-grained stats on the throughput and latency performance of each
component of each running topology.

Benefit:

Easy to operate: Operating storm is quiet easy.

Real fast: It can process 100 messages per second per node.

Fault Tolerant: It detects the fault automatically and re-starts the functional attributes.

Reliable: It guarantees that each unit of data will be executed at least once or exactly once.

Scalable: It runs across a cluster of machine

Field grouping:

Field grouping in storm uses a mod hash function to decide which task to send a tuple, ensuring
which task will be processed in the correct order. For that, you don‘t require any cache. So, there
is no time-out or limit to known field values.

The stream is partitioned by the fields specified in the grouping. For example, if the stream is
grouped by the ―user-id‖ field, tuples with the same ―user-id‖ will always go to the same task,
but tuples with different ―user-id‖‗s may go to different tasks.

Installation files

Apache is a Web (HTTP) server, not an application server. The base package does not include
any such functionality. PHP project and the mod_perl project allow you to work with databases
from within the Apache environment.

srm.conf :- This is the default file for the ResourceConfig directive in httpd.conf. It is processed
after httpd.conf but before access.conf.

access.conf :- This is the default file for the AccessConfig directive in httpd.conf.It is processed
after httpd.conf and srm.conf.

httpd.conf :-The httpd.conf file is well-commented and mostly self-explanatory.

Storm in Financial Services

In financial services, Storm can be helpful in preventing


Securities fraud :

1. Perform real-time anomaly detection on known patterns of activities and use learned
patterns from prior modeling and simulations.
2. Correlate transaction data with other streams (chat, email, etc.) in a cost-effective parallel
processing environment.
3. Reduce query time from hours to minutes on large volumes of data.
4. Build a single platform for operational applications and analytics that reduces total cost
of ownership (TCO)

Order routing : Order routing is the process by which an order goes from the end user to an
exchange. An order may go directly to the exchange from the customer, or it may go first to a
broker who then routes the order to the exchange.

Pricing : Pricing is the process whereby a business sets the price at which it will sell its products
and services, and may be part of the business‘s marketing plan.

Compliance Violations: compliance means conforming to a rule, such as a specification, policy,


standard or law. Regulatory compliance describes the goal that organizations aspire to achieve in
their efforts to ensure that they are aware of and take steps to comply with relevant laws and
regulations. And any disturbance in regarding compliance is violations in compliance.

Apache Storm v/s Apache Kafka

Apache Kafka: It is a distributed and robust messaging system that can handle huge amount of
data and allows passage of messages from one end-point to another. Kafka is designed to allow a
single cluster to serve as the central data backbone for a large organization. It can be elastically
and transparently expanded without downtime. Data streams are partitioned and spread over a
cluster of machines to allow data streams larger than the capability of any single machine and to
allow clusters of coordinated consumers.

Apache Storm: It is a real time message processing system, and you can edit or manipulate data
in real-time. Storm pulls the data from Kafka and applies some required manipulation. It makes
it easy to reliably process unbounded streams of data, doing real-time processing what Hadoop
did for batch processing. Storm is simple, can be used with any programming language, and is a
lot of fun to use.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy