0% found this document useful (0 votes)

37 views15 pages

Introduction To Nosql: Topics To Be Covered

This document provides an introduction and overview of NoSQL databases and related concepts: 1. Aggregate oriented databases organize data into collections or aggregates that form boundaries for transactions. Key-value, document, and column family databases are aggregate oriented. 2. NoSQL databases address issues that relational databases are not designed for, like large volumes of structured, semi-structured, and unstructured data. They provide simpler scalability and improved performance for big data. 3. Different types of NoSQL databases include key-value stores, column family stores, graph stores, and document stores, each suited for different application requirements.

Uploaded by

amish gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views15 pages

Introduction To Nosql: Topics To Be Covered

Uploaded by

amish gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Introduction to NoSQL

Topics to be covered:

Aggregate oriented databases

An aggregate is a collection of data that we interact with as a unit. Aggregates form the
boundaries for ACID operations with the database. Key value, document, and column family
databases can all be seen as forms of aggregate oriented database. Aggregates make it easier for
the database to manage data storage over clusters.

Aggregate oriented databases work best when most data interaction is done with the same
aggregate; Aggregate ignorant databases are better when interactions use data organized in many
different formations. Aggregate oriented databases make inter aggregate relationships more
difficult to handle than intra aggregate relationships. They often compute materialized views to
provide data organized differently from their primary aggregates. This is often done with map
reduce computations.

Impedance Mismatch in Database Terminology

It is the difference between the relational model and the in memory data structures. The
relational data model organizes data into a structure of tables and rows, or more properly,
relations and tuples in the relational model, a tuple is a set of name value pairs and a relation is a
set of tuples. All operations in SQL consume and return relations, which leads to the
mathematically elegant relational algebra.

This foundation on relations provides a certain elegance and simplicity, but it also introduces
limitations. In particular, the values in a relational tuple have to be simple—they cannot contain
any structure, such as a nested record or a list. This limitation isn‘t true for in memory data
structures, which can take on much richer structures than relations. As a result, if you want to use
a richer in memory data structure, you have to translate it to a relational representation to store it
on disk. Hence the impedance mismatches—two different representations that require translation

Replication and Sharding

 Replication takes the same data and copies it over multiple nodes. Sharding puts different
data on different nodes
 Sharding is particularly valuable for performance because it can improve both read and
write performance. Using replication, particularly with caching, can greatly improve read
performance but does little for applications that have a lot of writes. Sharding provides a
way to horizontally scale

NoSQL V/s Relational Database

Google needs a storage layer for their inverted search index. They figure a traditional RDBMS is
not going to cut it. So they implement a NoSQL data store, Bitable on top of their GFS file
system. The major part is that thousands of cheap commodity hardware machines provide the
speed and the redundancy. Everyone else realizes what Google just did. Brewers CAP theorem is
proven. All RDBMS systems of use are CA systems. People begin playing with CP and AP
systems as well. K/V stores are vastly simpler, so they are the primary vehicle for the research.

Software-as-a-service systems in general do not provide an SQL-like store. Hence, people get
more interested in the NoSQL type stores. I think much of the take-off can be related to this
history. Scaling Google took some new ideas at Google and everyone else follows suit because
this is the only solution they know to the scaling problem right now. Hence, you are willing to
rework everything around the distributed database idea of Google because it is the only way to
scale beyond a certain size.

Architecture

Relation of NoSQL with Big Data

Big data applications are generally looked from 4 perspectives: Volume, Velocity, Variety and
Veracity. Whereas, NoSQL applications are driven by the inability of a current application to
efficiently scale. Though volume and velocity are important, NoSQL also focuses on variability
and agility.

NoSQL is often used to store big data. NoSQL stores provide simpler scalability and improved
performance relative to traditional RDMS. They help big data moment in a big way by storing
unstructured data and providing a means to query them as per requirements. There are different
kinds of NoSQL data stores, which are useful for different kind of applications. While evaluating
a particular NoSQL solution, one should looks for their requirements in terms of automatic
scalability, data loss, payment model etc.

Features

When compared to relational databases, NoSQL databases are more scalable and provide
superior performance, and their data model addresses several issues that the relational model is
not designed to address:

 Large volumes of structured, semi-structured, and unstructured data

 Agile sprints, quick iteration, and frequent code pushes
 Object-oriented programming that is easy to use and flexible
 Efficient, scale-out architecture instead of expensive, monolithic architecture

Different Kinds Of NoSQL Data Stores

There are varieties of NoSQL data stores available which can be widely distributed among four
categories:

 Key-value store: A simple data storage system that uses a key to access a value.
Examples- Redis, Riak, DynamoDB, Memcache
 Column family store: A sparse matrix system that uses a row and a column as keys.
Example- HBase, Cassandra, Big Table
 Graph store: For relationship-intensive problems. Example- Neo4j, InfiniteGraph
 Document store: Storing hierarchical data structures directly in the database. Example-
MongoDB, CouchDB, Marklogic

Introduction to Scala and Spark

Scala

Scala is an object functional programming and scripting language for general software
applications designed to express solutions in a concise manner.

Scala source code is intended to be compiled to Java byte code, so that the resulting executable
code runs on a Java virtual machine. Scala provides language interoperability with Java, so that
libraries written in either language may be referenced directly in Scala or Java code. Like Java,
Scala is object-oriented, and uses a curly-brace syntax reminiscent of the C programming
language. Unlike Java, Scala has many features of functional programming languages like
Scheme, Standard ML and Haskell, including currying, immutability, lazy evaluation, and
pattern matching. It also has an advanced type system supporting algebraic data types,
covariance and contra variance, higher-order types (but not higher-rank types), and anonymous
types. Other features of Scala not present in Java include operator overloading, optional
parameters, named parameters, and raw strings. Conversely, a feature of Java not in Scala is
checked exceptions, which has proved controversial.

Advantage Of Scala

 Less error prone functional style

 High maintainability and productivity
 High scalability
 High testability
 Provides features of concurrent programming

Scala Is Better Than Other Programming Language

 The arrays uses regular generics while in other language, generics are bolted on as an
afterthought and are completely separate but have overlapping behaviors with arrays.
 Scala has immutable ―Val‖ as a first class language feature. The ―Val‖ of Scala is similar
to Java final variables. Contents may mutate but top reference is immutable.
 Scala lets ‗if blocks ‘, ‗for-yield loops and ‗code ‘in braces to return a value. It is more
preferable, and eliminates the need for a separate ternary operator.
 Singleton has singleton objects rather than C++/Java/ C# classic static. It is a cleaner
solution
 Persistent immutable collections are the default and built into the standard library.
 It has native tuples and a concise code
 It has no boiler plate code

Variables

Values and variables are two shapes that come in Scala. A value variable is constant and cannot
be changed once assigned. It is immutable, while a regular variable, on the other hand, is
mutable, and you can change the value.

The two types of variables are

var myVar : Int =0;

val myVal : Int =1;

Scala Literals

The different types of literals in scala are

 Integer literals
 Floating point literals
 Boolean literals
 Symbol literals
 Character literals
 String literals
 Multi-Line strings

Apache Spark
Apache Spark is an open-source distributed general-purpose cluster-computing framework.
Spark provides an interface for programming entire clusters with implicit data parallelism and
fault tolerance.

Spark Ecosystem
Spark MLib- Machine learning library in Spark for commonly used learning algorithms like
clustering, regression, classification, etc.

Spark Streaming - This library is used to process real time streaming data.

Spark GraphX – Spark API for graph parallel computations with basic operators like
joinVertices, subgraph, aggregateMessages, etc.

Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI
tools.

RDDs (Resilient Distributed Datasets)

RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the
data coming into the system in object format. RDDs are used for in-memory computations on
large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records,
that are –

Immutable – RDDs cannot be altered.

Resilient – If a node holding the partition fails the other node takes the data.

Transformation

Transformations are functions executed on demand, to produce a new RDD. All transformations
are followed by actions. Some examples of transformations include map, filter and
reduceByKey.

Action
Actions are the results of RDD computations or transformations. After an action is performed,
the data from RDD moves back to the local machine. Some examples of actions include reduce,
collect, first, and take.

Lazy Evaluation

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate
on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but
it does nothing, unless asked for the final result.

When a transformation like map () is called on a RDD-the operation is not performed

immediately. Transformations in Spark are not evaluated till you perform an action. This helps
optimize the overall data processing workflow.

Core Components Of A Distributed Spark Application

Driver: The process that runs the main () method of the program to create RDDs and perform
transformations and actions on them.

Executor: The worker processes that run the individual tasks of a Spark job.

Cluster Manager: A pluggable component in Spark, to launch Executors and Drivers. The
cluster manager allows Spark to run on top of other external managers like Apache Mesos or
YARN.

Apache SSL

SSL (Secure Socket Layer) data transport requires encryption, and many governments have
restrictions upon the import, export, and use of encryption technology. If Apache included SSL
in the base package, its distribution would involve all sorts of legal and bureaucratic issues, and
it would no longer be freely available. Also, some of the technology required to talk to current
clients using SSL is patented by RSA Data Security, who restricts its use without a license.
Spark SQL:
Spark SQL is a library provided in Apache Spark for processing structured data. Spark SQL
provides various APIs that provides information about the structure of the data and the
computation being performed on that data. You can use SQL as well as Dataset APIs to interact
with Spark SQL.

DataFrame:
A DataFrame is a Dataset organized into named columns. A DataFrame is equivalent to a
Relational Database Table. DataFrames can be created from a variety of sources such as
structured data files, external databases, Hive tables and Resilient Distributed Datasets.

Shark
Most of the data users know only SQL and are not good at programming. Shark is a tool,
developed for people who are from a database background - to access Scala MLib capabilities
through Hive like SQL interface. Shark tool helps data users run Hive on Spark - offering
compatibility with Hive metastore, queries and data.

MLlib
MLlib is a library provided in Apache Spark for machine learning. It provides tools for common
machine learning algorithms, featurizations, Pipelines, Persistence and utilities for statistics, data
handling etc. pache Spark MLlib provides ML Pipelines which is a chain of algorithms
combined into a single workflow. ML Pipelines consists of the following key components.

DataFrame - The Apache Spark ML API uses DataFrames provided in the Spark SQL library to
hold a variety of data types such as text, feature vectors, labels and predictions.

Transformer - A transformer is an algorithm that transforms one dataframe into another

dataframe.

Estimators - An estimator is an algorithm that can be applied on a dataframe to produce a

Transformer.

Different algorithms:

Extraction algorithms
Spark MLlib machine learning library provides the following feature extraction algorithms.
TF-IDF - Term frequency-inverse document frequency (TF-IDF) is a feature extraction
algorithm that determines the importance of a term to a document.

Word2Vec - Word2Vec is an estimator algorithm which takes a sequence of words and

generates a Word2VecModel which can be used as features for prediction, document similarity
and other similar calculations.

CountVectorizer - CountVectorization is an extraction algorithm that converts a collection of

text documents to vectors of token counts, that can be passed to learning algorithms.

Transformation algorithms

Tokenizer - Tokenizer breaks text into smaller terms usually words.

StopWordsRemover - Stop words remover takes a sequence of strings as input and removes all
stop words for the input. Stop words are words that occur frequently in a document but carries
little importance.

n-gram - An n-gram contains a sequence of n tokens, usually words, where n is an integer.

NGram takes as input a sequence of strings and outputs a sequence of n-grams.

Binarizer - Binarizer is a transformation algorithm that transforms numerical features to binary

features based on a threshold value. Features greater than the threshold value are set to 1 and
features equal to or less than 1 are set to 0.

PolynomialExpansion - PolynomialExpansion class provided in the Spark MLlib library

implements the polynomial expansion algorithm. Polynomial expansion is the process of
expanding features into a polynomial space, based on n-degree combination of original
dimensions.

Discrete Cosine Transform - The discrete cosine transformation transforms a sequence in the
time domain to another sequence in the frequency domain.

StringIndexer - StringIndexer assigns a column of string labels to a column of indices.

IndexToString - IndexToString maps a column of label indices back to a column of original

label strings.

OneHotEncoder - One-hot encoder maps a column of label indices to a column of binary

vectors.

VectorIndexer - VectorIndexer helps index categorical features in dataset of vectors.

Interaction - Interaction is a transformer which takes a vector or double-valued columns and
generates a single column that contains the product of all combinations of one value from each
input column.

Normalizer - Normalizer is a Transformer which transforms a dataset of Vector rows,

normalizing each Vector to have unit norm. This normalization can help standardize your input
data and improve the behavior of learning algorithms.

StandardScaler - StandardScaler transforms a dataset of Vector rows, normalizing each feature

to have unit standard deviation and/or zero mean.

MinMaxScaler - MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a

specific range (often [0, 1]).

MaxAbsScaler - MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to

range [-1, 1] by dividing through the maximum absolute value in each feature. It does not
shift/center the data, and thus does not destroy any sparsity.

Bucketizer - Bucketizer transforms a column of continuous features to a column of feature

buckets, where the buckets are specified by users.

ElementwiseProduct - ElementwiseProduct multiplies each input vector by a provided ―weight‖

vector, using element-wise multiplication. In other words, it scales each column of the dataset by
a scalar multiplier. This represents the Hadamard product between the input vector, v and
transforming vector, w, to yield a result vector.

SQLTransformer - SQLTransformer implements the transformations which are defined by SQL

statement.

VectorAssembler - VectorAssembler is a transformer that combines a given list of columns into

a single vector column.

QuantileDiscretizer - QuantileDiscretizer takes a column with continuous features and outputs a

column with binned categorical features.

Imputer - The Imputer transformer completes missing values in a dataset, either using the mean
or the median of the columns in which the missing values are located.

Selection algorithms:
VectorSlicer - VectorSlicer is a selection algorithm that takes a feature vector as input and
outputs a new feature vector that is a sub array of original features.

RFormula - RFormula selects columns specified by an RFormula. RFormula produces a vector

column of features and a double or string column of label.
ChiSqSelector - ChiSqSelector, which stands for Chi-Squared feature selection, operates on
labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence
to select features.

Locality Sensitive Hashing - LSH is a feature selection algorithm that hashes data points into
buckets, so that the data points which are close to each other are in the same buckets with high
probability, while data points that are far away from each other are very likely in different
buckets. Locality Sensitive Hashing is used in clustering, approximate nearest neighbor search
and outlier detection with large datasets.

Classification Algorithm:

Logistic Regression - Logistic regression is a classification algorithm that predicts categorical

responses. Spark MLlib uses either logistic regression to predict a binary outcome by using
binomial logistic regression, or multinomial logistic regression to predict a multi-class outcome.

Decision tree classifier - Decision trees are a popular family of classification and regression
methods.

Random forest classifier - Random forests are a popular family of classification and regression
methods.

Gradient-boosted tree classifier - Gradient-boosted trees (GBTs) are a popular classification

and regression method using ensembles of decision trees.

Multilayer perception classifier - Multilayer perception classifier (MLPC) is a classifier based

on the feed forward artificial neural network. MLPC consists of multiple layers of nodes. Each
layer is fully connected to the next layer in the network. Nodes in the input layer represent the
input data. All other nodes map inputs to outputs by a linear combination of the inputs with the
node's weight w and bias b and applying an activation function.

Linear support vector machine - A support vector machine constructs a hyperplane or set of
hyperplanes in a high- or infinite-dimensional space, which can be used for classification,
regression, or other tasks.

Regression algorithms

Linear Regression -

Decision Tree Regression - Decision trees are a popular family of classification and regression
methods.

Random Forest Regression - Random forests are a popular family of classification and
regression methods.
Gradient-boosted tree regression - Gradient-boosted trees (GBTs) are a popular regression
method using ensembles of decision trees

Survival Regression - Spark MLlib implements the Accelerated failure time (AFT) model
which is a parametric survival regression model for censored data.

Isotonic Regression - Isotonic regression belongs to the family of regression algorithms.

Clustering algorithm

K-means - k-means is one of the most commonly used clustering algorithms that clusters the
data points into a predefined number of clusters.

Latent Dirichlet allocation - LDA is implemented as an Estimator that supports both

EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base model.
Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel
if needed.

Bisecting k-means - Bisecting k-means is a kind of hierarchical clustering using a divisive (or
―top-down‖) approach: all observations start in one cluster, and splits are performed recursively
as one moves down the hierarchy.

Gaussian Mixture Model (GMM) - A Gaussian Mixture Model represents a composite

distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its
own probability.

Filtering Algorithm:
Collaborative filtering is mostly used for recommender systems. Spark MLlib implements the
following collaborative filtering algorithms.

Explicit vs. implicit feedback - The standard approach to matrix factorization based
collaborative filtering treats the entries in the user-item matrix as explicit preferences given by
the user to the item, for example, users giving ratings to movies.

Scaling of the regularization parameter - Scale the regularization parameter regParam in

solving each least squares problem by the number of ratings the user generated in updating user
factors, or the number of ratings the product received in updating product factors.

Cold-start strategy - When making predictions using an ALSModel, it is common to encounter

users and/or items in the test dataset that were not present during training the model. This
typically occurs in two scenarios.
Spark Streaming:

Spark Streaming is a library provided in Apache Spark for processing live data streams that is
scalable, has high-throughput and is fault-tolerant. Spark Streaming can ingest data from
multiple sources such as Kafka, Flume, Kinesis or TCP sockets; and process this data using
complex algorithms provided in the Spark API including algorithms provided in the Spark MLlib
and GraphX libraries. Processed data can be pushed to live dashboards, file systems and
databases.

Apache Spark Streaming component receives live data streams from input sources such as
Kafka, Flume, Kinesis etc. and divides them into batches. The Spark engine processes these
input batches and produces the final stream of results in batches.

DStream:
DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that
represents a continuous stream of data. DStreams can be either created from input sources such
as Kafka, Flume or Kinesis; or by applying high-level operations on existing DStreams.

Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream

contains data from a certain interval.

Apache GraphX:

Apache Spark GraphX is a component library provided in the Apache Spark ecosystem that
seamlessly works with both graphs as well as with collections.

GraphX implements a variety of graph algorithms and provides a flexible API to utilize the
algorithms.
This kind of NoSQL database fits best in the case where in a connected set of all nodes,edges
satisfy a given predicate, starting from a given node.A classic example may be any social
engineering site.

Apache Spark GraphX provides the following types of operators - Property operators, Structural
operators and Join operators.

Property Operators- Property operators modify the vertex or edge properties using a user
defined map function and produces a new graph.

Structural Operators- Structural operators operate on the structure of an input graph and
produces a new graph.

Join Operators- Join operators add data to graphs and produces a new graphs.

Join operators join data from external collections (RDDs) with graphs. Apache Spark Graphx
provides the following join property operators.

joinVertices() - The joinVertices() operator joins the input RDD data with vertices and returns a
new graph. The vertex properties are obtained by applying the user defined map() function to the
result of the joined vertices. Vertices without a matching value in the RDD retain their original
value.

outerJoinVertices() - The outerJoinVertices() operator joins the input RDD data with vertices
and returns a new graph. The vertex properties are obtained by applying the user defined map()
function to the all vertices, and includes ones that are not present in the input RDD.

Pros and Cons of Graph database

Pros:

 Graph databases seem to be tailor-made for networking applications. The prototypical

example is a social network, where nodes represent users who have various kinds of
relationships to each other. Modeling this kind of data using any of the other styles is
often a tough fit, but a graph database would accept it with relish.
 They are also perfect matches for an object-oriented system.

Cons

 Because of the high degree of interconnectedness between nodes, graph databases are
generally not suitable for network partitioning.
 Graph databases don‘t scale out well.
Apache Storm:

Storm UI is used in monitoring the topology. The Storm UI provides information about errors
happening in tasks and fine-grained stats on the throughput and latency performance of each
component of each running topology.

Benefit:

Easy to operate: Operating storm is quiet easy.

Real fast: It can process 100 messages per second per node.

Fault Tolerant: It detects the fault automatically and re-starts the functional attributes.

Reliable: It guarantees that each unit of data will be executed at least once or exactly once.

Scalable: It runs across a cluster of machine

Field grouping:

Field grouping in storm uses a mod hash function to decide which task to send a tuple, ensuring
which task will be processed in the correct order. For that, you don‘t require any cache. So, there
is no time-out or limit to known field values.

The stream is partitioned by the fields specified in the grouping. For example, if the stream is
grouped by the ―user-id‖ field, tuples with the same ―user-id‖ will always go to the same task,
but tuples with different ―user-id‖‗s may go to different tasks.

Installation files

Apache is a Web (HTTP) server, not an application server. The base package does not include
any such functionality. PHP project and the mod_perl project allow you to work with databases
from within the Apache environment.

srm.conf :- This is the default file for the ResourceConfig directive in httpd.conf. It is processed
after httpd.conf but before access.conf.

access.conf :- This is the default file for the AccessConfig directive in httpd.conf.It is processed
after httpd.conf and srm.conf.

httpd.conf :-The httpd.conf file is well-commented and mostly self-explanatory.

Storm in Financial Services

In financial services, Storm can be helpful in preventing

Securities fraud :

1. Perform real-time anomaly detection on known patterns of activities and use learned
patterns from prior modeling and simulations.
2. Correlate transaction data with other streams (chat, email, etc.) in a cost-effective parallel
processing environment.
3. Reduce query time from hours to minutes on large volumes of data.
4. Build a single platform for operational applications and analytics that reduces total cost
of ownership (TCO)

Order routing : Order routing is the process by which an order goes from the end user to an
exchange. An order may go directly to the exchange from the customer, or it may go first to a
broker who then routes the order to the exchange.

Pricing : Pricing is the process whereby a business sets the price at which it will sell its products
and services, and may be part of the business‘s marketing plan.

Compliance Violations: compliance means conforming to a rule, such as a specification, policy,

standard or law. Regulatory compliance describes the goal that organizations aspire to achieve in
their efforts to ensure that they are aware of and take steps to comply with relevant laws and
regulations. And any disturbance in regarding compliance is violations in compliance.

Apache Storm v/s Apache Kafka

Apache Kafka: It is a distributed and robust messaging system that can handle huge amount of
data and allows passage of messages from one end-point to another. Kafka is designed to allow a
single cluster to serve as the central data backbone for a large organization. It can be elastically
and transparently expanded without downtime. Data streams are partitioned and spread over a
cluster of machines to allow data streams larger than the capability of any single machine and to
allow clusters of coordinated consumers.

Apache Storm: It is a real time message processing system, and you can edit or manipulate data
in real-time. Storm pulls the data from Kafka and applies some required manipulation. It makes
it easy to reliably process unbounded streams of data, doing real-time processing what Hadoop
did for batch processing. Storm is simple, can be used with any programming language, and is a
lot of fun to use.

Google - PremiumProfessional Cloud Database Engineer.132q
100% (1)
Google - PremiumProfessional Cloud Database Engineer.132q
45 pages
Unit 2 _ Big Data Analytics_CCS334
No ratings yet
Unit 2 _ Big Data Analytics_CCS334
36 pages
BDA module 5 -part1 (No SQL) 2023
No ratings yet
BDA module 5 -part1 (No SQL) 2023
32 pages
Unit 3 Nosql Databases Adt
No ratings yet
Unit 3 Nosql Databases Adt
64 pages
Overview: High Performance Scalable Data Stores
No ratings yet
Overview: High Performance Scalable Data Stores
19 pages
Unit 5
No ratings yet
Unit 5
27 pages
CS8091-BIG DATA ANALYTICS UNIT V Notes
100% (4)
CS8091-BIG DATA ANALYTICS UNIT V Notes
31 pages
Introduction To: Nosql
No ratings yet
Introduction To: Nosql
27 pages
NoSQL lec
No ratings yet
NoSQL lec
45 pages
Bda Unit-5 PDF
No ratings yet
Bda Unit-5 PDF
83 pages
Type of NOSQL Databases and Its Comparison With Relational Databases PDF
No ratings yet
Type of NOSQL Databases and Its Comparison With Relational Databases PDF
4 pages
NOSQL
No ratings yet
NOSQL
25 pages
NoSQL Databases
No ratings yet
NoSQL Databases
20 pages
Case Study On Different Nosql Data Models
No ratings yet
Case Study On Different Nosql Data Models
6 pages
NoSQL Database
No ratings yet
NoSQL Database
10 pages
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
No ratings yet
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
42 pages
NOSQL Concept 2
No ratings yet
NOSQL Concept 2
4 pages
Unit 6
No ratings yet
Unit 6
143 pages
Module 5- Nosql
No ratings yet
Module 5- Nosql
45 pages
Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
Unit 2
No ratings yet
Unit 2
65 pages
Emerging Research Trends in Database Systems
No ratings yet
Emerging Research Trends in Database Systems
21 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
NOSQL, Graph Databases & Cypher
No ratings yet
NOSQL, Graph Databases & Cypher
78 pages
Unit No 1
No ratings yet
Unit No 1
34 pages
BDT Unit 4
No ratings yet
BDT Unit 4
93 pages
Nosql Technologies: Performance Characteristics and Monitoring
No ratings yet
Nosql Technologies: Performance Characteristics and Monitoring
18 pages
cp5293 Big Data Analytics Unit 5 PDF
No ratings yet
cp5293 Big Data Analytics Unit 5 PDF
28 pages
no sql.pptx
No ratings yet
no sql.pptx
12 pages
Nosql Database: Abstract
No ratings yet
Nosql Database: Abstract
6 pages
What Is NoSQL
No ratings yet
What Is NoSQL
4 pages
Unit 3
No ratings yet
Unit 3
10 pages
Unit 1 Notes in NoSQL
No ratings yet
Unit 1 Notes in NoSQL
20 pages
NoSQL Databases
No ratings yet
NoSQL Databases
10 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
NoSQL Databases Notes
No ratings yet
NoSQL Databases Notes
5 pages
No SQL Lecture Notes
No ratings yet
No SQL Lecture Notes
17 pages
UNIT V BIG DATA FRAMEWORKS
No ratings yet
UNIT V BIG DATA FRAMEWORKS
42 pages
BDA Unit2 Complete
No ratings yet
BDA Unit2 Complete
56 pages
NoSQL DBs
No ratings yet
NoSQL DBs
46 pages
10 Nosql
No ratings yet
10 Nosql
23 pages
BIG Data 2
No ratings yet
BIG Data 2
18 pages
Nosql Prepared
No ratings yet
Nosql Prepared
60 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
15 pages
CloudComputing DATABASE
No ratings yet
CloudComputing DATABASE
27 pages
NoSQL DATABSES
No ratings yet
NoSQL DATABSES
12 pages
Unit 2
No ratings yet
Unit 2
26 pages
NoSQL
No ratings yet
NoSQL
18 pages
Nosql Databases
No ratings yet
Nosql Databases
2 pages
NoSQL (1)
No ratings yet
NoSQL (1)
12 pages
BDA
No ratings yet
BDA
65 pages
Unit Ii - Nosql Databases
No ratings yet
Unit Ii - Nosql Databases
112 pages
BDA Ass 3
No ratings yet
BDA Ass 3
8 pages
2 - Disadvantages of NoSQL Technology
No ratings yet
2 - Disadvantages of NoSQL Technology
3 pages
Oracle Database No SQL-1
No ratings yet
Oracle Database No SQL-1
28 pages
Why Nosql - Ibm
No ratings yet
Why Nosql - Ibm
6 pages
BigData_Unit2_V2
No ratings yet
BigData_Unit2_V2
70 pages
Big Data Notes
No ratings yet
Big Data Notes
70 pages
Chapter 1 - Introducing Big Data & NoSQL
No ratings yet
Chapter 1 - Introducing Big Data & NoSQL
14 pages
JavaScript Data Structures Explained: A Practical Guide with Examples
From Everand
JavaScript Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Twitter Article
No ratings yet
Twitter Article
12 pages
Inter S
No ratings yet
Inter S
5 pages
Collections in Python Are Basically Container Data Types, Namely Lists, Sets, Tuples, Dictionary
No ratings yet
Collections in Python Are Basically Container Data Types, Namely Lists, Sets, Tuples, Dictionary
3 pages
Detection of Frauds For Debit Card Transactions at
No ratings yet
Detection of Frauds For Debit Card Transactions at
10 pages
Google Cloud Platform Tutorial
100% (3)
Google Cloud Platform Tutorial
51 pages
Nikhil Java Resume
No ratings yet
Nikhil Java Resume
5 pages
Use Case Descriptions
No ratings yet
Use Case Descriptions
7 pages
Dbms (Batch A)
No ratings yet
Dbms (Batch A)
4 pages
Refresh Methods in Power BI Projects 1686723157
No ratings yet
Refresh Methods in Power BI Projects 1686723157
6 pages
Relations2-QA
No ratings yet
Relations2-QA
3 pages
Content Server 6.5 SP2 Full-Text Indexing Deployment and Administration Guide
No ratings yet
Content Server 6.5 SP2 Full-Text Indexing Deployment and Administration Guide
144 pages
DBAM Assignment 1
No ratings yet
DBAM Assignment 1
4 pages
LM-DBMS
No ratings yet
LM-DBMS
159 pages
TM254 - Final - Fall2022-2023
100% (1)
TM254 - Final - Fall2022-2023
6 pages
DBMS Complete Notes
No ratings yet
DBMS Complete Notes
47 pages
Test Bank
No ratings yet
Test Bank
2 pages
TestQuestions
No ratings yet
TestQuestions
1 page
T24-App-Delivery-Note_V4.0
No ratings yet
T24-App-Delivery-Note_V4.0
23 pages
Top 10 FREE Performance Troubleshooting Tools For SQL Server
No ratings yet
Top 10 FREE Performance Troubleshooting Tools For SQL Server
52 pages
Data Base Management Systems Laboratory: Department of Computer Science Engineering
No ratings yet
Data Base Management Systems Laboratory: Department of Computer Science Engineering
74 pages
MySQL Made Easy - Joseph C Scott
No ratings yet
MySQL Made Easy - Joseph C Scott
971 pages
Accenture Technical 1 (2)
No ratings yet
Accenture Technical 1 (2)
9 pages
Jian
No ratings yet
Jian
6 pages
TM - TSI - 05 Foundation of BI (Part 1)
No ratings yet
TM - TSI - 05 Foundation of BI (Part 1)
26 pages
Deleting Business Parter
No ratings yet
Deleting Business Parter
2 pages
BDA Question Bank
No ratings yet
BDA Question Bank
10 pages
Parking
No ratings yet
Parking
18 pages
SQL: Structured Query Language: Prepared By: Prof Momhamad Ubaidullah Bokhari
No ratings yet
SQL: Structured Query Language: Prepared By: Prof Momhamad Ubaidullah Bokhari
102 pages
A Review Paper On Big Data Analytics Tools: Article
No ratings yet
A Review Paper On Big Data Analytics Tools: Article
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Introduction To Nosql: Topics To Be Covered

Uploaded by

Introduction To Nosql: Topics To Be Covered

Uploaded by

Introduction to NoSQL

Aggregate oriented databases

Impedance Mismatch in Database Terminology

Replication and Sharding

NoSQL V/s Relational Database

Relation of NoSQL with Big Data

 Large volumes of structured, semi-structured, and unstructured data

Different Kinds Of NoSQL Data Stores

Introduction to Scala and Spark

 Less error prone functional style

Scala Is Better Than Other Programming Language

The two types of variables are

var myVar : Int =0;

The different types of literals in scala are

RDDs (Resilient Distributed Datasets)

Immutable – RDDs cannot be altered.

When a transformation like map () is called on a RDD-the operation is not performed

Core Components Of A Distributed Spark Application

Transformer - A transformer is an algorithm that transforms one dataframe into another

Estimators - An estimator is an algorithm that can be applied on a dataframe to produce a

Word2Vec - Word2Vec is an estimator algorithm which takes a sequence of words and

CountVectorizer - CountVectorization is an extraction algorithm that converts a collection of

Tokenizer - Tokenizer breaks text into smaller terms usually words.

n-gram - An n-gram contains a sequence of n tokens, usually words, where n is an integer.

Binarizer - Binarizer is a transformation algorithm that transforms numerical features to binary

PolynomialExpansion - PolynomialExpansion class provided in the Spark MLlib library

StringIndexer - StringIndexer assigns a column of string labels to a column of indices.

IndexToString - IndexToString maps a column of label indices back to a column of original

OneHotEncoder - One-hot encoder maps a column of label indices to a column of binary

VectorIndexer - VectorIndexer helps index categorical features in dataset of vectors.

Normalizer - Normalizer is a Transformer which transforms a dataset of Vector rows,

StandardScaler - StandardScaler transforms a dataset of Vector rows, normalizing each feature

MinMaxScaler - MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a

MaxAbsScaler - MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to

Bucketizer - Bucketizer transforms a column of continuous features to a column of feature

ElementwiseProduct - ElementwiseProduct multiplies each input vector by a provided ―weight‖

SQLTransformer - SQLTransformer implements the transformations which are defined by SQL

VectorAssembler - VectorAssembler is a transformer that combines a given list of columns into

QuantileDiscretizer - QuantileDiscretizer takes a column with continuous features and outputs a

RFormula - RFormula selects columns specified by an RFormula. RFormula produces a vector

Logistic Regression - Logistic regression is a classification algorithm that predicts categorical

Gradient-boosted tree classifier - Gradient-boosted trees (GBTs) are a popular classification

Multilayer perception classifier - Multilayer perception classifier (MLPC) is a classifier based

Isotonic Regression - Isotonic regression belongs to the family of regression algorithms.

Latent Dirichlet allocation - LDA is implemented as an Estimator that supports both

Gaussian Mixture Model (GMM) - A Gaussian Mixture Model represents a composite

Scaling of the regularization parameter - Scale the regularization parameter regParam in

Cold-start strategy - When making predictions using an ALSModel, it is common to encounter

Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream

Pros and Cons of Graph database

 Graph databases seem to be tailor-made for networking applications. The prototypical

Easy to operate: Operating storm is quiet easy.

Scalable: It runs across a cluster of machine

httpd.conf :-The httpd.conf file is well-commented and mostly self-explanatory.

Storm in Financial Services

In financial services, Storm can be helpful in preventing

Compliance Violations: compliance means conforming to a rule, such as a specification, policy,

Apache Storm v/s Apache Kafka

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.