It 6001 Da 2 Marks With Answer PDF
It 6001 Da 2 Marks With Answer PDF
UNIT-I
Marketing
Finance
Government
Healthcare
Insurance
Retail
Public cloud
Private cloud
4.What is reporting?
It is the process of organizing data into informational summaries in order to monitor how
different areas of a business are performing.
5.What is analysis?
It is the process of exploring data and reports in order to extract meaningful insights
which can be used to better understand and improve business performance.
1
Data Analytics
In cloud computing, cheap nodes fail, especially when you have many of them. Mean
time between failures(MTBF) for 1 node = 3 years – MTBF for 1000 nodes = 1 day and
commodity network has low bandwidth.
2
Data Analytics
UNIT – II
o Data preparation
o Data mining and rule finding
o Result validation and interpretation
Rule induction is an area of machine learning in which formal rules are extracted from a
set of observations. The rules extracted may represent a full scientific model of the data, or
merely represent local patterns in the data.
o General to specific
o Specific to general
Single layer
Multi layer
Recurrent
Self-organized
3
Data Analytics
More than data mining tasks such as prediction, classification, etc., fuzzy models can
give insight to the underlying system and an be automatically derived from system’s dataset.
For achieving this, the technique used is grid based rule set.
The fuzzy modeling can be interpreted as a qualitative modeling scheme by which the
system behavior is qualitatively described using a natural language. A fuzzy qualitative model is
a generalized fuzzy model consisting of linguistic explanations about system behavior in the
framework of fuzzy logic instead of mathematical equations with numerical values or
conventional logical formula with logical symbols.
A time series is a sequential set of data points, measured typically at successive times. It
is mathematically defined as a set of vectors x(t), t=0,1,2,… where t represents the time
elapsed. The Variable x9t0 is treated as a random variable.
4
Data Analytics
UNIT - III
Data Stream Mining is the process of extracting useful knowledge from continuous,
rapid data streams. Many traditional data mining algorithms can be recast to work with larger
datasets, but they cannot address the problem of a continuous supply of data.
Sensor networks are a huge source of data occurring in streams. They are used in
numerous situations that require constant monitoring of several variables, based on which
important decisions are made. in many cases, alerts and alarms may be generated as a
response to the information received from a series of sensors.
One-Time queries are queries that are evaluated once over a point-in-time snapshot of
the data set, with the answer returned to the user.
Eg: A stock price checker may alert the user when a stock price crosses a particular price point.
Biased reservoir sampling is defined as bias function to regulate the sampling from the
stream. The bias gives a higher probability of selecting data points from recent parts of the
stream as compared to distant past.
5
Data Analytics
o Financial services
o Government
o E-Commerce sites
10.What is RTSA?
Real-Time Sentiment analysis (also known as opinion mining) refers to the use of natural
language processing text analysis and computational linguistics to identify and extract
subjective information in source materials.
6
Data Analytics
Unit-IV
The Association Rule Mining is main purpose to discovering frequent itemsets from a
large dataset is to discover a set of if-then rules called Association rules. The form of an
association rules is I→j, where I is a set of items(products) and j is a particular item.
o Apriori Algorithm
o FP-Growth Algorithm
o SON algorithm
o PCY algorithm
FOR(each basket):
7
Data Analytics
Toivonen’s algorithm makes only one full pass over the database. The algorithm thus
produces exact association rules in one full pass over the database. The algorithm will give
neither false negatives nor positives, but there is a small yet non-zero probability that it will fail
to produce any answer at all. Toivonen’s algorithm begins by selecting a small sample of the
input dataset and finding from it the candidate frequent itemsets.
o Collaborative filtering
o Customer segmentation
o Data summarization
o Dynamic trend detection
o Multimedia data analysis
o Biological data analysis
o Social network analysis
o Single-link clustering
o Complete-link clustering
o Average-link clustering
o Centroid link clustering
8.Define CLIQUE
CLIQUE is a subspace clustering algorithm that automatically finds subspaces with high-
density clustering in high dimensional attribute spaces. CLIQUE is a simple grid-based method
for finding density-based clusters in subspaces. The procedure for this grid-baased clustering is
relatively simple.
8
Data Analytics
The family of algorithms is of the point-assignment type and assumes a Euclidean space.
It is assumed that there are exactly k clusters for some known k. After picking k initial cluster
centroids, the points are considered one at a time and assigned to the closest centroid.
UNIT-V
o Saclable
o Fault tolerance
o Economical
o Handle hardware failures.
2.What is hive?
Hive provides a warehouse structure for other Hadoop input sources and SQL-Like
access for data in HDFS. Hive’s query language, HiveQL, compiles to MapReduce and also allows
user-defined functions(UDFS).
The key-value store uses a key to access a value. The key-value store has a schema-less
format. The key can be artificially generated or auto-generated while the value can be string,
JSON, BLOB, etc. the key-value uses a hash table with a unique key and a pointer to a particular
item of data.
6.What is sharding?
10