0% found this document useful (0 votes)
131 views10 pages

It 6001 Da 2 Marks With Answer PDF

This document discusses data analytics and related concepts. It covers: 1. Big data approaches and applications of big data analytics such as marketing, finance, healthcare, etc. 2. Types of data analysis including reporting, which organizes data into summaries, and analysis, which extracts insights from reports. 3. Machine learning techniques including linear regression, Bayesian inference, rule induction, and neural networks. 4. Data stream mining and architectures like Lambda architecture for processing streaming data. 5. Association rule mining to discover relationships between variables in large datasets.

Uploaded by

kumar3544
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views10 pages

It 6001 Da 2 Marks With Answer PDF

This document discusses data analytics and related concepts. It covers: 1. Big data approaches and applications of big data analytics such as marketing, finance, healthcare, etc. 2. Types of data analysis including reporting, which organizes data into summaries, and analysis, which extracts insights from reports. 3. Machine learning techniques including linear regression, Bayesian inference, rule induction, and neural networks. 4. Data stream mining and architectures like Lambda architecture for processing streaming data. 5. Association rule mining to discover relationships between variables in large datasets.

Uploaded by

kumar3544
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Analytics

IT 6006 DATA ANALYTICS 2 MARKS WITH ANSWER

UNIT-I

1.What is big data approach?


Many It tools are available for big data projects. Organizations whose data workloads
are constant and predictable are better served by traditional database whereas organizations
challenged by increasing data demands will need to take advantage of Hadoop’s scalable
infrastructure.

2.List out the applications of big data analytics.

 Marketing
 Finance
 Government
 Healthcare
 Insurance
 Retail

3.List the types of cloud environment.

 Public cloud
 Private cloud

4.What is reporting?

It is the process of organizing data into informational summaries in order to monitor how
different areas of a business are performing.

5.What is analysis?

It is the process of exploring data and reports in order to extract meaningful insights
which can be used to better understand and improve business performance.

6.List out the cross validation technique.

 Simple cross validation


 Double cross validation
 Multicross validation

7.Write short note on MapReduce?

1
Data Analytics

MapReduce provides a data parallel programming model for clusters of commodity


machines. It is pioneered by google which process 20PB of data per day. MapReduce is
popularized by Apache Hadoop project and used by Yahoo, Facebook, Amazon and others.

8.What is cloud computing?

Cloud computing is internet-based computing. It relies on sharing computing resources


on-demand rather than having local servers or PCS and other devices. It is a model for enabling
ubiquitous, convenient, on-demand network access to a shared pool of configurable computing
resources that can be rapidly provisioned and released with minimal management effort.

9.Describe the drawbacks of cloud computing?

In cloud computing, cheap nodes fail, especially when you have many of them. Mean
time between failures(MTBF) for 1 node = 3 years – MTBF for 1000 nodes = 1 day and
commodity network has low bandwidth.

10.List out the four major types of resampling.

 Randomized exact test


 Cross-validation
 Jackknife
 Bootstrap

2
Data Analytics

UNIT – II

1.What are the three stages of IDA process?

o Data preparation
o Data mining and rule finding
o Result validation and interpretation

2. What is linear regression?

Linear regression is an approach for modeling the relationship between a scalar


dependent variable y and one or more explanatory variables (or independent variables)
denoted X. The case of one explanatory variable is called simple linear regression.

3.Explain Bayesian Inference ?

Bayesian inference is a method of statistical inference in which Bayes' theorem is used


to update the probability for a hypothesis as more evidence or information becomes available.
Bayesian inference is an important technique in statistics, and especially in mathematical
statistics.

4.What is meant by rule induction?

Rule induction is an area of machine learning in which formal rules are extracted from a
set of observations. The rules extracted may represent a full scientific model of the data, or
merely represent local patterns in the data.

5.What are the two strategies in Learn-One-Rule Function.

o General to specific
o Specific to general

6.Write down the topologies of Neural Network.

 Single layer
 Multi layer
 Recurrent
 Self-organized

3
Data Analytics

7.What is meant by fuzzy logic.

More than data mining tasks such as prediction, classification, etc., fuzzy models can
give insight to the underlying system and an be automatically derived from system’s dataset.
For achieving this, the technique used is grid based rule set.

8. Write short note on fuzzy qualitative modeling.

The fuzzy modeling can be interpreted as a qualitative modeling scheme by which the
system behavior is qualitatively described using a natural language. A fuzzy qualitative model is
a generalized fuzzy model consisting of linguistic explanations about system behavior in the
framework of fuzzy logic instead of mathematical equations with numerical values or
conventional logical formula with logical symbols.

9.What are the steps for Bayesian data analysis.

 Setting up the prior distribution


 Setting up the posterior distribution
 Evaluating the fit of the model

10.Write short notes on time series model.

A time series is a sequential set of data points, measured typically at successive times. It
is mathematically defined as a set of vectors x(t), t=0,1,2,… where t represents the time
elapsed. The Variable x9t0 is treated as a random variable.

4
Data Analytics

UNIT - III

1.What is data stream model?

A data stream is a real-time, continuous and ordered sequence of items. It is not


possible to control the order in which the items arrive, nor it is feasible to locally store a stream
in its entirety in any memory device.

2.Define Data Stream Mining.

Data Stream Mining is the process of extracting useful knowledge from continuous,
rapid data streams. Many traditional data mining algorithms can be recast to work with larger
datasets, but they cannot address the problem of a continuous supply of data.

3.Write short note about sensor networks.

Sensor networks are a huge source of data occurring in streams. They are used in
numerous situations that require constant monitoring of several variables, based on which
important decisions are made. in many cases, alerts and alarms may be generated as a
response to the information received from a series of sensors.

4.what is meant by one-time queries?

One-Time queries are queries that are evaluated once over a point-in-time snapshot of
the data set, with the answer returned to the user.

Eg: A stock price checker may alert the user when a stock price crosses a particular price point.

5.Define biased reservoir sampling.

Biased reservoir sampling is defined as bias function to regulate the sampling from the
stream. The bias gives a higher probability of selecting data points from recent parts of the
stream as compared to distant past.

6.What is Bloom Filter?

5
Data Analytics

A Bloom Filter is a space-efficient probabilistic data structure, conceived by Burton


Howard Bloom in 1970, that is used to test whether an element is a member of set. False
Positive matches are possible but false negative are not, thus a Bloom filter has a 100% recall
rate.

7.List out the applications of RTAP.

o Financial services
o Government
o E-Commerce sites

8.Draw a High-Level architecture for RADAR.

9.What are the three layers of Lambda architecture.

o Batch Layer- for batch processing of all data.


o Speed Layer- for real-time processing of streaming data.
o Serving Layer- for responding to queries.

10.What is RTSA?

Real-Time Sentiment analysis (also known as opinion mining) refers to the use of natural
language processing text analysis and computational linguistics to identify and extract
subjective information in source materials.

6
Data Analytics

Unit-IV

1.What is Association Rule Mining?

The Association Rule Mining is main purpose to discovering frequent itemsets from a
large dataset is to discover a set of if-then rules called Association rules. The form of an
association rules is I→j, where I is a set of items(products) and j is a particular item.

2.List any two algorithms for Finding Frequent Itemset.

o Apriori Algorithm
o FP-Growth Algorithm
o SON algorithm
o PCY algorithm

3.What is meant by curse of dimensionality?

Points in high-dimensional Euclidean spaces, as well as points in non-Euclidean spaces


often behave unintuitively. Two unexpected properties of these spaces are that the random
points are almost always at about the same distance, and random vectors are almost always
orthogonal.

4.Write an algorithm of Park-Chen-Yu.

FOR(each basket):

FOR(each item in basket):

add 1 to item’s count;

FOR(each pair of items):

7
Data Analytics

{hash the pair to a bucket;

add 1 to the count for that bucket:}

5.Define Toivonen’s Algorithm

Toivonen’s algorithm makes only one full pass over the database. The algorithm thus
produces exact association rules in one full pass over the database. The algorithm will give
neither false negatives nor positives, but there is a small yet non-zero probability that it will fail
to produce any answer at all. Toivonen’s algorithm begins by selecting a small sample of the
input dataset and finding from it the candidate frequent itemsets.

6.List out some applications of clustering.

o Collaborative filtering
o Customer segmentation
o Data summarization
o Dynamic trend detection
o Multimedia data analysis
o Biological data analysis
o Social network analysis

7.What are the types of Hierarchical Clustering Methods.

o Single-link clustering
o Complete-link clustering
o Average-link clustering
o Centroid link clustering

8.Define CLIQUE

CLIQUE is a subspace clustering algorithm that automatically finds subspaces with high-
density clustering in high dimensional attribute spaces. CLIQUE is a simple grid-based method
for finding density-based clusters in subspaces. The procedure for this grid-baased clustering is
relatively simple.

9.What is meant by k-means algorithm?

8
Data Analytics

The family of algorithms is of the point-assignment type and assumes a Euclidean space.
It is assumed that there are exactly k clusters for some known k. After picking k initial cluster
centroids, the points are considered one at a time and assigned to the closest centroid.

10.Draw the diagram for Hierarchical Clustering.

UNIT-V

1.What are the main goals of Hadoop?

o Saclable
o Fault tolerance
o Economical
o Handle hardware failures.

2.What is hive?

Hive provides a warehouse structure for other Hadoop input sources and SQL-Like
access for data in HDFS. Hive’s query language, HiveQL, compiles to MapReduce and also allows
user-defined functions(UDFS).

3.What are the responsibilities of MapReduce Framework?

o Provides overall coordination of execution.


o Selects nodes for running mappers.
o Starts and monitors mapper’s execution.
o Sorts and shuffles output of mappers.
o Chooses locations for reducer’s execution.
o Delivers the output of mapper to reducers node.
o Starts and monitors reducers’s execution.
9
Data Analytics

4.What is a Key-Value store?

The key-value store uses a key to access a value. The key-value store has a schema-less
format. The key can be artificially generated or auto-generated while the value can be string,
JSON, BLOB, etc. the key-value uses a hash table with a unique key and a pointer to a particular
item of data.

5.What is visualization? What are the three major goals in visualization.

Visual Visualization is the presentation or communication of data using interactive


interfaces. It has three major goals:

 Communicating/presenting the analysis results efficiently and effectively.


 As a tool for confirmatory analysis that is to examine the hypothesis, analyze and
confirm.
 Exploratory data analysis as an interactive and mostly undirected search for
finding structures and trends.

6.What is sharding?

Horizontal partitioning of a large database leads to partitioning of rows of the database.


Each partition forms part of a shard, meaning small part of the whole. Each part can be located
on a separate database server or any physical location.

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy