0% found this document useful (0 votes)

13 views

Unit 1,2,3

FIRST THREE UNITS OF OUR SYLLABUS

Uploaded by

RAHUL M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Unit 1,2,3

FIRST THREE UNITS OF OUR SYLLABUS

Uploaded by

RAHUL M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Unit-1

1. Define Data Mining. Discuss the stages in Knowledge Discovery in Data.

**Data Mining:**
Data Mining is the process of discovering patterns, trends, correlations, and insights from
large datasets using various techniques from statistics, machine learning, and database
systems. It involves extracting valuable and actionable information from raw data to support
decision-making, prediction, and knowledge generation.
**Stages in Knowledge Discovery in Data (KDD):**
Knowledge Discovery in Data (KDD) is a broader process within which data mining plays a
significant role. KDD encompasses multiple stages that collectively transform raw data into
useful knowledge. The main stages of KDD are:
1. **Data Selection:** In this stage, relevant data is selected from various sources. This
involves identifying the datasets that are appropriate for the analysis and align with the
objectives of the data mining project.
2. **Data Preprocessing:** Raw data often contain noise, inconsistencies, missing values,
and irrelevant information. Data preprocessing involves cleaning, transforming, and
integrating the selected data to ensure its quality and consistency. This step helps in
improving the accuracy and effectiveness of subsequent analysis.
3. **Data Transformation:** Data is transformed into a suitable format for analysis. This
might involve techniques like normalization, aggregation, discretization, and feature
engineering. Transformation prepares the data for effective mining.
4. **Data Mining:** This is the core stage where various data mining techniques are applied
to the transformed data to discover patterns, trends, relationships, and anomalies. Common
data mining techniques include classification, clustering, regression, association rule mining,
and anomaly detection.
5. **Pattern Evaluation:** Once patterns are discovered through data mining, they need to
be evaluated for their usefulness and significance. Evaluation involves assessing the quality
of the patterns, their relevance to the problem at hand, and their potential to generate
actionable insights.
6. **Knowledge Presentation:** The valuable patterns and insights obtained from the
previous stages are presented in a format that is understandable and useful to decision-
makers. This could involve visualizations, reports, dashboards, or other forms of
communication.
7. **Knowledge Utilization:** The final stage involves using the generated knowledge to
make informed decisions, support strategic planning, or solve specific problems. The
insights gained from data mining can lead to improved processes, enhanced efficiency, and
better decision-making.
8. **Iterative Process:** KDD is often an iterative process, where the results obtained from
one iteration might lead to adjustments in the earlier stages. For instance, insights gained
from knowledge presentation and utilization might reveal the need for refining data
selection or altering data preprocessing steps.

2. What are the various kinds of data on which Data mining can be applied? Explain.
Data mining can be applied to various kinds of data from different domains and sources. The
types of data on which data mining can be applied include:
1. **Relational Data:** This is structured data that is stored in relational databases,
consisting of tables with rows and columns. Examples include customer databases, sales
records, and inventory databases. Data mining techniques can be used to discover patterns
like customer preferences, buying behavior, and market trends.
2. **Transactional Data:** These are records of transactions made by customers, such as
credit card transactions, online purchases, and banking transactions. Data mining can help
detect fraudulent activities, identify patterns in purchasing behavior, and provide insights
into consumer preferences.
3. **Textual Data:** Text data includes documents, emails, social media posts, and other
forms of unstructured text. Natural language processing techniques are employed to extract
insights, sentiment analysis, topic modeling, and information extraction from textual data.
4. **Temporal Data:** Temporal data involves a time component, such as time-series data
(data collected at regular intervals) or event sequences. Data mining can help forecast
future trends, analyze patterns over time, and understand temporal dependencies.
5. **Spatial Data:** Spatial data includes information about geographic locations.
Geographic Information Systems (GIS) and spatial data mining techniques can be used to
analyze location-based patterns, such as urban planning, transportation optimization, and
environmental monitoring.
6. **Multimedia Data:** Multimedia data includes images, videos, audio recordings, and
other non-textual formats. Data mining techniques can be applied to analyze visual content,
recognize objects, perform content-based image retrieval, and more.
7. **Biological Data:** Data mining is used in bioinformatics to analyze biological data such
as DNA sequences, protein structures, and gene expression profiles. It helps in identifying
genetic patterns, understanding disease markers, and drug discovery.
8. **Sensor Data:** With the rise of the Internet of Things (IoT), large volumes of sensor-
generated data are available. Data mining can help extract insights from sensor data for
applications like predictive maintenance, environmental monitoring, and smart cities.
9. **Web Data:** This includes data collected from web pages, online user behavior, and
web logs. Web data mining helps in understanding user preferences, improving website
design, and targeted marketing.
10. **Social Media Data:** Data mining techniques are widely used to analyze social media
platforms. This helps in sentiment analysis, trend detection, user behavior prediction, and
understanding social interactions.
11. **Financial Data:** Financial data includes stock market data, investment records, and
economic indicators. Data mining can be used to predict market trends, portfolio
optimization, and risk assessment.
12. **Healthcare Data:** Medical records, patient data, and clinical trial results can be
analyzed using data mining techniques to improve disease diagnosis, treatment
effectiveness, and healthcare management.

3. List the data mining functionalities and their objectives that support Business
Intelligence

Data mining functionalities play a crucial role in supporting Business Intelligence (BI) by
extracting valuable insights and patterns from large datasets. These functionalities help
organizations make informed decisions, identify trends, and gain a competitive edge. Here
are some key data mining functionalities that support BI, along with their objectives:
1. **Classification:**

Objective: To categorize data instances into predefined classes or labels.

BI Support: Helps in segmenting customers, predicting outcomes (e.g., churn prediction), and
making targeted marketing decisions.

2. **Regression:**

Objective: To predict numeric values based on input variables.

BI Support: Supports sales forecasting, demand prediction, and understanding relationships

between variables.

3.Association Rule Mining:

Objective: To discover relationships or associations between items in a dataset.

BI Support: Useful in market basket analysis, cross-selling, and understanding buying patterns.

4.**Clustering:**

Objective: To group similar data instances into clusters.

BI Support: Aids in market segmentation, identifying customer segments, and understanding

patterns within data.
5. **Anomaly Detection:**

Objective: To identify unusual or unexpected patterns in data.

BI Support: Helps in fraud detection, network security monitoring, and outlier identification.

6. Sequential Pattern Mining:

Objective: To discover patterns in sequences of data.

BI Support: Valuable for analyzing customer behavior over time, such as browsing patterns and
user journeys.

7. Text Mining and Sentiment Analysis:

Objective: To extract insights and sentiment from textual data.

BI Support: Enables monitoring customer reviews, social media sentiment, and understanding
customer feedback.

8. Spatial Data Analysis:

Objective: To uncover patterns and relationships in geographic data.

BI Support: Supports location-based decision-making, such as store placement and route

optimization.

9. Time Series Analysis:

Objective: To analyze data points collected over time.

BI Support: Useful for forecasting sales, demand, and trends in various industries.

10. Collaborative Filtering:

Objective: To make automatic predictions about user preferences by collecting preferences from
many users.

BI Support: Facilitates recommendation systems, personalized marketing, and content

recommendations.

11. Feature Selection and Dimensionality Reduction:

Objective: To identify relevant features and reduce data dimensionality.

BI Support: Enhances model performance, reduces noise, and improves decision-making.

12. Pattern Evaluation and Visualization:

Objective: To assess the quality and usefulness of discovered patterns.

BI Support: Enables clear presentation of insights to stakeholders, aiding in informed decision-

making.
4. Write the syntax for the following in Data Mining Query Language
a) Characterization & Discrimination
b) Association & Classification
c) Prediction
a) Characterization & Discrimination:
Characterization involves summarizing the general characteristics of a dataset, while
discrimination focuses on comparing different subgroups within the data. Here's a simplified
representation of DMQL syntax for these tasks:
CHARACTERIZE
FROM dataset
SUMMARIZE attribute1, attribute2, ...

DISCRIMINATE
FROM dataset
COMPARE subgroup1 TO subgroup2 USING attribute
b) Association & Classification:
Association mining discovers relationships between items, while classification assigns data
instances to predefined classes.
ASSOCIATE
FROM dataset
FIND itemsets
WITH SUPPORT > threshold

CLASSIFY
FROM dataset
USING attributes
INTO class

c) Prediction:
Prediction involves estimating numeric values based on existing data patterns.
PREDICT
FROM dataset
ESTIMATE target_attribute
USING predictors

5.Explain the process of Knowledge discovery of data.

1st answer

6. Explain the different kinds of patterns to be mined.

In data mining, patterns refer to regularities, relationships, or trends found in the data that
provide valuable insights and knowledge. There are several different kinds of patterns that can
be mined from datasets, each serving a specific purpose in extracting meaningful information.
Here are some of the main types of patterns that can be mined:
1. **Frequent Itemsets:**
These patterns identify sets of items that frequently appear together in transactions.
Association rule mining is commonly used to find frequent itemsets, which can help in tasks
like market basket analysis and cross-selling.
2. **Sequential Patterns:**
Sequential patterns capture temporal relationships between items or events in sequences.
They are often used in analyzing sequences of actions, such as customer behavior over time or
clickstream data.
3. **Spatial Patterns:**
Spatial patterns are related to geographic locations. They help identify spatial relationships
and trends, such as identifying clusters of disease outbreaks or analyzing the distribution of
retail stores in a city.

4. **Temporal Patterns:**
Temporal patterns involve time-related relationships in the data. Time series analysis is used
to discover trends, seasonality, and other patterns in data collected over time, such as stock
prices or temperature records.
5. **Graph Patterns:**
Graph patterns involve relationships between entities represented as nodes and edges in a
graph. Social network analysis often focuses on mining graph patterns to uncover connections
and influence within networks.
6. **Subgroup Patterns:**
Subgroup patterns identify subsets of data instances that exhibit specific characteristics.
Discrimination tasks involve finding patterns that differentiate one subgroup from another,
while characterization tasks reveal general attributes of a subgroup.
7. **Anomalous Patterns:**
Anomalous patterns, also known as outliers, represent data instances that deviate
significantly from the norm. Anomaly detection aims to identify these unusual patterns, which
can be crucial for fraud detection, fault diagnosis, and quality control.
8. **Classification Patterns:**
Classification patterns involve assigning data instances to predefined classes or categories
based on their attributes. These patterns are derived from training data and are used to predict
the class of new, unseen instances.
9. **Regression Patterns:**
Regression patterns predict a continuous numeric value based on input variables. They are
used to estimate relationships between variables and make predictions, such as predicting sales
or house prices.
10. **Clustering Patterns:**
Clustering patterns group similar data instances together based on their attributes. These
patterns help in segmenting data into meaningful clusters, which can be useful for customer
segmentation and market analysis.
11. **Dependency Patterns:**
Dependency patterns highlight relationships and dependencies between attributes in the data.
These patterns provide insights into how changes in one attribute affect others.
12. **Textual Patterns:**
Textual patterns involve relationships and structures within text data. Text mining
techniques are used to extract patterns such as sentiment trends, frequent phrases, and topic
distributions.

7. What are the technologies needed to effectively implement data mining? Explain in
brief.
To effectively implement data mining, a combination of technologies and tools is required.
These technologies provide the necessary infrastructure, algorithms, and platforms to
perform data mining tasks efficiently. Here are some key technologies needed for effective
data mining:

1. **Databases:**
Robust database management systems (DBMS) are essential for storing and managing the
large volumes of data used in data mining. Data warehouses and data marts provide
optimized structures for querying and processing data.

2. Data Preprocessing Tools:

Data preprocessing is a critical step in data mining. Tools for cleaning, transforming, and
integrating data help ensure data quality and consistency before analysis.

3. Machine Learning Algorithms:

A wide range of machine learning algorithms are used in data mining, including decision
trees, support vector machines, neural networks, k-means clustering, and more. These
algorithms are employed to discover patterns and relationships in the data.

4. Statistical Analysis Software:

Statistical tools are essential for analyzing data distributions, performing hypothesis
testing, and understanding the significance of patterns and trends.

5. **Programming Languages:**
Programming languages like Python, R, and Julia are commonly used for implementing
data mining algorithms, creating custom analytics solutions, and automating processes.

6. Big Data Technologies:

For handling massive datasets, technologies like Hadoop and Spark provide distributed
computing frameworks that enable efficient processing and analysis of large-scale data.

7. Data Visualization Tools:

Data visualization tools create meaningful visual representations of data mining results,
making it easier for stakeholders to understand and interpret insights.

8. Text Mining Tools:

Tools for natural language processing and text mining help in extracting insights from
textual data, performing sentiment analysis, and topic modeling.
9. **Geospatial Tools:**
Geospatial technologies and Geographic Information Systems (GIS) are used for analyzing
spatial data, mapping patterns, and making location-based decisions.

10. Cloud Computing:

Cloud platforms offer scalability and accessibility for data mining tasks. Cloud services
provide resources for processing and storing data without the need for extensive
infrastructure setup.

11. Data Mining Software:

Dedicated data mining software suites like RapidMiner, KNIME, and Weka provide
integrated environments for data preprocessing, algorithm implementation, and result
analysis.

12. Collaboration and Communication Tools:

Effective data mining often involves collaboration among team members. Tools for
communication, project management, and sharing insights are important for teamwork.

13. Domain-specific Tools:

Depending on the industry and domain, specialized tools might be needed. For example,
bioinformatics might require tools for genomics analysis, while marketing might use tools
for customer segmentation.

14. Security and Privacy Tools:

As data mining involves sensitive information, security and privacy technologies play a
role in protecting data and complying with regulations.

8. Give an account on data mining Query language.

Data Mining Query Language (DMQL) is a hypothetical query language designed to support
the querying and manipulation of data for data mining tasks. It's similar in concept to SQL
(Structured Query Language) used for relational databases but tailored specifically for data
mining operations. While DMQL itself doesn't exist as a standardized language, it represents
the idea of having a specialized language for expressing data mining tasks and operations.
Here's an account of the key features and concepts of DMQL:
1. **Task-Specific Operations:** DMQL is designed to express various data mining tasks and
operations, such as association rule mining, classification, clustering, and more. Each type of
operation has its own syntax and set of keywords.
2. **Pattern Specification:** DMQL allows users to specify the patterns they want to mine
from the data. These patterns can include frequent itemsets, sequential patterns, clusters,
decision trees, and more.
3. **Condition Filtering:** DMQL supports the ability to filter data based on conditions. This
is crucial for narrowing down the data to be analyzed before applying specific data mining
algorithms.
4. **Aggregation and Grouping:** Like SQL, DMQL can include aggregation functions and
grouping clauses to summarize and aggregate data before mining patterns.
5. **Hierarchical Queries:** Some data mining tasks involve hierarchical structures, such as
decision trees. DMQL can support querying hierarchical patterns and relationships within the
data.
6. **Prediction and Estimation:** For tasks like prediction and regression, DMQL can include
mechanisms for specifying the attributes to predict and the predictors to be used.
7. **Join Operations:** Joining datasets is a common operation in data mining when
information from multiple sources needs to be combined. DMQL can provide mechanisms for
joining and merging datasets.
8. **Ordering and Sorting:** Depending on the task, ordering and sorting of results might be
important. DMQL can allow specifying how the results should be ordered.
9. **Support and Confidence Thresholds:** DMQL can provide ways to specify minimum
support and confidence thresholds for frequent itemset mining and association rule mining.
10. **Subqueries:** Subqueries allow embedding one query within another. This can be useful
for expressing complex patterns or conditions.
11. **Data Sampling:** Data mining often involves working with large datasets. DMQL could
support sampling mechanisms to work with manageable portions of data.
12. **Result Presentation:** The output of DMQL queries can include patterns, rules, clusters,
or any other results specific to the task. These results need to be presented in a meaningful and
understandable format.

9. How data warehouse is different from a database? Explain similarities / differences

A data warehouse and a database are both data storage systems, but they have different
purposes and are optimized for different types of workloads.
A database is a collection of data organized in a way that allows for efficient access and
retrieval. Databases are typically used to store operational data, such as customer orders,
product inventory, and financial transactions. Databases are optimized for fast read and write
operations, as they need to be able to respond to requests from users and applications in real
time.
A data warehouse is a centralized repository of historical data that is used for analysis and
reporting. Data warehouses are typically used to store data from multiple sources, such as
transactional databases, operational systems, and external data sources. Data warehouses are
optimized for complex queries and analysis, as they need to be able to handle large amounts of
data and provide insights that can help businesses make better decisions.
In a database, the data is typically normalized, which means that each piece of data is stored in
only one place. This helps to improve data integrity and reduce redundancy.
In a data warehouse, the data is typically denormalized, which means that duplicate data may
be stored in multiple places. This is done to improve performance for analytical queries.

10. What is data mining? Is it a simple transformation or application of technology

developed from databases, statistics, machine learning, and pattern recognition? Justify
your answer

Data mining is the process of extracting knowledge from large data sets. It uses techniques
from statistics, machine learning, and pattern recognition to identify patterns and trends in data.
Data mining can be used to solve a variety of problems, such as fraud detection, customer
segmentation, and product recommendations.
Data mining is not a simple transformation of data. It requires a deep understanding of the data,
the problem that is being solved, and the techniques that can be used to extract knowledge from
the data. The data mining process typically involves the following steps:
Data collection: The first step is to collect the data that will be used for data mining. This data
can come from a variety of sources, such as transactional databases, customer surveys, and
social media.
Data preparation: The data that is collected often needs to be prepared before it can be used for
data mining. This may involve cleaning the data, removing outliers, and transforming the data
into a format that can be used by the data mining algorithms.
Data mining: The next step is to use data mining algorithms to extract knowledge from the
data. There are many different data mining algorithms available, each with its own strengths
and weaknesses. The choice of algorithm will depend on the specific problem that is being
solved.
Data interpretation: The final step is to interpret the results of the data mining process. This
involves understanding the meaning of the patterns and trends that have been identified. The
results of the data mining process can be used to make decisions, improve products and
services, and prevent fraud.
Here are some examples of how data mining is used:
Fraud detection: Data mining can be used to identify fraudulent transactions in financial data.
Customer segmentation: Data mining can be used to segment customers into groups based on
their purchase behavior. This information can be used to target marketing campaigns more
effectively.
Product recommendations: Data mining can be used to recommend products to customers
based on their past purchase history.
Risk assessment: Data mining can be used to assess the risk of a customer defaulting on a loan.
Medical diagnosis: Data mining can be used to diagnose diseases by analyzing medical records.

11. What are the significant components of Data Minig Architecture.

Data sources: The data sources are the raw data that is used for data mining. This data can come
from a variety of sources, such as transactional databases, customer surveys, and social media.
Data warehouse: The data warehouse is a repository of historical data that is used for data
mining. The data warehouse is typically optimized for complex queries and analysis, as it needs
to be able to handle large amounts of data and provide insights that can help businesses make
better decisions.
Data mining engine: The data mining engine is the software that is used to extract knowledge
from the data. The data mining engine typically includes a variety of data mining algorithms,
such as association rule mining, clustering, and classification.
Data mining algorithms: Data mining algorithms are the techniques that are used to extract
knowledge from the data. There are many different data mining algorithms available, each with
its own strengths and weaknesses. The choice of algorithm will depend on the specific problem
that is being solved.
Data mining tools: Data mining tools are the software applications that are used to implement
the data mining algorithms. Data mining tools can be used to automate the data mining process,
making it easier to extract knowledge from large data sets.
Data mining experts: Data mining experts are the people who are responsible for designing,
implementing, and interpreting the results of data mining projects. Data mining experts
typically have a background in statistics, machine learning, and data mining.

12. Describe the challenges to data mining regarding data mining methodology and user
interaction issues..

13. Illustrate how data mining system can be integrated with database/data warehouse
system.
Direct integration: The data mining system is directly integrated with the database/data
warehouse system. This means that the data mining system has direct access to the data in the
database/data warehouse. This is the most efficient way to integrate a data mining system, but
it can also be the most complex.
Indirect integration: The data mining system is indirectly integrated with the database/data
warehouse system. This means that the data mining system does not have direct access to the
data in the database/data warehouse. Instead, the data is extracted from the database/data
warehouse and stored in a separate data mart. The data mining system then accesses the data
mart. This is a less efficient way to integrate a data mining system, but it is also less complex.
Hybrid integration: The data mining system is integrated with the database/data warehouse
system in a hybrid way. This means that the data mining system has both direct and indirect
access to the data in the database/data warehouse. This is the most flexible way to integrate a
data mining system, but it can also be the most complex..

14. Discuss about the different types of data that can be mined.
2nd answer

15. Explain the major issues in Data Mining.

Data quality: The quality of the data is essential for the success of any data mining project.
However, data is often noisy, incomplete, and inconsistent. This can make it difficult to extract
knowledge from the data.
Data mining algorithms: There are many different data mining algorithms available, each with
its own strengths and weaknesses. The choice of algorithm will depend on the specific problem
that is being solved and the characteristics of the data.
Data mining models: The data mining models that are generated can be complex and difficult
to interpret. This can make it difficult to communicate the results of the data mining project to
the stakeholders.
User interaction: The users of the data mining results may have different expectations and
needs. This can make it difficult to design a data mining system that meets the needs of all
users.
Security and privacy: Data mining can be used to identify sensitive information about
individuals. This raises concerns about security and privacy.
Interpretability: The results of data mining can be difficult to interpret and explain. This can
make it difficult to use the results to make decisions.
Scalability: Data mining algorithms can be computationally expensive. This can make it
difficult to apply data mining to large data sets
16. Explain the different types of data mining architectures.

Sure, I can explain the different types of data mining architectures.

Data mining architecture is the way that the data mining process is organized and structured.
There are three main types of data mining architectures:

Centralized architecture: In a centralized architecture, all of the data is stored in a single

location, and all of the data mining tasks are performed in that location. This is the simplest
type of data mining architecture, but it can be inefficient for large data sets.
Distributed architecture: In a distributed architecture, the data is stored in multiple locations,
and the data mining tasks are performed in those locations. This is more efficient for large data
sets, but it can be more complex to manage.
Hybrid architecture: A hybrid architecture is a combination of the centralized and distributed
architectures. The data is stored in multiple locations, but the data mining tasks are performed
in a central location. This is the most flexible type of data mining architecture, but it can be the
most complex to manage.
Centralized architecture:
Advantages:
Simple to manage
Easy to deploy
Cost-effective for small data sets
Disadvantages:
Inefficient for large data sets
Can be a bottleneck
Distributed architecture:
Advantages:
Efficient for large data sets
Scalable
Can handle complex data mining tasks
Disadvantages:
Complex to manage
Difficult to deploy
Requires high-performance networking
Hybrid architecture:
Advantages:
Flexible
Efficient for large data sets
Can handle complex data mining tasks
Easy to manage
Disadvantages:
Complex to deploy
Requires high-performance networking
Unit-2
1.What is a data object? Explain the various Attribute types with relevant
examples.
A data object is a single unit of data that represents a real-world entity. It is a collection of attributes
that describe the entity. For example, a customer data object might have attributes such as name,
address, phone number, and email address.

The various attribute types are:

Nominal attributes: Nominal attributes are categorical attributes that have no inherent order. For
example, the color of a car is a nominal attribute.
Ordinal attributes: Ordinal attributes are categorical attributes that have a natural order. For example,
the ranking of a student in a class is an ordinal attribute.
Numeric attributes: Numeric attributes are quantitative attributes that can be measured. For example,
the weight of a person is a numeric attribute.
Date attributes: Date attributes are attributes that represent a date or time. For example, the date of birth
of a person is a date attribute.
Time attributes: Time attributes are attributes that represent a time of day. For example, the time of a
flight is a time attribute.

2."Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36,
8):
i)Compute the Euclidean distance between the two objects.
ii)Compute the Manhattan distance between the two objects.
iii)Compute the Minkowski distance between the two objects, using q
= 3."

i.) Compute the Euclidean distance between the two objects.

The Euclidean distance is the distance between two points in a multidimensional space. It is calculated
by taking the square root of the sum of the squared differences between the corresponding coordinates
of the two points.

In this case, the two objects are represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8). The Euclidean
distance between these two objects is:

sqrt((22 - 20)^2 + (1 - 0)^2 + (42 - 36)^2 + (10 - 8)^2)

= sqrt(4 + 1 + 36 + 4)
= sqrt(45)
= 3*sqrt(5)
ii) Compute the Manhattan distance between the two objects.
The Manhattan distance is the distance between two points in a multidimensional space, where the
distance between two points is calculated by taking the sum of the absolute differences between the
corresponding coordinates of the two points.

In this case, the two objects are represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8). The
Manhattan distance between these two objects is:

|22 - 20| + |1 - 0| + |42 - 36| + |10 - 8|

=2+1+6+2
= 11
iii) Compute the Minkowski distance between the two objects, using q = 3.
The Minkowski distance is a generalization of the Euclidean and Manhattan distances. It is calculated
by taking the qth root of the sum of the powers of the absolute differences between the corresponding
coordinates of the two points.

In this case, the two objects are represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8). The
Minkowski distance between these two objects, using q = 3, is:

sqrt[3]((22 - 20)^3 + (1 - 0)^3 + (42 - 36)^3 + (10 - 8)^3)

= sqrt[3](8 + 1 + 216 + 8)
= sqrt[3](225)
= 15

3. What is Data Preprocessing? Explain the major tasks in Data

Preprocessing
Data preprocessing is the process of preparing data for analysis. It is a critical step in any data mining
project, as it can help to improve the accuracy and efficiency of the analysis.

The major tasks in data preprocessing include:

Data cleaning: Data cleaning is the process of removing errors and inconsistencies from the data. This
can involve removing duplicate records, correcting typos, and filling in missing values.
Data integration: Data integration is the process of combining data from different sources into a single,
consistent data set. This can be necessary if the data is coming from different systems or formats.
Data transformation: Data transformation is the process of transforming the data into a format that is
suitable for analysis. This can involve converting categorical data into numeric data, scaling the data,
and removing outliers.
Data reduction: Data reduction is the process of reducing the size of the data set without losing important
information. This can be necessary if the data set is too large to be processed efficiently.
Feature selection: Feature selection is the process of selecting the most important features for the
analysis. This can help to improve the accuracy and efficiency of the analysis.

4. How is the redundancy issue resolved using data integration? Explain using any two
metrics used to check correlation amongst attributes.
Redundancy is a situation where the same information is stored in multiple places in a data set.
This can lead to a number of problems, such as:
Increased data storage requirements.
Increased data processing time.
Reduced data quality.
Difficulty in identifying the correct value of a particular piece of information.
Data integration can help to resolve the redundancy issue by combining data from different
sources into a single, consistent data set. This can be done by using a variety of techniques,
such as:
Data cleansing: This involves removing duplicate records, correcting typos, and filling in
missing values.
Data standardization: This involves converting data into a common format.
Once the data has been integrated, it can be analyzed to identify redundant attributes. There are a number
of metrics that can be used to check correlation amongst attributes, such as:
Pearson correlation coefficient: This is a measure of the linear correlation between two attributes.
Spearman's rank correlation coefficient: This is a measure of the monotonic correlation between two
attributes.
Kendall's tau correlation coefficient: This is a measure of the rank correlation between two attributes
that are not normally distributed.

5. Define attribute. Discuss about various types of attributes.

1st answer
6. State the process for calculating proximity measures for Ordinal Variables
Here are the steps involved in calculating proximity measures for ordinal variables:

Assign a numerical value to each ordinal level. This can be done arbitrarily, as long as the values are
consistent. For example, the levels of a ranking attribute might be assigned the values 1, 2, 3, and so
on.
Calculate the distance between each pair of objects. The distance can be calculated using a variety of
metrics, such as the Euclidean distance, the Manhattan distance, or the Minkowski distance.
Normalize the distances. This can be done by dividing each distance by the maximum distance in the
data set. This ensures that the distances are on a comparable scale.
Choose a proximity measure. There are a variety of proximity measures that can be used for ordinal
variables. Some common measures include:
Manhattan distance: The Manhattan distance is the sum of the absolute differences between the
corresponding values of the two objects.
Euclidean distance: The Euclidean distance is the square root of the sum of the squared differences
between the corresponding values of the two objects.
Minkowski distance: The Minkowski distance is a generalization of the Manhattan and Euclidean
distances. It is calculated by taking the qth root of the sum of the powers of the absolute differences
between the corresponding values of the two objects.
The choice of proximity measure will depend on the specific data set and the problem that is being
solved.

7. Explain binning. What are the various data pre processing techniques
where binning is used
Binning is a data preprocessing technique that divides a continuous variable into a discrete number of
bins. This can be done for a variety of reasons, such as:
To improve the accuracy of data analysis by reducing the impact of outliers.
To simplify the data for analysis by reducing the number of data points.
To make the data more understandable by humans.

Binning can be used in a variety of data preprocessing techniques, such as:

Data normalization Discretization Feature selection

9.Give the commonly used statistical measures for the characterization of

data dispersion. Discuss how they can be computed efficiently in large
databases.
Range: The range is the difference between the maximum and minimum values
in the data set. It is a simple measure of dispersion, but it can be misleading if the
data set contains outliers.
Interquartile range (IQR): The interquartile range is the difference between the
first and third quartiles of the data set. It is a more robust measure of dispersion
than the range, as it is not affected by outliers.
Variance: The variance is a measure of how spread out the data is around the
mean. It is calculated by taking the average of the squared deviations from the
mean.
Standard deviation: The standard deviation is the square root of the variance. It is
a more intuitive measure of dispersion than the variance, as it is measured in the
same units as the data.
Mean absolute deviation (MAD): The mean absolute deviation is the average of
the absolute deviations from the mean. It is a robust measure of dispersion that is
not affected by outliers.
These measures can be computed efficiently in large databases using a variety of
techniques, such as:

Parallel computing: Parallel computing can be used to speed up the computation

of these measures by dividing the data set into smaller chunks and computing the
measures on each chunk in parallel.
Sampling: Sampling can be used to reduce the size of the data set, which can
make the computation of these measures more efficient.
Approximate algorithms: Approximate algorithms can be used to compute these
measures with less accuracy, but they can be much more efficient than the exact
algorithms.

10. Explain in detail about Minkowski distance.

The Minkowski distance is a metric in a normed vector space that can be considered as a generalization
of both the Euclidean distance and the Manhattan distance. It is named after the German mathematician
Hermann Minkowski.
The Minkowski distance between two points is calculated as follows:
Minkowski distance = (|x1 - x2|^p + |y1 - y2|^p + ...)^1/p
where p is a real number and x1, x2, y1, y2 are the coordinates of the two points.
The Euclidean distance is a special case of the Minkowski distance where p = 2. The Manhattan distance
is a special case of the Minkowski distance where p = 1.
The Minkowski distance can be used to measure the distance between two points in any number of
dimensions. It is often used in data mining and machine learning algorithms, such as k-nearest neighbors
and support vector machines.
The choice of the value of p can affect the sensitivity of the Minkowski distance to outliers. A smaller
value of p makes the distance more sensitive to outliers, while a larger value of p makes the distance
less sensitive to outliers.
The Minkowski distance is a versatile metric that can be used in a variety of applications. It is a good
choice for measuring the distance between two points in any number of dimensions, and it can be used
to balance the sensitivity to outliers and the computational complexity.
Here are some of the advantages of using the Minkowski distance:
It is a versatile metric that can be used in a variety of applications.
It can be used to measure the distance between two points in any number of dimensions.
It can be used to balance the sensitivity to outliers and the computational complexity.
Here are some of the disadvantages of using the Minkowski distance:
It can be more computationally expensive than other metrics, such as the Euclidean distance.
The choice of the value of p can affect the results.

11. Explain the process of Data Cleaning

Data cleaning is the process of identifying and correcting errors and inconsistencies in data. It is an
important step in data preprocessing, as it can help to improve the accuracy and reliability of the data.
The data cleaning process can be divided into the following steps:

Data inspection: The first step is to inspect the data to identify any errors or inconsistencies. This can
be done by looking for data that is missing, out of range, or inconsistent.
Data cleaning: Once the errors and inconsistencies have been identified, they need to be cleaned. This
can be done by correcting the errors, removing the data, or replacing the data with missing values.
Data validation: Once the data has been cleaned, it needs to be validated to ensure that the errors have
been corrected. This can be done by running tests on the data to ensure that it is accurate and consistent.
The data cleaning process can be a time-consuming and challenging task, but it is essential to ensure
the quality of the data. Here are some of the common data cleaning tasks:

Missing value imputation: This involves filling in missing values in the data. This can be done by using
a variety of methods, such as mean imputation, median imputation, or multiple imputation.
Outlier detection and removal: This involves identifying and removing outliers from the data. Outliers
are data points that are significantly different from the rest of the data. They can be caused by errors in
the data collection or by the presence of unusual data points.
Data transformation: This involves transforming the data into a format that is more suitable for analysis.
This can involve converting categorical data into numeric data or normalizing the data.
Data integration: This involves combining data from different sources into a single data set. This can
be necessary if the data is coming from different systems or if the data is in different formats.
Data standardization: This involves normalizing the data so that the values are on a comparable scale.
This can help to improve the accuracy of the analysis.

12. Write short notes on the following normalization methods:a) Min-max

normalization
b) z-score normalization
c) Normalization by decimal scaling"
Here are short notes on the following normalization methods:
Min-max normalization: Min-max normalization is a method of normalizing data by scaling the values
to between 0 and 1. This is done by subtracting the minimum value from each value and then dividing
by the difference between the maximum and minimum values.
Z-score normalization: Z-score normalization is a method of normalizing data by subtracting the mean
value from each value and then dividing by the standard deviation. This is done to standardize the data
so that the mean is 0 and the standard deviation is 1.
Normalization by decimal scaling: Normalization by decimal scaling is a method of normalizing data
by multiplying each value by a constant so that all of the values are between 0 and 1. This is done to
simplify the interpretation of the data.
Min-max normalization:
Advantages:
Simple to implement
Robust to outliers
Disadvantages:
Not invariant to changes in the scale of the data
Not invariant to changes in the mean of the data
Z-score normalization:
Advantages:
Invariant to changes in the scale and mean of the data
Widely used in machine learning
Disadvantages:
Sensitive to outliers
Normalization by decimal scaling:
Advantages:
Simple to implement
Invariant to changes in the scale of the data
Disadvantages:
Not invariant to changes in the mean of the data
Not widely used in machine learning

13. Explain how the Central tendency of Data is measured.

Central tendency is a measure of the center of a distribution. It is often used to describe the "typical"
value of a set of data. There are three main measures of central tendency:
Mean: The mean is the average of all the values in the data set. It is calculated by adding all the values
and then dividing by the number of values.
Median: The median is the middle value in the data set, when all the values are sorted from least to
greatest.
Mode: The mode is the most frequent value in the data set.

14. State the methods applied to measure proximity on data with nominal
and binary attributes with relevant examples
15. In real-world data, tuples with missing values for some attributes are
a common occurrence. Describe various methods for handling this
problem.
Missing values are a common occurrence in real-world data. There are a number of methods
for handling missing values, each with its own advantages and disadvantages.
Here are some of the most common methods for handling missing values:
Deletion: This method simply deletes the tuples with missing values. This is the simplest method,
but it can also be the most drastic. If a large number of tuples are deleted, the data set may become
too small or biased.
Imputation: This method replaces the missing values with estimates. There are a number of
imputation methods, such as mean imputation, median imputation, and multiple imputation. Mean
imputation replaces the missing values with the mean of the non-missing values for the same
attribute. Median imputation replaces the missing values with the median of the non-missing values
for the same attribute. Multiple imputation replaces the missing values with a set of estimates that
are generated using a statistical model.
Modeling: This method uses a statistical model to predict the missing values. This can be a more
accurate method than imputation, but it can also be more complex.
Ignoring: This method simply ignores the missing values. This can be a good option if the missing
values are not too frequent or if they are not important for the analysis.

16. Explain about various Dimensionality Reduction techniques.

Principal Component Analysis (PCA)

Principal Component Analysis is a statistical process that converts the observations of
correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis
and predictive modeling.

PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie recommendation
system, optimizing the power allocation in various communication channels.

Backward Feature Elimination

The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression model. Below steps are performed in this technique
to reduce the dimensionality or in feature selection:
o In this technique, firstly, all the n variables of the given dataset are taken to train the
model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1 features for n
times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the performance
of the model, and then we will drop that variable or features; after that, we will be left
with n-1 features.
o Repeat the complete process until no feature can be dropped.

In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features require for the
machine learning algorithms.

Forward Feature Selection

Forward feature selection follows the inverse process of the backward elimination
process. It means, in this technique, we don't eliminate the feature; instead, we will find
the best features that can produce the highest increase in the performance of the
model. Below steps are performed in this technique:

o We start with a single feature only, and progressively we will add each feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the performance of
the model.

Missing Value Ratio

If a dataset has too many missing values, then we drop those variables as they do not
carry much useful information. To perform this, we can set a threshold level, and if a
variable has missing values more than that threshold, we will drop that variable. The
higher the threshold value, the more efficient the reduction.

Low Variance Filter

As same as missing value ratio technique, data columns with some changes in the data
have less information. Therefore, we need to calculate the variance of each variable,
and all data columns with variance lower than a given threshold are dropped because
low variance features will not affect the target variable.
High Correlation Filter
High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This
correlation between the independent numerical variable gives the calculated value of
the correlation coefficient. If this value is higher than the threshold value, we can
remove one of the variables from the dataset. We can consider those variables or
features that show a high correlation with the target variable.

Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine
learning. This algorithm contains an in-built feature importance package, so we do not
need to program it separately. In this technique, we need to generate a large set of
trees against the target variable, and with the help of usage statistics of each attribute,
we need to find the subset of features.
Unit-3
1. Define data warehouse. Draw the architecture of data
warehouse and explain the three tiers in detail.
A data warehouse is a collection of data that is gathered
from various sources within an organization and stored in
a way that is optimized for analysis. It is a single,
consistent, and integrated view of data from multiple
sources.
Bottom Tier (Data Source):
The bottom tier represents the source systems that generate
data. These sources can include operational databases,
external data feeds, spreadsheets, and more. The data from
different sources might be heterogeneous in nature, with
varying formats and structures.
Middle Tier (Data Warehouse Server):
The middle tier is the heart of the data warehouse architecture. It includes several components that work
together to transform, integrate, and store data for analysis:
ETL (Extract, Transform, Load): ETL processes extract data from source systems, transform it to fit
into the data warehouse schema, and then load it into the warehouse. Transformation includes data
cleansing, aggregation, normalization, and other operations to ensure data quality and consistency.
Data Warehouse Database: This is where the integrated and transformed data is stored. It's optimized
for querying and reporting, using specialized structures like star or snowflake schemas to facilitate fast
analytical processing.
Metadata Repository: Metadata provides information about the data stored in the warehouse, including
its source, transformation rules, and relationships. The metadata repository maintains documentation
about the data and helps users understand and interpret it correctly.
Top Tier (Client Interface):
The top tier is the user-facing layer of the data warehouse architecture. It includes tools and interfaces
that allow users to interact with the data and extract meaningful insights:
Query and Reporting Tools: These tools enable users to formulate queries, generate reports, and analyze
data according to their business requirements. SQL-based interfaces, OLAP tools, and visualization
tools are commonly used in this tier.
Data Mining and Analysis Tools: Users can apply advanced analytics, data mining, and machine
learning techniques to discover patterns, trends, and correlations in the data.
Dashboard and Visualization Tools: These tools create interactive dashboards, charts, and graphs to
present data in a visually appealing and easy-to-understand manner.
Business Intelligence Applications: Business intelligence (BI) applications provide a comprehensive
environment for querying, reporting, and analyzing data, supporting decision-making processes across
the organization.
2. Explain the following in OLAP a) Roll up & Drill Down operation b) Slice
& Dice operation c) Pivot operation
OLAP (Online Analytical processing) is a category of database applications that enable users to analyze
multidimensional data from various perspectives. It is used in data mining and data warehousing to find
patterns and trends in data.
The following are three of the most common OLAP operations:
a. Roll up: Roll up is an operation that aggregates data from multiple levels of a hierarchy. For example,
you could roll up sales data from individual products to product categories or to the overall company.
b. Drill down: Drill down is the opposite of roll up. It is an operation that expands data to a lower level
of detail. For example, you could drill down on sales data for a product category to see the sales data
for individual products.
c. Slice and dice: Slice and dice is an operation that filters and rearranges data to create a new view of
the data. For example, you could slice and dice sales data by product category, region, or time period.
d. Pivot: Pivot is an operation that rotates the axes of a data table. For example, you could pivot sales
data so that the product categories are rows and the time periods are columns.

3. Write in brief about schemas in Multidimensional Data Model.

A schema in a multidimensional data model is a logical description of the data. It defines the
dimensions, measures, and relationships between the data.
The dimensions in a multidimensional data model are the categories of data. For example, in a sales
data model, the dimensions might be product, customer, and time.
The measures in a multidimensional data model are the quantities that are measured. For example, in a
sales data model, the measures might be sales amount, quantity sold, and profit.
The relationships between the dimensions and measures define how the data is related. For example,
the product dimension might be related to the measures by the sales amount, quantity sold, and profit.
The schema is used to create the physical database. It is also used to query the data and to create reports.

Star schema: A star schema is the simplest type of

multidimensional data model. It consists of a fact table
and a set of dimension tables. The fact table contains
the measures, and the dimension tables contain the
dimensions.

Snowflake schema: A snowflake schema is a

variation of the star schema. It is similar to a star
schema, but the dimension tables are further divided
into smaller tables. This can make the schema more
efficient, but it can also make it more complex.
Cube schema: A cube schema is a more
complex type of multidimensional data model. It
is a multidimensional array of data that is
organized into a hierarchy of dimensions. This
can make the schema more efficient for certain
types of queries, but it can also make it more
complex to manage.

4.Describe in brief about BUC for Iceberg Cube Computation

BUC stands for Bottom-Up Computation. It is an algorithm for computing iceberg

cubes. An iceberg cube is a data cube that only stores the data that meets a certain
threshold condition.
The BUC algorithm works by first computing the entire data cube. Then, it recursively
removes the cells that do not meet the threshold condition. The BUC algorithm is a
simple and efficient algorithm for computing iceberg cubes.
Here are the steps involved in the BUC algorithm:
Initialize the cube with all the data.
For each cell in the cube:
If the cell meets the threshold condition, then keep the cell.
Otherwise, remove the cell.
Recursively apply step 2 to the remaining cells.
The BUC algorithm can be used to compute iceberg cubes of any size. It is a simple
and efficient algorithm that can be easily implemented.

Here are some of the advantages of using the BUC algorithm:

It is simple and efficient.

It can be used to compute iceberg cubes of any size.
It is easy to implement.
Here are some of the disadvantages of using the BUC algorithm:
It can be computationally expensive for large data sets.
5. Explain multidimensional data model with a neat diagram.
A multidimensional data model is a data model
that organizes data into a multidimensional
array. The dimensions of the array represent
different aspects of the data, such as time,
product, and customer.
A multidimensional data model is often used to
store data for online analytical processing
(OLAP). OLAP is a type of data analysis that
allows users to explore data from multiple
perspectives.

The diagram shows a multidimensional data model with three dimensions: time, product, and
customer. Each dimension has a set of values. For example, the time dimension has the
values 2022, 2023, and 2024. The product dimension has the values phone, laptop, and tablet.
The customer dimension has the values John Doe, Jane Doe, and Peter Smith.
The data in the multidimensional data model is stored in a cube. The cube is a three-
dimensional array. The rows of the cube represent the values of the time dimension. The
columns of the cube represent the values of the product dimension. The cells of the cube
represent the values of the customer dimension.
The multidimensional data model can be used to store a variety of data types, such as
numbers, text, and dates. The data can be stored in a variety of formats, such as relational
databases, NoSQL databases, and Hadoop.
The multidimensional data model is a powerful tool for data analysis. It allows users to explore
data from multiple perspectives and to identify patterns and trends.
Here are some of the advantages of using a multidimensional data model:
It is a powerful tool for data analysis.
It allows users to explore data from multiple perspectives.
It can be used to identify patterns and trends.
It is flexible and can be used to store a variety of data types.
Here are some of the disadvantages of using a multidimensional data model:
It can be complex to design and implement.
It can be expensive to store and manage.

It can be difficult to update.

6. List out the various OLAP operations.

2nd answer
7. Briefly explain about Multiway Array Aggregation for Full Cube
Computation.
Multiway Array Aggregation (MAAggregation) is a method for computing the full data cube by
aggregating the data in multiple dimensions simultaneously. This can be done by partitioning the data
into small chunks and then aggregating each chunk independently. The aggregated chunks can then be
combined to form the full data cube.
MAAggregation is a more efficient way to compute the full data cube than traditional methods, such as
the bottom-up computation (BUC) method. This is because MAAggregation can take advantage of the
sparsity of the data cube. In a sparse data cube, most of the cells are empty. MAAggregation can skip
over these empty cells, which can significantly reduce the amount of computation required.
Here are the steps involved in MAAggregation:
Partition the data into small chunks.
Aggregate each chunk independently.
Combine the aggregated chunks to form the full data cube.
The choice of the chunk size is important for the efficiency of MAAggregation. A smaller chunk size
will result in a more accurate data cube, but it will also require more computation. A larger chunk size
will result in a less accurate data cube, but it will require less computation.
MAAggregation is a good choice for computing the full data cube when the data is sparse. It is also a
good choice when the data is large and the computation time is limited.
Here are some of the advantages of using MAAggregation:
It is more efficient than traditional methods, such as the BUC method.
It can be used to compute the full data cube even when the data is sparse.
It can be used to compute the full data cube in a limited amount of time.
Here are some of the disadvantages of using MAAggregation:
It can be more complex to implement than traditional methods.
It can be more difficult to debug.

8. Explain in detail about the implementation of a data

warehousing.
Requirements analysis and capacity planning: The first process in data warehousing
involves defining enterprise needs, defining architectures, carrying out capacity planning, and
selecting the hardware and software tools. This step will contain be consulting senior
management as well as the different stakeholder.

2. Hardware integration: Once the hardware and software has been selected, they require to
be put by integrating the servers, the storage methods, and the user software tools.
3. Modeling: Modelling is a significant stage that involves
designing the warehouse schema and views. This may contain
using a modeling tool if the data warehouses are
sophisticated.

4. Physical modeling: For the data warehouses to perform

efficiently, physical modeling is needed. This contains
designing the physical data warehouse organization, data
placement, data partitioning, deciding on access techniques,
and indexing.

5. Sources: The information for the data warehouse is likely to come from several data sources.
This step contains identifying and connecting the sources using the gateway, ODBC drives, or
another wrapper.

6. ETL: The data from the source system will require to go through an ETL phase. The process
of designing and implementing the ETL phase may contain defining a suitable ETL tool vendors
and purchasing and implementing the tools. This may contains customize the tool to suit the
need of the enterprises.

7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the
tools will be needed, perhaps using a staging area. Once everything is working adequately, the
ETL tools may be used in populating the warehouses given the schema and view definition.

8. User applications: For the data warehouses to be helpful, there must be end-user
applications. This step contains designing and implementing applications required by the end-
users.

9. Roll-out the warehouses and applications: Once the data warehouse has been populated
and the end-client applications tested, the warehouse system and the operations may be rolled
out for the user's community to use.

9. Compare and contrast OLAP and OLTP

10. Discuss the following
a) Star schema
b) Snow Flake schema
c) Fact constellation schema
a,b 3rd answer

Fact Constellation is a schema for representing multidimensional model. It

is a collection of multiple fact tables having some common dimension tables.
It can be viewed as a collection of several star schemas and hence, also
known as Galaxy schema. It is one of the widely used schema for Data
warehouse designing and it is much more complex than star and snowflake
schema. For complex systems, we require fact constellations.

11. What are general strategies for Data Cube Computation.

there are many different strategies for data cube computation. Here are some of the most common ones:
Full cube computation: This is the simplest strategy. It involves computing the entire data cube,
including all the cells. This is the most accurate strategy, but it can be very computationally expensive,
especially for large data sets.
Partial cube computation: This strategy involves computing only a subset of the data cube. This can be
done by specifying the dimensions and measures of interest. This is less accurate than full cube
computation, but it can be much more efficient.
Incremental computation: This strategy involves computing the data cube incrementally. This means
that the data cube is only computed when the underlying data changes. This can be very efficient for
data sets that change frequently.
Hybrid computation: This strategy combines two or more of the above strategies. For example, a hybrid
strategy might involve computing the full cube for a small data set and then using partial cube
computation for a large data set.

12. Define Prediction Cube. Explain prediction mining in Cube

Space.
Prediction Cube:
A prediction cube is a multidimensional data structure used in data mining to forecast future trends,
outcomes, or values based on historical data and analytical models. It extends the concept of a traditional
data cube by incorporating predictive analytics. In a prediction cube, each cell contains not only
aggregated measures but also predictive values or models that estimate future behavior.
13. How is OLTP different from OLAP? Explain Briefly.

9th answer

Data Warehousing & Data Mining Unit-3 Notes
No ratings yet
Data Warehousing & Data Mining Unit-3 Notes
27 pages
MBA Data Mining Unit 1 Notes
No ratings yet
MBA Data Mining Unit 1 Notes
12 pages
Data Mining
No ratings yet
Data Mining
4 pages
data ming unit 2
No ratings yet
data ming unit 2
8 pages
ISS - Module 3
No ratings yet
ISS - Module 3
11 pages
Dm Answers
No ratings yet
Dm Answers
22 pages
Unit 01
No ratings yet
Unit 01
10 pages
UNIT - 5
No ratings yet
UNIT - 5
22 pages
Data Mining
No ratings yet
Data Mining
20 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
Knowledge Management UNIT-3 Notes
No ratings yet
Knowledge Management UNIT-3 Notes
17 pages
DMBI Theory
No ratings yet
DMBI Theory
15 pages
QB 2 Marker
No ratings yet
QB 2 Marker
25 pages
Data Science
No ratings yet
Data Science
11 pages
Chap 1
No ratings yet
Chap 1
45 pages
Data Mining
No ratings yet
Data Mining
48 pages
data mining 1
No ratings yet
data mining 1
39 pages
Data Mining Notes
No ratings yet
Data Mining Notes
297 pages
Data Mining
No ratings yet
Data Mining
2 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
ICS 2408 Lecture 1 Introduction
No ratings yet
ICS 2408 Lecture 1 Introduction
32 pages
HTCB Unit 2
No ratings yet
HTCB Unit 2
7 pages
Unit-1 Data Mining
No ratings yet
Unit-1 Data Mining
19 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
Unit 1
No ratings yet
Unit 1
59 pages
Unit-1
No ratings yet
Unit-1
7 pages
unit 1 DM
No ratings yet
unit 1 DM
24 pages
BIG DATA ANALYTICS
No ratings yet
BIG DATA ANALYTICS
3 pages
Data Mining
No ratings yet
Data Mining
52 pages
ba unit 3 own (1)
No ratings yet
ba unit 3 own (1)
7 pages
dm
No ratings yet
dm
3 pages
DM & W SQ
No ratings yet
DM & W SQ
15 pages
Unit-1 DWDM
No ratings yet
Unit-1 DWDM
20 pages
DWDM Notes
No ratings yet
DWDM Notes
59 pages
Synopsis Print
No ratings yet
Synopsis Print
4 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
Chapter 1 Data Mining Lecture Note
No ratings yet
Chapter 1 Data Mining Lecture Note
31 pages
DWDM
No ratings yet
DWDM
18 pages
DWDM - Unit - II
No ratings yet
DWDM - Unit - II
55 pages
Lecture_01_11jan
No ratings yet
Lecture_01_11jan
29 pages
data mining unit I notes
No ratings yet
data mining unit I notes
24 pages
Data Mining
No ratings yet
Data Mining
30 pages
Data Mining Notes
100% (1)
Data Mining Notes
45 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
UNIT I DBMI
No ratings yet
UNIT I DBMI
35 pages
da257829-b262-4875-aa76-2975d8aeaa2c
No ratings yet
da257829-b262-4875-aa76-2975d8aeaa2c
31 pages
BI Lecture 5ppt
No ratings yet
BI Lecture 5ppt
18 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
25 pages
BTECH Data Mining Answer
No ratings yet
BTECH Data Mining Answer
35 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
Data-Mining-OVERVIEW (1)
No ratings yet
Data-Mining-OVERVIEW (1)
8 pages
Unit Iii
No ratings yet
Unit Iii
10 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Data Mining
No ratings yet
Data Mining
43 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
Introduction to Data Mining and Data Warehousing
No ratings yet
Introduction to Data Mining and Data Warehousing
2 pages
DWDM UNIT 3
No ratings yet
DWDM UNIT 3
16 pages
Data Mining
No ratings yet
Data Mining
6 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
AK-MCQ - X 402 - Unit 3 - Database Management System
No ratings yet
AK-MCQ - X 402 - Unit 3 - Database Management System
4 pages
Shard Manager
No ratings yet
Shard Manager
17 pages
Go To SA38 2. Enter Program Name RPDASC00 and Execute
No ratings yet
Go To SA38 2. Enter Program Name RPDASC00 and Execute
2 pages
Database Management System (Solution)
No ratings yet
Database Management System (Solution)
7 pages
PDF PDF
No ratings yet
PDF PDF
18 pages
DW
No ratings yet
DW
4 pages
CS-407 Mobile Application Development
No ratings yet
CS-407 Mobile Application Development
2 pages
Cs Project File (4)
No ratings yet
Cs Project File (4)
29 pages
Ontologies and The Semantic Web Ain Shams University
No ratings yet
Ontologies and The Semantic Web Ain Shams University
37 pages
MySQL Stored Procedure Function Lecture Notes Continuation
No ratings yet
MySQL Stored Procedure Function Lecture Notes Continuation
6 pages
15 Conceptual SQL Interview Questions and Answers - by John H. - Medium
No ratings yet
15 Conceptual SQL Interview Questions and Answers - by John H. - Medium
9 pages
Database Management Systems Lab Assesment-3 Name: P.Gopichand Reg - No: 18MIS0101 Course Code: SWE1004 Faculty: Jayaram Reddy A Slot: L1+L2
No ratings yet
Database Management Systems Lab Assesment-3 Name: P.Gopichand Reg - No: 18MIS0101 Course Code: SWE1004 Faculty: Jayaram Reddy A Slot: L1+L2
17 pages
NetBackup103 AdminGuide Oracle
No ratings yet
NetBackup103 AdminGuide Oracle
332 pages
Sample Paper-5 2025
No ratings yet
Sample Paper-5 2025
7 pages
Chapter One UG
No ratings yet
Chapter One UG
90 pages
Spreadsheet& RDBMs
No ratings yet
Spreadsheet& RDBMs
10 pages
Dataguard Theory
No ratings yet
Dataguard Theory
2 pages
Fraud Claim Detection
No ratings yet
Fraud Claim Detection
13 pages
Oracle Team: Summary of Qualifications
No ratings yet
Oracle Team: Summary of Qualifications
2 pages
Hassan Aladdine - IT PROJECT PRODUCT MANAGER
No ratings yet
Hassan Aladdine - IT PROJECT PRODUCT MANAGER
1 page
12 SM Ip
No ratings yet
12 SM Ip
180 pages
RDBMS LAB Manual
No ratings yet
RDBMS LAB Manual
29 pages
C6-DBC-20 Milestone - Jan to Jun 2025
No ratings yet
C6-DBC-20 Milestone - Jan to Jun 2025
4 pages
Rdbms Notes
No ratings yet
Rdbms Notes
193 pages
Day5 Patterns Use Cases
No ratings yet
Day5 Patterns Use Cases
45 pages
Tableau Interview Questions by LG PDF
100% (1)
Tableau Interview Questions by LG PDF
36 pages
terminal_mysql
No ratings yet
terminal_mysql
7 pages
Company Visitor Management System Project Report
No ratings yet
Company Visitor Management System Project Report
2 pages
Building Event Driven Services With Apache Kafka and Kafka Streams (PDFDrive)
No ratings yet
Building Event Driven Services With Apache Kafka and Kafka Streams (PDFDrive)
76 pages
WWW Educative Io Courses Database Design Fundamentals B6VQBZ-2
No ratings yet
WWW Educative Io Courses Database Design Fundamentals B6VQBZ-2
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 1,2,3

Uploaded by

Unit 1,2,3

Uploaded by

Unit-1

1. Define Data Mining. Discuss the stages in Knowledge Discovery in Data.

Objective: To categorize data instances into predefined classes or labels.

Objective: To predict numeric values based on input variables.

BI Support: Supports sales forecasting, demand prediction, and understanding relationships

3.**Association Rule Mining:**

Objective: To discover relationships or associations between items in a dataset.

Objective: To group similar data instances into clusters.

BI Support: Aids in market segmentation, identifying customer segments, and understanding

Objective: To identify unusual or unexpected patterns in data.

6. **Sequential Pattern Mining:**

Objective: To discover patterns in sequences of data.

7. **Text Mining and Sentiment Analysis:**

Objective: To extract insights and sentiment from textual data.

8. **Spatial Data Analysis:**

Objective: To uncover patterns and relationships in geographic data.

BI Support: Supports location-based decision-making, such as store placement and route

9. **Time Series Analysis:**

Objective: To analyze data points collected over time.

10. **Collaborative Filtering:**

BI Support: Facilitates recommendation systems, personalized marketing, and content

11. **Feature Selection and Dimensionality Reduction:**

Objective: To identify relevant features and reduce data dimensionality.

BI Support: Enhances model performance, reduces noise, and improves decision-making.

12. **Pattern Evaluation and Visualization:**

Objective: To assess the quality and usefulness of discovered patterns.

BI Support: Enables clear presentation of insights to stakeholders, aiding in informed decision-

5.Explain the process of Knowledge discovery of data.

6. Explain the different kinds of patterns to be mined.

2. **Data Preprocessing Tools:**

3. **Machine Learning Algorithms:**

4. **Statistical Analysis Software:**

6. **Big Data Technologies:**

7. **Data Visualization Tools:**

8. **Text Mining Tools:**

10. **Cloud Computing:**

11. **Data Mining Software:**

12. **Collaboration and Communication Tools:**

13. **Domain-specific Tools:**

14. **Security and Privacy Tools:**

8. Give an account on data mining Query language.

9. How data warehouse is different from a database? Explain similarities / differences

10. What is data mining? Is it a simple transformation or application of technology

11. What are the significant components of Data Minig Architecture.

15. Explain the major issues in Data Mining.

Sure, I can explain the different types of data mining architectures.

Centralized architecture: In a centralized architecture, all of the data is stored in a single

The various attribute types are:

i.) Compute the Euclidean distance between the two objects.

sqrt((22 - 20)^2 + (1 - 0)^2 + (42 - 36)^2 + (10 - 8)^2)

|22 - 20| + |1 - 0| + |42 - 36| + |10 - 8|

sqrt[3]((22 - 20)^3 + (1 - 0)^3 + (42 - 36)^3 + (10 - 8)^3)

3. What is Data Preprocessing? Explain the major tasks in Data

The major tasks in data preprocessing include:

5. Define attribute. Discuss about various types of attributes.

Binning can be used in a variety of data preprocessing techniques, such as:

Data normalization Discretization Feature selection

9.Give the commonly used statistical measures for the characterization of

Parallel computing: Parallel computing can be used to speed up the computation

10. Explain in detail about Minkowski distance.

11. Explain the process of Data Cleaning

12. Write short notes on the following normalization methods:a) Min-max

13. Explain how the Central tendency of Data is measured.

16. Explain about various Dimensionality Reduction techniques.

Principal Component Analysis (PCA)

Backward Feature Elimination

Forward Feature Selection

Missing Value Ratio

Low Variance Filter

3. Write in brief about schemas in Multidimensional Data Model.

Star schema: A star schema is the simplest type of

Snowflake schema: A snowflake schema is a

4.Describe in brief about BUC for Iceberg Cube Computation

BUC stands for Bottom-Up Computation. It is an algorithm for computing iceberg

Here are some of the advantages of using the BUC algorithm:

It is simple and efficient.

3.Association Rule Mining:

6. Sequential Pattern Mining:

7. Text Mining and Sentiment Analysis:

8. Spatial Data Analysis:

9. Time Series Analysis:

10. Collaborative Filtering:

11. Feature Selection and Dimensionality Reduction:

12. Pattern Evaluation and Visualization:

2. Data Preprocessing Tools:

3. Machine Learning Algorithms:

4. Statistical Analysis Software:

6. Big Data Technologies:

7. Data Visualization Tools:

8. Text Mining Tools:

10. Cloud Computing:

11. Data Mining Software:

12. Collaboration and Communication Tools:

13. Domain-specific Tools:

14. Security and Privacy Tools: