Dataminig ch1 30006
Dataminig ch1 30006
Dataminig ch1 30006
Data mining is a process of discovering various models, summaries, and derived values from a given
collection of data.
Or
Minig is the process of extracting information from various data resources to identify different patterns
and errors. This technique is particularly used in extracting data from large databases and the internets.
In practice, the two primary goals of data mining tend to be prediction and description.
• Predictive data mining, which produces the model of the system described by the given data set.
It involved the using historical data and statistical algorithms to build models that can predict
future outcomes or trends.
• Descriptive data mining, which produces new, nontrivial information based on the available data
set. It focused on exploring and summarizing existing data to understand patterns, relationships
within a data.
• Structure Identification: This step utilizes prior knowledge of the target system to define a class
of models, typically represented by a parameterized function y = f (u, t). This function is
determined based on the designer's expertise, intuition, and the governing laws of the system.
• Parameter Identification: Once the model structure is established, optimization techniques are
applied to find the parameter vector t that best fits the model to the system's behavior. This step
aims to determine the parameters t* for the model y* = f (u, t*) to accurately describe the system.
2. Data Collection: Data can be obtained through designed experiments or observational approaches.
Understanding the data generation process is crucial, and ensuring consistency in the sampling
distribution between training and testing datasets is important for accurate model estimation and
application.
3. Data Preprocessing: Involves outlier detection and removal, as well as scaling, encoding, and feature
selection. These steps aim to enhance the quality of the data and ensure that variables have
appropriate weights for analysis.
4. Model Estimation: Selecting and implementing suitable data mining techniques to build models.
This process often involves comparing multiple models to identify the most effective one for the
given problem.
5. Model Interpretation and Conclusion: Data mining models should be interpretable to facilitate
decision-making. Balancing model accuracy with interpretability is important, especially considering
the complexity of modern high-dimensional models.
For example, in a study analyzing housing prices, independent variables might include factors such as
square footage, number of bedrooms, and neighborhood quality. The dependent variable would be the
price of the house. The goal of data mining would be to understand how changes in the independent
variables affect the dependent variable, allowing for predictions or insights into housing prices.
or
data warehouse can be defined as a central repository of integrated, structured, and preprocessed data that
is optimized for analysis and exploration.
A data warehouse includes the following categories of data, where the classification is accommodated to
the time - dependent data sources:
Data transformation
1. Simple Transformations: Basic changes made to individual data fields, like converting data
types or replacing encoded values with decoded ones.
2. Cleansing and Scrubbing: Ensuring consistent formatting and accuracy of data, such as properly
formatting addresses or validating values within a specified range.
3. Integration: Combining data from different sources into a unified structure in the data
warehouse. Challenges include identifying the same entities across multiple systems and
resolving conflicts or missing values.
4. Aggregation and Summarization: Condensing operational data into fewer instances in the
warehouse. Summarization involves adding values along dimensions (e.g., daily sales to monthly
sales), while aggregation combines different business elements into a common total, depending
on the domain (e.g., combining sales of different products and services).