Classification & Prediction
Classification & Prediction
Classification & Prediction
What is classification?
Following are the examples of cases where the data analysis task is Classification
−
A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes
or no for marketing data.
Real-Life Examples
There are many real-life examples and applications of classification in data
mining. Some of the most common examples of applications include -
Email spam classification - This involves classifying emails as spam or
non-spam based on their content and metadata.
Image classification - This involves classifying images into different
categories, such as animals, plants, buildings, and people.
Medical diagnosis - This involves classifying patients into different
categories based on their symptoms, medical history, and test results.
Credit risk analysis - This involves classifying loan applications into
different categories, such as low-risk, medium-risk, and high-risk, based on
the applicant's credit score, income, and other factors.
Sentiment analysis - This involves classifying text data, such as reviews or
social media posts, into positive, negative, or neutral categories based on
the language used.
Customer segmentation - This involves classifying customers into
different segments based on their demographic information, purchasing
behavior, and other factors.
Fraud detection - This involves classifying transactions as fraudulent or
non-fraudulent based on various features such as transaction amount,
location, and frequency.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer
will spend during a sale at his company. In this example we are bothered to
predict a numeric value. Therefore the data analysis task is an example of numeric
prediction. In this case, a model or a predictor will be constructed that predicts a
continuous-valued-function or ordered value.
How Does Classification Works?
With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes
two steps −
Building the Classifier or Model
Using Classifier for Classification
1. Data Cleaning: Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques,
and the problem of missing values is solved by replacing a missing value
with the most commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
3. Data Transformation and reduction: The data can be transformed by any of the
following methods.
o Normalization: The data is transformed using normalization.
Normalization involves scaling all values for a given attribute to
make them fall within a small specified range. Normalization is used
when the neural networks or the methods involving measurements
are used in the learning step.
o Generalization: The data can also be transformed by generalizing it to
the higher concept. For this purpose, we can use the concept
hierarchies.
Comparison of Classification and Prediction Methods
o Accuracy: The accuracy of the classifier can be referred to as the ability of
the classifier to predict the class label correctly, and the accuracy of the
predictor can be referred to as how well a given predictor can estimate the
unknown value.
o Speed: The speed of the method depends on the computational cost of
generating and using the classifier or predictor.
o Robustness: Robustness is the ability to make correct predictions or
classifications. In the context of data mining, robustness is the ability of the
classifier or predictor to make correct predictions from incoming unknown
data.
o Scalability: Scalability refers to an increase or decrease in the performance
of the classifier or predictor based on the given data.
o Interpretability: Interpretability is how readily we can understand the
reasoning behind predictions or classification made by the predictor or
classifier.
CLASSIFICATION TECHNIQUES
1. DECISION TREE
o Decision Tree Mining is a type of data mining technique that is used to
build Classification Models. It builds classification models in the form of a
tree-like structure, just like its name. This type of mining belongs to
supervised class learning.
o In supervised learning, the target result is already known. Decision trees
can be used for both categorical and numerical data. The categorical data
represent gender, marital status, etc. while the numerical data represent
age, temperature, etc.
Solution:
P(A|B) = (P(B|A) * P(A) )/ P(B)
1. Mango:
P(Long | Mango)= 0 → 3
On multiplying eq 1,2,3 ==> P(X | Mango) = 0.53 * 0.69 * 0
P(X | Mango) = 0
2. Banana:
P(Yellow | Banana) = 1 → 4
3. Others: