TM 4 - Data Mining and Machine Learning
TM 4 - Data Mining and Machine Learning
Plan Monitering
Assess Describe Clean Generate Review
&
Situation Data Data Test Design Process
Maintenance
Determine Produce
Explore Construct Build Determine
Data Mining Final
Data Data Model Next Steps
Goals Report
Verify
Produce Integrate Assess Review
Data
Project Plan Data Model Project
Quality
Format
Data
Precourse: Exploratory Data Analysis
Case Study 1
An e-commerce company wants to optimize its marketing strategy
by identifying distinct customer groups. The company collects data
on purchase history, browsing behavior, and customer
demographics. Using data analytics, they aim to create targeted
marketing campaigns and improve customer retention.
https://www.kaggle.com/datasets/uom190346a/e-commerce-customer-behavior-dataset
Case Study 2
An online retailer wants to enhance the shopping experience by
recommending personalized products to customers. They have
historical transaction data, including products purchased together
and customer preferences. By leveraging machine learning, the
retailer aims to implement a recommendation system to increase
sales.
https://www.kaggle.com/datasets/asaniczka/amazon-canada-products-2023-2-1m-products
1. Business Understanding
● Define the business problem.
● Explain the business objectives and success criteria.
2. Data Understanding
● Describe the dataset (features, size, source, missing values,
etc.).
Introduction to Machine Learning
a branch of Artificial Intelligence (AI) that enables computers to learn from
data, uncover patterns and make decisions or predictions without being
explicitly programmed.
Example of ML Implementation
● Predictive Analytics (forecasting sales, customer demand).
○ Predictive analytics is a branch of advanced analytics used to make predictions about future outcomes based
on historical data. Customer Segmentation (personalizing marketing strategies).
● Fraud Detection (identifying abnormal behavior).
Types of Machine Learning
Supervised, Unsupervised, and Reinforcement Learning
Types of Machine Learning
Supervised Learning
relies on labeled data , where the model learns from both inputs (features) and
corresponding outputs (labels)
Supervised learning is when we teach or train the machine using data that is well-labelled. Which means some
data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set of training examples) and
produces a correct outcome from labeled data.
Types of Problems on Supervised Learning
● Classification problems
This algorithm helps to predict a discrete value.
● Regression problems
This algorithm helps to predict a continuous value.
Types of Problems on Supervised Learning
Classification problems
● This algorithm helps to predict a discrete value. It can be thought, the input data as a member of a
particular class or group.
● For instance, taking up the photos of the fruit dataset, each photo has been labelled as a mango, an
apple, etc. Here, the algorithm has to classify the new images into any of these categories.
Algorithm
● Naive Bayes Classifier
● Support Vector Machines
● Logistic Regression
Predictive Analytics Techniques
Logistic Regression
Used when the dependent variable is binary (e.g., yes/no outcomes) .
Logistic regression models the probability that a given input point belongs to a
particular category.
Use in Management:
● Credit Scoring: Predicting the likelihood that a borrower will default on a loan.
● Employee Retention: Identifying factors that influence whether an employee will stay or leave.
Types of Problems on Supervised Learning
Evaluation Metrics - Classification
problems
● True Negative (TN): The model correctly predicted a negative
outcome (the actual outcome was negative).
● True Positive (TP): The model predicts the data is in the Positive
class and the data is actually in the Positive class.
● False Negative (FN): The model incorrectly predicted a negative
outcome (the actual outcome was positive).
● False Positive (FP): The model incorrectly predicted a positive
outcome (the actual outcome was negative)
Confusion matrix
Types of Problems on Supervised Learning
Evaluation Metrics - Classification
problems
Confusion matrix
Types of Problems on Supervised Learning
Evaluation Metrics - Classification
problems
● Accuracy: Accuracy is the percentage of predictions that the model makes correctly. It is calculated by
dividing the number of correct predictions by the total number of predictions.
● Precision: Precision is the percentage of positive predictions that the model makes that are actually
correct. It is calculated by dividing the number of true positives by the total number of positive predictions.
● Recall: Recall is the percentage of all positive examples that the model correctly identifies. It is
calculated by dividing the number of true positives by the total number of positive examples.
● F1 score: The F1 score is a weighted average of precision and recall. It is calculated by taking the
harmonic mean of precision and recall.
Types of Problems on Supervised Learning
Regression problems
● These problems are used for continuous data.
● For example, predicting the price of a piece of land in a city, given the area, location, number of rooms, etc.
And then the input is sent to the machine for calculating the price of the land according to previous
examples.
Algorithm
● Linear Regression
● Nonlinear Regression
● Bayesian Linear Regression
Predictive Analytics Techniques
Linear Regression
A foundational technique to model the relationship between a dependent
variable and one or more independent variables.
● Equation: Y=a+bXY = a + bXY=a+bX (where Y is the predicted value, X is the independent variable, and a, b are
constants).
● Use in Management:
○ Sales Forecasting: Predicting sales volume based on marketing spend.
○ Cost Estimation: Estimating operational costs based on variable factors like demand.
○ Case Study: At MIT Sloan, linear regression models were used to predict energy consumption based on seasonal
variations and pricing models.
Types of Problems on Supervised Learning
Evaluation Metrics - Regression
problems
● Mean Squared Error (MSE): MSE measures the average squared difference between the predicted values
and the actual values. Lower MSE values indicate better model performance.
● Root Mean Squared Error (RMSE): RMSE is the square root of MSE, representing the standard deviation
of the prediction errors. Similar to MSE, lower RMSE values indicate better model performance.
● Mean Absolute Error (MAE): MAE measures the average absolute difference between the predicted
values and the actual values. It is less sensitive to outliers compared to MSE or RMSE.
● R-squared (Coefficient of Determination): R-squared measures the proportion of the variance in the
target variable that is explained by the model. Higher R-squared values indicate better model fit.
Example of Supervised Learning
Let’s say you have a fruit basket that you want to identify. The machine would first analyze the image to extract
features such as its shape, color, and texture. Then, it would compare these features to the features of the fruits
it has already learned about. If the new image’s features are most similar to those of an apple, the machine
would predict that the fruit is an apple.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step is to train the
machine with all the different fruits one by one like this:
If the shape of the object is rounded and has a depression at the top, is red in color, then it will be labeled as
–Apple.
If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the basket, and
asked to identify it.
Since the machine has already learned the things from previous data and this time has to use it wisely. It will
first classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it in the
Banana category. Thus the machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).
Applications of Supervised learning
Supervised Learning
● Spam filtering: Supervised learning algorithms can be trained to identify and classify spam emails based
on their content, helping users avoid unwanted messages.
● Image classification: Supervised learning can automatically classify images into different categories,
such as animals, objects, or scenes, facilitating tasks like image search, content moderation, and
image-based product recommendations.
● Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing patient data, such as
medical images, test results, and patient history, to identify patterns that suggest specific diseases or
conditions.
● Fraud detection: Supervised learning models can analyze financial transactions and identify patterns
that indicate fraudulent activity, helping financial institutions prevent fraud and protect their customers.
● Natural language processing (NLP): Supervised learning plays a crucial role in NLP tasks, including
sentiment analysis, machine translation, and text summarization, enabling machines to understand and
process human language effectively.
Tanya Jawab - Metode Learning dalam Machine
Learning
1. Mencari tahu apakah seseorang akan gagal dalam membayar utangnya.
Pertanyaan:
Metode pembelajaran apa yang digunakan untuk memprediksi apakah seseorang
akan gagal membayar utangnya?
Jawaban:
Metode yang digunakan adalah Supervised Learning dengan teknik Classification. Model
seperti Logistic Regression, Decision Tree, Random Forest, atau Neural Networks dapat
digunakan untuk memprediksi apakah seseorang akan gagal membayar utang berdasarkan
variabel seperti riwayat kredit, penghasilan, dan rasio utang terhadap pendapatan.
Tanya Jawab - Metode Learning dalam Machine
Learning
2. Mengelompokkan jenis burung berdasarkan beberapa variabel secara
otomatis.
Pertanyaan:
Metode pembelajaran apa yang sesuai untuk mengelompokkan jenis burung tanpa
label yang sudah ditentukan?
Jawaban:
Metode yang digunakan adalah Unsupervised Learning dengan teknik Clustering, seperti
K-Means, Hierarchical Clustering, atau DBSCAN. Teknik ini digunakan untuk
mengelompokkan burung berdasarkan fitur seperti ukuran tubuh, warna bulu, dan pola migrasi
tanpa perlu label awal.
Tanya Jawab - Metode Learning dalam Machine
Learning
3. Menebak suhu udara di suatu tempat di keesokan hari.
Pertanyaan:
Metode pembelajaran apa yang cocok untuk memprediksi suhu udara esok hari
berdasarkan data historis?
Jawaban:
Metode yang digunakan adalah Supervised Learning dengan teknik Regression, seperti
Linear Regression, Random Forest Regression, atau Neural Networks. Model ini
mempelajari pola suhu berdasarkan data historis dan variabel cuaca lainnya untuk membuat
prediksi.
Tanya Jawab - Metode Learning dalam Machine
Learning
4. Menghitung waktu tempuh perjalanan dari satu titik ke titik berikutnya dengan
menggunakan moda transportasi tertentu.
Pertanyaan:
Bagaimana metode pembelajaran mesin dapat digunakan untuk memperkirakan waktu
tempuh perjalanan?
Jawaban:
Metode yang digunakan adalah Supervised Learning dengan pendekatan Regression,
misalnya Gradient Boosting Regression atau Neural Networks. Model ini
memanfaatkan data seperti jarak, kondisi lalu lintas, kecepatan rata-rata moda
transportasi, dan cuaca untuk memprediksi waktu tempuh perjalanan.
Advantages of Supervised learning
● Supervised learning allows collecting data and produces data output from previous experiences.
● Helps to optimize performance criteria with the help of experience.
● Supervised machine learning helps to solve various types of real-world computation problems.
● It performs classification and regression tasks.
● It allows estimating or mapping the result to a new sample.
● We have complete control over choosing the number of classes we want in the training data.
● Unsupervised learning is self-organized learning. Its main aim is to explore the underlying patterns and predicts the
output. Here we basically provide the machine with data and ask to look for hidden features and cluster the data in a
way that makes sense.
● Unsupervised learning is the training of a machine using information that is neither classified nor labeled and
allowing the algorithm to act on that information without guidance.
● The task of the machine is to group unsorted information according to similarities, patterns, and differences
without any prior training of data.
Example of Unsupervised Learning
Imagine you have a machine learning model trained on a large dataset of unlabeled images, containing both
dogs and cats. The model has never seen an image of a dog or cat before, and it has no pre-existing labels or
categories for these animals. Your task is to use unsupervised learning to identify the dogs and cats in a new,
unseen image.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs and cats ‘.
But it can categorize them according to their similarities, patterns, and differences, i.e., we can easily categorize
the above picture into two parts. The first may contain all pics having dogs in them and the second part may
contain all pics having cats in them. Here you didn’t learn anything before, which means no training data or
examples.
It allows the model to work on its own to discover patterns and information that was previously undetected. It
mainly deals with unlabelled data.
Types of Unsupervised Learning
Clustering
● Type of unsupervised learning that is used to group similar data points together.
● Clustering algorithms work by iteratively moving data points closer to their cluster centers and further away
from data points in other clusters.
Algorithm
● Hierarchical clustering
● K-means clustering
● Principal Component Analysis
● Singular Value Decomposition
● Independent Component Analysis
● Gaussian Mixture Models (GMMs)
● Density-Based Spatial Clustering of Applications with Noise
(DBSCAN)
Types of Unsupervised Learning
Association
● Association rule learning is a type of unsupervised learning that is used to identify patterns in a data.
Association rule learning algorithms work by finding relationships between different items in a dataset
Algorithm
● Apriori Algorithm
● Eclat Algorithm
● FP-Growth Algorithm
Supervised Learning
Evaluation Metrics
● Silhouette score: The silhouette score measures how well each data point is clustered with its own
cluster members and separated from other clusters. It ranges from -1 to 1, with higher scores indicating
better clustering.
● Calinski-Harabasz score: The Calinski-Harabasz score measures the ratio between the variance
between clusters and the variance within clusters. It ranges from 0 to infinity, with higher scores
indicating better clustering.
● Adjusted Rand index: The adjusted Rand index measures the similarity between two clusterings. It
ranges from -1 to 1, with higher scores indicating more similar clusterings.
● Davies-Bouldin index: The Davies-Bouldin index measures the average similarity between clusters. It
ranges from 0 to infinity, with lower scores indicating better clustering.
● F1 score: The F1 score is a weighted average of precision and recall, which are two metrics that are
commonly used in supervised learning to evaluate classification models. However, the F1 score can also
be used to evaluate non-supervised learning models, such as clustering models.
Applications of Unsupervised learning
Unsupervised Learning
● Anomaly detection: Unsupervised learning can identify unusual patterns or deviations from normal
behavior in data, enabling the detection of fraud, intrusion, or system failures.
● Scientific discovery: Unsupervised learning can uncover hidden relationships and patterns in scientific
data, leading to new hypotheses and insights in various scientific fields.
● Recommendation systems: Unsupervised learning can identify patterns and similarities in user behavior
and preferences to recommend products, movies, or music that align with their interests.
● Customer segmentation: Unsupervised learning can identify groups of customers with similar
characteristics, allowing businesses to target marketing campaigns and improve customer service more
effectively.
● Image analysis: Unsupervised learning can group images based on their content, facilitating tasks such
as image classification, object detection, and image retrieval.
Tanya Jawab - Metode Learning dalam Machine
Learning
2. Mengelompokkan jenis burung berdasarkan beberapa variabel secara
otomatis.
Pertanyaan:
Metode pembelajaran apa yang sesuai untuk mengelompokkan jenis burung tanpa
label yang sudah ditentukan?
Jawaban:
Metode yang digunakan adalah Unsupervised Learning dengan teknik Clustering, seperti
K-Means, Hierarchical Clustering, atau DBSCAN. Teknik ini digunakan untuk
mengelompokkan burung berdasarkan fitur seperti ukuran tubuh, warna bulu, dan pola migrasi
tanpa perlu label awal.
Advantages of Unsupervised learning
● Difficult to measure accuracy or effectiveness due to lack of predefined answers during training.
● The results often have lesser accuracy.
● The user needs to spend time interpreting and label the classes which follow that classification.
● Unsupervised learning can be sensitive to data quality, including missing values, outliers, and noisy data.
● Without labeled data, it can be difficult to evaluate the performance of unsupervised learning models, making it
challenging to assess their effectiveness.
Types of Machine Learning
Reinforcement Learning
The algorithms learn to react to an environment on their own.
● Reinforcement Learning (RL) is a more complex approach where an agent learns by interacting with the environment,
receiving rewards or penalties based on its actions. The agent’s goal is to maximize cumulative reward over time.
● For a learning agent, there is always a start state and an end state. However, to reach the end state, there might be a
different path. In Reinforcement Learning Problem an agent tries to manipulate the environment. The agent travels from
one state to another. The agent gets the reward(appreciation) on success but will not receive any reward or appreciation
on failure. In this way, the agent learns from the environment
Applications in Business
● Dynamic Pricing: RL can be used in e-commerce to adjust pricing dynamically, learning optimal price points that
maximize revenue while responding to competitor pricing.
● Supply Chain Optimization: In logistics, RL optimizes supply chain operations by learning the most efficient routing
and inventory levels over time.
Implementation Step
Implementation Steps
Konsep Workflow
• Workflow: Rangkaian widget yang dihubungkan untuk membangun pipeline analisis
data.
• Contoh workflow:
• Input Data → Preprocessing → Visualisasi → Machine Learning Model →
Evaluasi
Kategori Fungsi
Tugas Mahasiswa:
1. Simpan hasil akurasi dan confusion matrix dari masing-masing model.
2. Bandingkan kelebihan dan kekurangan setiap model.
3. Tentukan model terbaik berdasarkan hasil evaluasi.
Bagian 3: Kesimpulan dan Laporan Tugas
Mahasiswa diminta menyusun laporan dalam format PDF atau Word dengan struktur berikut:
1.Pendahuluan : Penjelasan singkat tentang dataset Iris dan tujuan analisis. TM3
2.Analisis Deskriptif
1. Tabel statistik deskriptif.
2. Visualisasi data dengan interpretasi.
3.Implementasi Model Klasifikasi
1. Screenshots workflow di Orange Data Mining.
2. Hasil perbandingan akurasi dari lima model.
4.Evaluasi Model
1. Tabel perbandingan akurasi semua model.
2. Confusion Matrix untuk model terbaik.
3. Interpretasi hasil ROC Curve.
5.Kesimpulan
1. Model mana yang paling baik untuk mengklasifikasikan Iris Setosa?
2. Faktor apa yang mempengaruhi hasil klasifikasi?
3. Rekomendasi untuk meningkatkan akurasi model.
Bobot Penilaian:
Format Pengumpulan dan Penilaian
Aspek Bobot (%)
Format Pengumpulan:
• Laporan dalam format PDF/Word. Eksplorasi Data & Statistik Deskriptif 20%
• File Workflow Orange (.ows). Implementasi Model Klasifikasi 30%
Evaluasi & Perbandingan Model 30%
Kesimpulan & Interpretasi 20%
Homework
Please make the EDA and Machine Learning with this dataset
https://raw.githubusercontent.com/datasciencedojo/datase
ts/master/titanic.csv
Submit it in Google Collab with your group