0% found this document useful (0 votes)

3 views

TM 4 - Data Mining and Machine Learning

The document provides an introduction to Data Mining and Machine Learning, covering topics such as the CRISP-DM methodology, types of machine learning (supervised, unsupervised, and reinforcement learning), and their applications in business. It includes case studies on e-commerce and product recommendations, as well as discussions on classification and regression problems in supervised learning. Additionally, it outlines the advantages and disadvantages of supervised learning and emphasizes its importance in predictive analytics and decision-making.

Uploaded by

Musa Radhitia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

TM 4 - Data Mining and Machine Learning

Uploaded by

Musa Radhitia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Introduction of Data Mining

and Machine Learning

Dr. Donny Maha Putra, S.Kom., M.Ak

Lecturer in UPN Veteran Jakarta and Data Analytic Oﬃcer

in Ministry of Finance of Indonesia

Salsabila Ramadhina S.Kom MSc

Lecturer in UPN Veteran Jakarta, Product Manager at Ultra
Voucher, and Scholarship Mentor at Kobi Education

Feb.upnv febupnvj feb.upnvj.ac.id febupnvj@upnvj.ac.id

Session Objectives
● Overview of Machine Learning
● CRISP-DM
● Case Study (15 minutes discussion)
● Supervised vs Unsupervised Learning
● Supervised Learning Classiﬁcation Problem
● Supervised Learning Regression Problem
● Mid-Exam Session
What is Data Science?
Data Science adalah
Bidang interdisipliner yang menggunakan
metode, proses, algoritma dan sistem ilmiah
untuk mengekstrak pengetahuan dan wawasan
dari data terstruktur dan tidak terstruktur,
serta menerapkan pengetahuan dan wawasan
yang dapat ditindaklanjuti dari data di berbagai
domain aplikasi.
Machine Learning

Machine learning adalah

sekumpulan
metode untuk
mendeteksi pola data
secara otomatis dan
menggunakannya untuk
memprediksi data di masa
depan dan memandu
pengambilan keputusan.
Dengan kata lain, belajar
dari data.
Deep Learning
Metodologi
Data Sains
(CRISP-DM)
What is CRISP-DM?
• Cross Industry Standard
Process for Data Mining

• The most popular

methodology for data
analytics projects
CRoss-Industry Standard Process
for Data Mining
Data Mining Lifecycle
Kita bahas
bagian ini
▪ Business Understanding
▪ Data Understanding
▪ Data Preparation
▪ Modeling
▪ Evaluation
▪ Deployment

▪ Notice the iteration!

THE FAMOUS CRISP-DM

CRISP-DM: Phases
• Business Understanding
Project objectives and requirements understanding, Data mining problem definition
• Data Understanding
Initial data collection and familiarization, Data quality problems identification
• Data Preparation
Table, record and attribute selection, Data transformation and cleaning
• Modeling
Modeling techniques selection and application, Parameters calibration
• Evaluation
Business objectives & issues achievement evaluation
• Deployment
Result model deployment, Repeatable data mining process implementation
Phases and Tasks
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation

Determine Collect Select

Select Evaluate Plan
Business Initial Modeling
Data Results Deployment
Objectives Data Technique

Plan Monitering
Assess Describe Clean Generate Review
&
Situation Data Data Test Design Process
Maintenance

Determine Produce
Explore Construct Build Determine
Data Mining Final
Data Data Model Next Steps
Goals Report

Verify
Produce Integrate Assess Review
Data
Project Plan Data Model Project
Quality

Format
Data
Precourse: Exploratory Data Analysis

Case Study 1
An e-commerce company wants to optimize its marketing strategy
by identifying distinct customer groups. The company collects data
on purchase history, browsing behavior, and customer
demographics. Using data analytics, they aim to create targeted
marketing campaigns and improve customer retention.

https://www.kaggle.com/datasets/uom190346a/e-commerce-customer-behavior-dataset

E-commerce Customer Behavior Dataset (Kaggle)

Precourse: Exploratory Data Analysis

Case Study 2
An online retailer wants to enhance the shopping experience by
recommending personalized products to customers. They have
historical transaction data, including products purchased together
and customer preferences. By leveraging machine learning, the
retailer aims to implement a recommendation system to increase
sales.

https://www.kaggle.com/datasets/asaniczka/amazon-canada-products-2023-2-1m-products

Amazon Product Recommendation Dataset

Precourse: Exploratory Data Analysis

Discuss with your team (15 minutes)

1. Business Understanding
● Deﬁne the business problem.
● Explain the business objectives and success criteria.

2. Data Understanding
● Describe the dataset (features, size, source, missing values,
etc.).
Introduction to Machine Learning
a branch of Artiﬁcial Intelligence (AI) that enables computers to learn from
data, uncover patterns and make decisions or predictions without being
explicitly programmed.

How does it work?

ML systems build statistical models based on sample data (training data) to make decisions or predictions.

Why is it important in business?

Machine Learning enables predictive analytics, automation, and personalized customer experiences, which
signiﬁcantly improve business performance and decision-making.

Example of ML Implementation
● Predictive Analytics (forecasting sales, customer demand).
○ Predictive analytics is a branch of advanced analytics used to make predictions about future outcomes based
on historical data. Customer Segmentation (personalizing marketing strategies).
● Fraud Detection (identifying abnormal behavior).
Types of Machine Learning
Supervised, Unsupervised, and Reinforcement Learning
Types of Machine Learning
Supervised Learning
relies on labeled data , where the model learns from both inputs (features) and
corresponding outputs (labels)

Supervised learning is when we teach or train the machine using data that is well-labelled. Which means some
data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set of training examples) and
produces a correct outcome from labeled data.
Types of Problems on Supervised Learning
● Classiﬁcation problems
This algorithm helps to predict a discrete value.
● Regression problems
This algorithm helps to predict a continuous value.
Types of Problems on Supervised Learning
Classiﬁcation problems
● This algorithm helps to predict a discrete value. It can be thought, the input data as a member of a
particular class or group.
● For instance, taking up the photos of the fruit dataset, each photo has been labelled as a mango, an
apple, etc. Here, the algorithm has to classify the new images into any of these categories.

Algorithm
● Naive Bayes Classiﬁer
● Support Vector Machines
● Logistic Regression
Predictive Analytics Techniques
Logistic Regression
Used when the dependent variable is binary (e.g., yes/no outcomes) .
Logistic regression models the probability that a given input point belongs to a
particular category.

Use in Management:
● Credit Scoring: Predicting the likelihood that a borrower will default on a loan.
● Employee Retention: Identifying factors that inﬂuence whether an employee will stay or leave.
Types of Problems on Supervised Learning
Evaluation Metrics - Classiﬁcation
problems
● True Negative (TN): The model correctly predicted a negative
outcome (the actual outcome was negative).
● True Positive (TP): The model predicts the data is in the Positive
class and the data is actually in the Positive class.
● False Negative (FN): The model incorrectly predicted a negative
outcome (the actual outcome was positive).
● False Positive (FP): The model incorrectly predicted a positive
outcome (the actual outcome was negative)

Confusion matrix
Types of Problems on Supervised Learning
Evaluation Metrics - Classiﬁcation
problems

● True Positive (TP): It is the total counts having both predicted

and actual values are Dog.
● True Negative (TN): It is the total counts having both predicted
and actual values are Not Dog.
● False Positive (FP): It is the total counts having prediction is
Dog while actually Not Dog.
● False Negative (FN): It is the total counts having prediction is
Not Dog while actually, it is Dog.

Confusion matrix
Types of Problems on Supervised Learning
Evaluation Metrics - Classiﬁcation
problems
● Accuracy: Accuracy is the percentage of predictions that the model makes correctly. It is calculated by
dividing the number of correct predictions by the total number of predictions.
● Precision: Precision is the percentage of positive predictions that the model makes that are actually
correct. It is calculated by dividing the number of true positives by the total number of positive predictions.
● Recall: Recall is the percentage of all positive examples that the model correctly identiﬁes. It is
calculated by dividing the number of true positives by the total number of positive examples.
● F1 score: The F1 score is a weighted average of precision and recall. It is calculated by taking the
harmonic mean of precision and recall.
Types of Problems on Supervised Learning
Regression problems
● These problems are used for continuous data.
● For example, predicting the price of a piece of land in a city, given the area, location, number of rooms, etc.
And then the input is sent to the machine for calculating the price of the land according to previous
examples.

Algorithm
● Linear Regression
● Nonlinear Regression
● Bayesian Linear Regression
Predictive Analytics Techniques
Linear Regression
A foundational technique to model the relationship between a dependent
variable and one or more independent variables.

● Equation: Y=a+bXY = a + bXY=a+bX (where Y is the predicted value, X is the independent variable, and a, b are
constants).
● Use in Management:
○ Sales Forecasting: Predicting sales volume based on marketing spend.
○ Cost Estimation: Estimating operational costs based on variable factors like demand.
○ Case Study: At MIT Sloan, linear regression models were used to predict energy consumption based on seasonal
variations and pricing models.
Types of Problems on Supervised Learning
Evaluation Metrics - Regression
problems

● Mean Squared Error (MSE): MSE measures the average squared difference between the predicted values
and the actual values. Lower MSE values indicate better model performance.
● Root Mean Squared Error (RMSE): RMSE is the square root of MSE, representing the standard deviation
of the prediction errors. Similar to MSE, lower RMSE values indicate better model performance.
● Mean Absolute Error (MAE): MAE measures the average absolute difference between the predicted
values and the actual values. It is less sensitive to outliers compared to MSE or RMSE.
● R-squared (Coefﬁcient of Determination): R-squared measures the proportion of the variance in the
target variable that is explained by the model. Higher R-squared values indicate better model ﬁt.
Example of Supervised Learning

Let’s say you have a fruit basket that you want to identify. The machine would ﬁrst analyze the image to extract
features such as its shape, color, and texture. Then, it would compare these features to the features of the fruits
it has already learned about. If the new image’s features are most similar to those of an apple, the machine
would predict that the fruit is an apple.

For instance, suppose you are given a basket ﬁlled with different kinds of fruits. Now the ﬁrst step is to train the
machine with all the different fruits one by one like this:

If the shape of the object is rounded and has a depression at the top, is red in color, then it will be labeled as
–Apple.
If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the basket, and
asked to identify it.

Since the machine has already learned the things from previous data and this time has to use it wisely. It will
first classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it in the
Banana category. Thus the machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).
Applications of Supervised learning
Supervised Learning
● Spam filtering: Supervised learning algorithms can be trained to identify and classify spam emails based
on their content, helping users avoid unwanted messages.
● Image classification: Supervised learning can automatically classify images into different categories,
such as animals, objects, or scenes, facilitating tasks like image search, content moderation, and
image-based product recommendations.
● Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing patient data, such as
medical images, test results, and patient history, to identify patterns that suggest specific diseases or
conditions.
● Fraud detection: Supervised learning models can analyze financial transactions and identify patterns
that indicate fraudulent activity, helping financial institutions prevent fraud and protect their customers.
● Natural language processing (NLP): Supervised learning plays a crucial role in NLP tasks, including
sentiment analysis, machine translation, and text summarization, enabling machines to understand and
process human language effectively.
Tanya Jawab - Metode Learning dalam Machine
Learning
1. Mencari tahu apakah seseorang akan gagal dalam membayar utangnya.
Pertanyaan:
Metode pembelajaran apa yang digunakan untuk memprediksi apakah seseorang
akan gagal membayar utangnya?

Jawaban:
Metode yang digunakan adalah Supervised Learning dengan teknik Classification. Model
seperti Logistic Regression, Decision Tree, Random Forest, atau Neural Networks dapat
digunakan untuk memprediksi apakah seseorang akan gagal membayar utang berdasarkan
variabel seperti riwayat kredit, penghasilan, dan rasio utang terhadap pendapatan.
Tanya Jawab - Metode Learning dalam Machine
Learning
2. Mengelompokkan jenis burung berdasarkan beberapa variabel secara
otomatis.
Pertanyaan:
Metode pembelajaran apa yang sesuai untuk mengelompokkan jenis burung tanpa
label yang sudah ditentukan?
Jawaban:
Metode yang digunakan adalah Unsupervised Learning dengan teknik Clustering, seperti
K-Means, Hierarchical Clustering, atau DBSCAN. Teknik ini digunakan untuk
mengelompokkan burung berdasarkan fitur seperti ukuran tubuh, warna bulu, dan pola migrasi
tanpa perlu label awal.
Tanya Jawab - Metode Learning dalam Machine
Learning
3. Menebak suhu udara di suatu tempat di keesokan hari.
Pertanyaan:
Metode pembelajaran apa yang cocok untuk memprediksi suhu udara esok hari
berdasarkan data historis?

Jawaban:
Metode yang digunakan adalah Supervised Learning dengan teknik Regression, seperti
Linear Regression, Random Forest Regression, atau Neural Networks. Model ini
mempelajari pola suhu berdasarkan data historis dan variabel cuaca lainnya untuk membuat
prediksi.
Tanya Jawab - Metode Learning dalam Machine
Learning
4. Menghitung waktu tempuh perjalanan dari satu titik ke titik berikutnya dengan
menggunakan moda transportasi tertentu.
Pertanyaan:
Bagaimana metode pembelajaran mesin dapat digunakan untuk memperkirakan waktu
tempuh perjalanan?

Jawaban:
Metode yang digunakan adalah Supervised Learning dengan pendekatan Regression,
misalnya Gradient Boosting Regression atau Neural Networks. Model ini
memanfaatkan data seperti jarak, kondisi lalu lintas, kecepatan rata-rata moda
transportasi, dan cuaca untuk memprediksi waktu tempuh perjalanan.
Advantages of Supervised learning

● Supervised learning allows collecting data and produces data output from previous experiences.
● Helps to optimize performance criteria with the help of experience.
● Supervised machine learning helps to solve various types of real-world computation problems.
● It performs classiﬁcation and regression tasks.
● It allows estimating or mapping the result to a new sample.
● We have complete control over choosing the number of classes we want in the training data.

Disadvantages of Supervised learning

● Classifying big data can be challenging.

● Training for supervised learning needs a lot of computation time. So, it requires a lot of time.
● Supervised learning cannot handle all complex tasks in Machine Learning.
● Computation time is vast for supervised learning.
● It requires a labelled data set.
● It requires a training process.
Types of Machine Learning
Unsupervised Learning
deals with data that has no labeled, data does not have any pre-existing labels or
categories.

● Unsupervised learning is self-organized learning. Its main aim is to explore the underlying patterns and predicts the
output. Here we basically provide the machine with data and ask to look for hidden features and cluster the data in a
way that makes sense.
● Unsupervised learning is the training of a machine using information that is neither classiﬁed nor labeled and
allowing the algorithm to act on that information without guidance.
● The task of the machine is to group unsorted information according to similarities, patterns, and differences
without any prior training of data.
Example of Unsupervised Learning

Imagine you have a machine learning model trained on a large dataset of unlabeled images, containing both
dogs and cats. The model has never seen an image of a dog or cat before, and it has no pre-existing labels or
categories for these animals. Your task is to use unsupervised learning to identify the dogs and cats in a new,
unseen image.

For instance, suppose it is given an image having both dogs and cats which it has never seen.

Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs and cats ‘.
But it can categorize them according to their similarities, patterns, and differences, i.e., we can easily categorize
the above picture into two parts. The ﬁrst may contain all pics having dogs in them and the second part may
contain all pics having cats in them. Here you didn’t learn anything before, which means no training data or
examples.

It allows the model to work on its own to discover patterns and information that was previously undetected. It
mainly deals with unlabelled data.
Types of Unsupervised Learning
Clustering
● Type of unsupervised learning that is used to group similar data points together.
● Clustering algorithms work by iteratively moving data points closer to their cluster centers and further away
from data points in other clusters.

Algorithm
● Hierarchical clustering
● K-means clustering
● Principal Component Analysis
● Singular Value Decomposition
● Independent Component Analysis
● Gaussian Mixture Models (GMMs)
● Density-Based Spatial Clustering of Applications with Noise
(DBSCAN)
Types of Unsupervised Learning
Association

● Association rule learning is a type of unsupervised learning that is used to identify patterns in a data.
Association rule learning algorithms work by ﬁnding relationships between different items in a dataset

Algorithm
● Apriori Algorithm
● Eclat Algorithm
● FP-Growth Algorithm
Supervised Learning
Evaluation Metrics
● Silhouette score: The silhouette score measures how well each data point is clustered with its own
cluster members and separated from other clusters. It ranges from -1 to 1, with higher scores indicating
better clustering.
● Calinski-Harabasz score: The Calinski-Harabasz score measures the ratio between the variance
between clusters and the variance within clusters. It ranges from 0 to infinity, with higher scores
indicating better clustering.
● Adjusted Rand index: The adjusted Rand index measures the similarity between two clusterings. It
ranges from -1 to 1, with higher scores indicating more similar clusterings.
● Davies-Bouldin index: The Davies-Bouldin index measures the average similarity between clusters. It
ranges from 0 to infinity, with lower scores indicating better clustering.
● F1 score: The F1 score is a weighted average of precision and recall, which are two metrics that are
commonly used in supervised learning to evaluate classification models. However, the F1 score can also
be used to evaluate non-supervised learning models, such as clustering models.
Applications of Unsupervised learning
Unsupervised Learning

● Anomaly detection: Unsupervised learning can identify unusual patterns or deviations from normal
behavior in data, enabling the detection of fraud, intrusion, or system failures.
● Scientific discovery: Unsupervised learning can uncover hidden relationships and patterns in scientific
data, leading to new hypotheses and insights in various scientific fields.
● Recommendation systems: Unsupervised learning can identify patterns and similarities in user behavior
and preferences to recommend products, movies, or music that align with their interests.
● Customer segmentation: Unsupervised learning can identify groups of customers with similar
characteristics, allowing businesses to target marketing campaigns and improve customer service more
effectively.
● Image analysis: Unsupervised learning can group images based on their content, facilitating tasks such
as image classification, object detection, and image retrieval.
Tanya Jawab - Metode Learning dalam Machine
Learning
2. Mengelompokkan jenis burung berdasarkan beberapa variabel secara
otomatis.
Pertanyaan:
Metode pembelajaran apa yang sesuai untuk mengelompokkan jenis burung tanpa
label yang sudah ditentukan?
Jawaban:
Metode yang digunakan adalah Unsupervised Learning dengan teknik Clustering, seperti
K-Means, Hierarchical Clustering, atau DBSCAN. Teknik ini digunakan untuk
mengelompokkan burung berdasarkan fitur seperti ukuran tubuh, warna bulu, dan pola migrasi
tanpa perlu label awal.
Advantages of Unsupervised learning

● It does not require training data to be labeled.

● Dimensionality reduction can be easily accomplished using unsupervised learning.
● Capable of ﬁnding previously unknown patterns in data.
● Unsupervised learning can help you gain insights from unlabeled data that you might not have been able to get
otherwise.
● Unsupervised learning is good at ﬁnding patterns and relationships in data without being told what to look for. This can
help you learn new things about your data.

Disadvantages of Unsupervised learning

● Difficult to measure accuracy or effectiveness due to lack of predefined answers during training.
● The results often have lesser accuracy.
● The user needs to spend time interpreting and label the classes which follow that classification.
● Unsupervised learning can be sensitive to data quality, including missing values, outliers, and noisy data.
● Without labeled data, it can be difficult to evaluate the performance of unsupervised learning models, making it
challenging to assess their effectiveness.
Types of Machine Learning
Reinforcement Learning
The algorithms learn to react to an environment on their own.

● Reinforcement Learning (RL) is a more complex approach where an agent learns by interacting with the environment,
receiving rewards or penalties based on its actions. The agent’s goal is to maximize cumulative reward over time.
● For a learning agent, there is always a start state and an end state. However, to reach the end state, there might be a
different path. In Reinforcement Learning Problem an agent tries to manipulate the environment. The agent travels from
one state to another. The agent gets the reward(appreciation) on success but will not receive any reward or appreciation
on failure. In this way, the agent learns from the environment

Applications in Business
● Dynamic Pricing: RL can be used in e-commerce to adjust pricing dynamically, learning optimal price points that
maximize revenue while responding to competitor pricing.
● Supply Chain Optimization: In logistics, RL optimizes supply chain operations by learning the most efﬁcient routing
and inventory levels over time.
Implementation Step
Implementation Steps

Step 1: Data Preparation

● Data Collection: Gather historical data relevant to the business problem (sales data, customer interactions, ﬁnancial
records).
● Data Cleaning: Handle missing values, outliers, and ensure normalization (as discussed in previous sessions).

Step 2: Model Selection

Choose appropriate predictive models based on the type of problem (regression or classiﬁcation).
Choose the appropriate predictive technique based on:
● Nature of the Problem: Regression (continuous output) vs. classiﬁcation (categorical output).
● Data Characteristics: Size, type (numerical, categorical), and structure.

Step 3: Training and Validation

● Split data into training and testing sets.
● Use cross-validation to ensure the model generalizes well to unseen data.
Implementation Steps

Step 4: Performance Evaluation

● Metrics:
○ For classification: Accuracy, Precision, Recall, F1-Score.
○ For regression: Mean Squared Error (MSE), R-squared.
● Practical Example: Predicting the likelihood of customers subscribing to a new service based on past
behavior (classification problem). Evaluate performance using precision and recall to minimize false
positives.
Step 5: Deployment and Continuous Monitoring
● Once the model is implemented, ensure it continues to perform well by monitoring predictions against
actual outcomes.
● Example: Use in financial forecasting, where predicted and actual revenue are compared periodically to
adjust the model.
Demo EDA and
Machine Learning
using Python
[Collab]
Demo Machine
Learning using Orange
Data Mining
Konsep Fundamental dalam Orange Data Mining

Konsep Workflow
• Workflow: Rangkaian widget yang dihubungkan untuk membangun pipeline analisis
data.
• Contoh workflow:
• Input Data → Preprocessing → Visualisasi → Machine Learning Model →
Evaluasi

Jenis Data yang Didukung

• File CSV, Excel untuk data tabular.
• Database SQL untuk pengolahan data besar.
• Images, text untuk analisis berbasis teks atau citra.
• Data Streaming untuk analisis real-time.
Pengenalan Widget dalam Orange
Kategori Widget

Kategori Fungsi

Data Mengunggah, membersihkan, dan memfilter data

Visualization Menampilkan data dalam bentuk grafik

Model Membangun dan melatih model Machine Learning

Evaluation Mengevaluasi performa model

Text Mining Menganalisis teks

Image Analytics Menganalisis gambar

Bioinformatics Khusus untuk analisis bioinformatika

Widget Utama yang
Sering Digunakan
• File – Untuk mengimpor data.
• Data Table – Menampilkan data dalam tabel.
• Select Columns – Memilih fitur yang relevan.
• Scatter Plot – Visualisasi distribusi data.
• Preprocess – Normalisasi dan transformasi
data.
• Linear Regression – Model prediktif dasar.
• Test & Score – Evaluasi model.
Tugas Analisis Data Iris Setosa Menggunakan Orange Data Tugas
Mining TM3
Deskripsi Tugas
Mahasiswa akan melakukan analisis deskriptif dan klasifikasi bunga Iris Setosa
menggunakan Orange Data Mining. Tugas ini mencakup:
1. Eksplorasi dan analisis deskriptif dataset Iris, termasuk statistik dasar dan
visualisasi.
2. Klasifikasi menggunakan lima algoritma supervised learning dan perbandingan
performanya.
3. Evaluasi hasil klasifikasi menggunakan metrik yang sesuai.
Petunjuk Pengerjaan
Tugas ini akan dikerjakan menggunakan Orange Data Mining dalam dua tahap utama:
1. Analisis Deskriptif – Menyelidiki distribusi data dan visualisasi.
2. Klasifikasi – Membandingkan lima algoritma pembelajaran mesin.
Bagian 1: Analisis Deskriptif Dataset Iris
1.1 Memuat dan Menjelajahi Data
• Langkah 1: Buka Orange Data Mining dan tambahkan File Widget.
• Langkah 2: Pilih dataset Iris yang sudah tersedia di Orange.
• Langkah 3: Hubungkan File Widget ke Data Table Widget untuk melihat isi dataset.
Tugas
1.2 Statistik Deskriptif
Gunakan Widget Statistics untuk mendapatkan ringkasan statistik dari dataset:
TM3
• Ukuran dataset (jumlah baris dan kolom).
• Nilai rata-rata, median, standar deviasi, minimum, dan maksimum untuk
setiap fitur (sepal length, sepal width, petal length, petal width).

1.3 Visualisasi Data

Tambahkan beberapa widget visualisasi untuk memahami pola dalam dataset:
• Box Plot Widget → Melihat distribusi nilai setiap fitur dan outlier.
• Scatter Plot Widget → Menampilkan hubungan antara dua fitur (misalnya, sepal length vs petal
length).
• Distribution Widget → Melihat distribusi nilai setiap variabel berdasarkan kategori spesies.
Tugas Mahasiswa:
1. Tampilkan tangkapan layar statistik deskriptif dan visualisasi data di dalam laporan.
2. Interpretasikan pola yang ditemukan, misalnya:
• Bagaimana perbedaan distribusi antara Iris Setosa, Versicolor, dan Virginica?
• Apakah ada fitur yang bisa membedakan Iris Setosa dengan spesies lainnya?
Bagian 2: Klasifikasi Bunga Iris Setosa Tugas
2.1 Pemilihan Algoritma Klasifikasi TM3
Gunakan 5 algoritma supervised learning berikut:
1. Decision Tree
2. Random Forest
3. Logistic Regression
4. Support Vector Machine (SVM)
5. k-Nearest Neighbors (k-NN)
Langkah-langkah Implementasi di Orange:
• Tambahkan Select Columns Widget untuk memastikan kolom species sebagai target
(label).
• Tambahkan Test & Score Widget untuk membandingkan performa model.
• Gunakan Confusion Matrix Widget untuk melihat kesalahan klasifikasi.
• Tambahkan ROC Analysis Widget untuk membandingkan kurva ROC dari masing-masing
model.

Tugas Mahasiswa:
1. Simpan hasil akurasi dan confusion matrix dari masing-masing model.
2. Bandingkan kelebihan dan kekurangan setiap model.
3. Tentukan model terbaik berdasarkan hasil evaluasi.
Bagian 3: Kesimpulan dan Laporan Tugas
Mahasiswa diminta menyusun laporan dalam format PDF atau Word dengan struktur berikut:
1.Pendahuluan : Penjelasan singkat tentang dataset Iris dan tujuan analisis. TM3
2.Analisis Deskriptif
1. Tabel statistik deskriptif.
2. Visualisasi data dengan interpretasi.
3.Implementasi Model Klasifikasi
1. Screenshots workflow di Orange Data Mining.
2. Hasil perbandingan akurasi dari lima model.
4.Evaluasi Model
1. Tabel perbandingan akurasi semua model.
2. Confusion Matrix untuk model terbaik.
3. Interpretasi hasil ROC Curve.
5.Kesimpulan
1. Model mana yang paling baik untuk mengklasifikasikan Iris Setosa?
2. Faktor apa yang mempengaruhi hasil klasifikasi?
3. Rekomendasi untuk meningkatkan akurasi model.
Bobot Penilaian:
Format Pengumpulan dan Penilaian
Aspek Bobot (%)
Format Pengumpulan:
• Laporan dalam format PDF/Word. Eksplorasi Data & Statistik Deskriptif 20%
• File Workflow Orange (.ows). Implementasi Model Klasifikasi 30%
Evaluasi & Perbandingan Model 30%
Kesimpulan & Interpretasi 20%
Homework
Please make the EDA and Machine Learning with this dataset
https://raw.githubusercontent.com/datasciencedojo/datase
ts/master/titanic.csv
Submit it in Google Collab with your group

Deadline 29 October 2024

Mid-Exam Project
1. Buat video maksimal 5 menit untuk presentasi projek
2. Membuat laporan dalam bentuk PDF/Word
3. Mengumpulkan dalam bentuk File Python (.ipynb) atau File
Orange (.ows)
4. Sertakan screen capture Workﬂow Orange/Hasil code di
Python

Maksimal dikumpulkan tgl 18 April 2025 jam 12.00 Siang

Mhs dibebaskan menggunakan Python atau Orange
Proses Machine Learning
1.Cari dan Kumpulkan Dataset (MINIMAL 200 BARIS) dan setiap kelompok HARUS BERBEDA.
2.Pendahuluan : Penjelasan singkat tentang dataset yang dipilih dan tujuan analisis.
3.Analisis Deskriptif
1. Tabel statistik deskriptif.
2. Visualisasi data dengan interpretasi.
4.Implementasi Model (Gunakan Model sesuai data yang dipilih)
1. Screenshots workflow di Orange Data Mining (Jika menggunakan Orange)
2. Hasil perbandingan akurasi dari lima model. (Dibebaskan sesuai dengan data yang diambil)
3. Jelaskan alasan pemilihan model tsb
5.Evaluasi Model
1. Tabel perbandingan akurasi semua model.
2. Confusion Matrix untuk model terbaik.
6.Kesimpulan
1. Model mana yang paling baik?
2. Faktor apa yang menyebabkan model tersebut terbaik?
3. Rekomendasi untuk meningkatkan akurasi model.
Bobot Penilaian:
Format Pengumpulan dan Penilaian
Aspek Bobot (%)
Format Pengumpulan:
• Laporan dalam format PDF/Word. Eksplorasi Data & Statistik Deskriptif 20%
• File Workflow Orange (.ows). Implementasi Model Klasifikasi 30%
Evaluasi & Perbandingan Model 30%
Kesimpulan & Interpretasi 20%
Feb.upnv febupnvj feb.upnvj.ac.id febupnvj@upnvj.ac.id

System Design Interview Textbook
No ratings yet
System Design Interview Textbook
51 pages
Inventory Management System UML Diagram - Complete
100% (2)
Inventory Management System UML Diagram - Complete
15 pages
8051 Microcontroller Instruction
No ratings yet
8051 Microcontroller Instruction
32 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Fam QB Ans
No ratings yet
Fam QB Ans
9 pages
MLA TAB Lecture1
No ratings yet
MLA TAB Lecture1
81 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
AIYA SESSION 4
No ratings yet
AIYA SESSION 4
42 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Assignment
No ratings yet
Assignment
5 pages
Unit4_PPT (2)
No ratings yet
Unit4_PPT (2)
126 pages
Unit 4_Question Bank and answers
No ratings yet
Unit 4_Question Bank and answers
23 pages
Week 12 Intro to DS and ML
No ratings yet
Week 12 Intro to DS and ML
67 pages
Ass bigd
No ratings yet
Ass bigd
9 pages
3 Pred Analysis
No ratings yet
3 Pred Analysis
18 pages
FAM_QUESTION_BANK_CT[1]
No ratings yet
FAM_QUESTION_BANK_CT[1]
14 pages
Machine Learning - Unit - 1
100% (1)
Machine Learning - Unit - 1
58 pages
Module 4 - Study Material - Overview of Predictive Analytics
No ratings yet
Module 4 - Study Material - Overview of Predictive Analytics
15 pages
AI and ML For Business Antim Prahar WITH ANSWERS
No ratings yet
AI and ML For Business Antim Prahar WITH ANSWERS
26 pages
Machine Learning (1)
No ratings yet
Machine Learning (1)
133 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
nn
No ratings yet
nn
24 pages
ML-chap-2
No ratings yet
ML-chap-2
60 pages
Machine Learning For Beginners Overview of Algorithm TypesStart Learning Machine Learning From Here
No ratings yet
Machine Learning For Beginners Overview of Algorithm TypesStart Learning Machine Learning From Here
13 pages
ML-1-PPT-UNIT-1
No ratings yet
ML-1-PPT-UNIT-1
93 pages
Machine Learning Ppts
No ratings yet
Machine Learning Ppts
38 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Project
No ratings yet
Project
12 pages
Module 3 - Introduction to ML
No ratings yet
Module 3 - Introduction to ML
45 pages
05 - Machine Learning
No ratings yet
05 - Machine Learning
31 pages
ML_Introduction
No ratings yet
ML_Introduction
76 pages
Colloquium Evaluation: Faculty of Computer Science and Engineering To:Kanika Gupta Ma'Am Bhavya Sethi 16csu082
No ratings yet
Colloquium Evaluation: Faculty of Computer Science and Engineering To:Kanika Gupta Ma'Am Bhavya Sethi 16csu082
12 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
ML Notes
No ratings yet
ML Notes
10 pages
Module 3 (1)
No ratings yet
Module 3 (1)
63 pages
dbms-10 marks
No ratings yet
dbms-10 marks
32 pages
Social Media Analytics Techniques[1] (1)
No ratings yet
Social Media Analytics Techniques[1] (1)
77 pages
Big-Data Unit-3
100% (1)
Big-Data Unit-3
54 pages
Machine learning QB
No ratings yet
Machine learning QB
15 pages
Unit 1 PDF
No ratings yet
Unit 1 PDF
135 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
ML Unit1
No ratings yet
ML Unit1
25 pages
Machine Learning
No ratings yet
Machine Learning
42 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
20 pages
Unit-1 MLT
No ratings yet
Unit-1 MLT
51 pages
Classification of Machine Learning
No ratings yet
Classification of Machine Learning
73 pages
ML Unit 1
No ratings yet
ML Unit 1
9 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
8 pages
Machine Learning in PySpark
No ratings yet
Machine Learning in PySpark
18 pages
Understanding Data Mining
No ratings yet
Understanding Data Mining
21 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
10 pages
Machine Learning Reg
No ratings yet
Machine Learning Reg
45 pages
Module2 ch2
No ratings yet
Module2 ch2
36 pages
CSC 492 Lecture Notes_19.06.2024
No ratings yet
CSC 492 Lecture Notes_19.06.2024
34 pages
Machine Learning - Brief
No ratings yet
Machine Learning - Brief
12 pages
ML 2
No ratings yet
ML 2
39 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
24 pages
Classification
No ratings yet
Classification
53 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Crack the Data Analyst Interview: Real-Time Questions & Expert Answers
From Everand
Crack the Data Analyst Interview: Real-Time Questions & Expert Answers
Yash d.
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
6.+SIHRM+Compensation
No ratings yet
6.+SIHRM+Compensation
55 pages
TM12 - Entry Strategy and Alliance
No ratings yet
TM12 - Entry Strategy and Alliance
29 pages
Presentasi Manajemen Strategik 5
No ratings yet
Presentasi Manajemen Strategik 5
17 pages
Brown 2003
No ratings yet
Brown 2003
12 pages
AKM4
No ratings yet
AKM4
3 pages
2024-Materi EKM1-ES-TM 3 -4
No ratings yet
2024-Materi EKM1-ES-TM 3 -4
2 pages
Emerson, Society and Solitude
No ratings yet
Emerson, Society and Solitude
317 pages
Spoof (SBV5121i)
No ratings yet
Spoof (SBV5121i)
2 pages
MCA CyberSecurity HCL
No ratings yet
MCA CyberSecurity HCL
21 pages
4th Sem Syllabus
No ratings yet
4th Sem Syllabus
12 pages
A. Project Title: Problem Context
No ratings yet
A. Project Title: Problem Context
9 pages
Audit Trail
No ratings yet
Audit Trail
1 page
Epa Quick Guide npd5619-00 en
No ratings yet
Epa Quick Guide npd5619-00 en
2 pages
HackerOneBounty Solutions Brief
No ratings yet
HackerOneBounty Solutions Brief
2 pages
6.2 Everything Becomes Programmable-MIQ
No ratings yet
6.2 Everything Becomes Programmable-MIQ
39 pages
English 10 Activities Research
No ratings yet
English 10 Activities Research
3 pages
6AAC00000009397 - en - RevB - BORDLINE BC Launcher Reference Manual
No ratings yet
6AAC00000009397 - en - RevB - BORDLINE BC Launcher Reference Manual
25 pages
ReFace: Real-Time Adversarial Attacks On Face Recognition Systems
No ratings yet
ReFace: Real-Time Adversarial Attacks On Face Recognition Systems
13 pages
Net Act Alarm Based Automatic Data Collection Use Case Y23aAubi
No ratings yet
Net Act Alarm Based Automatic Data Collection Use Case Y23aAubi
31 pages
SEE Electrical: Intuitive and Versatile Computer-Aided-Design Software For All Your Electrical Design Needs
No ratings yet
SEE Electrical: Intuitive and Versatile Computer-Aided-Design Software For All Your Electrical Design Needs
6 pages
EER Exercises
No ratings yet
EER Exercises
4 pages
Quasimidi Rave-o-Lution 309 Instruction Manual
No ratings yet
Quasimidi Rave-o-Lution 309 Instruction Manual
49 pages
B Ise Upgrade Guide 3 1 PDF
No ratings yet
B Ise Upgrade Guide 3 1 PDF
58 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
29 pages
Palo Alto Hands On Workshop
100% (1)
Palo Alto Hands On Workshop
108 pages
Fargo Hdp5000 System Ds en
No ratings yet
Fargo Hdp5000 System Ds en
2 pages
Database Infographic
No ratings yet
Database Infographic
1 page
Fundamentals Of: Business Analysis
No ratings yet
Fundamentals Of: Business Analysis
29 pages
7 Deadly Sins of Itil Implementations
No ratings yet
7 Deadly Sins of Itil Implementations
11 pages
Poisson Distribution
No ratings yet
Poisson Distribution
1 page
MAP65 Users Guide
No ratings yet
MAP65 Users Guide
22 pages
Capsule Endoscopy Platform: Product Brochure
No ratings yet
Capsule Endoscopy Platform: Product Brochure
8 pages
Gopala Krishna FICO 6 Years Exp
No ratings yet
Gopala Krishna FICO 6 Years Exp
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

TM 4 - Data Mining and Machine Learning

Uploaded by

TM 4 - Data Mining and Machine Learning

Uploaded by

Introduction of Data Mining

and Machine Learning

Lecturer in UPN Veteran Jakarta and Data Analytic Oﬃcer

Salsabila Ramadhina S.Kom MSc

Feb.upnv febupnvj feb.upnvj.ac.id febupnvj@upnvj.ac.id

Machine learning adalah

• The most popular

▪ Notice the iteration!

THE FAMOUS CRISP-DM

Determine Collect Select

E-commerce Customer Behavior Dataset (Kaggle)

Amazon Product Recommendation Dataset

Discuss with your team (15 minutes)

How does it work?

Why is it important in business?

● True Positive (TP): It is the total counts having both predicted

Disadvantages of Supervised learning

● Classifying big data can be challenging.

● It does not require training data to be labeled.

Disadvantages of Unsupervised learning

Step 1: Data Preparation

Step 2: Model Selection

Step 3: Training and Validation

Step 4: Performance Evaluation

Jenis Data yang Didukung

Data Mengunggah, membersihkan, dan memfilter data

Visualization Menampilkan data dalam bentuk grafik

Model Membangun dan melatih model Machine Learning

Evaluation Mengevaluasi performa model

Text Mining Menganalisis teks

Image Analytics Menganalisis gambar

Bioinformatics Khusus untuk analisis bioinformatika

1.3 Visualisasi Data

Deadline 29 October 2024

Maksimal dikumpulkan tgl 18 April 2025 jam 12.00 Siang

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.