0% found this document useful (0 votes)
3 views

TM 4 - Data Mining and Machine Learning

The document provides an introduction to Data Mining and Machine Learning, covering topics such as the CRISP-DM methodology, types of machine learning (supervised, unsupervised, and reinforcement learning), and their applications in business. It includes case studies on e-commerce and product recommendations, as well as discussions on classification and regression problems in supervised learning. Additionally, it outlines the advantages and disadvantages of supervised learning and emphasizes its importance in predictive analytics and decision-making.

Uploaded by

Musa Radhitia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

TM 4 - Data Mining and Machine Learning

The document provides an introduction to Data Mining and Machine Learning, covering topics such as the CRISP-DM methodology, types of machine learning (supervised, unsupervised, and reinforcement learning), and their applications in business. It includes case studies on e-commerce and product recommendations, as well as discussions on classification and regression problems in supervised learning. Additionally, it outlines the advantages and disadvantages of supervised learning and emphasizes its importance in predictive analytics and decision-making.

Uploaded by

Musa Radhitia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Introduction of Data Mining

and Machine Learning


Dr. Donny Maha Putra, S.Kom., M.Ak

Lecturer in UPN Veteran Jakarta and Data Analytic Officer


in Ministry of Finance of Indonesia

Salsabila Ramadhina S.Kom MSc


Lecturer in UPN Veteran Jakarta, Product Manager at Ultra
Voucher, and Scholarship Mentor at Kobi Education

Feb.upnv febupnvj feb.upnvj.ac.id febupnvj@upnvj.ac.id


Session Objectives
● Overview of Machine Learning
● CRISP-DM
● Case Study (15 minutes discussion)
● Supervised vs Unsupervised Learning
● Supervised Learning Classification Problem
● Supervised Learning Regression Problem
● Mid-Exam Session
What is Data Science?
Data Science adalah
Bidang interdisipliner yang menggunakan
metode, proses, algoritma dan sistem ilmiah
untuk mengekstrak pengetahuan dan wawasan
dari data terstruktur dan tidak terstruktur,
serta menerapkan pengetahuan dan wawasan
yang dapat ditindaklanjuti dari data di berbagai
domain aplikasi.
Machine Learning

Machine learning adalah


sekumpulan
metode untuk
mendeteksi pola data
secara otomatis dan
menggunakannya untuk
memprediksi data di masa
depan dan memandu
pengambilan keputusan.
Dengan kata lain, belajar
dari data.
Deep Learning
Metodologi
Data Sains
(CRISP-DM)
What is CRISP-DM?
• Cross Industry Standard
Process for Data Mining

• The most popular


methodology for data
analytics projects
CRoss-Industry Standard Process
for Data Mining
Data Mining Lifecycle
Kita bahas
bagian ini
▪ Business Understanding
▪ Data Understanding
▪ Data Preparation
▪ Modeling
▪ Evaluation
▪ Deployment

▪ Notice the iteration!

THE FAMOUS CRISP-DM


CRISP-DM: Phases
• Business Understanding
Project objectives and requirements understanding, Data mining problem definition
• Data Understanding
Initial data collection and familiarization, Data quality problems identification
• Data Preparation
Table, record and attribute selection, Data transformation and cleaning
• Modeling
Modeling techniques selection and application, Parameters calibration
• Evaluation
Business objectives & issues achievement evaluation
• Deployment
Result model deployment, Repeatable data mining process implementation
Phases and Tasks
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation

Determine Collect Select


Select Evaluate Plan
Business Initial Modeling
Data Results Deployment
Objectives Data Technique

Plan Monitering
Assess Describe Clean Generate Review
&
Situation Data Data Test Design Process
Maintenance

Determine Produce
Explore Construct Build Determine
Data Mining Final
Data Data Model Next Steps
Goals Report

Verify
Produce Integrate Assess Review
Data
Project Plan Data Model Project
Quality

Format
Data
Precourse: Exploratory Data Analysis

Case Study 1
An e-commerce company wants to optimize its marketing strategy
by identifying distinct customer groups. The company collects data
on purchase history, browsing behavior, and customer
demographics. Using data analytics, they aim to create targeted
marketing campaigns and improve customer retention.

https://www.kaggle.com/datasets/uom190346a/e-commerce-customer-behavior-dataset

E-commerce Customer Behavior Dataset (Kaggle)


Precourse: Exploratory Data Analysis

Case Study 2
An online retailer wants to enhance the shopping experience by
recommending personalized products to customers. They have
historical transaction data, including products purchased together
and customer preferences. By leveraging machine learning, the
retailer aims to implement a recommendation system to increase
sales.

https://www.kaggle.com/datasets/asaniczka/amazon-canada-products-2023-2-1m-products

Amazon Product Recommendation Dataset


Precourse: Exploratory Data Analysis

Discuss with your team (15 minutes)

1. Business Understanding
● Define the business problem.
● Explain the business objectives and success criteria.

2. Data Understanding
● Describe the dataset (features, size, source, missing values,
etc.).
Introduction to Machine Learning
a branch of Artificial Intelligence (AI) that enables computers to learn from
data, uncover patterns and make decisions or predictions without being
explicitly programmed.

How does it work?


ML systems build statistical models based on sample data (training data) to make decisions or predictions.

Why is it important in business?


Machine Learning enables predictive analytics, automation, and personalized customer experiences, which
significantly improve business performance and decision-making.

Example of ML Implementation
● Predictive Analytics (forecasting sales, customer demand).
○ Predictive analytics is a branch of advanced analytics used to make predictions about future outcomes based
on historical data. Customer Segmentation (personalizing marketing strategies).
● Fraud Detection (identifying abnormal behavior).
Types of Machine Learning
Supervised, Unsupervised, and Reinforcement Learning
Types of Machine Learning
Supervised Learning
relies on labeled data , where the model learns from both inputs (features) and
corresponding outputs (labels)

Supervised learning is when we teach or train the machine using data that is well-labelled. Which means some
data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set of training examples) and
produces a correct outcome from labeled data.
Types of Problems on Supervised Learning
● Classification problems
This algorithm helps to predict a discrete value.
● Regression problems
This algorithm helps to predict a continuous value.
Types of Problems on Supervised Learning
Classification problems
● This algorithm helps to predict a discrete value. It can be thought, the input data as a member of a
particular class or group.
● For instance, taking up the photos of the fruit dataset, each photo has been labelled as a mango, an
apple, etc. Here, the algorithm has to classify the new images into any of these categories.

Algorithm
● Naive Bayes Classifier
● Support Vector Machines
● Logistic Regression
Predictive Analytics Techniques
Logistic Regression
Used when the dependent variable is binary (e.g., yes/no outcomes) .
Logistic regression models the probability that a given input point belongs to a
particular category.

Use in Management:
● Credit Scoring: Predicting the likelihood that a borrower will default on a loan.
● Employee Retention: Identifying factors that influence whether an employee will stay or leave.
Types of Problems on Supervised Learning
Evaluation Metrics - Classification
problems
● True Negative (TN): The model correctly predicted a negative
outcome (the actual outcome was negative).
● True Positive (TP): The model predicts the data is in the Positive
class and the data is actually in the Positive class.
● False Negative (FN): The model incorrectly predicted a negative
outcome (the actual outcome was positive).
● False Positive (FP): The model incorrectly predicted a positive
outcome (the actual outcome was negative)

Confusion matrix
Types of Problems on Supervised Learning
Evaluation Metrics - Classification
problems

● True Positive (TP): It is the total counts having both predicted


and actual values are Dog.
● True Negative (TN): It is the total counts having both predicted
and actual values are Not Dog.
● False Positive (FP): It is the total counts having prediction is
Dog while actually Not Dog.
● False Negative (FN): It is the total counts having prediction is
Not Dog while actually, it is Dog.

Confusion matrix
Types of Problems on Supervised Learning
Evaluation Metrics - Classification
problems
● Accuracy: Accuracy is the percentage of predictions that the model makes correctly. It is calculated by
dividing the number of correct predictions by the total number of predictions.
● Precision: Precision is the percentage of positive predictions that the model makes that are actually
correct. It is calculated by dividing the number of true positives by the total number of positive predictions.
● Recall: Recall is the percentage of all positive examples that the model correctly identifies. It is
calculated by dividing the number of true positives by the total number of positive examples.
● F1 score: The F1 score is a weighted average of precision and recall. It is calculated by taking the
harmonic mean of precision and recall.
Types of Problems on Supervised Learning
Regression problems
● These problems are used for continuous data.
● For example, predicting the price of a piece of land in a city, given the area, location, number of rooms, etc.
And then the input is sent to the machine for calculating the price of the land according to previous
examples.

Algorithm
● Linear Regression
● Nonlinear Regression
● Bayesian Linear Regression
Predictive Analytics Techniques
Linear Regression
A foundational technique to model the relationship between a dependent
variable and one or more independent variables.

● Equation: Y=a+bXY = a + bXY=a+bX (where Y is the predicted value, X is the independent variable, and a, b are
constants).
● Use in Management:
○ Sales Forecasting: Predicting sales volume based on marketing spend.
○ Cost Estimation: Estimating operational costs based on variable factors like demand.
○ Case Study: At MIT Sloan, linear regression models were used to predict energy consumption based on seasonal
variations and pricing models.
Types of Problems on Supervised Learning
Evaluation Metrics - Regression
problems

● Mean Squared Error (MSE): MSE measures the average squared difference between the predicted values
and the actual values. Lower MSE values indicate better model performance.
● Root Mean Squared Error (RMSE): RMSE is the square root of MSE, representing the standard deviation
of the prediction errors. Similar to MSE, lower RMSE values indicate better model performance.
● Mean Absolute Error (MAE): MAE measures the average absolute difference between the predicted
values and the actual values. It is less sensitive to outliers compared to MSE or RMSE.
● R-squared (Coefficient of Determination): R-squared measures the proportion of the variance in the
target variable that is explained by the model. Higher R-squared values indicate better model fit.
Example of Supervised Learning

Let’s say you have a fruit basket that you want to identify. The machine would first analyze the image to extract
features such as its shape, color, and texture. Then, it would compare these features to the features of the fruits
it has already learned about. If the new image’s features are most similar to those of an apple, the machine
would predict that the fruit is an apple.

For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step is to train the
machine with all the different fruits one by one like this:

If the shape of the object is rounded and has a depression at the top, is red in color, then it will be labeled as
–Apple.
If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the basket, and
asked to identify it.

Since the machine has already learned the things from previous data and this time has to use it wisely. It will
first classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it in the
Banana category. Thus the machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).
Applications of Supervised learning
Supervised Learning
● Spam filtering: Supervised learning algorithms can be trained to identify and classify spam emails based
on their content, helping users avoid unwanted messages.
● Image classification: Supervised learning can automatically classify images into different categories,
such as animals, objects, or scenes, facilitating tasks like image search, content moderation, and
image-based product recommendations.
● Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing patient data, such as
medical images, test results, and patient history, to identify patterns that suggest specific diseases or
conditions.
● Fraud detection: Supervised learning models can analyze financial transactions and identify patterns
that indicate fraudulent activity, helping financial institutions prevent fraud and protect their customers.
● Natural language processing (NLP): Supervised learning plays a crucial role in NLP tasks, including
sentiment analysis, machine translation, and text summarization, enabling machines to understand and
process human language effectively.
Tanya Jawab - Metode Learning dalam Machine
Learning
1. Mencari tahu apakah seseorang akan gagal dalam membayar utangnya.
Pertanyaan:
Metode pembelajaran apa yang digunakan untuk memprediksi apakah seseorang
akan gagal membayar utangnya?

Jawaban:
Metode yang digunakan adalah Supervised Learning dengan teknik Classification. Model
seperti Logistic Regression, Decision Tree, Random Forest, atau Neural Networks dapat
digunakan untuk memprediksi apakah seseorang akan gagal membayar utang berdasarkan
variabel seperti riwayat kredit, penghasilan, dan rasio utang terhadap pendapatan.
Tanya Jawab - Metode Learning dalam Machine
Learning
2. Mengelompokkan jenis burung berdasarkan beberapa variabel secara
otomatis.
Pertanyaan:
Metode pembelajaran apa yang sesuai untuk mengelompokkan jenis burung tanpa
label yang sudah ditentukan?
Jawaban:
Metode yang digunakan adalah Unsupervised Learning dengan teknik Clustering, seperti
K-Means, Hierarchical Clustering, atau DBSCAN. Teknik ini digunakan untuk
mengelompokkan burung berdasarkan fitur seperti ukuran tubuh, warna bulu, dan pola migrasi
tanpa perlu label awal.
Tanya Jawab - Metode Learning dalam Machine
Learning
3. Menebak suhu udara di suatu tempat di keesokan hari.
Pertanyaan:
Metode pembelajaran apa yang cocok untuk memprediksi suhu udara esok hari
berdasarkan data historis?

Jawaban:
Metode yang digunakan adalah Supervised Learning dengan teknik Regression, seperti
Linear Regression, Random Forest Regression, atau Neural Networks. Model ini
mempelajari pola suhu berdasarkan data historis dan variabel cuaca lainnya untuk membuat
prediksi.
Tanya Jawab - Metode Learning dalam Machine
Learning
4. Menghitung waktu tempuh perjalanan dari satu titik ke titik berikutnya dengan
menggunakan moda transportasi tertentu.
Pertanyaan:
Bagaimana metode pembelajaran mesin dapat digunakan untuk memperkirakan waktu
tempuh perjalanan?

Jawaban:
Metode yang digunakan adalah Supervised Learning dengan pendekatan Regression,
misalnya Gradient Boosting Regression atau Neural Networks. Model ini
memanfaatkan data seperti jarak, kondisi lalu lintas, kecepatan rata-rata moda
transportasi, dan cuaca untuk memprediksi waktu tempuh perjalanan.
Advantages of Supervised learning

● Supervised learning allows collecting data and produces data output from previous experiences.
● Helps to optimize performance criteria with the help of experience.
● Supervised machine learning helps to solve various types of real-world computation problems.
● It performs classification and regression tasks.
● It allows estimating or mapping the result to a new sample.
● We have complete control over choosing the number of classes we want in the training data.

Disadvantages of Supervised learning

● Classifying big data can be challenging.


● Training for supervised learning needs a lot of computation time. So, it requires a lot of time.
● Supervised learning cannot handle all complex tasks in Machine Learning.
● Computation time is vast for supervised learning.
● It requires a labelled data set.
● It requires a training process.
Types of Machine Learning
Unsupervised Learning
deals with data that has no labeled, data does not have any pre-existing labels or
categories.

● Unsupervised learning is self-organized learning. Its main aim is to explore the underlying patterns and predicts the
output. Here we basically provide the machine with data and ask to look for hidden features and cluster the data in a
way that makes sense.
● Unsupervised learning is the training of a machine using information that is neither classified nor labeled and
allowing the algorithm to act on that information without guidance.
● The task of the machine is to group unsorted information according to similarities, patterns, and differences
without any prior training of data.
Example of Unsupervised Learning

Imagine you have a machine learning model trained on a large dataset of unlabeled images, containing both
dogs and cats. The model has never seen an image of a dog or cat before, and it has no pre-existing labels or
categories for these animals. Your task is to use unsupervised learning to identify the dogs and cats in a new,
unseen image.

For instance, suppose it is given an image having both dogs and cats which it has never seen.

Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs and cats ‘.
But it can categorize them according to their similarities, patterns, and differences, i.e., we can easily categorize
the above picture into two parts. The first may contain all pics having dogs in them and the second part may
contain all pics having cats in them. Here you didn’t learn anything before, which means no training data or
examples.

It allows the model to work on its own to discover patterns and information that was previously undetected. It
mainly deals with unlabelled data.
Types of Unsupervised Learning
Clustering
● Type of unsupervised learning that is used to group similar data points together.
● Clustering algorithms work by iteratively moving data points closer to their cluster centers and further away
from data points in other clusters.

Algorithm
● Hierarchical clustering
● K-means clustering
● Principal Component Analysis
● Singular Value Decomposition
● Independent Component Analysis
● Gaussian Mixture Models (GMMs)
● Density-Based Spatial Clustering of Applications with Noise
(DBSCAN)
Types of Unsupervised Learning
Association

● Association rule learning is a type of unsupervised learning that is used to identify patterns in a data.
Association rule learning algorithms work by finding relationships between different items in a dataset

Algorithm
● Apriori Algorithm
● Eclat Algorithm
● FP-Growth Algorithm
Supervised Learning
Evaluation Metrics
● Silhouette score: The silhouette score measures how well each data point is clustered with its own
cluster members and separated from other clusters. It ranges from -1 to 1, with higher scores indicating
better clustering.
● Calinski-Harabasz score: The Calinski-Harabasz score measures the ratio between the variance
between clusters and the variance within clusters. It ranges from 0 to infinity, with higher scores
indicating better clustering.
● Adjusted Rand index: The adjusted Rand index measures the similarity between two clusterings. It
ranges from -1 to 1, with higher scores indicating more similar clusterings.
● Davies-Bouldin index: The Davies-Bouldin index measures the average similarity between clusters. It
ranges from 0 to infinity, with lower scores indicating better clustering.
● F1 score: The F1 score is a weighted average of precision and recall, which are two metrics that are
commonly used in supervised learning to evaluate classification models. However, the F1 score can also
be used to evaluate non-supervised learning models, such as clustering models.
Applications of Unsupervised learning
Unsupervised Learning

● Anomaly detection: Unsupervised learning can identify unusual patterns or deviations from normal
behavior in data, enabling the detection of fraud, intrusion, or system failures.
● Scientific discovery: Unsupervised learning can uncover hidden relationships and patterns in scientific
data, leading to new hypotheses and insights in various scientific fields.
● Recommendation systems: Unsupervised learning can identify patterns and similarities in user behavior
and preferences to recommend products, movies, or music that align with their interests.
● Customer segmentation: Unsupervised learning can identify groups of customers with similar
characteristics, allowing businesses to target marketing campaigns and improve customer service more
effectively.
● Image analysis: Unsupervised learning can group images based on their content, facilitating tasks such
as image classification, object detection, and image retrieval.
Tanya Jawab - Metode Learning dalam Machine
Learning
2. Mengelompokkan jenis burung berdasarkan beberapa variabel secara
otomatis.
Pertanyaan:
Metode pembelajaran apa yang sesuai untuk mengelompokkan jenis burung tanpa
label yang sudah ditentukan?
Jawaban:
Metode yang digunakan adalah Unsupervised Learning dengan teknik Clustering, seperti
K-Means, Hierarchical Clustering, atau DBSCAN. Teknik ini digunakan untuk
mengelompokkan burung berdasarkan fitur seperti ukuran tubuh, warna bulu, dan pola migrasi
tanpa perlu label awal.
Advantages of Unsupervised learning

● It does not require training data to be labeled.


● Dimensionality reduction can be easily accomplished using unsupervised learning.
● Capable of finding previously unknown patterns in data.
● Unsupervised learning can help you gain insights from unlabeled data that you might not have been able to get
otherwise.
● Unsupervised learning is good at finding patterns and relationships in data without being told what to look for. This can
help you learn new things about your data.

Disadvantages of Unsupervised learning

● Difficult to measure accuracy or effectiveness due to lack of predefined answers during training.
● The results often have lesser accuracy.
● The user needs to spend time interpreting and label the classes which follow that classification.
● Unsupervised learning can be sensitive to data quality, including missing values, outliers, and noisy data.
● Without labeled data, it can be difficult to evaluate the performance of unsupervised learning models, making it
challenging to assess their effectiveness.
Types of Machine Learning
Reinforcement Learning
The algorithms learn to react to an environment on their own.

● Reinforcement Learning (RL) is a more complex approach where an agent learns by interacting with the environment,
receiving rewards or penalties based on its actions. The agent’s goal is to maximize cumulative reward over time.
● For a learning agent, there is always a start state and an end state. However, to reach the end state, there might be a
different path. In Reinforcement Learning Problem an agent tries to manipulate the environment. The agent travels from
one state to another. The agent gets the reward(appreciation) on success but will not receive any reward or appreciation
on failure. In this way, the agent learns from the environment

Applications in Business
● Dynamic Pricing: RL can be used in e-commerce to adjust pricing dynamically, learning optimal price points that
maximize revenue while responding to competitor pricing.
● Supply Chain Optimization: In logistics, RL optimizes supply chain operations by learning the most efficient routing
and inventory levels over time.
Implementation Step
Implementation Steps

Step 1: Data Preparation


● Data Collection: Gather historical data relevant to the business problem (sales data, customer interactions, financial
records).
● Data Cleaning: Handle missing values, outliers, and ensure normalization (as discussed in previous sessions).

Step 2: Model Selection


Choose appropriate predictive models based on the type of problem (regression or classification).
Choose the appropriate predictive technique based on:
● Nature of the Problem: Regression (continuous output) vs. classification (categorical output).
● Data Characteristics: Size, type (numerical, categorical), and structure.

Step 3: Training and Validation


● Split data into training and testing sets.
● Use cross-validation to ensure the model generalizes well to unseen data.
Implementation Steps

Step 4: Performance Evaluation


● Metrics:
○ For classification: Accuracy, Precision, Recall, F1-Score.
○ For regression: Mean Squared Error (MSE), R-squared.
● Practical Example: Predicting the likelihood of customers subscribing to a new service based on past
behavior (classification problem). Evaluate performance using precision and recall to minimize false
positives.
Step 5: Deployment and Continuous Monitoring
● Once the model is implemented, ensure it continues to perform well by monitoring predictions against
actual outcomes.
● Example: Use in financial forecasting, where predicted and actual revenue are compared periodically to
adjust the model.
Demo EDA and
Machine Learning
using Python
[Collab]
Demo Machine
Learning using Orange
Data Mining
Konsep Fundamental dalam Orange Data Mining

Konsep Workflow
• Workflow: Rangkaian widget yang dihubungkan untuk membangun pipeline analisis
data.
• Contoh workflow:
• Input Data → Preprocessing → Visualisasi → Machine Learning Model →
Evaluasi

Jenis Data yang Didukung


• File CSV, Excel untuk data tabular.
• Database SQL untuk pengolahan data besar.
• Images, text untuk analisis berbasis teks atau citra.
• Data Streaming untuk analisis real-time.
Pengenalan Widget dalam Orange
Kategori Widget

Kategori Fungsi

Data Mengunggah, membersihkan, dan memfilter data

Visualization Menampilkan data dalam bentuk grafik

Model Membangun dan melatih model Machine Learning

Evaluation Mengevaluasi performa model

Text Mining Menganalisis teks

Image Analytics Menganalisis gambar

Bioinformatics Khusus untuk analisis bioinformatika


Widget Utama yang
Sering Digunakan
• File – Untuk mengimpor data.
• Data Table – Menampilkan data dalam tabel.
• Select Columns – Memilih fitur yang relevan.
• Scatter Plot – Visualisasi distribusi data.
• Preprocess – Normalisasi dan transformasi
data.
• Linear Regression – Model prediktif dasar.
• Test & Score – Evaluasi model.
Tugas Analisis Data Iris Setosa Menggunakan Orange Data Tugas
Mining TM3
Deskripsi Tugas
Mahasiswa akan melakukan analisis deskriptif dan klasifikasi bunga Iris Setosa
menggunakan Orange Data Mining. Tugas ini mencakup:
1. Eksplorasi dan analisis deskriptif dataset Iris, termasuk statistik dasar dan
visualisasi.
2. Klasifikasi menggunakan lima algoritma supervised learning dan perbandingan
performanya.
3. Evaluasi hasil klasifikasi menggunakan metrik yang sesuai.
Petunjuk Pengerjaan
Tugas ini akan dikerjakan menggunakan Orange Data Mining dalam dua tahap utama:
1. Analisis Deskriptif – Menyelidiki distribusi data dan visualisasi.
2. Klasifikasi – Membandingkan lima algoritma pembelajaran mesin.
Bagian 1: Analisis Deskriptif Dataset Iris
1.1 Memuat dan Menjelajahi Data
• Langkah 1: Buka Orange Data Mining dan tambahkan File Widget.
• Langkah 2: Pilih dataset Iris yang sudah tersedia di Orange.
• Langkah 3: Hubungkan File Widget ke Data Table Widget untuk melihat isi dataset.
Tugas
1.2 Statistik Deskriptif
Gunakan Widget Statistics untuk mendapatkan ringkasan statistik dari dataset:
TM3
• Ukuran dataset (jumlah baris dan kolom).
• Nilai rata-rata, median, standar deviasi, minimum, dan maksimum untuk
setiap fitur (sepal length, sepal width, petal length, petal width).

1.3 Visualisasi Data


Tambahkan beberapa widget visualisasi untuk memahami pola dalam dataset:
• Box Plot Widget → Melihat distribusi nilai setiap fitur dan outlier.
• Scatter Plot Widget → Menampilkan hubungan antara dua fitur (misalnya, sepal length vs petal
length).
• Distribution Widget → Melihat distribusi nilai setiap variabel berdasarkan kategori spesies.
Tugas Mahasiswa:
1. Tampilkan tangkapan layar statistik deskriptif dan visualisasi data di dalam laporan.
2. Interpretasikan pola yang ditemukan, misalnya:
• Bagaimana perbedaan distribusi antara Iris Setosa, Versicolor, dan Virginica?
• Apakah ada fitur yang bisa membedakan Iris Setosa dengan spesies lainnya?
Bagian 2: Klasifikasi Bunga Iris Setosa Tugas
2.1 Pemilihan Algoritma Klasifikasi TM3
Gunakan 5 algoritma supervised learning berikut:
1. Decision Tree
2. Random Forest
3. Logistic Regression
4. Support Vector Machine (SVM)
5. k-Nearest Neighbors (k-NN)
Langkah-langkah Implementasi di Orange:
• Tambahkan Select Columns Widget untuk memastikan kolom species sebagai target
(label).
• Tambahkan Test & Score Widget untuk membandingkan performa model.
• Gunakan Confusion Matrix Widget untuk melihat kesalahan klasifikasi.
• Tambahkan ROC Analysis Widget untuk membandingkan kurva ROC dari masing-masing
model.

Tugas Mahasiswa:
1. Simpan hasil akurasi dan confusion matrix dari masing-masing model.
2. Bandingkan kelebihan dan kekurangan setiap model.
3. Tentukan model terbaik berdasarkan hasil evaluasi.
Bagian 3: Kesimpulan dan Laporan Tugas
Mahasiswa diminta menyusun laporan dalam format PDF atau Word dengan struktur berikut:
1.Pendahuluan : Penjelasan singkat tentang dataset Iris dan tujuan analisis. TM3
2.Analisis Deskriptif
1. Tabel statistik deskriptif.
2. Visualisasi data dengan interpretasi.
3.Implementasi Model Klasifikasi
1. Screenshots workflow di Orange Data Mining.
2. Hasil perbandingan akurasi dari lima model.
4.Evaluasi Model
1. Tabel perbandingan akurasi semua model.
2. Confusion Matrix untuk model terbaik.
3. Interpretasi hasil ROC Curve.
5.Kesimpulan
1. Model mana yang paling baik untuk mengklasifikasikan Iris Setosa?
2. Faktor apa yang mempengaruhi hasil klasifikasi?
3. Rekomendasi untuk meningkatkan akurasi model.
Bobot Penilaian:
Format Pengumpulan dan Penilaian
Aspek Bobot (%)
Format Pengumpulan:
• Laporan dalam format PDF/Word. Eksplorasi Data & Statistik Deskriptif 20%
• File Workflow Orange (.ows). Implementasi Model Klasifikasi 30%
Evaluasi & Perbandingan Model 30%
Kesimpulan & Interpretasi 20%
Homework
Please make the EDA and Machine Learning with this dataset
https://raw.githubusercontent.com/datasciencedojo/datase
ts/master/titanic.csv
Submit it in Google Collab with your group

Deadline 29 October 2024


Mid-Exam Project
1. Buat video maksimal 5 menit untuk presentasi projek
2. Membuat laporan dalam bentuk PDF/Word
3. Mengumpulkan dalam bentuk File Python (.ipynb) atau File
Orange (.ows)
4. Sertakan screen capture Workflow Orange/Hasil code di
Python

Maksimal dikumpulkan tgl 18 April 2025 jam 12.00 Siang


Mhs dibebaskan menggunakan Python atau Orange
Proses Machine Learning
1.Cari dan Kumpulkan Dataset (MINIMAL 200 BARIS) dan setiap kelompok HARUS BERBEDA.
2.Pendahuluan : Penjelasan singkat tentang dataset yang dipilih dan tujuan analisis.
3.Analisis Deskriptif
1. Tabel statistik deskriptif.
2. Visualisasi data dengan interpretasi.
4.Implementasi Model (Gunakan Model sesuai data yang dipilih)
1. Screenshots workflow di Orange Data Mining (Jika menggunakan Orange)
2. Hasil perbandingan akurasi dari lima model. (Dibebaskan sesuai dengan data yang diambil)
3. Jelaskan alasan pemilihan model tsb
5.Evaluasi Model
1. Tabel perbandingan akurasi semua model.
2. Confusion Matrix untuk model terbaik.
6.Kesimpulan
1. Model mana yang paling baik?
2. Faktor apa yang menyebabkan model tersebut terbaik?
3. Rekomendasi untuk meningkatkan akurasi model.
Bobot Penilaian:
Format Pengumpulan dan Penilaian
Aspek Bobot (%)
Format Pengumpulan:
• Laporan dalam format PDF/Word. Eksplorasi Data & Statistik Deskriptif 20%
• File Workflow Orange (.ows). Implementasi Model Klasifikasi 30%
Evaluasi & Perbandingan Model 30%
Kesimpulan & Interpretasi 20%
Feb.upnv febupnvj feb.upnvj.ac.id febupnvj@upnvj.ac.id

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy