Slide Kuliah - Pemilihan Dan Evaluasi Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Pemilihan dan Evaluasi

Model
Agustinus Jacobus, ST, MCs
Metodologi Data Science/ML/AI
• Metodologi Teknis: Kegiatan DS/AI dianggap Kegiatan Teknikal
• Knowledge Discovery and Data Mining
• SEMMA dari SAS Institute
• Metodologi Lengkap: Kegiatan DS/AI dianggap Kegiatan Bisnis (Masalah
Bisnis menjadi Masalah DS/AI)
• CRISP-DM
• IBM Data Science Methodology
• Microsoft’s Team Data Science Process
• Domino DataLab Methodology
• Indonesia???
• Standard Kompetensi Kerja Nasional (KepMen Ketenagakerjaan No 299 thn 2020)
Knowledge Discovery and Data Mining

https://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf
SEMMA dari SAS Institute

https://documentation.sas.com/?doc
setId=emref&docsetTarget=n061bz
urmej4j3n1jnj8bbjjm1a2.htm&docse
tVersion=14.3&locale=en
CRISP-DM
IBM Data Science Methodology
Microsoft’s Team Data Science Process

https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview
Domino DataLab Methodology
Standard Kompetensi Kerja Nasional:
KepMen Ketenagakerjaan No 299 thn 2020
Machine learning pipeline
DS/ML/AI is Not One ManShow
Model Evaluation
Machine Learning in Practice
Why is evaluation important?
• So you know when you’ve succeeded
• So you know how much you’ve succeeded
• So you can decide when to stop
• So you can decide when to update the model
Basic questions for evaluation
• When to evaluate?
• What metric to use?
• On what data?
When to evaluate


Evaluation metric
• Evaluation metrics are tied to machine learning tasks
• Different metrics for each of ML/AI task
• Some metrics, such as precision recall, are useful for multiple tasks
Types of evaluation metric
• Training metric
• Validation metric
• Tracking metric
• Business metric
Example: recommender system
• Task: Given data on which users liked which items, recommend other
items to users
• Training metric
• How well is it predicting the preference score?
• Residual mean squared error: (actual – predicted)2
• Validation metric
• Does it rank known preferences correctly?
• Ranking loss
Example: recommender system
• Tracking metric
• Does it rank items correctly, especially for top items?
• Normalized Discounted Cumulative Gain (NDCG)
• Business metric
• Does it increase the amount of time the user spends on the site/service?
Dealing with metrics
• Many possible metrics at different stages
• Defining the right metric is an art
• What’s useful?
• What’s feasible?
• Aligning the metrics will make everyone happier
• Not always possible: cannot directly train model to optimize for user
engagement
Metrics Example (Classification)
• Confusion Matrix
• Accuracy
• Precision: It answers the question: When the classifier predicts yes, how often is it
correct?
• Recall: It answers the question: When it’s actually Yes, how often does the classifier
predict yes?
• False Positive Rate (FPR) : It answers the question: When it’s actually no, how often does
the classifier predict Yes?
• F1 Score: This is a harmonic mean of the Recall and Precision
• Receiver Operator Characteristic (ROC) Curve
• Precision-Recall (PR) Curve
• Logarithmic Loss or Log Loss, tells you how confident the model is in assigning a class to
an observation.
https://www.analyticsvidhya.com/blog/2020/10/quick-guide-to-evaluation-metrics-for-
supervised-and-unsupervised-machine-learning/
Metrics Example (Regression)
• Mean Absolute Error is the average of the difference between the
original value and the predicted value
• Root Mean Squared Error (RMSE)
• R Squared / Coefficient of Determination
• Adjusted R Square

https://www.analyticsvidhya.com/blog/2020/10/quick-guide-to-evaluation-metrics-for-
supervised-and-unsupervised-machine-learning/
Metrics Example (Clustering)
• Silhouette Coefficient for a set of samples is given as the mean of the
Silhouette Coefficient for each sample.
• The score is bounded between -1 for incorrect clustering and +1 for highly
dense clustering.
• Scores around zero indicate overlapping clusters.
• Dunn’s Index
• Dunn’s Index is equal to the minimum inter-cluster distance divided by the
maximum cluster size.
• A higher DI implies better clustering

https://www.analyticsvidhya.com/blog/2020/10/quick-guide-to-evaluation-metrics-for-
supervised-and-unsupervised-machine-learning/
How to Validate/Evaluate the model
• Pada kasus supervised model diuji dengan membandingkan nilai atau
kelas hasil prediksi dengan nilai atau kelas actual.
• Pembandingan dilakukan dengan menggunakan :
• Data Test (data uji) digunakan untuk mengevaluasi model
• Data Validasi, digunakan untuk mengevaluasi model pada tahap
hyperparameter tuning
• Data validasi ataupun data uji adalah himpunan data berlabel yang
digunakan untuk melakukan pengujian.
• Label bawaan dari data test adalah nilai/class actual yang nantinya
dibandingkan dengan hasil prediksi.
How to Validate/Evaluate the model
• Data Training dan data validasi, digunakan dalam proses training

• Data uji digunakan untuk menguji model dari proses training


• Model diuji dengan data yang belum pernah dilihat oleh model
Dataset Spliting

are some of the most commonly


used training data testing data split
ratios:

• Train: 80%, Test: 20%


• Train 70%, Test 30%
• Train: 67%, Test: 33%
• Train: 50%, Test: 50%

https://sdsclub.com/how-to-train-and-test-data-like-a-pro/
Cross Validation Basic
Proses Klasifikasi:
Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’

29
Process Klasifikasi:
Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes

30
Confusion Matriks
Confusion Matriks Example
• Diasumsikan hasil evaluasi model dari kasus pembelian computer,
disajikan dalam bentuk confussion matriks berikut:

Actual class\Predicted class buy_computer = yes buy_computer = no Total

buy_computer = yes 6954 46 7000

buy_computer = no 412 2588 3000

Total 7366 2634 10000


Classifier Evaluation Metrics:
Accuracy, Error Rate, Sensitivity and Specificity

• Classifier Accuracy, atau recognition ◼ Masalah ketidak seimbangan class:


rate: persentase data test yang ◼ Salah satu class jarang terjadi,
diklasifikasi dengan benar benar.
misalnya. fraud, or HIV-positive
◼ Perbedaan yang terlalu jauh antara
Accuracy = (TP + TN)/All kelas positif dan negative.

• Error rate: 1 – accuracy, atau ◼ Sensitivity: True Positive recognition


Error rate = (FP + FN)/All rate
◼ Sensitivity = TP/P

◼ Specificity: True Negative


recognition rate
◼ Specificity = TN/N

34
Classifier Evaluation Metrics:
Precision and Recall, and F-measures

• Precision: exactness – persentase jumlah tuple yang diprediksi positif dan aktualnya
positif

• Recall: completeness – persentase jumlah tuple positif yang diprediksi positif

• F measure (F1 or F-score): harmonic mean of precision and recall,

35
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity

cancer = no 140 9560 9700 98.56 (specificity)

Total 230 9770 10000 96.40 (accuracy)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

36
Evaluating Classifier Accuracy:
Holdout
• Data dipartisi secara random kedalam 2 himpunan
independen
• Training set (mis., 2/3) untuk model construction
• Test set (mis. 1/3) untuk accuracy estimation
• Random sampling: variasi dari holdout
• mengulang holdout sebanyak k times, accuracy = rata-
rata nilai akurasi yang didapat
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• (k-fold, dimana k = 10 sering digunakan)
• Data dipartisi secara random dalam k subset, dengan ukuran yang kurang
lebih sama.
• Pada iterasi ke-i , gunakan Di sebagai data test dan sisanya sebagai data
training.
• Leave-one-out: k folds dengan k = jumlah tuples, untuk data dengan
jumlah kecil.
• Stratified cross-validation: Setiap class didistribusikan secara
seimbang pada masing-masing partisi
Cross-Validation Methods illustration

• K-fold

• Leave-one Out
Evaluating Classifier Accuracy: Bootstrap
• Baik untuk himpunan data kecil.
• Sample untuk data training diberikan secara merata dengan penggantian.
• Setiap kali sebuah tuple dipilih akan ada kemungkinan dipilih kembali dan dikembalikan
ke data training.
• Metode bootstrap yang umum digunakan (.632 boostrap)
• Sebuah data set dengan d tuples, diambil sample sebanyak d kali dengan
penggantian sehingga menghasilkan data training sebanyak d samples.
• Sisa data yang tidak terpilih menjadi data test.
• Sekitar 63.2% data dipilih digunakan dalam bootstrap dan sisanyanya 36.8%
digunakan sebagai data test (since (1 – 1/d)d ≈ e-1 = 0.368)
• Prosedure sampling dilakukan sebanyak k kali, akurasi keseluruhan dari model adalah
sebagai berikut:
Evaluating Classifier Accuracy:
Bootstrap illustration
Model Selection and Tuning
Why is hyperparameter tuning hard?
• Involves model training as a sub-process
• Can’t optimize directly
• • Methods:
• Grid search
• Random search
• Smart search
• Gaussian processes/Bayesian optimization
• Random forests
• Derivative-free optimization
• Genetic algorithms
Online Evaluation
Model In Production
Tugas
• Buatlah paper yang berisi penjelasan dari bagaimana melakukan
evaluasi terhadap model regresi dan clustering, serta metric apa saja
dapat digunakan
• Referensi minimal 5, dari sumber-sumber yang dapat dipercaya

• Paper ditulis menggunakan format jurnal JTI/JTEK, template dapat


diambil pada link berikut:
• https://ejournal.unsrat.ac.id/index.php/elekdankom/pages/view/pet
unjuk

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy