0% found this document useful (0 votes)

4 views7 pages

Notes - With R Code

The document outlines a workshop on classification techniques in data mining for beginners, focusing on R coding for data retrieval, cleaning, and analysis. It provides detailed steps for importing datasets, splitting them into training and testing sets, running classification tree analyses, and checking prediction accuracy. Additionally, it discusses manipulating the classification tree to improve prediction outcomes.

Uploaded by

buiductho2024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views7 pages

Notes - With R Code

Uploaded by

buiductho2024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners

Dennis Foung

Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners

General Notes to R coding

- Make sure you type every single letter / symbol in exactly the same way I have stated on the notes.
- It is not easy for me to explain why I use square brackets / round brackets / commas in a few hours
workshop. Please Google them if you want to find out.
- To know more about the command, you can use the help function of R: help(COMMAND)

Legend
“XXXX”  Words in quotation marks are words of a menu item
XXXX  Words in box represent the code for R / R Studio

Details of Dataset
id_student Index
gender Male = M; Female = F
online_q1 Online quiz 1 : 0-100
online_q2 Online quiz 1 : 0-100
finalscores Final Score of Ss: 0-100
avgtimespent Time Spent on Canvas: 0-60
abbusage Average Canvas usage (clicks):0-infinitiy
exam Exam Scores: 0-100
midterm Mid-term Test Scores: 0-100
mid_at_risk Coded variable:
Mid-term test = 0-59: 1
Mid-term test =60+ : 0
essay Scores for Essay
Grading scheme: 0-4.5
grade final Grade for Ss
Grading scheme: 0-4.5
final_at_risk Coded variable:
Final Grade = 0-2.5: 1
Final Grade =3+ : 0
Step 1 – Data Retrieval
Importing and manipulating data for R Studio
(Command: d=read.csv("Classification Tree.csv", sep=",", header=TRUE)
[CHANGE the directory to where your file is in]

In this set of notes, the name for the dataset is “d”

Step 2 – Data Cleaning

Step 2.5 – Splitting Dataset
1. Splitting the dataset (R studio)
Step 1: Identify a random sample
set.seed (12345)
- This is to allow you to re-do the procedures with the same results (e.g. random number)
- You can use any number to replace “12345”.
id=sample(1:378, size=252,replace=F)
- This is to identify a random set of number, for drawing a Training Dataset. “id” is the name
for this set of number. After entering this command, “id” will have 252 numbers, from 1 to
378.
- For “1:378”, include the number of records you have.
- For “size”, include the number of records you need for the Training Dataset. A
Recommended number is two-thirds of the total sample.
P. 1 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung

d1=d[id,]
d2=d[-id,]
- These two lines are to actually split the original dataset (“d”) into two. The first line asked R
to include the 252 records identified before to be included in the “d1” dataset (i.e. Training
dataset).
- The next line is to identify the remaining records to be included in the “d2” dataset (i.e.
Testing Dataset).
- Be reminded to use square brackets!
- Replace the “d” before the square brackets if you have another name for your dataset.

dim (d1)
dim (d2)
- To play safe, use the above two lines to check if the dataset is alright.
- From the output of R, the first number is the number of rows (no. of records), and the
second number is the number of columns (no. of variables)

Step 3 – Training (Classification Tree)

1. Installing the packages required for classification tree analysis
install.packages("rpart")
install.packages("rpart.plot")

2. Running the Classification Tree

Step 1: Define the possible algorithm (same as WS2)
ag=(mid_at_risk~online_q1+online_q2)
- “ag” is the place you define to store the algorithm.
- Right after the equal sign, “=”, we have the dependent variables. In the current example,
we want to predict if students are at-risk in the mid-term test. Please be reminded that this
is a re-coded variable using the mid-term test scores.
- Use the “~”. You should be able to enter that by pressing the shift button and the “`”, the
button next to “1” on the main keyboard.
- After the “~”, enter the equation you define. Enter the variables (Case sensitive!) and
separate the variables with the “+”.

Step 2: Run the classification tree analysis

library (rpart)
ctree1=rpart(ag,data=d1,method="class")
- The first line is to call the library you have installed earlier.
- “ctree1” is a place you store the results of classification tree.
- “rpart” is a command, meaning “Recursive Partitioning and Regression Trees”
P. 2 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung
- “ag” is the place you store the algorithm (in the last step).
- “d1” refers back to the Training dataset you define earlier.
- “class” reminds R that you are doing classification tree.

Step 3: Preview the classification tree

print(ctree1)
- The above command “print” asks R to show the analysis in a textual format.
- Perhaps drawing the tree can be a better way to show the results

library(rpart.plot)
rpart.plot (ctree1,extra=101)

- The first line is to call the library you have installed earlier.
- “rpart.plot” is a command to plot the classification tree
See the output below:

- Each box is called a node.

- The first number is the predicted class of that node. In this example, “1” means that
students are at-risk and “0” means that students are not at-risk. (Read the first page of
this set of notes for more details)
- The second line shows the distribution of each class. The number on the left is the
number of records for the predicted class. The number on the right is the number of
records in the remaining class. The last number at the bottom is the percentage of
records in that node [considering the previous conditions] (Note: The figures to be
displayed vary if the command is different, e.g. extra=108)
- “XXX > XXX” refers to the the condition for classification; If the condition is met, we
should follow the line on the left. If the condition is not met, we should follow the line
on the right. Same for ALL nodes.
Example:
Student A → Scores for Online Quiz 1 = 31

P. 3 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung

YES

Based on the classification tree, student A is predicted to be at-risk (1) and 4% (i.e. 9
records) of training records met the same set of condition / are predicted in the same way.
Among these nine records (meeting all the conditions), 3 students are actually at-risk
students while 6 students re actually not at-risk students (but these 6 records actually met
the same set of conditions were predicted to be at-risk).

3. Checking the Prediction accuracy

Step 1: check the prediction accuracy for Training Dataset
pr=predict(ctree1)
cl=max.col(pr)
table(cl, d1$mid_at_risk)
- “predict” is the command to predict the results of each record based on “ctree1”
- “pr” is the place to store the probabilities of each class (e.g. there is 60% of chance that
this student is being at-risk and a 40% chance of not being at-risk).
- To determine the predicted class of a record, the “max.col” command asks R to take the
column with the highest probability in “pr”. (e.g. R will say that the student is being at-risk
if “pr” says that there is a 60% chance for this student to be at-risk and 40% chance of not
being at-risk).
- “table” is a command to do crosstabs: the predicted number of records being at-risk and
the actual number of students being at-risk. The columns show the actual number and the
rows show the predicted ones.
- This table is called “Misclassification Table”. See the output below:

P. 4 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung

80: CORRECT PREDICTION 17: INCORRECT PREDICTION

Predicted: Not at-risk Predicted: Not at-risk
Actual: Not at-risk Actual: At-risk

37: INCORRECT PREDICTION

118: CORRECT PREDICTION
Predicted: At-risk
Predicted: At-risk
Actual: Not at-risk
Actual: At-risk

- Error Rate = Incorrect Prediction / Total No. of Prediction = (17+37) / (80+37+17+118) =

0.214. There is no a general rule of thumb, but you need to consider the practical
implication.
(Step 4 - TESTING)
Step 2: Check the prediction accuracy for the Testing Dataset
pr<-predict(ctree1,d2)
cl<-max.col(pr)
table(cl,d2$mid_at_risk)
- First line: Use the results for “ctree1” to predict the values for the Testing dataset and store
it in “pr”
- Similar to the previous step: “pr” is the place to store the probabilities of each class (e.g.
there is 60% of chance that this student is being at-risk and of course a 40% chance of not
being at-risk).
- Similar to the previous step: To determine the predicted class of a record, the “max.col”
command asks R to take the column with the highest probability in “pr”. (e.g. R will say that
the student is being at-risk if “pr” says that there is a 60% chance for this student to be at-
risk and 40% chance of not being at-risk).
- Similar to the previous step: “table” is to do crosstabs: the predicted number of records
being at-risk and the actual number of students being at-risk. The columns show the actual
number and the rows show the predicted ones.
-Once again, this table is called “Misclassification Table”. See the output on the below:

- Error rate: (29+26) / (23+29+26+48) = 0.437. The error rate is quite different from Training
dataset. This can be a red flag for “over training”. Google it for more information. Of course,
the error rate itself is not ideal as well. In fact, the whole situation is not ideal!
- Try to (1) manipulate the classification tree (see below); (2) recode the variables(e.g. a
different cut-off point?); (3) use some other variables to predict this; or / and (4) use
P. 5 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung
another method.

4. Manipulating the Classification Tree

You can manipulating the classification tree to fit your purposes. Here are two commonly-used
commands (from the help menu of R):
minsplit the minimum number of observations that must exist in a node in order for a split to be attempted.

[For example, in the node above, there were only 34 observations and is now being split into two
nodes. If minsplit=35, the two nodes below will not exist.]

maxdepth Set the maximum depth of any node of the final tree, with the root node counted as depth 0 [The first
node on the top]. If values are greater than 30, rpart will give nonsense results on 32-bit machines.

[In the tree we built, the value for depth is 4.]

Step 1: Re-running and plotting the Classification tree

library (rpart)
ctree2=rpart(ag,data=d1,method="class", minsplit=35,maxdepth=4)
rpart.plot (ctree2,extra=101)

- The new tree will not split any nodes that are smaller than 35, so the tree has been changed.

Step 2: Re-checking the prediction accuracy for the Training / Testing Dataset
pr=predict(ctree2)
P. 6 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung
cl=max.col(pr)
table(cl, d1$mid_at_risk)

pr<-predict(ctree2,d2)
cl<-max.col(pr)
table(cl,d2$mid_at_risk)

Error rate for the Training Dataset = (34+23) / (83+34+23+112) = 0.226

Error rate for the Training Dataset = (22+26) / (22+26+30+48) = 0.380

Still, the error rates are not ideal; Perhaps it may be better to re-code the variables.

P. 7 / 7

Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
R-course_part7-ML_exercise-sheet-2024
No ratings yet
R-course_part7-ML_exercise-sheet-2024
8 pages
R Assignment
No ratings yet
R Assignment
8 pages
ISYE6501-Homework-2
No ratings yet
ISYE6501-Homework-2
11 pages
R20 DMT UNIT-III
No ratings yet
R20 DMT UNIT-III
21 pages
Longintro
No ratings yet
Longintro
60 pages
Data Minning Unit 2-1
No ratings yet
Data Minning Unit 2-1
10 pages
Unit-4 DM
No ratings yet
Unit-4 DM
19 pages
Final Data Lab
No ratings yet
Final Data Lab
21 pages
1
No ratings yet
1
19 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Datamining Lab Record
No ratings yet
Datamining Lab Record
36 pages
R Code For Discriminant and Cluster Analysis
No ratings yet
R Code For Discriminant and Cluster Analysis
23 pages
Classification and Regression Trees
No ratings yet
Classification and Regression Trees
48 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
Assigmnent 3 (Data Mining)
No ratings yet
Assigmnent 3 (Data Mining)
18 pages
Decision Trees For The Beginner: Dan Murphy CANW September 29, 2017
No ratings yet
Decision Trees For The Beginner: Dan Murphy CANW September 29, 2017
26 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
class 2a-Decision Trees
No ratings yet
class 2a-Decision Trees
28 pages
Unit 4
No ratings yet
Unit 4
186 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
DM Module-3 Notes
No ratings yet
DM Module-3 Notes
25 pages
Introduction To RPART
No ratings yet
Introduction To RPART
67 pages
Janani Prakash Loan Prediction Study
No ratings yet
Janani Prakash Loan Prediction Study
97 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Machine Learning in Ecology
No ratings yet
Machine Learning in Ecology
15 pages
Chapter 5 Classification
No ratings yet
Chapter 5 Classification
24 pages
Record
No ratings yet
Record
23 pages
Machine Learning Project
No ratings yet
Machine Learning Project
12 pages
ML LAB Manual
No ratings yet
ML LAB Manual
28 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
Unit 3
100% (1)
Unit 3
21 pages
Data Science in Non-Life Insurance Pricing
No ratings yet
Data Science in Non-Life Insurance Pricing
142 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
Divorce Prediction System: Devansh Kapoor 179202050
No ratings yet
Divorce Prediction System: Devansh Kapoor 179202050
12 pages
BDA Lab Manual (12 Weeks)
No ratings yet
BDA Lab Manual (12 Weeks)
22 pages
7 Classification
100% (3)
7 Classification
63 pages
BAN5
No ratings yet
BAN5
2 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Random Forest
No ratings yet
Random Forest
5 pages
Supervised Learning in R Classification
No ratings yet
Supervised Learning in R Classification
7 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Classification & Prediction
No ratings yet
Classification & Prediction
19 pages
9.0 KNN Nearest Neighbours Algorithm
No ratings yet
9.0 KNN Nearest Neighbours Algorithm
4 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
data science
No ratings yet
data science
15 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
Decision Trees
67% (3)
Decision Trees
14 pages
Classification
No ratings yet
Classification
81 pages
Datamining 2
No ratings yet
Datamining 2
54 pages
ClassificationandPrediction_Module3
No ratings yet
ClassificationandPrediction_Module3
88 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Applsci 09 05178
No ratings yet
Applsci 09 05178
13 pages
B.Tech - Minor Program - Course structure-JNTUH
No ratings yet
B.Tech - Minor Program - Course structure-JNTUH
14 pages
Predicting_Football_Match_Result_Using_Fusion-based_Classification_Models
No ratings yet
Predicting_Football_Match_Result_Using_Fusion-based_Classification_Models
6 pages
Master Spilak Bruno
No ratings yet
Master Spilak Bruno
73 pages
MLP and CNN
No ratings yet
MLP and CNN
56 pages
CIGRE A Novel Method For Pollution Detection of External Insulation
No ratings yet
CIGRE A Novel Method For Pollution Detection of External Insulation
10 pages
06 Decision Trees
No ratings yet
06 Decision Trees
29 pages
12model2(17)
No ratings yet
12model2(17)
5 pages
Learning Vector Quantization (LVQ) and K-Nearest Neighbor For Intrusion Classification
No ratings yet
Learning Vector Quantization (LVQ) and K-Nearest Neighbor For Intrusion Classification
5 pages
Draft Week3
No ratings yet
Draft Week3
41 pages
ESGB_2025_classification and regression tress [Enregistré automatiquement]
No ratings yet
ESGB_2025_classification and regression tress [Enregistré automatiquement]
43 pages
Deep Learning Curriculum
No ratings yet
Deep Learning Curriculum
23 pages
Haar Cascade
No ratings yet
Haar Cascade
6 pages
R Data Analysis
No ratings yet
R Data Analysis
10 pages
Prediction of Water Quality Using Naive Bayesian Algorithm
No ratings yet
Prediction of Water Quality Using Naive Bayesian Algorithm
2 pages
Semantic Compositionality Through Recursive Matrix-Vector Spaces
No ratings yet
Semantic Compositionality Through Recursive Matrix-Vector Spaces
11 pages
nancy_chaurasia
No ratings yet
nancy_chaurasia
2 pages
Misinterpreting WOR Diagnostic Plots - Paper 1
No ratings yet
Misinterpreting WOR Diagnostic Plots - Paper 1
9 pages
L1 Intro
No ratings yet
L1 Intro
29 pages
Machine Learning KTU Module 1
No ratings yet
Machine Learning KTU Module 1
77 pages
Comparison of Activation Function On Extreme Learning Machine (ELM) Performance For Classifying The Active Compound - 5.0023872
No ratings yet
Comparison of Activation Function On Extreme Learning Machine (ELM) Performance For Classifying The Active Compound - 5.0023872
9 pages
Predictive Breast Cancer Statistical Modelling for Early Diagnosis
No ratings yet
Predictive Breast Cancer Statistical Modelling for Early Diagnosis
14 pages
Data Mining
No ratings yet
Data Mining
57 pages
LN and ML-based Model Architecture For Recruiting IT Professionals
No ratings yet
LN and ML-based Model Architecture For Recruiting IT Professionals
18 pages
Multiple Discriminant Analysis
No ratings yet
Multiple Discriminant Analysis
18 pages
Deakin Ms Data Science Programme
No ratings yet
Deakin Ms Data Science Programme
21 pages
Convolutional Neural Networks in Computer Vision: Jochen Lang
No ratings yet
Convolutional Neural Networks in Computer Vision: Jochen Lang
44 pages
SAP MM Interview Questions......
No ratings yet
SAP MM Interview Questions......
3 pages
sentiment analysis poster
No ratings yet
sentiment analysis poster
1 page
Ada Boost
No ratings yet
Ada Boost
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Notes - With R Code

Uploaded by

Notes - With R Code

Uploaded by

Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners

Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners

General Notes to R coding

In this set of notes, the name for the dataset is “d”

Step 2 – Data Cleaning

Step 3 – Training (Classification Tree)

2. Running the Classification Tree

Step 2: Run the classification tree analysis

Step 3: Preview the classification tree

- Each box is called a node.

3. Checking the Prediction accuracy

80: CORRECT PREDICTION 17: INCORRECT PREDICTION

37: INCORRECT PREDICTION

- Error Rate = Incorrect Prediction / Total No. of Prediction = (17+37) / (80+37+17+118) =

4. Manipulating the Classification Tree

[In the tree we built, the value for depth is 4.]

Step 1: Re-running and plotting the Classification tree

Error rate for the Training Dataset = (34+23) / (83+34+23+112) = 0.226

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Notes - With R Code

Uploaded by

Notes - With R Code

Uploaded by

Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners

Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners

General Notes to R coding

** In this set of notes, the name for the dataset is “d”**

Step 2 – Data Cleaning

Step 3 – Training (Classification Tree)

2. Running the Classification Tree

Step 2: Run the classification tree analysis

Step 3: Preview the classification tree

- Each box is called a node.

3. Checking the Prediction accuracy

80: CORRECT PREDICTION 17: INCORRECT PREDICTION

37: INCORRECT PREDICTION

- Error Rate = Incorrect Prediction / Total No. of Prediction = (17+37) / (80+37+17+118) =

4. Manipulating the Classification Tree

[In the tree we built, the value for depth is 4.]

Step 1: Re-running and plotting the Classification tree

Error rate for the Training Dataset = (34+23) / (83+34+23+112) = 0.226

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

In this set of notes, the name for the dataset is “d”