Notes - With R Code
Notes - With R Code
Dennis Foung
Legend
“XXXX” Words in quotation marks are words of a menu item
XXXX Words in box represent the code for R / R Studio
Details of Dataset
id_student Index
gender Male = M; Female = F
online_q1 Online quiz 1 : 0-100
online_q2 Online quiz 1 : 0-100
finalscores Final Score of Ss: 0-100
avgtimespent Time Spent on Canvas: 0-60
abbusage Average Canvas usage (clicks):0-infinitiy
exam Exam Scores: 0-100
midterm Mid-term Test Scores: 0-100
mid_at_risk Coded variable:
Mid-term test = 0-59: 1
Mid-term test =60+ : 0
essay Scores for Essay
Grading scheme: 0-4.5
grade final Grade for Ss
Grading scheme: 0-4.5
final_at_risk Coded variable:
Final Grade = 0-2.5: 1
Final Grade =3+ : 0
Step 1 – Data Retrieval
Importing and manipulating data for R Studio
(Command: d=read.csv("Classification Tree.csv", sep=",", header=TRUE)
[CHANGE the directory to where your file is in]
d1=d[id,]
d2=d[-id,]
- These two lines are to actually split the original dataset (“d”) into two. The first line asked R
to include the 252 records identified before to be included in the “d1” dataset (i.e. Training
dataset).
- The next line is to identify the remaining records to be included in the “d2” dataset (i.e.
Testing Dataset).
- Be reminded to use square brackets!
- Replace the “d” before the square brackets if you have another name for your dataset.
dim (d1)
dim (d2)
- To play safe, use the above two lines to check if the dataset is alright.
- From the output of R, the first number is the number of rows (no. of records), and the
second number is the number of columns (no. of variables)
library(rpart.plot)
rpart.plot (ctree1,extra=101)
- The first line is to call the library you have installed earlier.
- “rpart.plot” is a command to plot the classification tree
See the output below:
P. 3 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung
YES
NO
NO
Based on the classification tree, student A is predicted to be at-risk (1) and 4% (i.e. 9
records) of training records met the same set of condition / are predicted in the same way.
Among these nine records (meeting all the conditions), 3 students are actually at-risk
students while 6 students re actually not at-risk students (but these 6 records actually met
the same set of conditions were predicted to be at-risk).
P. 4 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung
- Error rate: (29+26) / (23+29+26+48) = 0.437. The error rate is quite different from Training
dataset. This can be a red flag for “over training”. Google it for more information. Of course,
the error rate itself is not ideal as well. In fact, the whole situation is not ideal!
- Try to (1) manipulate the classification tree (see below); (2) recode the variables(e.g. a
different cut-off point?); (3) use some other variables to predict this; or / and (4) use
P. 5 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung
another method.
[For example, in the node above, there were only 34 observations and is now being split into two
nodes. If minsplit=35, the two nodes below will not exist.]
maxdepth Set the maximum depth of any node of the final tree, with the root node counted as depth 0 [The first
node on the top]. If values are greater than 30, rpart will give nonsense results on 32-bit machines.
- The new tree will not split any nodes that are smaller than 35, so the tree has been changed.
Step 2: Re-checking the prediction accuracy for the Training / Testing Dataset
pr=predict(ctree2)
P. 6 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung
cl=max.col(pr)
table(cl, d1$mid_at_risk)
pr<-predict(ctree2,d2)
cl<-max.col(pr)
table(cl,d2$mid_at_risk)
Still, the error rates are not ideal; Perhaps it may be better to re-code the variables.
P. 7 / 7