0% found this document useful (0 votes)
4 views7 pages

Notes - With R Code

The document outlines a workshop on classification techniques in data mining for beginners, focusing on R coding for data retrieval, cleaning, and analysis. It provides detailed steps for importing datasets, splitting them into training and testing sets, running classification tree analyses, and checking prediction accuracy. Additionally, it discusses manipulating the classification tree to improve prediction outcomes.

Uploaded by

buiductho2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

Notes - With R Code

The document outlines a workshop on classification techniques in data mining for beginners, focusing on R coding for data retrieval, cleaning, and analysis. It provides detailed steps for importing datasets, splitting them into training and testing sets, running classification tree analyses, and checking prediction accuracy. Additionally, it discusses manipulating the classification tree to improve prediction outcomes.

Uploaded by

buiductho2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners

Dennis Foung

Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners

General Notes to R coding


- Make sure you type every single letter / symbol in exactly the same way I have stated on the notes.
- It is not easy for me to explain why I use square brackets / round brackets / commas in a few hours
workshop. Please Google them if you want to find out.
- To know more about the command, you can use the help function of R: help(COMMAND)

Legend
“XXXX”  Words in quotation marks are words of a menu item
XXXX  Words in box represent the code for R / R Studio

Details of Dataset
id_student Index
gender Male = M; Female = F
online_q1 Online quiz 1 : 0-100
online_q2 Online quiz 1 : 0-100
finalscores Final Score of Ss: 0-100
avgtimespent Time Spent on Canvas: 0-60
abbusage Average Canvas usage (clicks):0-infinitiy
exam Exam Scores: 0-100
midterm Mid-term Test Scores: 0-100
mid_at_risk Coded variable:
Mid-term test = 0-59: 1
Mid-term test =60+ : 0
essay Scores for Essay
Grading scheme: 0-4.5
grade final Grade for Ss
Grading scheme: 0-4.5
final_at_risk Coded variable:
Final Grade = 0-2.5: 1
Final Grade =3+ : 0
Step 1 – Data Retrieval
Importing and manipulating data for R Studio
(Command: d=read.csv("Classification Tree.csv", sep=",", header=TRUE)
[CHANGE the directory to where your file is in]

** In this set of notes, the name for the dataset is “d”**

Step 2 – Data Cleaning


Step 2.5 – Splitting Dataset
1. Splitting the dataset (R studio)
Step 1: Identify a random sample
set.seed (12345)
- This is to allow you to re-do the procedures with the same results (e.g. random number)
- You can use any number to replace “12345”.
id=sample(1:378, size=252,replace=F)
- This is to identify a random set of number, for drawing a Training Dataset. “id” is the name
for this set of number. After entering this command, “id” will have 252 numbers, from 1 to
378.
- For “1:378”, include the number of records you have.
- For “size”, include the number of records you need for the Training Dataset. A
Recommended number is two-thirds of the total sample.
P. 1 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung

d1=d[id,]
d2=d[-id,]
- These two lines are to actually split the original dataset (“d”) into two. The first line asked R
to include the 252 records identified before to be included in the “d1” dataset (i.e. Training
dataset).
- The next line is to identify the remaining records to be included in the “d2” dataset (i.e.
Testing Dataset).
- Be reminded to use square brackets!
- Replace the “d” before the square brackets if you have another name for your dataset.

dim (d1)
dim (d2)
- To play safe, use the above two lines to check if the dataset is alright.
- From the output of R, the first number is the number of rows (no. of records), and the
second number is the number of columns (no. of variables)

Step 3 – Training (Classification Tree)


1. Installing the packages required for classification tree analysis
install.packages("rpart")
install.packages("rpart.plot")

2. Running the Classification Tree


Step 1: Define the possible algorithm (same as WS2)
ag=(mid_at_risk~online_q1+online_q2)
- “ag” is the place you define to store the algorithm.
- Right after the equal sign, “=”, we have the dependent variables. In the current example,
we want to predict if students are at-risk in the mid-term test. Please be reminded that this
is a re-coded variable using the mid-term test scores.
- Use the “~”. You should be able to enter that by pressing the shift button and the “`”, the
button next to “1” on the main keyboard.
- After the “~”, enter the equation you define. Enter the variables (Case sensitive!) and
separate the variables with the “+”.

Step 2: Run the classification tree analysis


library (rpart)
ctree1=rpart(ag,data=d1,method="class")
- The first line is to call the library you have installed earlier.
- “ctree1” is a place you store the results of classification tree.
- “rpart” is a command, meaning “Recursive Partitioning and Regression Trees”
P. 2 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung
- “ag” is the place you store the algorithm (in the last step).
- “d1” refers back to the Training dataset you define earlier.
- “class” reminds R that you are doing classification tree.

Step 3: Preview the classification tree


print(ctree1)
- The above command “print” asks R to show the analysis in a textual format.
- Perhaps drawing the tree can be a better way to show the results

library(rpart.plot)
rpart.plot (ctree1,extra=101)

- The first line is to call the library you have installed earlier.
- “rpart.plot” is a command to plot the classification tree
See the output below:

- Each box is called a node.


- The first number is the predicted class of that node. In this example, “1” means that
students are at-risk and “0” means that students are not at-risk. (Read the first page of
this set of notes for more details)
- The second line shows the distribution of each class. The number on the left is the
number of records for the predicted class. The number on the right is the number of
records in the remaining class. The last number at the bottom is the percentage of
records in that node [considering the previous conditions] (Note: The figures to be
displayed vary if the command is different, e.g. extra=108)
- “XXX > XXX” refers to the the condition for classification; If the condition is met, we
should follow the line on the left. If the condition is not met, we should follow the line
on the right. Same for ALL nodes.
Example:
Student A → Scores for Online Quiz 1 = 31

P. 3 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung

YES

NO

NO

Based on the classification tree, student A is predicted to be at-risk (1) and 4% (i.e. 9
records) of training records met the same set of condition / are predicted in the same way.
Among these nine records (meeting all the conditions), 3 students are actually at-risk
students while 6 students re actually not at-risk students (but these 6 records actually met
the same set of conditions were predicted to be at-risk).

3. Checking the Prediction accuracy


Step 1: check the prediction accuracy for Training Dataset
pr=predict(ctree1)
cl=max.col(pr)
table(cl, d1$mid_at_risk)
- “predict” is the command to predict the results of each record based on “ctree1”
- “pr” is the place to store the probabilities of each class (e.g. there is 60% of chance that
this student is being at-risk and a 40% chance of not being at-risk).
- To determine the predicted class of a record, the “max.col” command asks R to take the
column with the highest probability in “pr”. (e.g. R will say that the student is being at-risk
if “pr” says that there is a 60% chance for this student to be at-risk and 40% chance of not
being at-risk).
- “table” is a command to do crosstabs: the predicted number of records being at-risk and
the actual number of students being at-risk. The columns show the actual number and the
rows show the predicted ones.
- This table is called “Misclassification Table”. See the output below:

P. 4 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung

80: CORRECT PREDICTION 17: INCORRECT PREDICTION


Predicted: Not at-risk Predicted: Not at-risk
Actual: Not at-risk Actual: At-risk

37: INCORRECT PREDICTION


118: CORRECT PREDICTION
Predicted: At-risk
Predicted: At-risk
Actual: Not at-risk
Actual: At-risk

- Error Rate = Incorrect Prediction / Total No. of Prediction = (17+37) / (80+37+17+118) =


0.214. There is no a general rule of thumb, but you need to consider the practical
implication.
(Step 4 - TESTING)
Step 2: Check the prediction accuracy for the Testing Dataset
pr<-predict(ctree1,d2)
cl<-max.col(pr)
table(cl,d2$mid_at_risk)
- First line: Use the results for “ctree1” to predict the values for the Testing dataset and store
it in “pr”
- Similar to the previous step: “pr” is the place to store the probabilities of each class (e.g.
there is 60% of chance that this student is being at-risk and of course a 40% chance of not
being at-risk).
- Similar to the previous step: To determine the predicted class of a record, the “max.col”
command asks R to take the column with the highest probability in “pr”. (e.g. R will say that
the student is being at-risk if “pr” says that there is a 60% chance for this student to be at-
risk and 40% chance of not being at-risk).
- Similar to the previous step: “table” is to do crosstabs: the predicted number of records
being at-risk and the actual number of students being at-risk. The columns show the actual
number and the rows show the predicted ones.
-Once again, this table is called “Misclassification Table”. See the output on the below:

- Error rate: (29+26) / (23+29+26+48) = 0.437. The error rate is quite different from Training
dataset. This can be a red flag for “over training”. Google it for more information. Of course,
the error rate itself is not ideal as well. In fact, the whole situation is not ideal!
- Try to (1) manipulate the classification tree (see below); (2) recode the variables(e.g. a
different cut-off point?); (3) use some other variables to predict this; or / and (4) use
P. 5 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung
another method.

4. Manipulating the Classification Tree


You can manipulating the classification tree to fit your purposes. Here are two commonly-used
commands (from the help menu of R):
minsplit the minimum number of observations that must exist in a node in order for a split to be attempted.

[For example, in the node above, there were only 34 observations and is now being split into two
nodes. If minsplit=35, the two nodes below will not exist.]

maxdepth Set the maximum depth of any node of the final tree, with the root node counted as depth 0 [The first
node on the top]. If values are greater than 30, rpart will give nonsense results on 32-bit machines.

[In the tree we built, the value for depth is 4.]

Step 1: Re-running and plotting the Classification tree


library (rpart)
ctree2=rpart(ag,data=d1,method="class", minsplit=35,maxdepth=4)
rpart.plot (ctree2,extra=101)

- The new tree will not split any nodes that are smaller than 35, so the tree has been changed.

Step 2: Re-checking the prediction accuracy for the Training / Testing Dataset
pr=predict(ctree2)
P. 6 / 7
Workshop – A Taste of Classification Techniques: A Data Mining Workshop for Beginners
Dennis Foung
cl=max.col(pr)
table(cl, d1$mid_at_risk)

pr<-predict(ctree2,d2)
cl<-max.col(pr)
table(cl,d2$mid_at_risk)

Error rate for the Training Dataset = (34+23) / (83+34+23+112) = 0.226


Error rate for the Training Dataset = (22+26) / (22+26+30+48) = 0.380

Still, the error rates are not ideal; Perhaps it may be better to re-code the variables.

P. 7 / 7

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy