0% found this document useful (0 votes)
165 views21 pages

Les 3 DWM

Uploaded by

Payal Khuspe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
165 views21 pages

Les 3 DWM

Uploaded by

Payal Khuspe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 21
Data Warehousing and Mining (MU) 4 3.1 INTRODUCTION © There are two forms of data analysis that can be used for extracting models describing important classes or to predict future data trends. ‘© These two forms are as follows: (i) Classification Gi) Prediction ‘* Classification is a type of data analysis in, which models defining relevant data classes are extracted. * — Classification models, called classifier, predict categorical class labels; and prediction models predict continuous valued functions. * For example, we can build a classification model to categorize bank Joan applications as either safe or risky, or a prediction model to predict the expenditures in dollars. of potential customers on computer ‘equipment given their income and occupation. DH 3.2 BASIC CONCEPTS In this section we will discuss what classification is, working of classification, issues in classification and criteria ‘% 3.2.1 What Is Classification? ‘+ Following are the examples of cases where the data analysis task is Classification (MU-New Syllabus wef academic year 22-23) (Classitication)...Paye n, (i) A bank loan officer wants to analyze th order to know which customer (loan app. risky or which are safe. tag or (ii) A marketing manager at a company analyze a customer with a given profile yy, buy anew laptop. + Inboth ofthe above examples, a model or ls consircted to predict the categorical lates. 7. labels are “sky” or “safe” for loan application gg and “yes” or “no” for marketing data, ts ‘2% 3.2.2 How Does Classification Work? the help ofthe bank loan application that we fae discussed above, let us understand the working 4 classification. The Data Classification process includes two steps — 1. Building the Classifier or Model 2. Using Classifier for Classification > 1. Building the Classifier or Model ‘© This step is the learning step or the learning phase. © In this step the classification algorithms build the classifier. * The classifier is built from the training set made up of database tuples and their associated class labels. ‘* Each tuple that constitutes the training set is refered 19 8 a category or class. These tuples can also be referred to-as sample, object or data points. Classification algorithm youth THEN loan_ decision = 34) | income = high THEN loan_decision = IF age = middle_aged AND income = ow THEN loan_decision = risky (el rech-WeoPublicotions.A SACHIN SHAM Vertue sees ane Ming uy ne Mining (MU) (Classification)... Page no. (3-3) 2. Using Classifier for Classification In this step, ey S lair iS used for clasifiation, Here the est data is used to estimate the accuracy of classification ENS SESS Heaton res can be applied tothe new data tuples if he accuracy is considered acceptable. (ohn, middle_aged, low) loan_decision? risky (WcaFig. 3.2.2 : Testing a Classifier 3.2.3, Classification Issues ‘The major issue is preparing the data for Classification Paring the data involves the following activites: Data Cleaning : Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute. Relevance Analysis : Database may also have the invelevant attributes. Correlation analysis is used to know whether any two given attributes are related. Data Transformation and Reduction : The data can be transformed by any of the following methods @ Normalization : The data is transformed using normalization. Normalization involves scaling all values for given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used. Gi) Generalization: The data can also be transformed by generalizing it to the higher concept. For this Purpose, we can use the concept hierarchies. (iii) Data can also be reduced by some other methods such as wavelet transformation, binning, histogram analysis and clustering. 1U-New Syllabus wef academic year 22-23) 3.2.4 Comparison of Classification Methods Here are the criteria for comparing the methods of Classification * Accuracy : Accuracy of classifier refers to the ability of classifier. It predicts the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for anew data Speed : This refers to the computational cost in generating and using the classifier or predictor. Robustness : It refers to the ability of classifier or Predictor to make correct predictions from given noisy data, * Scalability : Scalability refers to the ability to construct the classifier ot predictor efficiently; given large amount of data. Interpretability : It refers to what extent the classifier ot predictor understands. 3.3 DECISION TREE INDUCTION Why is tree pruning useful in decision tree | induction? What is a drawback of using a separate | set of tuples to evaluate pruning? Given a decision | tree, you have the option of (a) converting the ! decision tree to rules and then pruning the resulting rules, or (b) pruning the decision tree and then converting the pruned tree to rules. ‘What advantage does (a) have over (b)? : Tech-Neo Publications. SACHIN SHAH Venture Data Warehousing and Mining (MU) UQ. Why is tree pruning useful in decision tree 1 induction? What is a drawback of using a separate | set of tuples to evaluate pruning? ‘© Decision tree induction is the learning of decision trees from class-labeled training tuples. * A decision tree is a structure that includes a root node, branches, and leaf nodes. © Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf ‘node holds a class label. * The topmost node in the tree is the root node. * The following decision tree is for the concept buys_laptop that indicates whether a customer at a ‘company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class (cther buys_laptop = yes or buys_laptop = no) (0c9Fig. 3.3.1: Representation of a Decision Tree © The benefits of having a decision tree are as follows: 1. _ It.does not require any domain knowledge. 2. Itis easy to comprehend. 3. The leaming and classification steps of a deci tree are simple and fast. 3.3.1 Dedsion Tree Induction Algorithm ‘© A machine Jearning researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as, IDS (Iterative Dichotomiser). Later, he presented C4,5, which was the successor of ID3, 1D3 and C45 adopt 4 greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive divide-and-conquer manner. (MU-New Syllabus w.ef academic year 22-23) Algorithm : Generate_decsion tre, Go sion tree form training tuples = decision PS fda pa Inpot «Data partition, D, which is a set of trainin 8 ty their associated class labels. Des ng «attribute_list, the set of candidate ature, © Attribute_selection_method, a procedure 1 the spliting criterion that “best” parton, ye tuples into individual classes. This criterion ng splining_atribute and iter a sping puny» splitting subset. Output : A Decision Tree 5 Method 1. create anode N; 2. ifftuples in D are all of the same class, C.then 3, return N as leaf node labeled with class C; 4, if.attribute_list is empty then 5. retum N as a leaf node with labeled with mejor in D; //majority voting 6. apply attribute_selection_method(D, attibu_is, find the best splitting_criterion; 7. label node N with splitting_criterion; 8. if splitting attribute is discrete-valued and multi splits allowed then// no restricted to binary tees 9. attribute_list€-attribute_list - spliting anribue; remove splitting attribute 10. for each outcome j of splitting criterion 1/ pastition the tuples and grow subtrees for partition 11. let D; be the set of data tuples in D sa J://a partition 12. iD, is empty then 13. attach a leaf labeled with the majority ss "" > node N; 14. else attach the node returned by Generate_¢2" (©, attribute list) to node N; fying ous isi ois arohousing and Mining (MU) 2 Tree Pruning pt ‘The decision tree built may overfit the training data ‘There could be too many branches, some of which may reflect anomalies in the training data due to noise or outliers. ‘Tree pruning addresses this issue of overfitting the data by removing the least reliable branches (using statistical measures). ‘This generally results in a more compact and reliable decision tree that is faster and more accurate in its classification of data. ‘There are two approaches to prune a tree: ( Prepruning - The tree is pruned by halting its construction early. (W) Post-pruning - This approach removes a sub-tree froma fully grown tree. Drawback of using a separate set of tuples to evaluate pruning Ifa separate set of tuples are used to evaluate pruning is that it may not be representative of the training tuples used to create the original decision tree. If the separate set of tuples are skewed, then using them to evaluate the pruned tree would not be a good indicator of the pruned tree’s classification accuracy. Furthermore, using a separate set of tuples to evaluate pruning means there are less tuples to use for creation and testing of the tree. While this is considered @ drawback in machine learning, it may not be so in data mining due to the availability of larger data sets. nere. ity OF Ee ee acsei "Gq Given a decision tree, you e | egivaring the decision ee to rues-and then, ‘ ‘pruning the resulting rules. or. (>) j "decision tree and then converting the pruned te® | q - toutes. What advantage does (a) have over (6)? 1 Tf pruning a subtree, we would remove the subtree completely with method (b). However, with method (a), if ssificatios YW 3.3.3 Cost Complexity ‘The cost complexity is measured by the following (™° parameters — ‘Number of leaves in the tree, and Error rate of the tree 2%. 3.3.4 Classification using Information Gain (D3) 1D3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively (repeatedly) dichotomizes(divides) features into two or more groups at each step. Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree. In simple ‘words, the top-down approach means that we start building the tree from the top and the greedy approach ‘means that at each iteration we select the best feature at the present moment to create a node. Most generally ID3 is used for classification problems ‘with nominal features only. Metrics in 103 ‘As mentioned previously, the ID3 algorithm selects the best feature at each step while building a Decision tree. 1D3 uses Information Gain or just Gain to find the best feature. Information Gain calculates the reduction in the entropy and measures how well a given feature separates of classifies the target classes. The feature with the highest Information Gain is selected as the best one. In simple words, Entropy is the measure of disorder and the Entropy of a dataset is the measure of disorder in the target feature of the dataset. Expected amount of information (in bits) needed to assign a class to a randomly drawn object is called Entropy. In the case of binary classification (where the target column has only two types of classes) entropy is 0 if all values in the target column are homogenous(similar) and will be 1 if the target column has equal number pruning a rule, we may remove any precondition of it. The values for both the classes. latter is less restrictive. © We denote our dataset as D, entropy is calculated i (MU-NeW Syllabus wef academic year 22-23) TR rech-eo Publcations.A SACHIN SHAH Venture Data Warehousing and Mining (MU) 1 Enropy ) = © py toa where, 1s the otal numberof classes inthe target con 2, isthe probability of clas‘ or the ratio of “number of rows with clas | in the target colina” 10 the “otal rmiberof roms” in he dataset Information Gain for a feature column A is calculated Gain(A) = Entropy(D) ~ Entropy(A) 13 Steps 1. Calculate the Information Gain of each feature. 2. Considering that all rows don’t belong to the same class, split the dataset D into subsets using the feature for which the Information Gain is maximum. Make a decision tree node using the feature with the ‘maximum Information gain, 4. If-all rows belong to the same class, make the current node as a leaf node with the class as its label. 5. Repeat the above steps for the remaining features until ‘we tun out of all features, or the decision tree has all leaf nodes. Ex 9.3.1: Apply ID3 algorithm on the following training dataset and extract the classification rule from the tree. Classiicaton)..Pape no (a. Soin. Let the class label attributes be as follows: C1 = Play_Tennis = Yes = 9 Samples 2=Play_Tennis = No = 5 Samples ‘Therefore, P(C1) = 9/14 and P(C2) = 5/14 (i) Entropy before split for the given database D: m0) = Ente )= Feats Sig, 04097 + 05305 = 0940 Ci) Choosing Outlook as Spiting Atvbute Outlook ct C2 Entropy Play Tennis = Yes | Play Tennis=No| sunny 2 3 srt Overcast 4 oO 0O Rain 3 2 oar 1 Hotook) = x HiSumny) + Ax Hover > H(Rain) 5 4 5 = 74% 0971 479x045 x0.970 Gain (Outlook) = H(D) — H(utlook) = 0.940 — 0.694 = 0.246 pe (ii) Choosing Temperature as Splitting Attribute Day | Outiook | Temp. | Humiaty | Wind | Play Tennle 1 [sien [ret | Hah [weak ae Temperature | C1 c2 | Entropy 2 | Sumy | Hot | High | Strong | No ise eee, | Plas Tonal | 3 | Overcast | Hot [High | Weak | Yes acs 4 [Rain [wig | High | weak | ves ae 2 2 ‘ 5 [Ran | Coot | Normal | weak | Yes Mita 4 2 092 8 [Rain | Coot | Normal | strong | No Cool 5 1 ost 7 | Overcast | Coot [Normal | Strong | Yeo a : a 8 | Sunny | mid | High | Weak No ++ H (Temperature) Ja% Hon + 74% H(Mild) + 7p 9 [sunny [Coot | Nomar | weak | Vee x H(Cool) ‘40 | Rain | wig_| Normal | Weak | Ves six 4G x092 4x08 <0911 1 sunoy [wis | Normat_| Suen | Yes 12 [Oweas | wis | Hips | Steg | Gain (Temperature) = H(D) ~ H(Temperature) 13 | Overcast | “Hot_[ Nomar] weak | Yer = 0940-0911 =0.029 14 [Rain [wid | High [song Tne (MU-New Syllabus we. academic year 22-23) Tech-Neo Publications..A SACHIN SHAH Venture Autribure Classification Day [Outlook | Temp. | Humidity | Wind | Play_ Tennis A | 1] Sunny | Hot itn | No Pay Tenole= No} 2 | Sunny [ Hot [High | Strong No 4 [_ oses 8 | Sunny [ mid | High | Weak | No ; | ose 9 | sunny [ Coot | Normal | Weak Yes 1 11 | sunny [ mid | Normal [Strong] Yes Humidity) = 74 x High) 4 *H (Normal) As 7 s 14% 0.985 +74 x 0.592 = 0.789 Gain (Humidity HO) - Humidity) = 0.940 - 0.789 = 0.151 (9) Choosing Wind as Splitting Attibute Let the class label atributes be as follows: Cl = Play_Tennis = YestSunny = 2 Samples C2 = Play_Tennis = NolSunny = 3 Samples ‘Therefore, P(CI) = 2/5 and P(C2) = 3/5 (Entropy before split for the given database DI: HO) = 2 arlos,(2) Gi) Choosing Temperature as Splitting Attribute 2, 5.3, 5 Flog, 5 +5 log, 3= 0971 Wind, ch Hs Entropy Play_Teonis = Yes | Play-Tennis=No | ‘Strong 3 3 1 Weak 6 2 0811 ‘Temperature cl a Entropy Play Tennis | Play Tennis} Hf «| = ¥esiSunny | =No'Sunny | <>< 6 8g . H(Wind) = 74 x H(Strong) +74 XH (Weak) Hot 0 2 ° A Mild 1 1 1 =ygXl + 7qX 0811 = 0.892 Cool i 3 2 Gain (Wind) = H(D) - H(Wind) = 0.940 ~ 0,892 = 0,048 1. Hetenpertre) = 2xxigon +2 cna + Gain(HumidityiD) = 0.151 Gain(WindID) = 0.048 Outlook attribute has the highest gain; therefore, it is wed as the decision attribute in the root node. Since, Outlook has three possible values, the root node has three branches (Sunny, Overcast, Rain), ‘Sunny ‘Overcast Rain (coFig. P. 3.3.1(@) Now, consider Outlook = Sunny and count the number ‘ftupes from the original dataset D. Let us denote it as D1. (MU-New Sylabus w.ef academic year 22-23) x H(Cool) Sytem = 5x0+5x 145 x0=04 Gain (Temperature) = H(D1) ~ H(Temperature) = 0.971 -0. 1.571 (ii) Choosing Humidity as Splitting Attribute ‘Humidity | C1 a Entropy Play Tennis | Play Tennis | © = Yes\Suriny | =NoSunny High 0 3 o Normal 2 o [| o x H(High) + 2 XH (Normal) Gain (Humidity) = H(D1) - H(Humidity) = 0.971 -| 197 as Splitting Attribute (iv) Choosing Tech-Neo Publications..A SACHIN SHAH Venture Summary: Gain(TemperatureD1) = 0.571 Geain(HiumidityD1) = 0.971 Gain(WindiD1) = 0.02 Humidity attribute has the highest gain; therefore, itis placed below Outlook = “Sunny”. Since, Humidity has two possible values, the Humidity ‘nade has two branches (High, Normal). From dataset D1, we find that when Humidity = High, Play Tennis = No and when Humidity = Normal, Play_Teanis = Yes. (weaFig. P33.100) ‘* Now, consider Outlook = Overcast and count the ‘number of tuples from the original dataset D. Let us 105) Fig, P.3.3.1(c) Now, consider Outlook = Rain and count ie. of tuples from the original dataset D. Let ys jg, as D3. Day | Outlook | Temp. | Humidity | Wind «| rain | Mid | High | Weak 5 | Rain | Cool | Normal | Weak 6 | pan | Cool | Normal | Stong) ty 10 | pan | mid | Nomal | Weak | es 4 Rain Mild High Strong te Let the class label attributes be as follows: C1 = Play_Tennis = YeslRain = 3 Samples C2 = Play_Tennis = NolRain = 2 Samples Therefore, P(C1) = 3/5 and P(C2) = 2/5 (i Entropy before split for the given database D3: ¥ nmuQ) H(D3) 0 + HO3) 0 Bue, 242: oe, 3 +5 oe: i)_Choosing Temperature as Spliting Attribute Entrops H denote it as D2, ie ie 0 : Day | Outiook | Temp. | Humidity | Wind | Play Tennis 2 mo 3 [Overcast | Hot | igh | Weak | Yes ‘Cool 1 1 J 7 [overcast Coot [ Noma [ srong | Yee 9 i 3 2 co 12 [overcast wi [Hoh _|Srong [ Yes | (Cempertar) = 5 HH) + HM) + 5° I o 19 [Overcast | Hot [noma Weak [ves | §x0+3xoo18+2x1=0951 © From dataset D2, we find that for all values of Outlook = “Overcast”, Play_Tennis = Yes. (MU-New Syllabus we f academic year 22-23) Gain (Temperature) = H(D3) ~ Temperature) 0.971 -0.951 =0.02 B] Tech-Neo Publications..A SACHIN SHAM jarehousing and Mining (MU) ‘Choosing Wind as Splitting Attribute 2 HWind) = 3x1 (Strong) +2 x H (Weak) 2x043x0-0 Gain (Wind) = H(D3) ~ H(Wind) = 0.971 -0 =0.971 .02 971 Wind attribute has the highest gain; therefore, it is placed below Outlook = “Rain”. Since, Wind has two possible values, the Wind node has two branches (Strong, Weak). From dataset D3, we find that when Wind = Strong, Play_Tennis = No and when Wind = Weak, Play Tennis = (en Fig. P.3.3.1(4) ‘The decision tree can also be expressed in rule format as: TFOutlook = Sunny AND Humidity = High THEN Play_Tennis = No IF Outlook = Sunny AND Humidity = Normal THEN Play_Tennis = YES IF Outlook = Overcast THEN Play_Tennis = YES IF Outlook = Rain AND Wind = Strong THEN Play_Tennis = No TF Outlook = Rain AND Wind = Weak THEN Play_Tennis = YES (Classification)....Page no_(3-9) Ex. 33.2 : A simple example from the stock market involving only discrete ranges has profit as categorical attribute with values (Up, Down} and the training data is: ae] Age | Competition | Type _| Profit ‘oid | Yes _| Software | Down old No Software | Down Old No Hardware | Down Mid Yes Software | Down Mid L Yes Hardware | Down Mid] No _| Hardware |_Up Mid No Software Up New ‘Yes Software Up New No Hardware | Up New No | Software Up Apply decision tree algorithm and show the generated rules. Soin. : Let the class label attributes be as follows: Cl = Profit=Down=5 Samples C2 = Profit= Up=5 Samples ‘Therefore, P(C1) = 5/10 and P{C2) = 5/10 (Entropy before split forthe given database D = HO) = = piles, (5) int 5, 10,5) 1 © HD) = 7ploe gs + 7p lon 5 =05+05 (ii) Choosing Age as the Splitting Attribute. os (Age) = px HOlg) +75 x H Mita) +x HOew) = Bxovters deacon Gain (Age) = H(D)-H(Age) = 1-0.4=0.6 (MU-New Syllabus we academic year 22-23) UB rech-neo Publications A SACHIN SHAH Venture vorts res a 6 +. H (Competition) =: 4 x H(Yes) +79 H (No) 0.9183 4 6 = 0.8755 = 7px08i13+-Sx 09183-0875 Gain(Competition) = H(D) - H(Competition) = 1-0.8755 = 0.1245 Gv) Choosing Type as the Splitting Attribute. i 4 x H(Software) +4 H (Hardware) ren = 9X! +79xl=1 HO) - H(Type} Gain(Age) = 0.6 Gain(Competition) = 0.1245 Gain (Type) = 0 ‘Age attribute has the highest gain; therefore, itis used as the decision attribute in the root node. Since Age has three possible values, the root node has three branches (Old, Mid, New), From dataset we find that, IF Age = Old THEN Profit = Down IFAge = Mid THEN Profit = Down OR Profit= Up IfAge = New THEN Profit = Up Old Mia New Down 2 Up (eaFig. P.3.3.2(a) [a] Mid Yes No Lette class label attributes be as foons C1 = Profit= Down = 2 Samples C2 = Profit = Up=2 Samples Therefore, P(Cl) = 2/4 and P(C2) = 274 (Entropy before split forthe given database py Sami 2,094.2, 4 $198.3 + §108,5=05 +05) HO) = + HOL = 2 + H (Competition) = $x H(Yes) + be HNo) 2 =4 xo42 x0=0 Gain(Competition) = H(D1) - H(Competton) © H(Type) = ay HiSoftware) +2 cH (Hard " 7 paisderet Gain(Competition) = H(D1) - H(Competition) =!~ Gain(Type) = H(D)~H(Type) = 1-1 =0 (MU-New Syllabus w.e f academic year 22-23) H Tech-Neo Publications..A SACHIN SH’ rousing and Mining (MU) owt (Classification) (3-11 son, sone petitioniDI necompestenD)) Lat he clas label atrbutes be as fllows: Gain CTypetD1) = 0 Cl = OwnHouse = Yes =7 Samp eee = Own House = Yes =7 Samples 3 aturibute has the highest gain; therefore,it | C2. = Chraved below Age="Mid™ Competition has two possible va “nce. lues, th e yn node has two branches (Yes, No). rom dataset DI, we find that when Competition = Yes Profit = Down and when Competition = No, Profit 2Up. (ooFig. P.3.3.2(a) ‘The decision tree can also be expressed in rule format as: Old THEN Profit Mid AND Competition = Yes THEN Profit = Down IF Age = IF Age IFAge = Mid AND Competition = No THEN Profit = Up IfAge_= New THEN Profit = Up Ex. 333 Using the following training data set, create classification ‘model using decision tree and draw the final tree. Age _| Own House Young _| Yes Medium | Yes ‘Young _| Rented Medium | Yes Yes Young _| Yes Old Yes Medium | Rented Medium | Rented ‘old | Rented | Old _} Rented Young _| Yes old Rented oe ee (Mu-t MU-New Syilabus wes academic year 22-23) ‘Own House = Rented = 5 Samples ‘Therefore, P(C1) = 7/12 and P(C2) = 5/2 () Entropy before split for the given database D: HD) = z toes (5) 2H) = Fploe, #2 + F tops"? = 0.454 +0.526 = 0980 (ii) Choosing Income as the Splitting Attribute Income a @ estoy Own House = Yes | Own House = Rented 4 Vor 2 ° 0 Hoh Hoh ~ [oo ° Medium 1 2 0918 low a 3 o | 2 Wdncome) = 2x H(Very High) +75 x H (High) + xt Medium + 35H (Low) = Sxo+dxo+ Fxosis+dxo = 0229 Gain(Income) = H(D)—Hincome) = 0.980-0.229 = 0.751 iii) Choosing Age as the Splitting Attribute 4 x Hovoung) + 35% Medium 2 [_ oste + 3xH os) 4 72x08 + xo971 + 73x 0918 0.904 HI Tech-Neo Publications...A SACHIN SHAH Venture Gain(tnonme) = 0.751 Gain Age) = 0.076 ‘Vecome artribute has the highest gain; therefore, itis used as ‘the decision attribute in the root node. Sexe Income has four possible values, the root node has ‘oar beanches (Very High, High, Medium, Low). WT a Low [Medium emg Fig. P3.3%a) Now, consider Income = Very High and count the ‘amber of tples from the original dataset D. Let us denote zs Stace both the tuples have class label Own House = Nex we drcaly give “Yes as class below Income = Very High Feo vor Low |Medium act Fly, P3.3.3(b) "orm, consider Income = High and count the umber of tophes Sr0%n the original dataset 1), Let us denote it ws 2, [Pee [ane] ‘howe the Your tuples have clan, Te, we estly Wve "eR wn ely Win Mbel Own Houne « ™ below Income = YAY Neen efter WAN axe ns 19-75 act Fig. P33.3() Now, consider Income = Low and oo, he tuples from the original dataset D. Let 4 den Since all the thee tuples have class ite Oxy e Rented, we directly give “Rented” aS a cay Income = Low. Very High, Low Yes Rena ct Fig. P.3.3.3(d) Now, consider Income = Medium ani o ‘number of tuples from the original dataset Lets itas D4, We find that, W income = Medium AND Age = Young THEN Own Howe =) TF lncome = Medium AND Age ® = Medium THEN Own Hous Income = Medium AND Age Old THEN Own House = 8°" wl [ech Neo Publications A SACHIN Rented Yeo acts Fig. P3.3.3(0) Rented -pe decision tree can also be expressed in rule format © eicome = Very High THEN Own House = Yes High THEN Own House = Yes Low THEN Own House = Rented ‘Medium AND Age = Young THEN Own House = Yes ‘Medium AND Age Medium THEN Own House = Rented ‘Medium AND Age (Old THEN Own Hous vec 334 KUEENEES ‘The following table consists of training data from an employee database. The data have been generalized. For example, “31... 35” for age represents the age range of 31 to 35. For a given row entry, count represents the ‘number of data tuples having the values for department, status, age, and salary given in that row. IF Income = IF Income IF income () How would you modify 4 Algorithm to take into consideration the count of generalized data tuple (i.e, of each row entry)? (b) Use your algorithm to construct a decision tee fom the given data. © som. (1) The basic decision tre algorith follows to take into consideration the ou generalized data tuple «The count of each tuple must be integrated into the Calculation of the attribute selection measure (Sve m should be modified as mnt of each as information gain). «© Take the count into consideration to determine the ‘most common class among the tuples. (b) Use 1D3 algorithm to construct a decision tree from the given data Let the class label attributes be as follows: C1 = status = junior = 113 Samples C2 = status = senior = 52 Samples ‘Therefore, P(C1) = 113/165 and P(C2) = 52/165 (i) Entropy before split forthe given database D: H@) = = pilog, Q mt fis’: 165 52 165) HD) = 795181713 165 82 sz 0 = 0.3740 + 0.5250 = 0.899 ))_ Choosing department atthe Splitting Attribute. 46K... 50K | 30 eee Se —— ‘Status = junlor 26K....30K | 40 = 3IK...39K | 40 Systems 2 [4ox...s0x [20 Marketing 4 (66K...70K i 46K...50K tary : : 110 (66K...70K. -H (department) = 795 x H(Sales) +35 x H A650 {AC (systems) +2 x H (marke 10. atketing | junior 41K...45K 165 ting) + }65 * H (secretary) i a 31 Secretary [senior | 46...50 | 36K...40K | 4 Tes * 08454 + 7G x 0.8238 + 1 tate _[[innior [26.0 [ 26x...30x Lo x0,8631 +22 —Ststatus be the class 8631 + 765 * 0.9709 = 0.8504 label attribute. (wu, New Syllabus wef academic year 22-23) ae Tech-Neo Publications...A SACHIN SHAH Ventu; re Data Warehous! Gain(department) = H(D) — H(department) = 0.899 — 0,8504 = 0.0486 and Mining (MU) ‘Summa Gain(department) = 0.0486 Gain(age) = 0.4247 Gain(salary) = 0.5375 ite has the highest gain; therefore, i the root node. : salary attribut the decision attribute in salary has six possible valUes, the r00t node ing Attribute. Cc cz. Entropy Status={ yr_| Status=senior_ A 20 0 0 49 0 o 7 44 35 0.9906 36...40 o 10 0 41.45 0 3 2 46...50 0 4 | ry Hage) = PO x HOL..25) +795 *H 26-30) 10 + x 81.38) +769 H BHM) 4 ag Hn A5) + 795% HA 5) 20 49 542 oul = Phx 04 ex 0+ 765 * 0.9906 +765 3 + xos x04 gag x 0+ 7g5%x0= 04743 Gain(age) = H(D)- Hage) 0.899 - 0.4743 = 0.4247 46, 165 xHGI 40 x HQ6K....30K) +765 4 38K) +765 xH GOK. is IK) +7ggxH (41K....45K) 6. 8 +765 %H (46K.....50K) +765 x H(66K....70K) 4. 65 8 + 765% 0= 0.3615 (one 0.4 765% 0476504785 x04 x 0.9468 Gain (salary) = H(D) ~H(salary) = 0.899 branches (G6K..-30K, 31K...35K, 36K...40K, 43K 46K...50K, 66K...70K). , From the dataset D, we find that [F salary = 26K.-.30K THEN status = junior IF salary = 30K...35K THEN statu IFsalary = 36K...40K THEN statu SR salary = 41K...45K THEN status = junior TFsalary = 46K...50K THEN status = junior: status = senior IF salary = 66K...70K THEN status = senicr Only one branch ie. 46K...50K is not giving 3 class label attribute. junior junior senior junior coisFig. P. 33.4(0) Now, consider salary = 46K...50K and number of tuples from the original dataset D. Let us <* itas D1 department Sales Systems ce ‘Systems Ee marketing ee Let the class label attributes be as follows: C1. = status = juniorlsalary = 46K...50K = C2. = status = seniorisalary 46K...50K = 40 Samples (MU-New Syllabus wef academic year 22-23) Therefore, P(C1) = 23/63 and P(C2) = 40/63 and Mining (MU) pefore split for the given database Dy. ato @ A 1 HOD = 2 reel) 2, HOD = 63 n+ 6 "29 0.9468 a a Statussjunlor | Status-senior Te 30 0 B 0 0 0 10 0 0 0 2 x H(sales) ‘8 XH (ystems) +12 2H (marketing) + 2 O92 xoxo et. = Bro Bor Peo Sxo=0 S xt cecetay) Gain(depatment) = H(D1) —H(department) 9468-0 = 0.9468 ip Choosing age asthe Splitting Attribute = ct a Entropy Statussjunior_| Statusssenior | _H 21.25 20 0 0 %..30 3 0 0 3.35 0 30 0 5A 0 10 0 a5| 0 0 0 46..50 0 al 0 0 3 30 2 Hlage) 2 HQ Dvgrtas ae XH1..35) +49 66 +8 XH NAS) +B xH (46...50) a Bxedcos B04 x0ed x04 x0e 0 Both the aribues have the same gun; therefore, we ‘one of the attribute arbitrarily and place it below SY =46K..50K", MINeW Slabs wes cadenicyeat 22-23) 40 [Medium [No | Fair Ye 5 ves 6 we 7 ves 8 % 9 ves ves ves ‘Yes ves \ e UB rech.neo Pubizaions.A SACHIN SH df oer X, be Credit sang = “Pai” ocompote: OXI) rom Naive Bayesian Classification pec) = HH, Pow) PRIC) = POGIC) POKIC) POGIC,) POX IC)) (__Age<=30 __) _2 ocicy = MG mpm Yea) “3 Incomes Medium), 4 POGC) = PBuys_Computer = Yes) "9 Student= Yes __) _6 voc) = Agape tompmars Fa) “9 6 POKIC) = =5 From Naive Bayesian Classification 24,66 PIKIC) = 5x9xg%5 = 0.044 POOC)PC) = 0044 x7; = 0.028 ‘To compute: P(KIC,) From Naive Bayesian Classification Fone) = Th Poca PRKIC,) = POK,IC,) Sea = p(——Ases=30__)_3 rove = HG pers No)" PoKsc) = p(_lacome = Medium) _ 2 = ges taamucr= Ne) "3 = Student = Yes) _1 ME) = juys_Computer = No, )=3 PoKJc) = p(Lredit_rating = Yes) _2 0) = (Gaye compute) 5 Fror ™m Naive Bayesian Classification wy New Sjlabus we academic year 22-23) Bre (A) se2eie2 Bxdagxd PUKIC) = 3x Ex gx PEXIC) PC) = 0.019% = 0.007 a2 [Naive Bayesian classification will assign a sample X 0 class Gif and only if, PLCIX) > P(CIX) _ RXIGDPIC) , PoxteyPicp P(X) P(X) P(X) is constant for both classes C, and C,- + PUKIC) P(C) > PEXIC) PCC) From (A), (B) and (C), P(KIC,) P(C,) > PCXIC,) PCC) We conclude that the unknown sample X = (Age “Medium”, Student = "Yes", Credit_rating = belongs to class Buys_Computer = Yes. iii Ex. 342: Apply statistical based algorithm to obtain the actual probabilities of each event to classify X = (Dept = “Systems”, Status = “Junior”, Age Use the following table: oC) “<=30", Income = “Fair") “26 -30").. Dept | Sits [ge [Saar] Cot sacs [senor [3135 [a0 [30 sari [2690 | a= [40 sass [pone [2630 [1x35 | 40 Syren [ur [21-25 [ aa a0 [20 sysens [ur [1-35 | 6x70 [5 Sytem [nor [2630 | 45x soe | 5 Systems | Senior [41-45 | 66K — 70k | 5 © som: (Class label attribute is Salary. = Salary between 46k ~ 50k = 55 Samples Cy= Salary between 26k — 30k = 40 Samples = Salary between 31k 40k = 40 Samples Cy = Salary between 66k ~ 70k = 10 Samples 40. 40. 40, POC) = Tyg and PC) = A Letevent X; be Dept “Systems” Tech-Neo Publications..A SACHIN SHAH Venture Data Wi event X; be Status = “Junior”, and ‘event X; be Age = "26 -30" To compute: P(XIC)) From Naive Bayesian Classification Paxic) = Ff, pac) PII) = POX IC) PEXIC)) POGIC,) POC) Dept = Systems )_25 POKIC) = » eet 35 = Junic 25 recy = w(t) 8 Age = 26 - 30 5 POGIG) = Hen Feros From Naive Bayesian Classification 25.5 POKIC,) = Fexggx55= 0.019 PRRICYPIC) = 0019x35=0.007 (A) To compute: PCKIC) From Naive Bayesian Classification Poxic) = I POGIC) PIKIG) = PCK,IC;) POGIC,) POGIC,) POGIC,) = p(Dept= Systems) _ 0. PO IC) = as = 40 Status = Junior) 40 Poni) = Paya) =s Age = 26-30 40 POG) = P(Satty 26k 300) From Naive Bayesian Classification 0 POI) = apxHPxM oo POO RC) = oxft=0 (8) To compute: P(X | C,) From Naive Bayesian Classification roy = foxes PIC) = oe POYIC) POKIC,) = =Systems) 0. noc = eee) a (Classification). ‘Status = Junior P(X,IC)) = Ceri ee 0 Age = 26-30) 49 lary 31 k— 35k) = 49 PUX{IC3) = P| fication From Naive Bayesian Cla: o PRKIC)) = 49, AO Ao P(KIC) PICs) = OX 745 = ‘ To compute: P (XIC.) From Naive Bayesian Classification PKC) = TT POC) POX IC) POGIC,) POGIC,) PX Jc, Depts Sysens.) 10 POUIC) = P(e eet FOr =10 PXIC,) = ( Age=26-30) 0 POGIC.) = P (Salary 66 k— 70k) = 10 From Naive Bayesian Classification 10, 5 0 PUKIC) = 79%79*19=9 10 PRI) PIC) = Ox74g=0 0 Naive Bayesian classification will assign samgl \ class C; if and only if, P(CAX) > P(CIX) ie. POKIC) P(C) ed PCKIC) PCC) P(X) PX) P(X) is constant for both classes C, and CG +. PCRIC) PCC) > PEXIC) P(C) From (A), (B), (©), (D) and (E) P(KIC,) P(C,) > P(KIC,) P(C,) 2 P(XIC,) P(C,)2 PC! PC.) ~. We conclude that the unknown sample X-= Dept = “Systems”, Status = “Junior”, Age =" belongs to class C, = Salary between 46k SIH sip” Ex. 9.4.3 : Given the training data for height cl Classify the tuple, t = using Classifica (MU-New Syllabus wef academic year 22-28) cyst e Tech-Neo Publications. SACHIN SHAK" housing and Mining (MU) Name | Gender | Height | Output | Kiran F_[16m_[ Shor Jatin M_ [2m [tan Madhuri F | 1.09m | Medium Manisha | F_ [188m | Medium Shilpa | F_ | 1.7m_| Shon Bobby M__|1.85m | Medium Kavita F_ [16m | shor Dinesh | M_ | 1.1m_| short Rahul M_ [22m [Ta Shree M_ [21m [Tall Divya F [1.8m | Medium Tushar | M_|1.95m_ | Medium Kim F_[19m_[ Medium Aart F [1.8m _| Medium Rajasbree | F _ | 1.75m_| Medium Soin. : From the above table, it is clear that there are 4 tuples classified as Short, 8 tuples classified as Medium and 3 tuples classified as Tall We divide the height attribute into six ranges as given below: (3-19) (Classificad From the given training data, we estimate P(Short) = 4/15 P(Medium) = 8/15 Petall) = 3/15 ‘The unseen tuple is t= ‘P(UShort) x P(Short) = P¢MiShort) x P(1.9:2.0Short) x P(Short) eheoxt = (Medium) x P(Medium)= P(MIMedium) x P((1.9.2.0)/Medium) x P(Medium) 218 = PQr ys = 00167 thal) x P(tall) = POMItal!) x P((1.9.2.0)tall) x P(all) 3132 = XG jg = 0.067 Based on the above probabilities, we classify the new tuple as Tall because it has the highest probability. Ex 344 : Given the training data for credit transaction. ‘Classify a new transaction with (Income = Medium and credit = Good) using Naive Bayes classification. (0.1.6), (1.6.1.7), (1.7,1.8], (1.8,1.9}, (1.9,2.0) and 2.0, a0) Income | Credit | Decision Very high | Excellent | AUTHORIZE High Good _| AUTHORIZE Medium | Excellent | AUTHORIZE High Good _| AUTHORIZE [very high | Good _| AUTHORIZE Medium | Excellent | AUTHORIZE High Bad REQUEST ID tight | 0.18) aa 8 Medium | Bad REQUEST ID (1718) 9 High Bad a REJECT (189) 10 | Low Bad CALL POLICE azo | 0 20,0q]| 0 Tota (MU-New Syllabus w.ef academic year 22-23) [abrech.neoPublcaions.A SACHIN SHAH Venture WW se are 6 uplesclasifed as AUTHORIZE, 2 tops cased sy ified us CALL POLICE, EQUEST (REQUEST ID) = 2/10 P{CALL POLICE) = 1/10 rots sca with gion nab bow Tate [va oy a mer] om | one | Meas | ETS ° ote a a vam [oe spe | a [a n Se a nf a gop ioe 7 come emoalas 2 falar fa n pee a Po cman Cel 3 $8 casera art Pal a co [eeoeess 2 efecealey Ta a Ta Fic ogowe tape = ncome = Medium, Crei= Good 1a AUTHORIZE) =P dwine» Mel AUTHORIZ x rt = CoetAUTHORZD TIAUTHORIZE) mEQID) xPMREQID) = POncome = MediumiREQ ID) x P(Credi = GoodIREQID) x PREQIP) 1 =o POREIECT) x POREJECT) = Ptlocome = Meum! REJECT) xP (Crit = GoodREJECT) x PRET Pau 2 xox% =0 = oxox =0 incon = MetiniCALL POLICE xP (Cedi GooatcALL POLICE P(CALLFOLCD P(UCALL POLICE) x P(CALL POLICE) = Se eae "+ Letus considera rule RI, - Ri: TF age = youth AND student = 9% ‘buy_ computer = yes © The IF part of DH 3.5 RULE BASED CLASSIFICATION : the rule is calle rue ase jcoemrent o The THEN part ofthe rl called ule come symou wat wader yu 225) Tver aorcnumuns aac" ] (u-New 2, To forma rule amecedeat, each sping eiterion is logically ANDed, 4, The leaf node Hols the class precion, forming there consequent DW 5.6 ACCURACY AND ERROR MEASURES + In data mining, clasification involves the problem of predicting which category or class a new observation tongs in. +The derived mode clasife)s based onthe analysis ofa seo traning data where each datas iven alas abel +The tained model (lassie) is then used to predict the cas label or new, unseen data. + To understand classification mets, one ofthe most ‘imporant concep the confusion mix + The diferent clasier evaluation measures ae iscused below 1. Confusion Matrix: how well your classifier can recognize wples of itera classes. It also calle as comings ney mats ach row in a confusion matrix represents an actual clas, wile each column represents predicted las ‘The 27 confusion matrix is denoted as Prediced Cass 1 [2 «useful oo for analyzing ‘al Cass [1 ol + TRerrepeseat he vals which ae preted 0 ve te and ar actully te. TW ttrepeent te values whic re preted 10 be fa and ar actly fale + repress the vals which ae edited © te oe, bat ar fle, Also elle Type I, 7. (U-Newsyabus wa academic year 22°23) %. (castlenn)..Page no (9-24 TN Teeprsent the ales which re predicted ae tre. Also called Type I err tee postive recognition ve tuples hat are coecty pore HN Specifilty + Aso called the rue negative ae proportion of negative ples that are comely ei. ~ Speciticity = TTP “Accurney : The acuracy of clasifier ona piven test setis the percentage of test set tuples tat re comely lasted by te classifi. I salto refered 10 as the ‘veal cognition ate of te TN Aecureey = PFIN TFPI EN Precision: Its the mesure of exactness. determines what percentage of tuples labelled as positive are sully pov, _. Precision = TERE Recall: His the measure of completeness, It dceraines what percentage of postive ples re abled as postive. 1 Real = FE Score: It is the harmonic mean of precision and recall I gives equal weight to precision and real. is also called F-measue or Fy cor. pp = 2=Presbonx Rea! = “Precision + Recall, F, Seore: Its the weighed measure of precision and recall Ke assign B times as mach weight real 3s 10 precision. Commonly used Fp measures art F (which ‘weights real ice as mach as precision) and Fs (ich weights precision twice as mach as real) Re Error Rate: Iti lt called msclasiiation rte of clasifierand is simply (I~ Accuracy). (rece bcs. SACHIN SHAH Vertue >| Data Warehousing and Mining (MU _—(Cassicatin).Page ne » EVALUATING THE ACCURACY OF A ‘The data is trained on the training set ang, ‘CLASSIFIER Besides the evaluation measure discussed above, other techniques to evaluate the accuracy of a classifier are discussed below. 3.7.1 Holdout In this method, the mostly large dataset is randomly divided to three subsets 1. Training set is a subset of the dataset used to build predictive models. 2. Validation set is a subset of the dataset used to assess the performance of model built in the training phase. It provides a test platform for fine tuning model's parameters and selecting the best- performing model. Not all modeling algorithms need a validation set. 3, Test set or unseen examples is a subset of the dataset to assess the likely future performance of a model. If a model fit to the training set much beter than it fits the test set, overfiting is probably the cause. ‘Typically, two-thirds of the data are allocated to the training set and the remaining one-third is allocated to the test set. Avaliable data Training New available data Training Validation Testing Walidation | (testing | holdout | holdout sample) | sample) wcisFig, 3.7.1 : Holdout ‘%. 3.7.2 Random Subsampling It is a variation of the holdout method, The holdout ‘method is repeated k times. It involves randomly splitting the data into a training and a test set. (MU-New Syllabus w.ef academic year 22-23) square eror (MSE) is obaned fom he pr, the test set my ‘This method is not recommended becatse would depend on the split. So a new split can My anew MSE and then you don’t know which oy * _ ‘The overall accuracy is calculated by «yin average ofthe acuracies obtained from eich a, 1 k rE Total number of examples me Experiment 1 ray ee come T 1111 (vc@Fig. 3.7.2: Random Subsampling 3.7.3 Cross Validation When only a limited amount of data is availabe,» achieve an unbiased estimate of the model perfome we use k-fold cross-validation Ink-fold cross-validation, into k subsets of equal size. ‘We build models k times, each time leaving out ott the subsets from training and use it as the test set Ifk equals the sample size, this is called "leave-on- out” we divide the da 3.7.4 Bootstrapping Bootstrapping is a technique used to make estimates from data by taking an average of the estimates fo smaller data samples. ‘The bootstrap method involves iteratively resampl dataset with replacement. Instead of only estimating our statistic once 09 & complete data, we can do it many times of # © sampling (with replacement) of the original sampl Repeating this re-sampling multiple times allows 8° obtain a vector of estimates. We can then compute variance, expected i empirical distribution, and other relevant statist these estimates. 7 ig ls tit LB rete Publicains_A SACHIN SHAH

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy