Materi Discrminant Analysis
Materi Discrminant Analysis
• The original dichotomous discriminant analysis was developed by Sir Ronald Fisher in 1936.
• Discriminant Analysis is a Dependence technique.
• Discriminant Analysis is used to predict group membership.
• This technique is used to classify individuals/objects into one of alternative groups on the basis of a set of
predictor variables (Independent variables) .
• The dependent variable in discriminant analysis is categorical and on a nominal scale, whereas the independent
variables are either interval or ratio scale in nature.
• When there are two groups (categories) of dependent variable, it is a case of two group discriminant analysis.
• When there are more than two groups (categories) of dependent variable, it is a case of multiple discriminant
analysis.
3 INTRODUCTION
• Discriminant Analysis is applicable in situations in which the total sample can be divided
into groups based on a non-metric dependent variable.
• Example:- male-female
- high-medium-low
• The primary objective of multiple discriminant analysis are to understand group
differences and to predict the likelihood that an entity (individual or object) will belong
to a particular class or group based on several independent variables.
4 ASSUMPTIONS OF DISCRIMINANT ANALYSIS
• NO Multicollinearity
• Multivariate Normality
• Independence of Observations
• Homoscedasticity
• No Outliers
• Adequate Sample Size
• Linearity (for LDA)
5 ASSUMPTIONS OF DISCRIMINANT ANALYSIS
• The assumptions of discriminant analysis are as under. The analysis is quite sensitive
to outliers and the size of the smallest group must be larger than the number of
predictor variables.
• Non-Multicollinearity: If one of the independent variables is very highly correlated
with another, or one is a function (e.g., the sum) of other independents, then the
tolerance value for that variable will approach 0 and the matrix will not have a
unique discriminant solution. There must also be low multicollinearity of the
independents. To the extent that independents are correlated, the standardized
discriminant function coefficients will not reliably assess the relative importanceof
the predictor variables. Predictive power can decrease with an increased correlation
between predictor variables. Logistic Regression may offer an alternative to DA as it
usually involves fewer violations of assumptions.
6
Multivariate Normality: Independent variables are normal for each level of the grouping
variable. Normal distribution: It is assumed that the data (for the variables) represent a
sample from a multivariate normal distribution. You can examine whether or not variables
are normally distributed with histograms of frequency distributions. However, note that
violations of the normality assumption are not "fatal" and the resultant significance test
are still reliable as long as non-normality is caused by skewness and not outliers
(Tabachnick and Fidell 1996).
• DA is fairly robust to violations of the most of these assumptions. But highly sensitive to
Multivariate Normality and Outliers.
9 PROCESS FLOW CHART
Research Problem
STAGE 1 Select objectives:
•Evaluate group differences on a multivariate profile
•Classify observations into group
•Identify dimensions of discrimination between groups
To
stage
3
10 PROCESS FLOW CHART
From
stage
2
Assumptions:
STAGE 3 Normality of independent variables
Linearity of relationships
Lack of multicollinearity
Equal dispersion matrices
To
stage
5
11 PROCESS FLOW CHART
From
stage
4
• The classification of the existing data points is done using the equation, and the accuracy
of the model is determined.
• This output is given by the classification matrix (also called confusion matrix), which tells
what percentage of the existing data points is correctly classified by the model.
16 RELATIVE IMPORTANCE OF INDEPENDENT
VARIABLE
• Suppose we have two independent variables X1 and X2.
• How do we know which one is more important in discriminating between groups?
• Coefficients of both the variables will provide the answer.
17 PREDICTING THE GROUP MEMBERSHIP FOR A NEW
DATA POINT
• For any new data point that we want to classify into one of the groups, the coefficients of
the equation are used to calculate Y discriminant score.
• A decision rule is formulated – to determine the cutoff score, which is usually the
midpoint of the mean discriminant score of two groups.
18 COEFFICIENTS
• Cutoff Score for Two Group Discriminant Function: Criterion against which individual’s
discriminant Z score is compared to determine Predicted group membership.
C= 𝑛2𝑦1 +𝑛1𝑦2
𝑛1 +𝑛2
(Basic formula for computing the optimal cutoff score between any two groups)
22 TERMS
• Function – This indicates the first or second canonical linear discriminant function. The number of functions is equal to the number of discriminating variables, if
there are more groups than variables, or 1 less than the number of levels in the group variable. In this example, job has three levels and three discriminating
variables were used, so two functions are calculated. Each function acts as projections of the data onto a dimension that best separates or discriminates between the
groups.
• Eigenvalue – These are the eigenvalues of the matrix product of the inverse of the within-group sums-of-squares and cross-product matrix and the between-groups
sums-of-squares and cross-product matrix. These eigenvalues are related to the canonical correlations and describe how much discriminating ability a function
possesses. The magnitudes of the eigenvalues are indicative of the functions’ discriminating abilities. See superscript e for underlying calculations.
• % of Variance – This is the proportion of discriminating ability of the three continuous variables found in a given function. This proportion is calculated as the
proportion of the function’s eigenvalue to the sum of all the eigenvalues. In this analysis, the first function accounts for 77% of the discriminating ability of the
discriminating variables and the second function accounts for 23%. We can verify this by noting that the sum of the eigenvalues is 1.081+.321 = 1.402. Then
(1.081/1.402) = 0.771 and (0.321/1.402) = 0.229.
• Cumulative % – This is the cumulative proportion of discriminating ability . For any analysis, the proportions of discriminating ability will sum to one. Thus, the
last entry in the cumulative column will also be one.
• Canonical Correlation – These are the canonical correlations of our predictor variables (outdoor, social and conservative) and the groupings in job. If we
consider our discriminating variables to be one set of variables and the set of dummies generated from our grouping variable to be another set of variables, we can
perform a canonical correlation analysis on these two sets. From this analysis, we would arrive at these canonical correlations.
24 EIGEN VALUES AND MULTIVARIATE TESTS
• Test of Function(s) – These are the functions included in a given test with the null hypothesis that the canonical correlations
associated with the functions are all equal to zero. In this example, we have two functions. Thus, the first test presented in this table
tests both canonical correlations (“1 through 2”) and the second test presented tests the second canonical correlation alone.
• Wilks’ Lambda – Wilks’ Lambda is one of the multivariate statistic calculated by SPSS. It is the product of the values of (1-
canonical correlation2). In this example, our canonical correlations are 0.721 and 0.493, so the Wilks’ Lambda testing both canonical
correlations is (1- 0.7212)*(1-0.4932) = 0.364, and the Wilks’ Lambda testing the second canonical correlation is (1-0.4932) = 0.757.
• Chi-square – This is the Chi-square statistic testing that the canonical correlation of the given function is equal to zero. In other
words, the null hypothesis is that the function, and all functions that follow, have no discriminating ability. This hypothesis is tested
using this Chi-square statistic.
• df – This is the effect degrees of freedom for the given function. It is based on the number of groups present in the categorical
variable and the number of continuous discriminant variables. The Chi-square statistic is compared to a Chi-square distribution with
the degrees of freedom stated here.
• Sig. – This is the p-value associated with the Chi-square statistic of a given test. The null hypothesis that a given function’s
canonical correlation and all smaller canonical correlations are equal to zero is evaluated with regard to this p-value. For a given
alpha level, such as 0.05, if the p-value is less than alpha, the null hypothesis is rejected. If not, then we fail to reject the null
hypothesis.
25 TWO GROUPS DISCRIMINANT ANALYSIS
• A retail outlet wants to know the consumer behavior pattern of the purchase of products in two categories- national brands
and local brands respectively, which would help it to place orders depending on demand and requirements of customer. This
retail outlet uses data from a retail outlet in another location to arrive at a decision about customers visiting at their end.
• This retail outlet wants to use discriminant analysis to screen the responsiveness of customers towards national brand and
local brand categories and find out the following:
1. The percentage of customers that it is able to classify correctly.
2. Statistical significance of discriminant function.
3. Which variable (annual income in lakh rupees and household size) are relatively better in discriminating between consumers
for national and local brand.
4. Classification of new customers into one of the two groups namely-
national (group-1) and local brand (group-2)acceptors.
27 DATA
• After the input data has been typed along with the variable labels and value labels in an SPSS file, to get the output for a Discriminant Analysis
problem proceed as mentioned below:
• 3. On the dialogue box which appears, select the GROUPING VARIABLE (dependent categorical variable in discriminant analysis) by clicking on the
right arrow to transfer it from the variable list on the left to the grouping variable box on the right.
• 4. Define the range of values of the grouping variable by clicking on DEFINE RANGE just below the grouping variable box. Fill in the minimum and
maximum values (the codes used in our problem is 0 and 1) of the variable in the box which appears. Then click CONTINUE.
• 5. Select all the independent variables for discriminant analysis from the variable list by clicking on the arrow which transfers them to the
INDEPENDENTS box on the right.
• 6. Just below the INDEPENDENTS box select ‘Enter independents together’ if you want all the selected independent variables (that are in the box)
in the discriminant model. (Here you have an option to use a STEPWISE discriminant analysis by selecting ‘Use Stepwise Method’ instead of ‘Enter
independents together’).
29 SPSS COMMANDS FOR DISCRIMINANT ANALYSIS
• 7. Click on STATISTICS on the lower part of the main dialog box. This opens up a smaller dialog box. Under
STATISTICS, click on MEANS and UNIVARIATE ANOVAS. Under the title FUNCTION COEFFICIENTS,
choose UNSTANDARDIZED to obtain the unstandardized coefficients of the discriminant function. These are
used to classify a new object in a discriminant analysis. Under MATRICES click on WITHIN GROUP
CORRELATION. Click on CONTINUE to return to the main dialog box.
• 8. Click on CLASSIFY on the lower part of the main dialog box. Select SUMMARY TABLE and LEAVE-ONE-
OUT CLASSIFICATION under the heading DISPLAY in the smaller dialog box that appears. This gives you the
classification table (also called the confusion matrix) that judges the accuracy of the discriminant model when
applied to the input data points. Click on CONTINUE to return to the main dialog box.
• 9. Click on SAVE and then select PREDICTED GROUP MEMBERSHIP and DISCRIMINANT SCORES.
• 10. Click OK to get the discriminant analysis output.
30 NORMALITY TESTING
31 NORMALITY TESTING
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
Coefficientsa
Model Collinearity Statistics
Tolerance VIF
Box’s test is used to determine whether two or more covariance matrices are
equal. Bartlett’s test for homogeneity of variance presented in Homogeneity
of Variances is derived from Box’s test. One caution: Box’s test is sensitive to
departures from normality. If the samples come from non-normal
distributions, then Box’s test may simply be testing for non-normality.
Suppose that we have m independent populations and we want to test the null
hypothesis that the population covariance matrices are all equal, i.e.
H0: Σ1 = Σ2 =⋯= Σm
36 TEST OF HOMOSCEDASTICITY: BOX-M TEST
• 1. Box's M test tests the assumption of homogeneity of covariance matrices. This test is very sensitive to
meeting the assumption of multivariate normality.
• 2. Discriminant function analysis is robust even when the homogeneity of variances assumption is not
met, provided the data do not contain important outliers.
• 3. For our data, we conclude the groups do not differ in their covariance matrices, satisfying the
assumption of DA.
▪ 4. when n is large, small deviations from homogeneity will be found significant, which is why Box's M must
be interpreted in conjunction with inspection of the log determinants. Log determinants are a measure
of the variability of the groups. Larger log determinants correspond to more variable groups. Large
differences in log determinants indicate groups that have different covariance matrices.
38 DISCRIMINANT ANALYSIS: CORE OUTPUT: GROUP
STATISTICS
As the two groups (National Brands/Local Brands) are to be compared on the basis of two characteristics
of the respondents namely, Annual income and Household Size it will be useful to compute their mean
values to get an idea of the differences in their mean score. The mean score for Annual Income for the
NB group is 18.87, whereas for the LB group, it is 15.05 The absolute difference in the score of the
Annual income is (18.87-15.05)= 3.82, whereas it is (4.8-2.5)=2.33 for the Household Income. But Hence
initially we can expect Annual Income to discriminate between the two group..
40 DISCRIMINANT ANALYSIS: CORE OUTPUT: TESTING
EQUALITY OF GROUP MEANS
Tests of Equality of Group Means
In the ANOVA table, the smaller the Wilks’s Lambda, the more important the independent
variable to the discriminant function. Wilks’s Lambda is significant by the F test for all
independent variables. Here both the F-statistic are significant and hence we can include both
the independent variables in our Discriminant Model but Annual Income with smaller Wilk’s
lambda is more important in discrimination. (Wilk’s Lambda Represents the proportion of
Unexplained Variability in the Dependent Variable which is opposite to R-Square of Multiple
Regression).
41 TO CHECK THE MODEL FIT
Run One- way ANOVA in SPSS from Compare Means with Discriminant Scores as
Dependent Variable and Brand as a Factor Variable
ANOVA
• Wilks’ lambda as the ratio of within-group sum of squares to total sum of squares, its
values should equal (18.0/38.041) = 0.473. A variable selection method for stepwise
discriminant analysis that chooses variables for entry into the equation on the basis of
how much they lower Wilks' lambda. At each step, the variable that minimizes the overall
Wilks' lambda is entered.
Wilks' Lambda
▪ Wilks’ Lambda is the ratio of within-groups sums of squares to the total sums of
squares. This is the proportion of the total variance in the discriminant scores not
explained by differences among groups.
▪ Wilks’ Lambda is the ratio of within-groups sums of squares to the total sums of
squares. This is the proportion of the total variance in the discriminant scores not
explained by differences among groups.
▪ Wilks’ lambda takes a value between 0 and 1 and lower the value of Wilks’ lambda, the
higher is the significance of the discriminant function.
▪ Therefore, a 0 (zero) value would be the most preferred one. A lambda of 1.00 occurs
when observed group means are equal. A small lambda indicates that group means
appear to differ. The associated significance value indicate whether the difference is
significant.
44 SPSS CORE OUTPUT: Wilks’ Lambda
• The statistical test of significance for Wilks’ lambda is carried out with the chi-squared
transformed statistic, which in our case is 12.721 with 2 degrees of freedom (degrees of
freedom equals the number of predictor variables) and a p value of 0.02. Since the p
value is less than 0.05, the assumed level of significance, it is inferred that the discriminant
function is significant and can be used for further interpretation of the results.
• Here, the Lambda of 0.473 has a significant value (Sig. = 0.002); thus, the group means
appear to differ.
45 SPSS OUTPUT: Eigen Values
Eigenvalues
Function Eigenvalue % of Variance Cumulative % Canonical
Correlation
• The basic principle in the estimation of a discriminant function is that the variance
between the groups relative to the variance within the group should be maximized.
The ratio of between group variance to within group variance is given by eigenvalue.
A higher eigenvalue is always desirable. An Eigen value indicates the proportion of
variance explained.
• A large Eigen value is associated with a strong function.
46 CANONICAL CORRELATION
• The last column of Table above indicates canonical correlation, which is the simple
correlation coefficient between the discriminant score and their corresponding group
membership (NB/LB). The value of this is 0.726, which the readers may verify. The square
of the canonical correlation is (0.726)2 = 0.527, which means 52.7 per cent of the
variance in the discriminating model between a prospective buyer/non-buyer is due to
the changes in the four predictor variables, namely,Annual Income and Household size.
47 SPSS OUTPUT STRUCTURE MATRIX
• The structure matrix table shows the correlations of each variable with each
discriminant function. The correlations serve like factor loadings in factor analysis -- that
is, by identifying the largest absolute correlations associated with each discriminant
function.
Structure Matrix
Function
1
ANNUALINCOME .887
HOUSEHOLDSIZE -.734
48 SPSS OUTPUT: Functions at Group Centroids
1
• More specifically, the discriminant score for each
group when the variable means (rather than individual
1.00 1.001
values for each subject) are entered into the
discriminant equation.
2.00 -1.001
• Note that the two scores are equal in absolute value
but have opposite signs.
49 SPSS OUTPUT: Unstandardized Coefficients
SIZE -.490
discriminant function indicates the relative
contribution of variables in discriminating between
the two groups.
53 CLASSIFICATION MATRIX AND CROSS VALIDATION
a,c
Classification Results
2.00 2 8 10
• The overall classificatory ability of the model measured by the hit ratio is given as:
57 RESULTS
• In last table, the stared value indicates that respondent 3 and 11 were wrongly classified in
group 1.
• Respondent 3 and 11 were actually belongs to group 2.
• Respondent 13 was wrongly classified in group 2.
• It originally belongs to group 1.
• Hit ratio= (no. of correct predictions/ total no. of cases)*100
=(17/20)*100
=85%
58 THREE GROUPS DISCRIMINANT ANALYSIS
• We are interested in the relationship between the three continuous variables and our
categorical variable. Specifically, we would like to know how many dimensions we would
need to express this relationship. Using this relationship, we can predict a classification
based on the continuous variables or assess how well the continuous variables separate
the categories in the classification. We will be discussing the degree to which the
continuous variables can be used to discriminate between the groups.
59 THREE GROUP LDA AND QDA
60 CASE STUDY FOR THREE GROUP LDA
• A large international air carrier has collected data on employees in three different job
classifications: 1) customer service personnel, 2) mechanics and 3) dispatchers. The
director of Human Resources wants to know if these three job classifications appeal to
different personality types. Each employee is administered a battery of psychological test
which include measures of interest in outdoor activity, sociability and conservativeness.
61 DATA DESCRIPTION
• The data used in this example are from a data file, Discrim.xls, with 244 observations on
four variables. The variables include three continuous, numeric variables
(outdoor, social and conservative) and one categorical variable (job) with three
levels: 1) customer service, 2) mechanic and 3) dispatcher.
62 SPSS OUTPUT: Descriptive Statistics
• From this output, we can see that some of the means of outdoor,
social and conservative differ noticeably from group to group in job. These differences
will hopefully allow us to use these predictors to distinguish observations in
one job group from observations in another job group. Next, we can look at the
correlations between these three predictors. These correlations will give us some
indication of how much unique information each predictor will contribute to the
analysis. If two predictor variables are very highly correlated, then they will be
contributing shared information to the analysis. Uncorrelated variables are likely
preferable in this respect. We will also look at the frequency of each job group.
65 SPSS OUTPUT: Correlations
66 SPSS COMMANDS
Analysis Case Processing Summary – This table summarizes the analysis dataset in terms of
valid and excluded cases. The reasons why SPSS might exclude an observation from the
analysis are listed here, and the number (“N”) and percent of cases falling into each category
(valid or one of the exclusions) are presented. In this example, all of the observations in the
dataset are valid.
68 SPSS OUTPUT: Group Statistics
Group Statistics – This table presents the distribution of observations into the three groups
within job. We can see the number of observations falling into each of the three groups. In this example,
we are using the default weight of 1 for each observation in the dataset, so the weighted number of
observations in each group is equal to the unweighted number of observations in each group.
69 UNSTANDARDIZED COEFFICIENTS
Function
1 2
• Standardized Canonical Discriminant Function Coefficients – These coefficients can be used to calculate the discriminant score
for a given case. The score is calculated in the same manner as a predicted value from a linear regression, using the standardized
coefficients and the standardized variables. For example, let zoutdoor, zsocial and zconservative be the variables created by
standardizing our discriminating variables. Then, for each case, the function scores would be calculated using the following equations:
• Score1 = 0.379*zoutdoor – 0.831*zsocial + 0.517*zconservative
• Score2 = 0.926*zoutdoor + 0.213*zsocial – 0.291*zconservative
• The distribution of the scores from each function is standardized to have a mean of zero and standard deviation of one. The magnitudes
of these coefficients indicate how strongly the discriminating variables effect the score. For example, we can see that the standardized
coefficient for zsocial in the first function is greater in magnitude than the coefficients for the other two variables. Thus, social will have
the greatest impact of the three on the first discriminant score.
• Structure Matrix – This is the canonical structure, also known as canonical loading or discriminant loading, of the discriminant
functions. It represents the correlations between the observed variables (the three continuous discriminating variables) and the
dimensions created with the unobserved discriminant functions (dimensions).
74 SPSS OUTPUT: Discriminant Function Output: Functions
at Group Centroids
Functions at Group Centroids – These are the means of the discriminant function scores by group for
each function calculated. If we calculated the scores of the first function for each case in our dataset, and
then looked at the means of the scores by group, we would find that the customer service group has a
mean of -1.219, the mechanic group has a mean of 0.107, and the dispatch group has a mean of
1.420. We know that the function scores have a mean of zero, and we can check this by looking at the
sum of the group means multiplied by the number of cases in each group: (85*-
1.219)+(93*.107)+(66*1.420) = 0.
75 SPSS OUTPUT: Predicted Classifications: Classification
Processing Summary
• . Predicted Group Membership – These are the predicted frequencies of groups from the
analysis. The numbers going down each column indicate how many were correctly and incorrectly
classified. For example, of the 89 cases that were predicted to be in the customer service group, 70
were correctly predicted, and 19 were incorrectly predicted (16 cases were in the mechanic group and
three cases were in the dispatch group).
• . Original – These are the frequencies of groups found in the data. We can see from the row totals that
85 cases fall into the customer service group, 93 fall into the mechanic group, and 66 fall into
the dispatch group. These match the results we saw earlier in the output for
the frequencies command. Across each row, we see how many of the cases in the group are classified
by our analysis into each of the different groups. For example, of the 85 cases that are in the customer
service group, 70 were predicted correctly and 15 were predicted incorrectly (11 were predicted to be
in the mechanic group and four were predicted to be in the dispatch group).
79 SPSS OUTPUT: Predicted Classifications Interpretation
• Count – This portion of the table presents the number of observations falling into the given
intersection of original and predicted group membership. For example, we can see in this
portion of the table that the number of observations originally in the customer
service group, but predicted to fall into the mechanic group is 11. The row totals of these
counts are presented, but column totals are not.
• % – This portion of the table presents the percent of observations originally in a given group
(listed in the rows) predicted to be in a given group (listed in the columns). For example, we
can see that the percent of observations in the mechanic group that were predicted to be in
the dispatch group is 16.1%. This is NOT the same as the percent of observations predicted
to be in the dispatch group that were in the mechanic group. The latter is not presented
in this table.
80 FINAL NOTES
•Note that the Standardized Canonical Discriminant Function Coefficients table and the
Structure Matrix table are listed in different orders.
•The number of discriminant dimensions is the number of groups minus 1. However, some
discriminant dimensions may not be statistically significant.
•In this example, there are two discriminant dimensions, both of which are statistically
significant.
81 FINAL NOTES
• The canonical correlations for the dimensions one and two are 0.72 and 0.49,
respectively.
• The standardized discriminant coefficients function in a manner analogous to
standardized regression coefficients in OLS regression.
• For example, a one standard deviation increase on the outdoor variable will result in a
0.32 standard deviation decrease in the predicted values on discriminant function 1.
• The canonical structure, also known as canonical loading or discriminant loadings,
represent correlations between observed variables and the unobserved discriminant
functions (dimensions).
• The discriminant functions are a kind of latent variable and the correlations are loadings
analogous to factor loadings.
• Group centroids are the class (i.e., group) means of canonical variables.
82 THINGS TO CONSIDER
• Multivariate normal distribution assumptions holds for the response variables. This means that
each of the dependent variables is normally distributed within groups, that any linear
combination of the dependent variables is normally distributed, and that all subsets of the
variables must be multivariate normal.
• Each group must have a sufficiently large number of cases.
• Different classification methods may be used depending on whether the variance-covariance
matrices are equal (or very similar) across groups.
• Non-parametric discriminant function analysis, called kth nearest neighbor, can also be
performed.
83
THANK YOU