Exercise 1 Instruction Pca
Exercise 1 Instruction Pca
Exercise 1 Instruction Pca
@Copyright 2008 Laboratory of Process Control and Automation, Helsinki University of Technology, Finland. @Comments chenghui@cc.hut.fi The purpose of this exercise is to show how to perform the PCA calculation with MATLAB for process monitoring. The exercise is organized as following: 1. Practice Matlab programming for PCA with short example 2. Homework: fault detection for the data collected from papermachine simulator Note: The homework is obligatory. The first deadline is 2 weeks, and after the correction, those who dont get the satisfactory result are required to modify your homework until its good. And the deadline for this correction is 1 month. You are only allowed to go to examination after you pass all the homeworks.
1. Practice MATLAB programming for PCA 1.1 The basic procedure for the PCA
1. Data preprocessing a. Zero-mean the training data set (faultless data) 2. Build model a. Calculate the covariance matrix b. Calculate the eigenvalues and eigenvectors and sort in decreasing order c. Select the number of principle components and form the transform matrix d. Calculate the confidence limit for the Hotelling T2, SPE and individual scores 3. Test for the new data a. Scale the new data b. Calculate the score by using the transform matrix c. Calculate the index: Hotelling T2, SPE d. Detect the change if the index or the individual score exceed the confidence limit e. If any abnormality is detected, draw the contribution plot and find the responsible variables (the variables contribution to the Hotelling T2 index)
The dataset X consists of N observations and m variables. In our example m<N, and the data set is formed as follows. Variables Observations x1 . . x 1N x2 . . x2 N ...... xmN ..... . . xm . .
118 30 119 32 120 33 Suppose the data set is X = with 2 variables and 6 observations. Firstly X is 120 31 121 32 122 34 zero-meaned. The mean of X is given by
(Equation 1)
(Equation 2)
The zero-mean data set have to be scaled by the standard deviation of each variable respectively as following, >>Xstd=std(X);% the standard deviation for each variable in X
1.2.2
Build model
b. Calculate the eigenvalues and eigenvectors for covariance matrix C and sort in decreasing order
The eigenvalues of covariance matrix C are calculated: det(C I ) = 0 Put the eigenvalues in a matrix given by (Equation 4)
1 0 = M 0
2
M 0
L 0 L 0 O M L m
(Equation 5)
where s are in decreasing order, i.e. the biggest is in the first row etc. And the eigenvectors are calculated: Cei = i ei (Equation 6)
where the ei:s are the eigenvectors of the corresponding i:s. The eigenvector matrix V: V = [e1 , e2 , e3 ,.........em ] (Equation 7)
c. Select the number of principle components and form the transform matrix
Usually less than all the principal components are used. There are several techniques to do this. One method is to look at the cumulative variance percent (=cumulative sum of the variances captured by each PC) and to select the PC for which, say, over 75% of the cumulative variance is captured. The variances captured by the PCs are calculated by the eigenvalues Captured var iance( PCi ) =
j =1
100%
j
(Equation 8)
When the amount of PCs (K) is selected, only the eigenvalues 1, 2, ,K and eigenvectors e1, e2, , eK related to these are used. The matrixes are:
1 0 K = M 0
2
M 0
L 0 L 0 O M L K
(Equation 9)
(Equation10)
Cusm=cumsum(diag(lamda))% the cumulative sum of the eigen values Percentage=[]; for i=1:length(Cusm) Percentage=[Percentage,Cusm(i)/Cusm(end)*100];% the percentage end npc = 1; % In this case, one principal component is selected for 90% Vk = eigvec(:,1:npc);%Tranform matrix Vk sigma = lamda(1:npc,1:npc); % sigma
d. Calculate the confidence limit for the Hotelling T2, SPE and individual scores
For the Hotelling T2, the limit for the confidence level =95% is K(N 1) F(K,N K, ) (Equation 11) NK where the F(K,N-K, ) corresponds to the probability point on the F-distribution with (K,N-K) degrees of freedom and confidence level , K is the number of principle component, and N is the number of observations.
2 Tlim =
For individual score, the limit for the confidence level =95% is
Conf(PCi ) = i t(N 1, ) (Equation 12) where t(N-1, ) corresponds to the probability point on the single sided t-distribution with N-1 degrees of freedom and area /2.
By MATLAB, >> T2lim_1 = sqrt(sigma(1,1))*tinv(0.975,N-1); % for the first principle component For SPE index, the limit for the confidence level =95% is
i j
h0 = 1
213 3 2 2
By MATLAB, >>zeta1=lamda(2,2); % the second eigen value >>zeta2=lamda(2,2)^2; % the square of the second eigen value >>zeta3=lamda(2,2)^3; % the power 3 of the second eigen value >> h0=1-2*zeta1*zeta3/3/zeta2^2; % >> ca=norminv(0.95, 0, 1); % ca is value for standard normal distribution for confidence level 95% >> SPElim=zeta1*(ca*h0*sqrt(2*zeta2)/zeta1+1+zeta2*h0*(h0-1)/zeta1^2)^(1/h0);
1.2.3
Sxnew =
x new xtrain
train
(Equation15)
By MATLAB, >> Sxnew1=(Xnew1-Xmean)./Xstd; % because Xmean is a row vector >> Sxnew2=(Xnew2-Xmean)./Xstd; >> Sxnew3=(Xnew3-Xmean)./Xstd;
(Equation 17)
t 2 (i )
i2
By MATLAB, >> T21=Sxnew1'*Vk*inv(sigma)*Vk'*Sxnew1; >> T22=Sxnew2'*Vk*inv(sigma)*Vk'*Sxnew2; >> T23=Sxnew3'*Vk*inv(sigma)*Vk'*Sxnew3; For the calculation of SPE, the following figure is referred.
The difference between original data and the uncompressed data is the residual:
(Equation 18)
d. Detect the change if the index or the individual score exceed the confidence limit
Compare the calculated scores or index and corresponding limits, detect the fault in the process. For example, the limit for the Hotelling T2 is 6.61 for our case and the T2 index for the three new data are 0.556, 5 and 11.25. So we can say that the third data is collected when the process is faulty while other two sets of data represent the healthy process.
e. The contribution plot can be drawn in order to tell the responsible variables
In response to T2 violations we can obtain contribution plots according to: For one observation xnew, find the r cases when the absolute value of score abs(t(i)) > limit(t(i)), where t(i) is the ith element in the vector t; Calculate the contribution of each variable xj to the out-of-control scores t(i) t(i) ~ cont i,j = v i,j x j v i, j is the (i, j) th element of the matrix VkT i When conti,j is negative set it equal to zero Calculate the total contribution of the jth process variable
In our case, firstly the third new data violates the limitation of T2 index and the first score violate its own limit. Then calculate the contribution of those two variables to this score. By MATLAB, >> Cont11=t3/sigma(1,1)*Vk(1,1)*Sxnew3(1,1); % The first variable in third data set to the first score of this data set Vk(1,1)=Vk'(1,1) >> Cont12=t3/sigma(1,1)*Vk(2,1)*Sxnew3(2,1); % The second variable in third data set to the first score of this data set Vk(2,1)=Vk'(1,2) Since there is only one score, the contribution plot of those two variables to the score T is draw as below.
14
12
10
The contribution plot shows the second variable is the reason for the abnormality of the process.
Homework
Some data were collected from the paper making process simulator. 11 variables with 1100 observations were selected to perform the PCA monitoring. After first 300 data observations, one fault occurred to the process and lasted 400 data observations after the occurrence. You can find the data in the web and the description for the data file is given as Column 1 2 3 4 5 6 7 8 9 10 11 Your tasks: 1. Build the PCA model with the faultless data (observations 0-300) using matlab (Note: when scale the data, have to take out the mean and divided by the respective standard deviation). 2. Calculate the limit for the Hotelling T2, SPE, and limits for the individual scores (confidence level=95% for Hotelling T2 and SPE, but 98% for the individual scores) 3. Use the whole data set as the new data to detect the fault (Note: scale the data with the mean and standard deviation obtained from training data). 4. For one faulty observation, draw the contribution plot for the T2 index. 5. Please submit your code (m-file) and your report (regarding the procedure you build your model and select the principle components) to di zhang by email. Description Fiber consistency in the wire pit Fiber consistency in the headbox Fiber consistency in the machine chest Stock mass flow rate from machine screen to headbox Pressure in the headbox Basic weight valve position Head box feed pump speed The valve 1 position for dryer group Pressure for the dryer group 5 first cylinder Basic weight Moisture
NOTE: The formula given to calculate the index T2 and SPE only works for the new data set as a column vector (not either for the matrix or row vector). So you need to reorganize the data set before you calculate the T2 and SPE. Also you need to calculate the T2 and SPE for each observation individually by using the loop in the program.