0% found this document useful (0 votes)
222 views

Principal Component Analysis Tutorial 101 With NumXL

This is the first entry in what will become an ongoing series on principal components analysis (PCA). In this tutorial, we will start with the general definition, motivation and applications of a PCA, and then use NumXL to carry on such analysis. Next, we will closely examine the different output elements in an attempt to develop a solid understanding of PCA, which will pave the way to a more advanced treatment in future issues. For more information, visit us at www.numxl.com

Uploaded by

NumXL Pro
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views

Principal Component Analysis Tutorial 101 With NumXL

This is the first entry in what will become an ongoing series on principal components analysis (PCA). In this tutorial, we will start with the general definition, motivation and applications of a PCA, and then use NumXL to carry on such analysis. Next, we will closely examine the different output elements in an attempt to develop a solid understanding of PCA, which will pave the way to a more advanced treatment in future issues. For more information, visit us at www.numxl.com

Uploaded by

NumXL Pro
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

PCA101Tutorial 1 SpiderFinancialCorp,2013

Tutorial:PrincipalComponent101
Thisisthefirstentryinwhatwillbecomeanongoingseriesonprincipalcomponentsanalysis(PCA).In
thistutorial,wewillstartwiththegeneraldefinition,motivationandapplicationsofaPCA,andthenuse
NumXLtocarryonsuchanalysis.Next,wewillcloselyexaminethedifferentoutputelementsinan
attempttodevelopasolidunderstandingofPCA,whichwillpavethewaytoamoreadvanced
treatmentinfutureissues.
Inthistutorial,wewillusethesocioeconomicdataprovidedbyHarman(1976).Thefivevariables
representtotalpopulation(Population),medianschoolyears(School),totalemployment
(Employment),miscellaneousprofessionalservices(Services),andmedianhousevalue(House
Value).EachobservationrepresentsoneoftwelvecensustractsintheLosAngelesStandard
MetropolitanStatisticalArea.
DataPreparation
First,letsorganizeourinputdata.First,weplacethevaluesofeachvariableinaseparatecolumn,and
eachobservation(i.e.censustractinLA)onaseparaterow.

Notethatthescales(i.e.magnitude)ofthevariablesvarysignificantly,soanyanalysisofrawdatawillbe
biasedtowardthevariableswithalargerscale,anddownplaytheeffectofoneswithalowerscale.
Tobetterunderstandtheproblem,letscomputethecorrelationmatrixforthe5variables:

PCA101Tutorial 2 SpiderFinancialCorp,2013

Thefive(5)variablesarehighlycorrelated,soonemaywonder:
1. Ifweweretousethosevariablestopredictanothervariable,doweneedthe5variables?
2. Aretherehiddenforces(driversorotherfactors)thatmovethose5variables?
Inpractice,weoftenencountercorrelateddataseries:commoditypricesindifferentlocations,future
pricesfordifferentcontracts,stockprices,interestrates,etc.
InplainEnglish,whatisprincipalcomponentanalysis(PCA)?
PCAisatechniquethattakesasetofcorrelatedvariablesandlinearlytransformsthosevariablesintoa
setofuncorrelatedfactors.
Toexplainitfurther,youcanthinkaboutPCAasanaxissystemtransformation.Letsexaminethisplot
oftwocorrelatedvariables:

PCA101Tutorial 3 SpiderFinancialCorp,2013

Simplyput,fromthe(X,Y)Cartesiansystem,thedatapointsarehighlycorrelated.Bytransforming
(rotating)theaxisinto(Z,W),thedatapointsarenolongercorrelated.
Intheory,thePCAfindsthatthosetransformations(oftheaxis)ofdatapointswilllookuncorrelated
withtheirrespect.
OK,nowwherearetheprincipalcomponents?
Totransformthedatapointsfromthe(X,Y)Cartesiansystemto(Z,W),weneedtocomputethezandw
valuesofeachdatapoint:

1 1
2 2
i i i
i i i
z x y
w x y
o |
o |
= +
= +

Ineffect,wearereplacingtheinputvariables ( , )
i i
x y withthoseof ( , )
i i
z w .The ( , )
i i
z w valuesareones
werefertoastheprincipalcomponents.

PCA101Tutorial 4 SpiderFinancialCorp,2013

Alright,howdowereducethedimensionsofthevariables?
Whenwetransformthevaluesofthedatapoints ( , )
i i
x y intothenewaxissystem( , )
i i
z w ,wemayfind
thatafewaxescapturemoreofthevaluesvariationthanothers.Forinstance,inourexampleabove,
wemayclaimthatall
i
w valuesareplainzeroanddontreallymatter.
1 1
2 2
i i i
i i i
x z w
y z w


= +
= +

1
2
i i
i i
x z
y z

=
=

Ineffect,thetwodimensionalsystem ( , )
i i
z w isreducedtoaonedimensionalsystem(
i
z ).
Ofcourse,forthisexample,droppingthe W factordistortsourdata,butforhigherdimensionsitmay
notbesobad.
Whichcomponentshouldwedrop?
Inpractice,weorderthecomponents(akafactors)intermsoftheirvariance(highestfirst)andexamine
theeffectofremovingtheonesoflowervariance(rightmost)inanefforttoreducethedimensionof
thedatasetwithminimallossofinformation.
Whyshouldwecareaboutprincipalcomponents?
Ariskmanagercanquantifytheiroverallriskintermsofaportfolioaggregateexposuretoahandfulof
drivers,insteadoftensofhundredsofcorrelatedsecuritiesprices.Furthermore,designinganeffective
hedgingstrategyisvastlysimplified.
Fortraders,quantifyingtradesintermsoftheirsensitivities(e.g.delta,gamma,etc.)tothosedrivers
givestraderoptionstosubstitute(ortrade)onesecurityforanother,constructatradingstrategy,
hedge,synthesizeasecurity,etc.
Adatamodelercanreducethenumberofinputvariableswithminimallossofinformation.
Process
NowwearereadytoconductourPCA.First,selectanemptycellinyourworksheetwhereyouwishthe
outputtobegenerated,thenlocateandclickonthePCAiconintheNumXLtab(ortoolbar).

TheRegressionWizardpopsup.

PCA101Tutorial 5 SpiderFinancialCorp,2013

Selectthecellsrangeforthefiveinputvariablevalues.
Notes:
1. Thecellsrangeincludes(optional)theheading(Label)cell,whichwouldbeusedintheoutput
tableswhereitreferencesthosevariables.
2. Theinputvariables(i.e.X)arealreadygroupedbycolumns(eachcolumnrepresentsavariable),
sowedontneedtochangethat.
3. LeavetheVariableMaskfieldblankfornow.Wewillrevisitthisfieldinlaterentries.
4. Bydefault,theoutputcellsrangeissettothecurrentselectedcellinyourworksheet.
Finally,onceweselecttheXandYcellsrange,theOptionsandMissingValuestabsbecome
available(enabled).
Next,selecttheOptionstab.

PCA101Tutorial 6 SpiderFinancialCorp,2013

Initially,thetabissettothefollowingvalues:
- StandardizeInputischecked.Thisoptionineffectreplacethevaluesofeachvariablewithits
standardizedversion(i.e.subtractthemeananddividebystandarddeviation).Thisoption
overcomesthebiasissuewhenthevaluesoftheinputvariableshavedifferentmagnitude
scales.Leavethisoptionchecked.
- PrincipalComponentOutputischecked.ThisoptioninstructsthewizardtogeneratePCA
relatedtables.Leaveitchecked.
- UnderPrincipalComponent,checktheValuesoptiontodisplaythevaluesforeachprincipal
component.
- Thesignificancelevel(aka o )issetto5%.
- TheInputVariablesisunchecked.Leaveituncheckedfornow.
Now,clickontheMissingValuestab.

PCA101Tutorial 7 SpiderFinancialCorp,2013

Inthistab,youcanselectanapproachtohandlemissingvaluesinthedataset(XandY).Bydefault,any
missingvaluefoundinXorYinanyobservationwouldexcludetheobservationfromtheanalysis.
Thistreatmentisagoodapproachforouranalysis,soletsleaveitunchanged.
Now,clickOKtogeneratetheoutputtables.

Analysis
1. PCAStatistics

PCA101Tutorial 8 SpiderFinancialCorp,2013

1. Theprincipalcomponentsareordered(andnamed)accordingtotheirvarianceinadescending
order,i.e.PC(1)hasthehighestvariance.
2. Inthesecondrow,theproportionstatisticsexplainthepercentageofvariationintheoriginal
dataset(5variablescombined)thateachprincipalcomponentcapturesoraccountsfor.
3. Thecumulativeproportionisameasureoftotalvariationexplainedbytheprincipalcomponents
uptothiscurrentcomponent.
Note:Inourexample,thefirstthreePCaccountfor94.3%ofthevariationofthe5variables.
4. NotethatthesumofvariancesofthePCshouldyieldthenumberofinputvariables,whichin
thiscaseisfive(5).
2. Loadings
Intheloadingtable,weoutlinetheweightsofalineartransformationfromtheinputvariable
(standardized)coordinatesystemtotheprincipalcomponents.

Forexample,thelineartransformationfor
1
PC isexpressedasfollows:

1 1 2 3 4 5
0.27 0.503 0.339 0.56 0.516 PC X X X X X = + + + +
Note:
1. Thesquaredloadings(column)addsuptoone.

5
2
1
1
j
j
|
=
=

80%
60%
40%
20%
0%
20%
40%
60%
80%
population medianschoolyrs totalemployment miscprofessional
services
medianhousevalue
Loadings
PC(1)
PC(2)
PC(3)

PCA101Tutorial 9 SpiderFinancialCorp,2013

2. Inthegraphabove,weplottedtheloadingsforourinputvariablesinthefirstthree
components.
3. Themedianschoolyears,misc.professionalservicesandmedianhousevaluevariableshave
comparableloadingsinPC(1),nextcomestotalemploymentloadingandfinally,population.One
mayproposethisasaproxyforthewealth/incomefactor.
4. Interpretingtheloadingsfortheinputvariablesintheremainingcomponentsprovetobemore
difficult,andrequireadeeperlevelofdomainexpertise.
5. Finally,computingtheinputvariablesbackfromthePCcanbeeasilydonebyapplyingthe
weightsintherowinsteadofthecolumn.Forexample,thepopulationfactorisexpressedas
follows:
1 1 2 3 4 5
0.227 0.657 0.64 0.308 0.109 X PC PC PC PC PC = +
6. WelldiscussthePCloadinglaterinthistutorial.
3. PrincipalComponentValues

InthePCvaluestable,wecalculatethetransformationoutputvalueforeachdimension(i.e.
component),sothe1
st
rowcorrespondstothe1
st
datapoint,andsoon.
ThevarianceofeachcolumnmatchesthevalueinthePCAstatisticstable.UsingExcel,computethe
biasedversionofthevariancefunction(VARA).
Bydefinition,thevaluesinthePCsareuncorrelated.Toverify,wecancalculatethecorrelationmatrix:

PCA101Tutorial 10 SpiderFinancialCorp,2013

Conclusion
Inthistutorial,weconvertedasetoffivecorrelatedvariablesintofiveuncorrelatedvariableswithout
anylossofinformation.
Furthermore,weexaminedtheproportion(andcumulativeproportion)ofeachcomponentasa
measureofvariancecapturedbyeachcomponent,andwefoundthatthefirstthreefactors
(components)accountfor94.3%ofthefivevariablesvariation,andthefirstfourcomponentsaccount
for98%.
Whatdowedonow?
OneoftheapplicationsofPCAisdimensionreduction;asin,canwedroponeormorecomponentsand
yetretaintheinformationintheoriginaldatasetformodelingpurposes?
Inoursecondentry,wewilllookatthevariationofeachinputvariablecapturedbyprincipal
components(microlevel)andcomputethefittedvaluesusingareducedsetofPCs.
Wewillcoverthisparticularissueinaseparateentryofourseries.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy