Outlayer PDF
Outlayer PDF
Outlayer PDF
org) The papers contained in the SESUG proceedings are the property of their authors, unless otherwise stated. Do not reprint without permission. SEGUG papers are distributed freely as a courtesy of the Institute for Advanced Analytics (http://analytics.ncsu.edu). Paper SA11
INTRODUCTION
4ou o(ten want to compare data sets# 4ou can't really do this point)!y)point, so you summari5e each set indi%idually and compare their descripti%e statistics on the a re ate# /y loo&in only at the summary, you are ma&in an assumption that all o!ser%ations are related in some way# *ow do you %eri(y this assumption, /y testin (or outliers# 6alues that are spurious or unrelated to the others must !e e+cluded (rom summari5ation# In this paper we present a simple, e((icient, and entle macro (or (ilterin a data set# There are many techni-ues, and the su!"ect o( ro!ust statistice is modern and rich thou h not at all simple# Chau%enet's criteria is easy to understand, can !e -uic&ly computed (or a !illion rows o( data, and e%en entle enou h to !e used on a tiny set o( only ten points# 7e ha%e seen that Chau%enet's criterion is used in astronomy, nuclear technolo y, eolo y, epidemiolo y, molecular !iolo y, radiolo y and many (ields o( physical science# It is widely used !y o%ernment la!oratories, industries, and uni%ersities# Althou h Chau%enet8s criterion is not currently used in Clinical trials, we would li&e to e+plore it (or possi!ility o( !ein applied to the trials en%ironment as well#
THE FILTERING PROCESS
4ou want some way to identi(y what o!ser%ations in your data set need closer study# It's not appropriate to simply throw away or delete an o!ser%ation9 you must &eep it around to loo& at later# The picture is as (ollows#
2ur macro does this easily# To scrutini5e %aria!le x and split origdata into ood and !ad pieces, use this macro call. %chauv(origdata, x, good=theGood, bad=theBad); Then, (or e+ample, proc means data=theGood; run; * summarize the good ... *; proc print data=theBad; run; * ... and show us the bad *;
A;
PA; L2L
PD; C2L
D;
D;
)P
A; A;
I31 is !ased on percentile statistics# Just li&e the median, percentiles presume speci(ic orderin o( o!ser%ations in a dataset# The sort step re-uired can !e costly, especially when the dataset is hu e# For e+ample, a data set with a !illion o!ser%ations ta&es hal( an hour to calculate the percentiles, e%en with the piecewise)para!olic al orithm, while only si+ minutes to enerate distri!ution moments G and H# Nu"#e$ %& O#(e$vati%n( 1,<<<,<<<,<<< 1<<,<<<,<<< 1<,<<<,<<< 1,<<<,<<< 1<<,<<< 1<,<<< 1,<<< 1<< C%"'utati%n Ti"e )e*ian+ P + P )ean+ St* ,- .IA.I1#JI K.<D#J; I.1I#DL I<#1; 1M#KA 1#I; 1#ML <#1J <#A< <#<1 <#<I <#<< <#<1 <#<< <#<< <#<1
Sortin and o!ser%ation num!erin cannot !e done in S3L !ecause rows o( a ta!le are independent amon themsel%es# 4ou cannot compute the median in a data!ase -uery# Neither can you use the 21DO1 /4 clause in a su!)-uery# There(ore, we must see& an alternati%e outlier test which can !e per(ormed within a data!ase -uery en%ironment#
WHO IS CHAU/ENET
A character in a play, The auntie o( someone who's !est (riend is a poo&a# 2nly Mrs# Chau%enet &nows the truth a!out Olwood P# Dowd# /ut this o!ser%ation is itsel( an outlier# French mathematician 7illiam Chau%enet, 1MA<)1MD<, is !est &nown (or his clear and simple writin style and pioneerin contri!utions to the C#S# Na%al Academy# *e mathematically %eri(ied the (irst !rid e spannin the Mississippi 1i%er, and was the second chancellor o( 7ashin ton Cni%ersity in St# Louis# To his honor, each year since 1LA; a well)written mathematical article recei%es the Chau%enet award#
CHAU/ENET0S CRITERIA
I( the e+pected num!er o( measurements at least as !ad as the suspect measurement is less than 10A, then the suspect measurement should !e re"ected#
PROCEDURE
Let's assume you ha%e a data set with numeric %aria!le x# Suppose there are n o!ser%ations in your dataset# 4ou want to throw away all o!ser%ations which are :not ood enou h:# *ow do you do this, 1emem!er that in clinical practice, no point is not ood enou h, so the su!"ect o( outliers does not apply# 1# A# I# J# Calculate G and H I( n B er(cP Q + ) G Q 0 H R F S then 1e"ect + i i 1epeat steps 1 and A until step 2 passes or too many points removed 1eport (inal G, H, and n
7hen the dust settles, you ha%e two data sets. The set o( all ood data points, and the set o( :!ad: points# Althou h most o(ten we don't care a!out !ad data points, sometimes the !ad points tell us much more in(ormation than do the ood points# 7e must not !e too hasty to completely (or et a!out the !ad data points, !ut &eep them aside (or later and care(ul e%aluation#
Some -uestions to as& a!out the :!ad: data points, which we will tal& a!out later. 1# 7hy were these particular points e+cluded, A# 7hat do all the !ad points ha%e in common,
E1A)PLE
Suppose we ha%e 1J measurements o( some parameter, shown !elow in Ta!le 1# It ta&es two iterations o( the Chau%enet procedure to eliminate all the :!ad: data %alues# The (irst pass mar&s A %alues as !ad. AL#MD and A;#D1# Then, the second pass mar&s another %alue !ad. A<#JK# Further and su!se-uent applications o( Chau%enet don't mar& any more points# As :!ad: data points are e+cluded, notice how the standard de%iation si ni(icantly impro%es (rom M#DD to ;#1D to I#IM# Original Pass #1 Pass #2 M#<A # # M#1K # # I#LD # # M#KJ # # <#MJ # # J#JK # # <#M1 # # D#DJ # # M#DM # # L#AK # # Shielded outlier A<#JK # %utlie$ AL#MD %utlie$ # 1<#IM # # A;#D1 # %utlie$ avg: 1<#;1 K#JK D#KI stdev: M#DD I#IM ;#1D n: 1J 11 1A The third outlier %alue, cau ht and e+cluded in Pass TA, is called a shielded outlier# At (irst, its %alue is small enou h )) or close enou h to the mean )) to !e considered ood# 2nly when the most e+treme %alue is remo%ed does this ne+t most lar est %alue !ecome noticea!le# As we will see later, each remo%al o( a data point :li htens the mass: o( a distri!ution# Smaller sample si5es re-uire their %alues to !e closer to ether# The shieldin e((ect produced !y %ery lar e %alues is precisely why we must per(orm an outlier test iteratively#
HOW IT WOR2S
I( there are (ewer points than you e+pect, then throw away those (ew points#
ne S B ne
3 1 1i These I points are outliers I
WHAT IS ERFC5
The complementary error (unction, erfc, is the residual area under the tails o( a distri!ution# Its %alue ets smaller (or %alues (urther away (rom the center o( the distri!ution# Thus, the error (unction %alue o( in(inity is always 5ero# That's li&e sayin there is nothin else to see when you loo& at the whole picture# There is nothin speci(ic to any particular distri!ution# erfc simply is the inte ral o( a pro!a!ility density (unction# In most data!ase systems and statistics pro rams, erfc assumes the normal or >aussian distri!ution# Csin the appropriate calculation or loo&)up ta!le ma&es Chau%enet's criterion a uni%ersal test#
WH6 THE 75
This is the ma ical Chau%enet num!er# 7e i%e each %alue a ;<= chance o( sur%i%al# Said another way, there must !e as many points closer to the mean as there are (urther away# A %alue is an outlier i( it is so (ar away that there's hardly any other %alues reater than it# Sample si5e is %ery important# A distri!ution with more data points is less li&ely to !e a((ected !y any sin le point, re ardless o( its %alue# Thin& a!out sample si5e as the :mass: o( a physical system# It is di((icult (or a cat to et a !owlin !all mo%in # /ut a cat can easily play with a pin pon !all all day# >reater mass means more inertia# This analo y is e+actly the same (or distri!utions# >reater num!ers o( data points means that there is little chance (or any sin le data point to a((ect the distri!ution shape# A %alue must !e very far away (rom the mean in order to :mo%e: the distri!ution o( other points and !e considered an outlier# 7ith A<< data points, an outlyin %alue is more than IH distant )) %ery (ar away (rom the meanU 2n the other hand, suppose we ha%e a nearly :mass)less:, li htwei ht distri!ution with only 1< %alues# A !ad %alue or outlier need only !e 1#LKBH away (rom the mean# There(ore, smaller sample si5es place more ri id re-uirements on the indi%idual %alues# The critical threshold which separates ood %alues (rom !ad is shown in the (i ure !elow# V is the usual 5) score, Q+)GQ0H, indicatin how (ar away a %alue is (rom the mean# Percenta es show how con(ident we are that a particular %alue !elon s to the distri!ution# This plot assumes a normal, aussian distri!ution, althou h the !asic concept here is uni%ersal and other distri!utions may !e similarly considered !y ta!ulatin the appropriate inte ral#
5
I#I I#A I#1 I#< A#L A#M A#D A#K A#; A#J A#I A#A A#1 A#< 1#L 1#M 1#D 1#K 1#; 1#J 1#I 1#A 1#1 < 1< A< I< J< ;< K< D< M< L< 1<< 11< 1A< 1I< 1J< 1;< 1K< 1D< 1M< 1L< A<<
2CTLIO1S
LL#D;= LL#;=
LM#I=
L;=
L<=
>22D
Sample Si5e P n R
This (i ure is simply the 5)score correspondin to the con(idence le%el o( P 1 ) 10PANR R# The :A: in this (ormula is the ma ical Chau%enet num!er# Two e+ample calculations ma&e this picture clear#
N@;
This chan es the sensiti%ity o( the outlier test, and corresponds to the nature o( the distri!ution with which you are testin # Csin 10I means that there must !e twice as many smaller values as there are lar er %alues# Similarly, ;0K means there should !e only one smaller value (or e%ery five larger %alues# Shown in the a!le !elow are a (ew %alues (or the Chau%enet (actor and a -ualitati%e comparison# Chau%enet (actor. Distri!ution shape. 10J pea&y 10I s&inny loose +WA#1JBH ### 10A normal moderate +W1#LKBH e-ual -uantity o( points close to and (ar (rom mean A0I disperse ti ht +W1#MJBH ### ;0K lon )tailed %ery ri id +W1#DJBH P(or N@1<R
1e"ection sensiti%ity. %ery lenient outlier i( ### +WA#A;BH Acceptance criteria. allows more points closer to mean
SPC t T outliers
GEIBH G G)IBH
It mi ht appear that we assume normality o( the data distri!ution# 2ur (irst step is to comp1ute mean and si ma# *owe%er, we only consider the 5)score, which is the ratio o( mean and si ma so their distri!ution in(erence is nulli(ied# 7hen the mean or si ma does not e+ist, such as (or Lorent5ian distri!utions (ound in Nuclear Ma netic 1esonance e+periments, other criteria must !e applied# Chau%enet's criteria has trou!le when the data distri!ution is stron ly !i)modal# 7hen there are widely separated, resol%a!le modes, all data points will !e re"ected# That's why we put a :stop limit: in step I o( the procedure so that the entire data set isn't whac&ed away#
In practice, a !i)modal or multi)modal distri!ution o( a parametric usually means that we ha%e mi+ed disparate data sources which should not !e mi+ed at all#
CONCLUSION
7e ha%e presented a simple, e((icient, and entle macro to ma&e a cleaner data set# It is easier to interpret a summary o( test results when the raw results are clean and all related Pi#e#, not spuriousR# The num!er o( points e+cluded (rom summari5ation is an important parameter# $eep a lo o( which points were e+cluded# I( many e%ents e+clude the same measurement, loo& (or a systematic trend such as wron test position, incorrect precondition, etc# 7ith Chau%enet's criteria you can !e sure your analyses are (ree o( in%isi!le ra!!its#
AUTHOR CONTACT
Lily Lin AD1; South Nor(ol& Street, Apt# 1<I San Mateo, CA LJJ<I PLDIR LDM)MALA lilyJlinXyahoo#com www.idiom.com !sherman pau" pubs chauv 2n)line Demonstration http# www.idiom.com !sherman pau" pubs demo chauvdemo.htm" Paul D She$"an II; Olan 6illa e Lane, Apt# JAJ San Jose, CA L;1IJ PJ<MR IMI)<JD1 shermanXidiom#com
REFERENCES
Chase, Mary Coyle# 1L;I# *ar%ey# C$. 2+(ord Cni%ersity Press# Di+on, 7# J# P1L;IR# :Processin data (or outliers#: Biometrics, %ol#L, pp# DJ)ML# Fer uson, T# S# P1LK1R :2n the re"ection o( outliers#: Proceedings of the 4th Berkeley Symp. !athematical Statistics and Probability, 1# pp# A;I)1MD# n
>ru!!s, F# P1LKLR# :Procedures (or Detectin 2utlyin 2!ser%ations in Samples#: "echnometrics, 11, pp#1) A1# *er5o , Ori& D# PA<<I, Jan# AJR# :Picturin 2ur Past#: In #ecord, %ol# AD, no#1D, St# Louis, M2. 7ashin ton Cni%ersity# 1etrie%ed July AD, A<<D, (rom http.00record#wustl#edu0A<<I01)AJ)<I0picturin YourYpast#html Mathematical Association o( America# :The Mathematical Association o( America's Chau%enet Pri5e,: 1etrie%ed July AD, A<<D, (rom http.00www#maa#or 0awards0chau%ent#html 1oss, Stephen M# PA<<IR# :Peirce's Criterion (or the Olimination o( Suspect O+perimental Data#: $. %ngr. Technolo y# Peirce, /# P1M;AR# :Criterion (or the re"ection o( dou!t(ul o!ser%ations#: &stronomical $ournal, 11 PA1R, pp# 1K1)1KI# Taylor, John 1# 1LLD# An Introduction to Orror Analysis . The Study o( Cncertainties in Physical Measurements, second ed# *erndon, 6A. Cni%ersity Science /oo&s# Tiet"en, >ary L# and 1o er *# Moore# P1LDA, Au ustR# :Some >ru!!s)Type Statistics (or the Detection o( Se%eral 2utliers#: "echnometrics, 1J PIR, pp#;MI);LD#
AC2NOWLEDG)ENTS
The authors would li&e to than& >rant Luo (or creatin and discussin the I31 comparison study# Annie Perlin deser%es a warm round o( applause (or her role as the other Chau%enet# 7e reatly appreciate the enerosity o( MedFocus (or allowin us to wor& on this article#
TRADE)AR2 INFOR)ATION
SAS, SAS Certi(ied Pro(essional, and all other SAS Institute Inc# product or ser%ice names are re istered trademar&s or trademar&s o( SAS Institute, Inc# in the CSA and other countries# Z indicates CSA re istration# 2ther !rand and product names are re istered trademar&s or trademar&s o( their respecti%e companies#
* the main macro *; %macro chauv(indat, var, good=outdatg, bad=outdatb, chau1ac=8.C); %"oca" is&""good "oopnum; * initia"ize a"" data points G00- *; * assumes there is not a"read5 a variab"e ca""ed +4G00- *; data chaudat; set <indat.; isgood=7; run; * "oop 1orever unti" a"" va"ues pass the test *; %"et is&""good=6&243; %"et "oopnum=7; %do %unti"(<is&""good. e> ./'3); %chauv8(chaudat, <var., <chau1ac., is&""good, <"oopnum.); %"et "oopnum=%eva"(<"oopnum.D7); %end; data <good. (drop=isgood); set chaudat; run; %i1 <bad. ne . %then %do; data <bad. (drop=isgood); set %do i=7 %to <"oopnum.)7; "oopdat<i. %end; ; run; %end; proc datasets "ib=worE nodetai"s no"ist nowarn; de"ete chaudat summ; de"ete %do i=7 %to <"oopnum.)7; "oopdat<i. %end; ; >uit; %mend chauv; *************************************** *** $23&,+,G -&.& .%3 $%&'(3,3. F&G *** *** a demonstration *** *** b5 2i"5 2in and Hau" - 4herman *** ***************************************; *** 1aEe some data ***; proc s>" noprint; create tab"e raw (va"ue integer); insert into raw (va"ue) va"ues (I.89) va"ues (I.7J) va"ues (K.L:) va"ues (8.I;) va"ues (;.;J) va"ues (8.I7) va"ues (I.:I) va"ues (L.9J) va"ues (98.;J) va"ues (78.KI) va"ues (9C.:7) ; >uit; %chauv(raw, va"ue, good=theGood, bad=theBad); proc means data=theGood; run; * summarize the good ... *; proc print data=theBad; run; * ... and show us the bad *; %chauv(raw, va"ue, good=raw, bad=.); * overwrite the origina" data *;
The code which enerates this data is shown !elow# %"et n=obs=78888; data a; group=7; do i= 7 to C; x=9CD8.:*rannor(J:IL); output; end; do i= J to 78; x=KD8.9*rannor(J:IL); output; end; do i =77 to <n=obs; x=78D7*rannor(J:IL); output; end; run; %macro m5compare; data 1ina"; merge a a=good(Eeep=i in=b) temp(Eeep=i in=c); b5 i; 1ormat 1"ag M;C.; i1 (not b) and (not c) then 1"ag=B/emoved b5 Both +N/ and $hauv i1 (not b) and c then 1"ag=B/emoved b5 +N/ 0,2GB; i1 b and (not c) then 1"ag=B/emoved b5 $hauv @ethod 0,2GB; run; proc 1re> data=1ina"; tab"e 1"ag; run; %mend m5compare; %macro i>r(inds=a); proc univariate data=<inds noprint; var x; output out=sum=<inds mean=mean median=pC8 >7=p9C >K=p:C std=std; run; data sum=<inds; set sum=<inds; range=p:C)p9C; '2=p:CD7.C*range; 22=p9C)7.C*range; ca"" s5mput (*'2*, '2); ca"" s5mput(*22*, 22); run; tit"e B+nter Nuarti"e /ange @ethodB; tit"e9 B&"" -ata HointB; 1ootnote BHrogram# 434'G 988: Haper 4&)77 data <inds.=good <inds.=bad; set <inds; i1 <22O=xO=<'2 then output <inds.=good; e"se output <inds.=bad; run;
@ethodB;
0utput# out"ierB;
1<
proc s>" ; tit"e B,umber o1 Bad /ecodsB; se"ect count(*) into# n=bad 1rom <inds.=bad; tit"e B+nter Nuarti"e /ange @ethodB; tit"e9 BGood -ata HointB; %mend i>r; %i>r(inds=a);
%macro chauv(inds=a, "oop=C); %do i=7 %to <"oop; proc univariate data=<inds noprint; var x; output out=sum=<inds n=n mean=mean median=pC8 >7=p9C >K=p:C std=std; run; proc s>"; create tab"e <inds. as se"ect r.i, r.x, s.n*er1c(abs(r.x)s.mean) s.std)?8.C as is=good 1rom sum=<inds as s inner Aoin <inds as r on 7=7 ; tit"e B,umber o1 Bad /ecods in 2oop <iB; se"ect count(*) into# n=bad 1rom <inds where is=good=8; >uit; data <inds; set <inds; group=7; where is=good=7; run; %i1 <n=bad=8 %then %do; %"et i=7888; tit"e B$hauvenet @ethodB; tit"e9 BGood -ata Hoint a1ter 2oop 3ndB; %end; %e"se %do; tit"e B$hauvenet @ethodB; tit"e9 BGood -ata Hoint a1ter 2oop <iB; %end; %m5compare; %end; %mend chauv; data temp; set a; run; %chauv(inds=temp, "oop=C8);
11