4ou o(ten want to compare data sets# 4ou can't really do this point)!y)point, so you summari5e each set indi%idually and compare their descripti%e statistics on the a re ate# /y loo&in only at the summary, you are ma&in an assumption that all o!ser%ations are related in some way# *ow do you %eri(y this assumption, /y testin (or outliers# 6alues that are spurious or unrelated to the others must !e e+cluded (rom summari5ation# In this paper we present a simple, e((icient, and entle macro (or (ilterin a data set# There are many techni-ues, and the su!"ect o( ro!ust statistice is modern and rich thou h not at all simple# Chau%enet's criteria is easy to understand, can !e -uic&ly computed (or a !illion rows o( data, and e%en entle enou h to !e used on a tiny set o( only ten points# 7e ha%e seen that Chau%enet's criterion is used in astronomy, nuclear technolo y, eolo y, epidemiolo y, molecular !iolo y, radiolo y and many (ields o( physical science# It is widely used !y o%ernment la!oratories, industries, and uni%ersities# Althou h Chau%enet8s criterion is not currently used in Clinical trials, we would li&e to e+plore it (or possi!ility o( !ein applied to the trials en%ironment as well#
4ou want some way to identi(y what o!ser%ations in your data set need closer study# It's not appropriate to simply throw away or delete an o!ser%ation9 you must &eep it around to loo& at later# The picture is as (ollows#
2ur macro does this easily# To scrutini5e %aria!le x and split origdata into ood and !ad pieces, use this macro call. %chauv(origdata, x, good=theGood, bad=theBad); Then, (or e+ample, proc means data=theGood; run; * summarize the good ... *; proc print data=theBad; run; * ... and show us the bad *;
I31 is !ased on percentile statistics# Just li&e the median, percentiles presume speci(ic orderin o( o!ser%ations in a dataset# The sort step re-uired can !e costly, especially when the dataset is hu e# For e+ample, a data set with a !illion o!ser%ations ta&es hal( an hour to calculate the percentiles, e%en with the piecewise)para!olic al orithm, while only si+ minutes to enerate distri!ution moments G and H# Nu"#e$ %& O#(e$vati%n( 1,<<<,<<<,<<< 1<<,<<<,<<< 1<,<<<,<<< 1,<<<,<<< 1<<,<<< 1<,<<< 1,<<< 1<< C%"'utati%n Ti"e )e*ian+ P + P )ean+ St* ,- .IA.I1#JI K.<D#J; I.1I#DL I<#1; 1M#KA 1#I; 1#ML <#1J <#A< <#<1 <#<I <#<< <#<1 <#<< <#<< <#<1
Sortin and o!ser%ation num!erin cannot !e done in S3L !ecause rows o( a ta!le are independent amon themsel%es# 4ou cannot compute the median in a data!ase -uery# Neither can you use the 21DO1 /4 clause in a su!)-uery# There(ore, we must see& an alternati%e outlier test which can !e per(ormed within a data!ase -uery en%ironment#
A character in a play, The auntie o( someone who's !est (riend is a poo&a# 2nly Mrs# Chau%enet &nows the truth a!out Olwood P# Dowd# /ut this o!ser%ation is itsel( an outlier# French mathematician 7illiam Chau%enet, 1MA<)1MD<, is !est &nown (or his clear and simple writin style and pioneerin contri!utions to the C#S# Na%al Academy# *e mathematically %eri(ied the (irst !rid e spannin the Mississippi 1i%er, and was the second chancellor o( 7ashin ton Cni%ersity in St# Louis# To his honor, each year since 1LA; a well)written mathematical article recei%es the Chau%enet award#
I( the e+pected num!er o( measurements at least as !ad as the suspect measurement is less than 10A, then the suspect measurement should !e re"ected#
Let's assume you ha%e a data set with numeric %aria!le x# Suppose there are n o!ser%ations in your dataset# 4ou want to throw away all o!ser%ations which are :not ood enou h:# *ow do you do this, 1emem!er that in clinical practice, no point is not ood enou h, so the su!"ect o( outliers does not apply# 1# A# I# J# Calculate G and H I( n B er(cP Q + ) G Q 0 H R F S then 1e"ect + i i 1epeat steps 1 and A until step 2 passes or too many points removed 1eport (inal G, H, and n
7hen the dust settles, you ha%e two data sets. The set o( all ood data points, and the set o( :!ad: points# Althou h most o(ten we don't care a!out !ad data points, sometimes the !ad points tell us much more in(ormation than do the ood points# 7e must not !e too hasty to completely (or et a!out the !ad data points, !ut &eep them aside (or later and care(ul e%aluation#
Some -uestions to as& a!out the :!ad: data points, which we will tal& a!out later. 1# 7hy were these particular points e+cluded, A# 7hat do all the !ad points ha%e in common,
Suppose we ha%e 1J measurements o( some parameter, shown !elow in Ta!le 1# It ta&es two iterations o( the Chau%enet procedure to eliminate all the :!ad: data %alues# The (irst pass mar&s A %alues as !ad. AL#MD and A;#D1# Then, the second pass mar&s another %alue !ad. A<#JK# Further and su!se-uent applications o( Chau%enet don't mar& any more points# As :!ad: data points are e+cluded, notice how the standard de%iation si ni(icantly impro%es (rom M#DD to ;#1D to I#IM# Original Pass #1 Pass #2 M#<A # # M#1K # # I#LD # # M#KJ # # <#MJ # # J#JK # # <#M1 # # D#DJ # # M#DM # # L#AK # # Shielded outlier A<#JK # %utlie$ AL#MD %utlie$ # 1<#IM # # A;#D1 # %utlie$ avg: 1<#;1 K#JK D#KI stdev: M#DD I#IM ;#1D n: 1J 11 1A The third outlier %alue, cau ht and e+cluded in Pass TA, is called a shielded outlier# At (irst, its %alue is small enou h )) or close enou h to the mean )) to !e considered ood# 2nly when the most e+treme %alue is remo%ed does this ne+t most lar est %alue !ecome noticea!le# As we will see later, each remo%al o( a data point :li htens the mass: o( a distri!ution# Smaller sample si5es re-uire their %alues to !e closer to ether# The shieldin e((ect produced !y %ery lar e %alues is precisely why we must per(orm an outlier test iteratively#
I( there are (ewer points than you e+pect, then throw away those (ew points#
The complementary error (unction, erfc, is the residual area under the tails o( a distri!ution# Its %alue ets smaller (or %alues (urther away (rom the center o( the distri!ution# Thus, the error (unction %alue o( in(inity is always 5ero# That's li&e sayin there is nothin else to see when you loo& at the whole picture# There is nothin speci(ic to any particular distri!ution# erfc simply is the inte ral o( a pro!a!ility density (unction# In most data!ase systems and statistics pro rams, erfc assumes the normal or >aussian distri!ution# Csin the appropriate calculation or loo&)up ta!le ma&es Chau%enet's criterion a uni%ersal test#
This is the ma ical Chau%enet num!er# 7e i%e each %alue a ;<= chance o( sur%i%al# Said another way, there must !e as many points closer to the mean as there are (urther away# A %alue is an outlier i( it is so (ar away that there's hardly any other %alues reater than it# Sample si5e is %ery important# A distri!ution with more data points is less li&ely to !e a((ected !y any sin le point, re ardless o( its %alue# Thin& a!out sample si5e as the :mass: o( a physical system# It is di((icult (or a cat to et a !owlin !all mo%in # /ut a cat can easily play with a pin pon !all all day# >reater mass means more inertia# This analo y is e+actly the same (or distri!utions# >reater num!ers o( data points means that there is little chance (or any sin le data point to a((ect the distri!ution shape# A %alue must !e very far away (rom the mean in order to :mo%e: the distri!ution o( other points and !e considered an outlier# 7ith A<< data points, an outlyin %alue is more than IH distant )) %ery (ar away (rom the meanU 2n the other hand, suppose we ha%e a nearly :mass)less:, li htwei ht distri!ution with only 1< %alues# A !ad %alue or outlier need only !e 1#LKBH away (rom the mean# There(ore, smaller sample si5es place more ri id re-uirements on the indi%idual %alues# The critical threshold which separates ood %alues (rom !ad is shown in the (i ure !elow# V is the usual 5) score, Q+)GQ0H, indicatin how (ar away a %alue is (rom the mean# Percenta es show how con(ident we are that a particular %alue !elon s to the distri!ution# This plot assumes a normal, aussian distri!ution, althou h the !asic concept here is uni%ersal and other distri!utions may !e similarly considered !y ta!ulatin the appropriate inte ral#
This (i ure is simply the 5)score correspondin to the con(idence le%el o( P 1 ) 10PANR R# The :A: in this (ormula is the ma ical Chau%enet num!er# Two e+ample calculations ma&e this picture clear#
This chan es the sensiti%ity o( the outlier test, and corresponds to the nature o( the distri!ution with which you are testin # Csin 10I means that there must !e twice as many smaller values as there are lar er %alues# Similarly, ;0K means there should !e only one smaller value (or e%ery five larger %alues# Shown in the a!le !elow are a (ew %alues (or the Chau%enet (actor and a -ualitati%e comparison# Chau%enet (actor. Distri!ution shape. 10J pea&y 10I s&inny loose +WA#1JBH ### 10A normal moderate +W1#LKBH e-ual -uantity o( points close to and (ar (rom mean A0I disperse ti ht +W1#MJBH ### ;0K lon )tailed %ery ri id +W1#DJBH P(or N@1<R
1e"ection sensiti%ity. %ery lenient outlier i( ### +WA#A;BH Acceptance criteria. allows more points closer to mean
It mi ht appear that we assume normality o( the data distri!ution# 2ur (irst step is to comp1ute mean and si ma# *owe%er, we only consider the 5)score, which is the ratio o( mean and si ma so their distri!ution in(erence is nulli(ied# 7hen the mean or si ma does not e+ist, such as (or Lorent5ian distri!utions (ound in Nuclear Ma netic 1esonance e+periments, other criteria must !e applied# Chau%enet's criteria has trou!le when the data distri!ution is stron ly !i)modal# 7hen there are widely separated, resol%a!le modes, all data points will !e re"ected# That's why we put a :stop limit: in step I o( the procedure so that the entire data set isn't whac&ed away#
In practice, a !i)modal or multi)modal distri!ution o( a parametric usually means that we ha%e mi+ed disparate data sources which should not !e mi+ed at all#
7e ha%e presented a simple, e((icient, and entle macro to ma&e a cleaner data set# It is easier to interpret a summary o( test results when the raw results are clean and all related Pi#e#, not spuriousR# The num!er o( points e+cluded (rom summari5ation is an important parameter# $eep a lo o( which points were e+cluded# I( many e%ents e+clude the same measurement, loo& (or a systematic trend such as wron test position, incorrect precondition, etc# 7ith Chau%enet's criteria you can !e sure your analyses are (ree o( in%isi!le ra!!its#
Lily Lin
San Mateo, CA
lilyJlinXyahoo#com
www.idiom.com !sherman pau" pubs chauv

Paul D She$"an
San Jose, CA
shermanXidiom#com
The authors would li&e to than& >rant Luo (or creatin and discussin the I31 comparison study# Annie Perlin deser%es a warm round o( applause (or her role as the other Chau%enet# 7e reatly appreciate the enerosity o( MedFocus (or allowin us to wor& on this article#
* the main macro *; %macro chauv(indat, var, good=outdatg, bad=outdatb, chau1ac=8.C); %"oca" is&""good "oopnum; * initia"ize a"" data points G00- *; * assumes there is not a"read5 a variab"e ca""ed +4G00- *; data chaudat; set <indat.; isgood=7; run; * "oop 1orever unti" a"" va"ues pass the test *; %"et is&""good=6&243; %"et "oopnum=7; %do %unti"(<is&""good. e> ./'3); %chauv8(chaudat, <var., <chau1ac., is&""good, <"oopnum.); %"et "oopnum=%eva"(<"oopnum.D7); %end; data <good. (drop=isgood); set chaudat; run; %i1 <bad. ne . %then %do; data <bad. (drop=isgood); set %do i=7 %to <"oopnum.)7; "oopdat<i. %end; ; run; %end; proc datasets "ib=worE nodetai"s no"ist nowarn; de"ete chaudat summ; de"ete %do i=7 %to <"oopnum.)7; "oopdat<i. %end; ; >uit; %mend chauv; *************************************** *** $23&,+,G -&.& .%3 $%&'(3,3. F&G *** *** a demonstration *** *** b5 2i"5 2in and Hau" - 4herman *** ***************************************; *** 1aEe some data ***; proc s>" noprint; create tab"e raw (va"ue integer); insert into raw (va"ue) va"ues (I.89) va"ues (I.7J) va"ues (K.L:) va"ues (8.I;) va"ues (;.;J) va"ues (8.I7) va"ues (I.:I) va"ues (L.9J) va"ues (98.;J) va"ues (78.KI) va"ues (9C.:7) ; >uit; %chauv(raw, va"ue, good=theGood, bad=theBad); proc means data=theGood; run; * summarize the good ... *; proc print data=theBad; run; * ... and show us the bad *; %chauv(raw, va"ue, good=raw, bad=.); * overwrite the origina" data *;
The code which enerates this data is shown !elow# %"et n=obs=78888; data a; group=7; do i= 7 to C; x=9CD8.:*rannor(J:IL); output; end; do i= J to 78; x=KD8.9*rannor(J:IL); output; end; do i =77 to <n=obs; x=78D7*rannor(J:IL); output; end; run; %macro m5compare; data 1ina"; merge a a=good(Eeep=i in=b) temp(Eeep=i in=c); b5 i; 1ormat 1"ag M;C.; i1 (not b) and (not c) then 1"ag=B/emoved b5 Both +N/ and $hauv i1 (not b) and c then 1"ag=B/emoved b5 +N/ 0,2GB; i1 b and (not c) then 1"ag=B/emoved b5 $hauv @ethod 0,2GB; run; proc 1re> data=1ina"; tab"e 1"ag; run; %mend m5compare; %macro i>r(inds=a); proc univariate data=<inds noprint; var x; output out=sum=<inds mean=mean median=pC8 >7=p9C >K=p:C std=std; run; data sum=<inds; set sum=<inds; range=p:C)p9C; '2=p:CD7.C*range; 22=p9C)7.C*range; ca"" s5mput (*'2*, '2); ca"" s5mput(*22*, 22); run; tit"e B+nter Nuarti"e /ange @ethodB; tit"e9 B&"" -ata HointB; 1ootnote BHrogram# 434'G 988: Haper 4&)77 data <inds.=good <inds.=bad; set <inds; i1 <22O=xO=<'2 then output <inds.=good; e"se output <inds.=bad; run;
0utput# out"ierB;
proc s>" ; tit"e B,umber o1 Bad /ecodsB; se"ect count(*) into# n=bad 1rom <inds.=bad; tit"e B+nter Nuarti"e /ange @ethodB; tit"e9 BGood -ata HointB; %mend i>r; %i>r(inds=a);
%macro chauv(inds=a, "oop=C); %do i=7 %to <"oop; proc univariate data=<inds noprint; var x; output out=sum=<inds n=n mean=mean median=pC8 >7=p9C >K=p:C std=std; run; proc s>"; create tab"e <inds. as se"ect r.i, r.x, s.n*er1c(abs(r.x)s.mean) s.std)?8.C as is=good 1rom sum=<inds as s inner Aoin <inds as r on 7=7 ; tit"e B,umber o1 Bad /ecods in 2oop <iB; se"ect count(*) into# n=bad 1rom <inds where is=good=8; >uit; data <inds; set <inds; group=7; where is=good=7; run; %i1 <n=bad=8 %then %do; %"et i=7888; tit"e B$hauvenet @ethodB; tit"e9 BGood -ata Hoint a1ter 2oop 3ndB; %end; %e"se %do; tit"e B$hauvenet @ethodB; tit"e9 BGood -ata Hoint a1ter 2oop <iB; %end; %m5compare; %end; %mend chauv; data temp; set a; run; %chauv(inds=temp, "oop=C8);