0% found this document useful (0 votes)

205 views11 pages

Outlayer PDF

Chauvenet's criteria provides a simple and efficient statistical test for identifying outlier values in a dataset. It works for any distribution, whether normal, skewed, or multimodal. The test calculates the expected number of values at least as extreme as the potential outlier. If the expected number is less than 1/2n, where n is the sample size, then the value is identified as an outlier and excluded from further analysis. The test can be performed iteratively to identify multiple outliers. It is widely used across various fields due to its ease of use and ability to handle large datasets.

Uploaded by

Arif Tri Mardianto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

205 views11 pages

Outlayer PDF

Uploaded by

Arif Tri Mardianto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

SESUG Proceedings (c) SESUG, Inc (http://www.sesug.

org) The papers contained in the SESUG proceedings are the property of their authors, unless otherwise stated. Do not reprint without permission. SEGUG papers are distributed freely as a courtesy of the Institute for Advanced Analytics (http://analytics.ncsu.edu). Paper SA11

Cleaning Data the Chauvenet Way

Lily Lin, MedFocus, San Mateo, CA Paul D Sherman, Independent Consultant, San Jose, CA
ABSTRACT
Throwin away data is a touchy su!"ect# $eep the ma%eric& and you contaminate a eneral trend# Toss away ood data points and you don't &now i( somethin important has happened )) until its too late# *ow do you &now what, i( any, data to e+clude, Chau%enet is the answer# *ere is a really entle test you can apply to any distri!ution o( num!ers# It wor&s e-ually well (or normal, s&ewed, and e%en multi)modal populations# This article i%es you a macro tool (or cleanin up your data and separatin the ood (rom the !ad# S&ill Le%el. /asic statistics, Data step, SAS0MAC12, and Proc S3L

INTRODUCTION
4ou o(ten want to compare data sets# 4ou can't really do this point)!y)point, so you summari5e each set indi%idually and compare their descripti%e statistics on the a re ate# /y loo&in only at the summary, you are ma&in an assumption that all o!ser%ations are related in some way# *ow do you %eri(y this assumption, /y testin (or outliers# 6alues that are spurious or unrelated to the others must !e e+cluded (rom summari5ation# In this paper we present a simple, e((icient, and entle macro (or (ilterin a data set# There are many techni-ues, and the su!"ect o( ro!ust statistice is modern and rich thou h not at all simple# Chau%enet's criteria is easy to understand, can !e -uic&ly computed (or a !illion rows o( data, and e%en entle enou h to !e used on a tiny set o( only ten points# 7e ha%e seen that Chau%enet's criterion is used in astronomy, nuclear technolo y, eolo y, epidemiolo y, molecular !iolo y, radiolo y and many (ields o( physical science# It is widely used !y o%ernment la!oratories, industries, and uni%ersities# Althou h Chau%enet8s criterion is not currently used in Clinical trials, we would li&e to e+plore it (or possi!ility o( !ein applied to the trials en%ironment as well#
THE FILTERING PROCESS

4ou want some way to identi(y what o!ser%ations in your data set need closer study# It's not appropriate to simply throw away or delete an o!ser%ation9 you must &eep it around to loo& at later# The picture is as (ollows#

>ood 6alues 2ri inal Data Set /ad 6alues

? continue with analysis ? why are these !ad,

2ur macro does this easily# To scrutini5e %aria!le x and split origdata into ood and !ad pieces, use this macro call. %chauv(origdata, x, good=theGood, bad=theBad); Then, (or e+ample, proc means data=theGood; run; * summarize the good ... *; proc print data=theBad; run; * ... and show us the bad *;

INTERQUARTILE RANGE IQR! TEST

The I31 test is commonly used in clinical studies to re"ect data points# The one)way analysis o( %ariance usin :!o+ plots: also uses the I31 test# 7his&ers o( the !o+ are called 2utlier Limits and set ;<= (urther away than the I31# The ran e (or a data point to !e considered : ood: is de(ined !elow#

L2L @ P C2L @ P where I31 @ P

) 1#; B I31 E 1#; B I31

PA; L2L

PD; C2L

A; A;

geometry of the box plot

) 1#; B P D; F + F A#; B P D; ) 1#; B P A;

An accepta!le data %alue must lie within these limits. A#; B P

I31 is !ased on percentile statistics# Just li&e the median, percentiles presume speci(ic orderin o( o!ser%ations in a dataset# The sort step re-uired can !e costly, especially when the dataset is hu e# For e+ample, a data set with a !illion o!ser%ations ta&es hal( an hour to calculate the percentiles, e%en with the piecewise)para!olic al orithm, while only si+ minutes to enerate distri!ution moments G and H# Nu"#e$ %& O#(e$vati%n( 1,<<<,<<<,<<< 1<<,<<<,<<< 1<,<<<,<<< 1,<<<,<<< 1<<,<<< 1<,<<< 1,<<< 1<< C%"'utati%n Ti"e )e*ian+ P + P )ean+ St* ,- .IA.I1#JI K.<D#J; I.1I#DL I<#1; 1M#KA 1#I; 1#ML <#1J <#A< <#<1 <#<I <#<< <#<1 <#<< <#<< <#<1

Sortin and o!ser%ation num!erin cannot !e done in S3L !ecause rows o( a ta!le are independent amon themsel%es# 4ou cannot compute the median in a data!ase -uery# Neither can you use the 21DO1 /4 clause in a su!)-uery# There(ore, we must see& an alternati%e outlier test which can !e per(ormed within a data!ase -uery en%ironment#

WHO IS CHAU/ENET
A character in a play, The auntie o( someone who's !est (riend is a poo&a# 2nly Mrs# Chau%enet &nows the truth a!out Olwood P# Dowd# /ut this o!ser%ation is itsel( an outlier# French mathematician 7illiam Chau%enet, 1MA<)1MD<, is !est &nown (or his clear and simple writin style and pioneerin contri!utions to the C#S# Na%al Academy# *e mathematically %eri(ied the (irst !rid e spannin the Mississippi 1i%er, and was the second chancellor o( 7ashin ton Cni%ersity in St# Louis# To his honor, each year since 1LA; a well)written mathematical article recei%es the Chau%enet award#

CHAU/ENET0S CRITERIA
I( the e+pected num!er o( measurements at least as !ad as the suspect measurement is less than 10A, then the suspect measurement should !e re"ected#
PROCEDURE

Let's assume you ha%e a data set with numeric %aria!le x# Suppose there are n o!ser%ations in your dataset# 4ou want to throw away all o!ser%ations which are :not ood enou h:# *ow do you do this, 1emem!er that in clinical practice, no point is not ood enou h, so the su!"ect o( outliers does not apply# 1# A# I# J# Calculate G and H I( n B er(cP Q + ) G Q 0 H R F S then 1e"ect + i i 1epeat steps 1 and A until step 2 passes or too many points removed 1eport (inal G, H, and n

7hen the dust settles, you ha%e two data sets. The set o( all ood data points, and the set o( :!ad: points# Althou h most o(ten we don't care a!out !ad data points, sometimes the !ad points tell us much more in(ormation than do the ood points# 7e must not !e too hasty to completely (or et a!out the !ad data points, !ut &eep them aside (or later and care(ul e%aluation#

Some -uestions to as& a!out the :!ad: data points, which we will tal& a!out later. 1# 7hy were these particular points e+cluded, A# 7hat do all the !ad points ha%e in common,

E1A)PLE
Suppose we ha%e 1J measurements o( some parameter, shown !elow in Ta!le 1# It ta&es two iterations o( the Chau%enet procedure to eliminate all the :!ad: data %alues# The (irst pass mar&s A %alues as !ad. AL#MD and A;#D1# Then, the second pass mar&s another %alue !ad. A<#JK# Further and su!se-uent applications o( Chau%enet don't mar& any more points# As :!ad: data points are e+cluded, notice how the standard de%iation si ni(icantly impro%es (rom M#DD to ;#1D to I#IM# Original Pass #1 Pass #2 M#<A # # M#1K # # I#LD # # M#KJ # # <#MJ # # J#JK # # <#M1 # # D#DJ # # M#DM # # L#AK # # Shielded outlier A<#JK # %utlie$ AL#MD %utlie$ # 1<#IM # # A;#D1 # %utlie$ avg: 1<#;1 K#JK D#KI stdev: M#DD I#IM ;#1D n: 1J 11 1A The third outlier %alue, cau ht and e+cluded in Pass TA, is called a shielded outlier# At (irst, its %alue is small enou h )) or close enou h to the mean )) to !e considered ood# 2nly when the most e+treme %alue is remo%ed does this ne+t most lar est %alue !ecome noticea!le# As we will see later, each remo%al o( a data point :li htens the mass: o( a distri!ution# Smaller sample si5es re-uire their %alues to !e closer to ether# The shieldin e((ect produced !y %ery lar e %alues is precisely why we must per(orm an outlier test iteratively#

HOW IT WOR2S
I( there are (ewer points than you e+pect, then throw away those (ew points#

O+pected num!er o( points with %alue +i

ne S B ne
3 1 1i These I points are outliers I

WHAT IS ERFC5

The complementary error (unction, erfc, is the residual area under the tails o( a distri!ution# Its %alue ets smaller (or %alues (urther away (rom the center o( the distri!ution# Thus, the error (unction %alue o( in(inity is always 5ero# That's li&e sayin there is nothin else to see when you loo& at the whole picture# There is nothin speci(ic to any particular distri!ution# erfc simply is the inte ral o( a pro!a!ility density (unction# In most data!ase systems and statistics pro rams, erfc assumes the normal or >aussian distri!ution# Csin the appropriate calculation or loo&)up ta!le ma&es Chau%enet's criterion a uni%ersal test#
WH6 THE 75

This is the ma ical Chau%enet num!er# 7e i%e each %alue a ;<= chance o( sur%i%al# Said another way, there must !e as many points closer to the mean as there are (urther away# A %alue is an outlier i( it is so (ar away that there's hardly any other %alues reater than it# Sample si5e is %ery important# A distri!ution with more data points is less li&ely to !e a((ected !y any sin le point, re ardless o( its %alue# Thin& a!out sample si5e as the :mass: o( a physical system# It is di((icult (or a cat to et a !owlin !all mo%in # /ut a cat can easily play with a pin pon !all all day# >reater mass means more inertia# This analo y is e+actly the same (or distri!utions# >reater num!ers o( data points means that there is little chance (or any sin le data point to a((ect the distri!ution shape# A %alue must !e very far away (rom the mean in order to :mo%e: the distri!ution o( other points and !e considered an outlier# 7ith A<< data points, an outlyin %alue is more than IH distant )) %ery (ar away (rom the meanU 2n the other hand, suppose we ha%e a nearly :mass)less:, li htwei ht distri!ution with only 1< %alues# A !ad %alue or outlier need only !e 1#LKBH away (rom the mean# There(ore, smaller sample si5es place more ri id re-uirements on the indi%idual %alues# The critical threshold which separates ood %alues (rom !ad is shown in the (i ure !elow# V is the usual 5) score, Q+)GQ0H, indicatin how (ar away a %alue is (rom the mean# Percenta es show how con(ident we are that a particular %alue !elon s to the distri!ution# This plot assumes a normal, aussian distri!ution, althou h the !asic concept here is uni%ersal and other distri!utions may !e similarly considered !y ta!ulatin the appropriate inte ral#
5
I#I I#A I#1 I#< A#L A#M A#D A#K A#; A#J A#I A#A A#1 A#< 1#L 1#M 1#D 1#K 1#; 1#J 1#I 1#A 1#1 < 1< A< I< J< ;< K< D< M< L< 1<< 11< 1A< 1I< 1J< 1;< 1K< 1D< 1M< 1L< A<<

2CTLIO1S
LL#D;= LL#;=

LM#I=

L;=

L<=

>22D

Sample Si5e P n R

This (i ure is simply the 5)score correspondin to the con(idence le%el o( P 1 ) 10PANR R# The :A: in this (ormula is the ma ical Chau%enet num!er# Two e+ample calculations ma&e this picture clear#

N@;

1 0 PANR @ 1 0 PAB;R @ 101< @ 1<01<< @ <#1<

1)p @ <#1< p @ <#L<

? ta!le loo&)up ? 5 @ 1#K; ? ta!le loo&)up ? 5 @ 1#LK

N@1< 1 0 PANR @ 1 0 PAB1<R @ 10A< @ ;01<< @ <#<;

WHAT HAPPENS IF WE USE 89: OR -9; OR <<<5

1)p @ <#<; p @ <#L;

This chan es the sensiti%ity o( the outlier test, and corresponds to the nature o( the distri!ution with which you are testin # Csin 10I means that there must !e twice as many smaller values as there are lar er %alues# Similarly, ;0K means there should !e only one smaller value (or e%ery five larger %alues# Shown in the a!le !elow are a (ew %alues (or the Chau%enet (actor and a -ualitati%e comparison# Chau%enet (actor. Distri!ution shape. 10J pea&y 10I s&inny loose +WA#1JBH ### 10A normal moderate +W1#LKBH e-ual -uantity o( points close to and (ar (rom mean A0I disperse ti ht +W1#MJBH ### ;0K lon )tailed %ery ri id +W1#DJBH P(or N@1<R

1e"ection sensiti%ity. %ery lenient outlier i( ### +WA#A;BH Acceptance criteria. allows more points closer to mean

re-uires more points (urther (rom mean

SUGGESTION FOR GOOD PROCESS CONTROL

Monitorin summary statistics (rom complete detail or raw data sets can lead to many anomalous out)o() control alerts# Too many warnin s dilutes the e((ecti%eness o( a -uality control team# Too (ew, o!%iously, is not ood either9 you don't want to !e so !lind as to pass all the "un&# Csin a entle outlier (ilterin method such as Chau%enet's criteria is a ood idea# 4ou split your raw data set into two pieces. ood and !ad# O+istin SPC PStatistical Process ControlR methods are then per(ormed on the ood data set# There is no distri!ution amon the !ad o!ser%ations9 they are all useless ar!a e# *owe%er, the quantity o( them is use(ul# Simply count the num!er o( o!ser%ations in the !ad data set and watch a trend chart o( that# A consistent, in)control process should ha%e a similar num!er o( outliers o%er time# I(, (or e+ample, one day you come in to wor& and see your process within SPC control yet with almost no outliers, then you &now somethin has chan ed and must !e loo&ed at#

>ood 6alues 2ri inal Data Set /ad 6alues

SPC t T outliers

GEIBH G G)IBH

nma+ Count nmin t somethin chan ed

WHAT )IGHT GO WRONG

It mi ht appear that we assume normality o( the data distri!ution# 2ur (irst step is to comp1ute mean and si ma# *owe%er, we only consider the 5)score, which is the ratio o( mean and si ma so their distri!ution in(erence is nulli(ied# 7hen the mean or si ma does not e+ist, such as (or Lorent5ian distri!utions (ound in Nuclear Ma netic 1esonance e+periments, other criteria must !e applied# Chau%enet's criteria has trou!le when the data distri!ution is stron ly !i)modal# 7hen there are widely separated, resol%a!le modes, all data points will !e re"ected# That's why we put a :stop limit: in step I o( the procedure so that the entire data set isn't whac&ed away#

In practice, a !i)modal or multi)modal distri!ution o( a parametric usually means that we ha%e mi+ed disparate data sources which should not !e mi+ed at all#

CONCLUSION
7e ha%e presented a simple, e((icient, and entle macro to ma&e a cleaner data set# It is easier to interpret a summary o( test results when the raw results are clean and all related Pi#e#, not spuriousR# The num!er o( points e+cluded (rom summari5ation is an important parameter# $eep a lo o( which points were e+cluded# I( many e%ents e+clude the same measurement, loo& (or a systematic trend such as wron test position, incorrect precondition, etc# 7ith Chau%enet's criteria you can !e sure your analyses are (ree o( in%isi!le ra!!its#

AUTHOR CONTACT
Lily Lin AD1; South Nor(ol& Street, Apt# 1<I San Mateo, CA LJJ<I PLDIR LDM)MALA lilyJlinXyahoo#com www.idiom.com !sherman pau" pubs chauv 2n)line Demonstration http# www.idiom.com !sherman pau" pubs demo chauvdemo.htm" Paul D She$"an II; Olan 6illa e Lane, Apt# JAJ San Jose, CA L;1IJ PJ<MR IMI)<JD1 shermanXidiom#com

REFERENCES
Chase, Mary Coyle# 1L;I# *ar%ey# C$. 2+(ord Cni%ersity Press# Di+on, 7# J# P1L;IR# :Processin data (or outliers#: Biometrics, %ol#L, pp# DJ)ML# Fer uson, T# S# P1LK1R :2n the re"ection o( outliers#: Proceedings of the 4th Berkeley Symp. !athematical Statistics and Probability, 1# pp# A;I)1MD# n

>ru!!s, F# P1LKLR# :Procedures (or Detectin 2utlyin 2!ser%ations in Samples#: "echnometrics, 11, pp#1) A1# *er5o , Ori& D# PA<<I, Jan# AJR# :Picturin 2ur Past#: In #ecord, %ol# AD, no#1D, St# Louis, M2. 7ashin ton Cni%ersity# 1etrie%ed July AD, A<<D, (rom http.00record#wustl#edu0A<<I01)AJ)<I0picturin YourYpast#html Mathematical Association o( America# :The Mathematical Association o( America's Chau%enet Pri5e,: 1etrie%ed July AD, A<<D, (rom http.00www#maa#or 0awards0chau%ent#html 1oss, Stephen M# PA<<IR# :Peirce's Criterion (or the Olimination o( Suspect O+perimental Data#: $. %ngr. Technolo y# Peirce, /# P1M;AR# :Criterion (or the re"ection o( dou!t(ul o!ser%ations#: &stronomical $ournal, 11 PA1R, pp# 1K1)1KI# Taylor, John 1# 1LLD# An Introduction to Orror Analysis . The Study o( Cncertainties in Physical Measurements, second ed# *erndon, 6A. Cni%ersity Science /oo&s# Tiet"en, >ary L# and 1o er *# Moore# P1LDA, Au ustR# :Some >ru!!s)Type Statistics (or the Detection o( Se%eral 2utliers#: "echnometrics, 1J PIR, pp#;MI);LD#

AC2NOWLEDG)ENTS
The authors would li&e to than& >rant Luo (or creatin and discussin the I31 comparison study# Annie Perlin deser%es a warm round o( applause (or her role as the other Chau%enet# 7e reatly appreciate the enerosity o( MedFocus (or allowin us to wor& on this article#

TRADE)AR2 INFOR)ATION
SAS, SAS Certi(ied Pro(essional, and all other SAS Institute Inc# product or ser%ice names are re istered trademar&s or trademar&s o( SAS Institute, Inc# in the CSA and other countries# Z indicates CSA re istration# 2ther !rand and product names are re istered trademar&s or trademar&s o( their respecti%e companies#

THE CHAU/ENET OUTLIER FILTERING )ACRO

options nosource nonotes; * ==================================================== * $%&'( ) $hauvenet*s criteria data c"eaner * * +,-&. ) input dataset * (&/ ) variab"e name to process * G00- ) output dataset 1or the G00- observations * B&- ) ditto, 1or the 0'.2+3/4. -ot (.) throws awa5 * $%&'6&$ ) sensitivit5 1actor. positive, "ess than 7. * * 7.8 988:)8;)9; pds "p" initia" cut * ==================================================== * per1orm a sing"e step o1 1i"tering *; %macro chauv8(iodat, var, chau1ac, macvar, "oopnum); * (re)compute summar5 on on"5 the good points *; proc means data=<iodat. noprint mean std; where isgood=7; output out=summ (drop==t5pe= =1re>=) mean=x std=s n=n; run; * app"5 the test *; proc s>"; create tab"e <iodat. as ( 4323$. raw.<var., summ.n*er1c((raw.<var.)summ.x) summ.s)?<chau1ac. &4 isgood 6/0@ summ inner Aoin <iodat. &4 raw 0, raw.isgood=7 ); >uit; %"et <macvar.=./'3; data "oopdat<"oopnum.; set <iodat.; i1 isgood e> 8 then do; ca"" s5mput(B<macvar.B, *6&243*); output "oopdat<"oopnum.; end; run; %mend chauv8; * * * * * * * * * * *

* the main macro *; %macro chauv(indat, var, good=outdatg, bad=outdatb, chau1ac=8.C); %"oca" is&""good "oopnum; * initia"ize a"" data points G00- *; * assumes there is not a"read5 a variab"e ca""ed +4G00- *; data chaudat; set <indat.; isgood=7; run; * "oop 1orever unti" a"" va"ues pass the test *; %"et is&""good=6&243; %"et "oopnum=7; %do %unti"(<is&""good. e> ./'3); %chauv8(chaudat, <var., <chau1ac., is&""good, <"oopnum.); %"et "oopnum=%eva"(<"oopnum.D7); %end; data <good. (drop=isgood); set chaudat; run; %i1 <bad. ne . %then %do; data <bad. (drop=isgood); set %do i=7 %to <"oopnum.)7; "oopdat<i. %end; ; run; %end; proc datasets "ib=worE nodetai"s no"ist nowarn; de"ete chaudat summ; de"ete %do i=7 %to <"oopnum.)7; "oopdat<i. %end; ; >uit; %mend chauv; *************************************** *** $23&,+,G -&.& .%3 $%&'(3,3. F&G *** *** a demonstration *** *** b5 2i"5 2in and Hau" - 4herman *** ***************************************; *** 1aEe some data ***; proc s>" noprint; create tab"e raw (va"ue integer); insert into raw (va"ue) va"ues (I.89) va"ues (I.7J) va"ues (K.L:) va"ues (8.I;) va"ues (;.;J) va"ues (8.I7) va"ues (I.:I) va"ues (L.9J) va"ues (98.;J) va"ues (78.KI) va"ues (9C.:7) ; >uit; %chauv(raw, va"ue, good=theGood, bad=theBad); proc means data=theGood; run; * summarize the good ... *; proc print data=theBad; run; * ... and show us the bad *; %chauv(raw, va"ue, good=raw, bad=.); * overwrite the origina" data *;

va"ues (I.J;) va"ues (:.:;) va"ues (9L.I:)

E1A)PLE = PERCENTILES AND QUARTILES TA2E A LONG TI)E TO CO)PUTE

Percentile)!ased calculations ta&e si ni(icantly lon er to per(orm than do distri!ution moments# This is due to the (ormer in%ol%in internal sortin steps# These results appear to !e -uite eneral, and in%ariant o( how many CPC's or threads are allocated to a system# options nosource notes; %macro means(n); data dum; do i=8 to <n.; x=ranexp(J:IL); output; end; run; proc means data=dum noprint nonobs pC8 p9C p:C >method=p9 >nt"de1=C; output out=dums (drop==t5pe= =1re>=) median=pC8 p9C=p9C p:C=p:C; run; proc means data=dum noprint nonobs mean std; var x; output out=dumss (drop==t5pe= =1re>=) mean=avg std=std; run; %mend; %means(7888888888); %means(788888888); %means(78888888); %means(7888888); %means(788888); %means(78888); %means(7888); %means(788); %put *** -0,3 ***;

E1A)PLE = CO)PARING IQR TEST AND CHAU/ENET0S CRITERIA

The I31 test is a sin le)pass al orithm# There(ore, its num!er o( outlier %alues is constant# For small sample si5es o( a normal distri!ution with a (ew arti(icially :!ad: points thrown in, Chau%enet's criteria somewhat o%erestimates the num!er o( !ad points# 2n the other hand, you mi ht thin& that I31 somewhat underestimates the num!er o( !ad points when the data set is small# 1emem!er that smaller sample si5es place more strict rules on what is an outlyin %alue# 7hen the data set is lar e, more than 1<,<<< o!ser%ations, the situation is re%ersed# I31 re"ects many more points than does Chau%enet# The cross)o%er point, where !oth tests report a!out the same -uantity o( outlyin %alues, seems to !e a!out J,<<< o!ser%ations# There(ore, with %ery lar e data sets Chau%enet's criteria is superior# It re"ects only those (ew really !ad %alues, and without percentile calculations is much more time and memory e((icient in its calculations# Loop Num!er 1 A I J ; N@A<< I31 Chau% 1< 1< 1< 1; 1< A< 1< AA 1< AI N@J,<<< I31 Chau% IL 1J IL IL IL JA IL JJ IL J; N@1<,<<< I31 Chau% MJ AL MJ JL MJ ;; MJ ;D # # N@1,<<<,<<< I31 Chau% D<IA J<M D<IA JI< D<IA JI1 # # # #

The code which enerates this data is shown !elow# %"et n=obs=78888; data a; group=7; do i= 7 to C; x=9CD8.:*rannor(J:IL); output; end; do i= J to 78; x=KD8.9*rannor(J:IL); output; end; do i =77 to <n=obs; x=78D7*rannor(J:IL); output; end; run; %macro m5compare; data 1ina"; merge a a=good(Eeep=i in=b) temp(Eeep=i in=c); b5 i; 1ormat 1"ag M;C.; i1 (not b) and (not c) then 1"ag=B/emoved b5 Both +N/ and $hauv i1 (not b) and c then 1"ag=B/emoved b5 +N/ 0,2GB; i1 b and (not c) then 1"ag=B/emoved b5 $hauv @ethod 0,2GB; run; proc 1re> data=1ina"; tab"e 1"ag; run; %mend m5compare; %macro i>r(inds=a); proc univariate data=<inds noprint; var x; output out=sum=<inds mean=mean median=pC8 >7=p9C >K=p:C std=std; run; data sum=<inds; set sum=<inds; range=p:C)p9C; '2=p:CD7.C*range; 22=p9C)7.C*range; ca"" s5mput (*'2*, '2); ca"" s5mput(*22*, 22); run; tit"e B+nter Nuarti"e /ange @ethodB; tit"e9 B&"" -ata HointB; 1ootnote BHrogram# 434'G 988: Haper 4&)77 data <inds.=good <inds.=bad; set <inds; i1 <22O=xO=<'2 then output <inds.=good; e"se output <inds.=bad; run;

@ethodB;

0utput# out"ierB;

proc s>" ; tit"e B,umber o1 Bad /ecodsB; se"ect count(*) into# n=bad 1rom <inds.=bad; tit"e B+nter Nuarti"e /ange @ethodB; tit"e9 BGood -ata HointB; %mend i>r; %i>r(inds=a);

%macro chauv(inds=a, "oop=C); %do i=7 %to <"oop; proc univariate data=<inds noprint; var x; output out=sum=<inds n=n mean=mean median=pC8 >7=p9C >K=p:C std=std; run; proc s>"; create tab"e <inds. as se"ect r.i, r.x, s.n*er1c(abs(r.x)s.mean) s.std)?8.C as is=good 1rom sum=<inds as s inner Aoin <inds as r on 7=7 ; tit"e B,umber o1 Bad /ecods in 2oop <iB; se"ect count(*) into# n=bad 1rom <inds where is=good=8; >uit; data <inds; set <inds; group=7; where is=good=7; run; %i1 <n=bad=8 %then %do; %"et i=7888; tit"e B$hauvenet @ethodB; tit"e9 BGood -ata Hoint a1ter 2oop 3ndB; %end; %e"se %do; tit"e B$hauvenet @ethodB; tit"e9 BGood -ata Hoint a1ter 2oop <iB; %end; %m5compare; %end; %mend chauv; data temp; set a; run; %chauv(inds=temp, "oop=C8);

Gene Keys - Magical Contemplations
100% (8)
Gene Keys - Magical Contemplations
5 pages
Transformer and Optical Isolation
No ratings yet
Transformer and Optical Isolation
18 pages
B. Jayant Baliga Silicon RF Power MOSFETS
No ratings yet
B. Jayant Baliga Silicon RF Power MOSFETS
320 pages
01 Cleaning Data The Chauvenet Way
No ratings yet
01 Cleaning Data The Chauvenet Way
11 pages
Sampling Criterion
No ratings yet
Sampling Criterion
6 pages
Data and Presentation
No ratings yet
Data and Presentation
31 pages
Outlier Detection in Non-Gaussian Distributions Uitschieter Detectie in Niet-Gauss Verdelingen
No ratings yet
Outlier Detection in Non-Gaussian Distributions Uitschieter Detectie in Niet-Gauss Verdelingen
45 pages
Lec448B 20160406
No ratings yet
Lec448B 20160406
30 pages
Essential Stats For Decision Making-1 Descriptive Stats-2011
No ratings yet
Essential Stats For Decision Making-1 Descriptive Stats-2011
116 pages
DS 5-MARKS SEMESETER SUGGESTION (1)
No ratings yet
DS 5-MARKS SEMESETER SUGGESTION (1)
56 pages
BRM Statwiki
No ratings yet
BRM Statwiki
55 pages
Ds 5 Marks Final
No ratings yet
Ds 5 Marks Final
11 pages
Answers IBS
No ratings yet
Answers IBS
13 pages
Week 11
No ratings yet
Week 11
22 pages
Grey Minimalist Business Project Presentation
No ratings yet
Grey Minimalist Business Project Presentation
30 pages
Updated 2 - STAT100 - Median+Mode+Range+Outlier+Percentiles - Problem+Solution - Asma
No ratings yet
Updated 2 - STAT100 - Median+Mode+Range+Outlier+Percentiles - Problem+Solution - Asma
7 pages
das ffff
No ratings yet
das ffff
16 pages
Rockwell Hardness Test
100% (2)
Rockwell Hardness Test
12 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
OpenSAP Ds1 Week 3 Transcript
No ratings yet
OpenSAP Ds1 Week 3 Transcript
17 pages
7_2
No ratings yet
7_2
34 pages
Q Test
No ratings yet
Q Test
3 pages
Statistical Concepts
No ratings yet
Statistical Concepts
51 pages
Exploring Data: The Beast of Bias
No ratings yet
Exploring Data: The Beast of Bias
21 pages
8th PPT Lecture On Measures of Position
0% (1)
8th PPT Lecture On Measures of Position
19 pages
Sfs5e PPT ch01
No ratings yet
Sfs5e PPT ch01
88 pages
Chapter 3 Solutions
No ratings yet
Chapter 3 Solutions
17 pages
Unit1 - 1basics of Statistics
No ratings yet
Unit1 - 1basics of Statistics
24 pages
Quantitative Methods For Decision Making-1
No ratings yet
Quantitative Methods For Decision Making-1
61 pages
DSOST2
No ratings yet
DSOST2
44 pages
BCSL44 PDF
No ratings yet
BCSL44 PDF
26 pages
IGNOU Assignment
0% (1)
IGNOU Assignment
9 pages
Lecture 2 & 3 - Numerical Presenation
No ratings yet
Lecture 2 & 3 - Numerical Presenation
60 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Test To Identify Outliers in Data Series
No ratings yet
Test To Identify Outliers in Data Series
16 pages
B. Data Management PPT
No ratings yet
B. Data Management PPT
61 pages
3 Stats Box and Whisker
No ratings yet
3 Stats Box and Whisker
35 pages
Feature Engineering
No ratings yet
Feature Engineering
35 pages
5.3 Quartiles (Skewness)
No ratings yet
5.3 Quartiles (Skewness)
20 pages
Lecture 3 - Numerical Presenation
No ratings yet
Lecture 3 - Numerical Presenation
66 pages
Data Screening and Psychometrics
No ratings yet
Data Screening and Psychometrics
7 pages
Statistics - Lecture Slides 3 - For Lecture
No ratings yet
Statistics - Lecture Slides 3 - For Lecture
37 pages
Statistics Refresher
No ratings yet
Statistics Refresher
11 pages
Robust Decision Trees
No ratings yet
Robust Decision Trees
6 pages
Biostat Portfolio
No ratings yet
Biostat Portfolio
21 pages
Exam MATH 3070: Lssuc
No ratings yet
Exam MATH 3070: Lssuc
4 pages
W4 D3 G9-12 Outliers Student
No ratings yet
W4 D3 G9-12 Outliers Student
4 pages
stat app ch 2 (3)
No ratings yet
stat app ch 2 (3)
7 pages
Bio Statistics
No ratings yet
Bio Statistics
435 pages
week7
No ratings yet
week7
5 pages
Outlier Analysis
No ratings yet
Outlier Analysis
28 pages
Chapter 1
No ratings yet
Chapter 1
30 pages
Applied Statistics Outliers Chapter 2
No ratings yet
Applied Statistics Outliers Chapter 2
12 pages
Health Statistics: Principles of Secondary Data Analysis
No ratings yet
Health Statistics: Principles of Secondary Data Analysis
61 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Statistics in Research Analysis
No ratings yet
Statistics in Research Analysis
12 pages
Statistic & Machine Learning: Team 2
No ratings yet
Statistic & Machine Learning: Team 2
42 pages
Mock 2024 الحل
No ratings yet
Mock 2024 الحل
9 pages
Summary | Raena AI
No ratings yet
Summary | Raena AI
12 pages
Sa 16
No ratings yet
Sa 16
5 pages
How to Find Inter-Groups Differences Using Spss/Excel/Web Tools in Common Experimental Designs: Book 1
From Everand
How to Find Inter-Groups Differences Using Spss/Excel/Web Tools in Common Experimental Designs: Book 1
P.Y. Cheng
No ratings yet
From Average To K-means
From Everand
From Average To K-means
Beam van Waardenberg
No ratings yet
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
Introduction To Course & Organizational Behavior (OB) Definition & Concept of Organization Foundation of OB Levels in OB Challenges Faced by OB
No ratings yet
Introduction To Course & Organizational Behavior (OB) Definition & Concept of Organization Foundation of OB Levels in OB Challenges Faced by OB
10 pages
Oscam Server Bak
No ratings yet
Oscam Server Bak
1 page
Sales Promotion Toshiba
No ratings yet
Sales Promotion Toshiba
36 pages
3456
No ratings yet
3456
47 pages
Design of Mat Foundation
No ratings yet
Design of Mat Foundation
13 pages
Mohammed -Civil Site Engineer cv 00
No ratings yet
Mohammed -Civil Site Engineer cv 00
3 pages
PSY 335 Syllabus Felver - Syracuse University
No ratings yet
PSY 335 Syllabus Felver - Syracuse University
9 pages
Human Relations Dimension of Supervision Human Relations Dimension of Supervision
100% (1)
Human Relations Dimension of Supervision Human Relations Dimension of Supervision
12 pages
Ben Banwell CV
No ratings yet
Ben Banwell CV
2 pages
Honda GX 120 Tech Manual
No ratings yet
Honda GX 120 Tech Manual
21 pages
Evaluating AI Agents
No ratings yet
Evaluating AI Agents
22 pages
ECR Literature 0475 Paper2 Drama v2
No ratings yet
ECR Literature 0475 Paper2 Drama v2
26 pages
Akıllı Telefonlar Ile Ilgili İngilizce Essay - Smartphones Essay - Essay Kontrol
No ratings yet
Akıllı Telefonlar Ile Ilgili İngilizce Essay - Smartphones Essay - Essay Kontrol
2 pages
Hot Sauce Experiment
No ratings yet
Hot Sauce Experiment
3 pages
CD Player & FM Tuner PDF
No ratings yet
CD Player & FM Tuner PDF
8 pages
ANDRITZ Pumps
100% (1)
ANDRITZ Pumps
2 pages
Midterm Examination in Educational Planning and Management
No ratings yet
Midterm Examination in Educational Planning and Management
3 pages
Iot Enabled Solar Power Monitoring System: International Journal of Engineering & Technology
No ratings yet
Iot Enabled Solar Power Monitoring System: International Journal of Engineering & Technology
5 pages
Water Cooler Trainer
No ratings yet
Water Cooler Trainer
2 pages
Key Notes: Chapter - 6 Lines and Angles
No ratings yet
Key Notes: Chapter - 6 Lines and Angles
2 pages
QA Second Assignment Solutions by ABEL D.
No ratings yet
QA Second Assignment Solutions by ABEL D.
7 pages
History of Aerospace
No ratings yet
History of Aerospace
81 pages
Prof Ed 108
No ratings yet
Prof Ed 108
3 pages
Complete Answer Guide for Accounting Information Systems 9th Edition Hall Test Bank
100% (10)
Complete Answer Guide for Accounting Information Systems 9th Edition Hall Test Bank
56 pages
Personal Development Week 1
100% (1)
Personal Development Week 1
23 pages
The Resilience Framework - Organizing For Sustained Viability (PDFDrive)
No ratings yet
The Resilience Framework - Organizing For Sustained Viability (PDFDrive)
273 pages
PMI ACP Exam Prep Apr2018 Updates Only
No ratings yet
PMI ACP Exam Prep Apr2018 Updates Only
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Outlayer PDF

Uploaded by

Outlayer PDF

Uploaded by

SESUG Proceedings (c) SESUG, Inc (http://www.sesug.

Cleaning Data the Chauvenet Way

>ood 6alues 2ri inal Data Set /ad 6alues

? continue with analysis ? why are these !ad,

INTERQUARTILE RANGE IQR! TEST

L2L @ P C2L @ P where I31 @ P

) 1#; B I31 E 1#; B I31

geometry of the box plot

An accepta!le data %alue must lie within these limits. A#; B P

O+pected num!er o( points with %alue +i

1 0 PANR @ 1 0 PAB;R @ 101< @ 1<01<< @ <#1<

1)p @ <#1< p @ <#L<

? ta!le loo&)up ? 5 @ 1#K; ? ta!le loo&)up ? 5 @ 1#LK

N@1< 1 0 PANR @ 1 0 PAB1<R @ 10A< @ ;01<< @ <#<;

1)p @ <#<; p @ <#L;

re-uires more points (urther (rom mean

SUGGESTION FOR GOOD PROCESS CONTROL

>ood 6alues 2ri inal Data Set /ad 6alues

nma+ Count nmin t somethin chan ed

WHAT )IGHT GO WRONG

THE CHAU/ENET OUTLIER FILTERING )ACRO

va"ues (I.J;) va"ues (:.:;) va"ues (9L.I:)

E1A)PLE = PERCENTILES AND QUARTILES TA2E A LONG TI)E TO CO)PUTE

E1A)PLE = CO)PARING IQR TEST AND CHAU/ENET0S CRITERIA

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.