Statistical Advisor
Statistical Advisor
Statistical Advisor
Beta – arises from a transformation of the F distribution and is typically used to model the distribution of order statistics
Binomial – is useful for describing distributions of binomial events, such as the number of defective components in samples of 20
units taken from production process
Statistical Advisor Tree clustering – objects are linked in successive steps , yielding a
tree that ultimately joins all objects
Cauchy – is often used in statistics as canonical example of a pathological distribution since both its mean and variances are
undefined k-means – one specifies a priori how many clusters to expect,
program will try to find the best division of objects into the requested
Chi-square – the sum of n independent squared random variables, each distributed following the standard normal distribution, is Factor analysis – is to detect underlying number of clusters
distributed as Chi-square with n degrees of freedom dimensions that explain relations between
Process Analysis – to what extent multiple variables Two-way joining – both cases and variables are clustered
Association rules – is to detect automatically, it will attempt to form clusters of similar data points
Exponential – is frequently used to model the time interval between successive random events, example would be the gap does the long-term performance
length between cars crossing an intersection Correspondence analysis – to analyze relationships or associations between (values)
of the process comply with two-way and multi-way tables containing specific values of categorical variables in
Extreme Value – to model extreme events such as size of floods, gust velocities encountered by airplanes, maxima of stock Pareto chart – identify the few engineering requirements or some measure of correspondence large data sets EM clustering – by fitting a mixture of distributions to the data
indices over a given year, etc that account for the majority of managerial goals? between rows and columns
Regression control chart – Neural networks – analytic techniques ARIMA – is to detect seasonal patterns
F – is mostly used in tests of variance (e.g.., ANOVA) monitor the relationship between problems Multidimensional scaling – is to detect modeled after the processes of learning in the
underlying dimensions for a set of multiple
Association Rules cognitive system and the neurological functions Autocorrelation analysis – allows you to examine the lagged effect
two aspects of the production Process capability indices are input variables Analysis of the brain and capable of predicting new (correlation) of a variable with itself or with another variable
Gamma – when modeling the distribution of the life-times of a product such as an electric light bulb, or the serving time taken at a
ticket booth at a baseball game
process ratios that summarize the extent to observations from other observations after
which a manufacturing process or PLS – allows you to extract factors executing a process so-called learning from Fourier analysis – is to decompose a time series into simple
Quality control charts – monitor Perform Pareto analysis supplier turns products or parts Cluster analysis – tree clustering, existing data underlying waves forms
Geometric – if the independent Bernoulli trials are made until a “success” occurs, then the total number of trials required is a (components) from a data set that
geometric random variable the extent to which our products within specified engineering limits, includes one or more predictor variables Associations between k-means clustering, two-way
meet specifications example 6Sigma joining, EM clustering Distributed lags – for analyzing the lagged effects of one
Regression control charts and one ore more dependant variables variables independent variable with one dependant variable. This can
Gompertz – is theoretical distribution of survival times
Standard process capability Perform gage repeatability / simultaneously evaluate multiple lags
X-bar chart – the sample means are plotted in order to histogram Autocorrelation analysis, ARIMA,
Logistic – is used to model binary responses (e.g.., gender) and is commonly used in logistic regression Compute quality control charts for reproducibility analysis
control the mean value of a variable
on-line quality control (Shewhart Clusters or natural groups Fourier analysis, Neural networks Neural networks – to predict lagged response variables
Log-normal – is often used in simulations of variables such as personal incomes, age at first marriage, or tolerance to poison in of variables or cases Log linear – techniques for the analysis of multi-
animals, etc R-chart – the sample ranges are plotted in order to control charts)
control the variability of a variable Factor Analysis, Correspondence Generalized linear models, Log way frequency (cross tabulation) tables, to select Classification trees – for predicting the membership of cases or
Compute quality control analysis, Multidimensional linear the best model for the data objects in the classes of a categorical dependant variable from their
Normal – is a bell shaped curve which is symmetrical about mean, is a theoretical function commonly used in inferential statistics measurements on one or more predictor variables
as an approximation to sampling distributions, it’s a good model for random variable T**2 chart – simultaneously monitor several charts scaling, Partial least squares Patterns or trends in
measurements Decision trees, Classification MARSplines – nonparametric regression procedure which makes no
Pareto – is commonly used in monitoring production processes, example- a machine which produces copper wire will Perform process capability observations over time
OC Curves – extremely useful for exploring the power
Survival analysis – Weibull, Analyze failure times trees, Discriminant analysis, assumption about the underlying functional relationship between the
occasionally generate a flaw at some point along the wire analysis dependant and independent variables
of our quality control procedure Gompertz, exponential, linear cluster analysis, nonparametric
hazard Factors or dimensions in Relationships in multi-way statistics - MARSplines, nonlinear
Poisson – distribution of rare events, example – number of accidents per person, number of sweepstakes won per person, etc Random Forest – consists of an arbitrary number of simple trees,
Develop/analyze sampling multiple continuous cross tabulation tables estimation, Naïve Bayes, Random
Fourier – decompose a complex time series
which are used to determine the final outcome
with cyclical components into a few underlying
Rayleigh – distance of darts from the target in a dart-throwing game
Power analysis – sample size estimation & plans variables Forest sinusoidal functions of particular wavelengths
Relationships between Simple frequency tables and plot Basic statistics -> mean, sd,
Rectangular – useful for describing random variables with a constant probability density over the defined range confidence interval estimation ARIMA – allows us to uncover the hidden
Designing / analyzing histograms or stem-and-leaf diagrams variance, skewness, kurtosis,
predictors and responses patterns in the data but also generate forecasts
Student’s t – is symmetric about 0, its shape is similar to that of standard normal distribution. It is most commonly used in testing
experiments (including Compute statistics for Explore data and Use Time Series, on patterns of the data are unclear and frequency distribution tables,
hypothesis about the mean of a particular population on categorical dependant Autocorrelation analysis, observations involve considerable error
Experimental designs – apply analysis of Taguchi) industrial quality control search for structure / variables Fourier decomposition,
Weibull – used when the failure probability varies over time, often used in reliability testing (e.g.., ball bearings, etc)
histograms, pie charts,
variance principles to product development Independent variable – are those that are manipulated / improvement? patterns / factors / Explore/Summarize seasonal & non- Cross tabulate two or more variables Chernoff faces, star, sun ray
(male/female) / grouping variables seasonal ARIMA plots, etc
clusters? a time series to describe their joint distribution
Statistical significance test - Compare the observed distribution of variables against several Distributions – normal, exponential, Dependent variable – are those that are measured / (cross tabulation tables, banner tables)
theoretical distributions and test the discrepancy of the observed data from the respective registered (WCC/height/weight/salary/etc)
gamma, log-normal, Chi-square, binomial, Tabulate/Plot categorical data Nonparametric – where we know nothing about Basic statistics + Harmonic & Basic statistics + two & higher way
theoretical distributions
geometric, etc. Statistical significance Nominal /Categorical – gender, race, color, city, etc Test hypothesis Do you want to (such as gender, occupation) the parameters of the variable of interest in the geometric mean, mode, cross tabulation tables, banner tables,
p-value – statistical significance of a result is an estimated measure of the degree to which it is tests – Chi-square test, Kolmogorov - Ordinal – middle class / rich, bachelor / masters, etc population, do not rely on mean or sd describing log linear, 2d & 3d histograms
“true” Smirnov test Interval/Ratio – temperature, income, etc (predictions) and compute frequencies, distribution of the variable of interest. Or data
median, quartiles, percentiles,
Describe / percentages, etc contains rankings than precise measurements quartile range, minimum,
Survival analysis - Weibull, Gompertz, The shape of or about your maximum, sum, etc
Z-value – standardized value - value is expressed in terms of its difference from the mean,
Kolmogorov-Smirnov test & Shapiro - Wilks’ W test – are used linear & exponential hazard data? Summarize / Detailed descriptive statistics
divided by sd
for test for normality Various curve fitting routines
fitting distributions Life table analysis – the proportion surviving up to the
Tabulate Data? Summarize/Plot the shape of the Nonparametric Censored data Use Survival respective interval, the proportion failing in the interval,
Confidence intervals – give us a range of values around the mean where we expect the “true” Kaplan-Meier product-limit method (failure times, the hazard rate for the interval, percentiles of the
(population) mean is located Monte-Carlo studies – to determine how sensitive they are to distribution of continuous Specialized industrial analysis cumulative survival function
violations of the assumptions of normal distribution of the analyzed Differences survival times)
variables in population variables/measurements descriptive stats and stats for
between Differences Correlation – is a measure of the relation between two or survival/failure times
Kaplan-Meier product-limit estimates – life table
t-Test – to evaluate the differences in means
Differences in relationship Process capability analysis computed continuously
between groups. f the resultant t-value is groups/samples between variables more variables
(of an Use Process
Perfect Negative = -1, Perfect Positive = +1, No Correlation =
statistically significant then one can conclude between variables in Descriptive statistics by industrial/manufac analysis Specialized distributions – Weibull, Gompertz, linear
that the means in the two variables are t-Test for independent 0 Simple descriptive statistics
Residual = deviations from the regression line different groups groups, samples or variables turing process)
hazard
different samples, F-test for and frequency distributions
Outliers = atypical / infrequent observations
the comparison of Comparison of Basic statistics -> mean, sd, variance,
F-test – for comparison of the variances in variances in two Comparison of Breakdown table of means of a Gage repeatability Use Process
two groups, if statistically significant, one can means/variances in General Pearson r – determines the extent to which values of two skewness, kurtosis, frequency distribution & reproducibility analysis
groups, means/variances in Comparison of variables or “proportional” to each other variable (income by gender,
conclude that the variances (variability) in the two groups/samples (nonparametric)
survival/failure times
Relationship tables,
two groups are different Nonparametric tests, multiple groups within gender by occupation,
comparisons of between variables Spearman R – similar to Pearson r except that it is computed
histograms, line in two or more groups etc), compute box plots,
distributions in two or from ranks histograms, pie charts, Chernoff faces,
graphs, scatter plots, statistics broken down by
more groups star, sun ray plots, etc
etc another variable. Gage repeatability – to the extent to which repeated measurements of
Histograms, line graphs, the same part by the same operator (of the gage) produce identical
results
General Linear Models (GLM) – assumes Comparison of means Comparison of means General General categorized plots
General Linear
that the variables in the comparison are Rank order tests, for two variables or for several variables (nonparametric) (nonparametric) Gage reproducibility – to the extent to which different operators
Models – ANOVA / Mean – average or central tendency of a continuous
normally distributed within the groups Kolmogorov-Smirnov measurements or repeated comparison of two comparison of several measuring the same parts with the same gage produce identical
MANOVA Use Survival analysis distribution measurements
two sample test measurements variables variables
Generalized Linear Models (GLZ) – Generalized Linear – Gehan’s
Generalized Linear Standard deviation & Variance – measures of
doesn’t assume that the variables in the Models – ANOVA generalized Wilcoxon
analysis follow normal distribution Models – maximum Simple linear Simple Multiple linear dispersion, that is, the variability of data
like, maximum test, the Cox-Mantel Multiple Relation Nonlinear Time-dependant Analysis of Stratified linear or
likelihood methods, relationships between relationships relationships
Variance components & Mixed model likelihood methods test, the Cox F-test, relationships between two relationships (lagged) relationships covariance nonlinear regression Skewness – measure of the sidedness of the
Variance the log-rank test, Peto two continuous between two between distribution – skewed towards right or left of the mean
ANOVA/MANOVA – techniques for
Histograms, line t-Test for dependant General Linear Model Nonparametric tests, Nonparametric tests, between sets of
analyzing research designs with random Components and and Peto’s variables categorical continuous
graphs, scatter plots, samples – repeated measures rank order tests, rank order tests, categorical variables General linear Kurtosis – measure of pointedness or peakedness of
effects, including the estimation of variance
mixed model ANOVA generalized Wilcoxon variables variables Time series, Neural
components for such effects. etc Nonparametric tests, ANOVA / MANOVA McNemar test McNemar test variables ANCOVA models, Generalized the distribution - spread or centered closely around
/ ANCOVA, test. Nonparametric networks the mean
It is also well suited for analyzing large main Nonparametric tests – linear models,
effect designs, and designs with many Discriminant function tests. Two-way or higher-
Wald - Wolfowitz test, Histograms, line Friedman analysis of Histograms, line Histograms, line Nonlinear estimation,
factors where the higher order interactions Analysis,
Mann-Whitney U test, graphs, scatter plots, variance, graphs, scatter plots, graphs, scatter plots, way cross tabulation Canonical Correlation Survival analysis, Histogram – examine frequency distributions of values of variables
are not of interest, and analysis involving
Nonparametric tests, tables, Nonparametric Scatter plots, surface
case weights Kolmogorov-Smirnov etc etc etc
tests, Fisher exact Canonical correlation Scatter plot – visualize relations between two or three variables
two-sample test, Histograms, line Log linear, plots etc
Discriminant function analysis – how to Histograms, line Standard Pearson r, test, McNemar’s test – to investigate the
Kruskal -Wallis test, graphs, scatter plots, Correspondence Probability plot – provide a quick way to visually inspect to what extent the pattern of data follows a distribution
identify the specific variables that show graphs, scatter plots, Multiple regression, simultaneous
different means in different groups Friedman’s two way etc analysis, Factor relationship between
etc Nonparametric 3D histograms Quantile-Quantile plot – is useful for finding the best fitting distribution within a family of distributions
analysis, Cochran Q McNemar test for changes in proportions analysis, predictive two set of variables
(Spearman R,
test – example, if one wants to compare how
mapping, Visual
many students in a class fail a particular test Kendall tau, Gamma Probability-Probability plot – is useful for determining how well a specific theoretical distribution fits the observed data
ANOVA/MANOVA – is to test for significant generalized linear Probit / Logit
differences between means by comparing at the beginning of the semester, and at the coefficient) Polynomial
end of semester model, link functions regression for Line plot – individual data points are connected by a line, provides a simple way to visually represent a sequence values
variances Survival analysis, regression analysis General nonlinear
Proportional hazard General regression categorical Nonlinear regression
regression Box plot – ranges of values of a selected variables are plotted separately for groups of cases
regression model models, Generalized dependant and analysis for censored
linear models, continuous survival times Pie chart – used for representing proportions of values of variables
2D scatter plots, line multiple regression, independent variables
Missing data point plot – to visualize the pattern or distribution of missing data
graphs, etc partial least squares, General regression Generalized linear Use Survival analysis
survival analysis, models, general model, nonlinear Ternary plot – to examine relations between three or more dimensions where three of those dimensions represent
proportional hazard linear models, Generalized linear estimation components of a mixture
regression model multiple regression model, nonlinear 2D, 3D fitting of lines Icon plots – represent cases or units of observation as multidimensional symbols, to spot complex relations
estimation or surfaces to
observed data Chernoff faces – face is drawn with relative values of the selected variables for each case are assigned to shapes and sizes
Regression – objective is to of individual face features (sun rays, stars, etc are variations of the same plot)
estimate the value of a continuous
output variable from some input Contour plot – is the projection of a 3-dimesnsional surface onto a 2-dimensional plane
variables Polynomial regression computes A data set is Censored if some
the relationship between a Logit and probit regression are observations are incomplete but not Ishikawa chart – to depict the factors or variables that make up a process (cause-and-effect diagram)
Multiple regression analysis – in dependant variable with one or more used for analyzing the relationship missing. For example – a part or product
which one dependant variable is independent variables, and those between one or more independent may not fail within the time span covered by Gain /Lift chart – provides a visual summary of the usefulness of the information provided by one or more statistical models
related to multiple independent independent variables squared, variables with a categorical study, however we do not know how long it
variables cubed, etc. dependant variable at two levels will function properly thereafter Matrix plot – summarizes the relationships between several variables in a matrix of true x-y plots
Polynomial regression is to detect
PLS – allow you to extract factors some curvilinearity in a relationship Pareto chart – identify the causes of quality problems or loss
from a dataset that includes one or
more predictor variables, and one or ROC curve – used to evaluate the goodness of fit for a classifier
more dependent variables
Ternary plot – the triangular coordinate systems are used to plot three (or more) variables
Brushing – an interactive method that enables us to select on-screen specific data points and identify their characteristics