0% found this document useful (0 votes)
84 views129 pages

Statistics

Uploaded by

prem prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views129 pages

Statistics

Uploaded by

prem prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 129

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.

[1][2][3] In applying statistics to a scientific, industrial, or


social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all
people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design
of surveys and experiments.[4]
When census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that
inferences and conclusions can reasonably extend from the sample to the population as a whole. An experimental study involves taking measurements of the system under
study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the
measurements. In contrast, an observational study does not involve experimental manipulation.
Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation,
and inferential statistics, which draw conclusions from data that are subject to random variation (e.g., observational errors, sampling variation).[5] Descriptive statistics are most
often concerned with two sets of properties of a distribution (sample or population): central tendency (or location) seeks to characterize the distribution's central or typical value,
while dispersion (or variability) characterizes the extent to which members of the distribution depart from its center and each other. Inferences on mathematical statistics are
made under the framework of probability theory, which deals with the analysis of random phenomena.
A standard statistical procedure involves the collection of data leading to test of the relationship between two statistical data sets, or a data set and synthetic data drawn from
an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis of
no relationship between two data sets. Rejecting or disproving the null hypothesis is done using statistical tests that quantify the sense in which the null can be proven false,
given the data that are used in the test. Working from a null hypothesis, two basic forms of error are recognized: Type I errors (null hypothesis is falsely rejected giving a "false
positive") and Type II errors (null hypothesis fails to be rejected and an actual relationship between populations is missed giving a "false negative").[6] Multiple problems have
come to be associated with this framework, ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis. [5]
Measurement processes that generate statistical data are also subject to error. Many of these errors are classified as random (noise) or systematic (bias), but other types of
errors (e.g., blunder, such as when an analyst reports incorrect units) can also occur. The presence of missing data or censoring may result in biased estimates and specific
techniques have been developed to address these problems.

Contents

 1Introduction
o 1.1Mathematical statistics
 2History
 3Statistical data
o 3.1Data collection
o 3.2Types of data
 4Methods
o 4.1Descriptive statistics
o 4.2Inferential statistics
o 4.3Exploratory data analysis
 5Misuse
o 5.1Misinterpretation: correlation
 6Applications
o 6.1Applied statistics, theoretical statistics and mathematical statistics
o 6.2Machine learning and data mining
o 6.3Statistics in academia
o 6.4Statistical computing
o 6.5Business statistics
o 6.6Statistics applied to mathematics or the arts
 7Specialized disciplines
 8See also
 9References
 10Further reading
 11External links

Introduction[edit]
Main article: Outline of statistics
Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data,[7] or as a branch
of mathematics.[8] Some consider statistics to be a distinct mathematical science rather than a branch of mathematics. While many scientific investigations make use of data,
statistics is concerned with the use of data in the context of uncertainty and decision making in the face of uncertainty.[9][10]
In applying statistics to a problem, it is common practice to start with a population or process to be studied. Populations can be diverse topics such as "all people living in a
country" or "every atom composing a crystal". Ideally, statisticians compile data about the entire population (an operation called census). This may be organized by
governmental statistical institutes. Descriptive statistics can be used to summarize the population data. Numerical descriptors include mean and standard
deviation for continuous data (like income), while frequency and percentage are more useful in terms of describing categorical data (like education).
When a census is not feasible, a chosen subset of the population called a sample is studied. Once a sample that is representative of the population is determined, data is
collected for the sample members in an observational or experimental setting. Again, descriptive statistics can be used to summarize the sample data. However, drawing the
sample contains an element of randomness; hence, the numerical descriptors from the sample are also prone to uncertainty. To draw meaningful conclusions about the entire
population, inferential statistics is needed. It uses patterns in the sample data to draw inferences about the population represented while accounting for randomness. These
inferences may take the form of answering yes/no questions about the data (hypothesis testing), estimating numerical characteristics of the data (estimation),
describing associations within the data (correlation), and modeling relationships within the data (for example, using regression analysis). Inference can extend
to forecasting, prediction, and estimation of unobserved values either in or associated with the population being studied. It can include extrapolation and interpolation of time
series or spatial data, and data mining.

Mathematical statistics[edit]
Main article: Mathematical statistics
Mathematical statistics is the application of mathematics to statistics. Mathematical techniques used for this include mathematical analysis, linear algebra, stochastic
analysis, differential equations, and measure-theoretic probability theory.[11][12]

History[edit]
Gerolamo Cardano, a pioneer on the mathematics of probability.

Main articles: History of statistics and Founders of statistics


The early writings on statistical inference date back to Arab mathematicians and cryptographers, during the Islamic Golden Age between the 8th and 13th centuries. Al-
Khalil (717–786) wrote the Book of Cryptographic Messages, which contains the first use of permutations and combinations, to list all possible Arabic words with and without
vowels.[13] In his book, Manuscript on Deciphering Cryptographic Messages, Al-Kindi gave a detailed description of how to use frequency analysis to
decipher encrypted messages. Al-Kindi also made the earliest known use of statistical inference, while he and later Arab cryptographers developed the early statistical
methods for decoding encrypted messages. Ibn Adlan (1187–1268) later made an important contribution, on the use of sample size in frequency analysis.[13]
The earliest European writing on statistics dates back to 1663, with the publication of Natural and Political Observations upon the Bills of Mortality by John Graunt.[14] Early
applications of statistical thinking revolved around the needs of states to base policy on demographic and economic data, hence its stat- etymology. The scope of the discipline
of statistics broadened in the early 19th century to include the collection and analysis of data in general. Today, statistics is widely employed in government, business, and
natural and social sciences.
The mathematical foundations of modern statistics were laid in the 17th century with the development of the probability theory by Gerolamo Cardano, Blaise Pascal and Pierre
de Fermat. Mathematical probability theory arose from the study of games of chance, although the concept of probability was already examined in medieval law and by
philosophers such as Juan Caramuel.[15] The method of least squares was first described by Adrien-Marie Legendre in 1805.
Karl Pearson, a founder of mathematical statistics.

The modern field of statistics emerged in the late 19th and early 20th century in three stages.[16] The first wave, at the turn of the century, was led by the work of Francis
Galton and Karl Pearson, who transformed statistics into a rigorous mathematical discipline used for analysis, not just in science, but in industry and politics as well. Galton's
contributions included introducing the concepts of standard deviation, correlation, regression analysis and the application of these methods to the study of the variety of human
characteristics—height, weight, eyelash length among others.[17] Pearson developed the Pearson product-moment correlation coefficient, defined as a product-
moment,[18] the method of moments for the fitting of distributions to samples and the Pearson distribution, among many other things.[19] Galton and Pearson
founded Biometrika as the first journal of mathematical statistics and biostatistics (then called biometry), and the latter founded the world's first university statistics department
at University College London.[20]
Ronald Fisher coined the term null hypothesis during the Lady tasting tea experiment, which "is never proved or established, but is possibly disproved, in the course of
experimentation".[21][22]
The second wave of the 1910s and 20s was initiated by William Sealy Gosset, and reached its culmination in the insights of Ronald Fisher, who wrote the textbooks that were
to define the academic discipline in universities around the world. Fisher's most important publications were his 1918 seminal paper The Correlation between Relatives on the
Supposition of Mendelian Inheritance (which was the first to use the statistical term, variance), his classic 1925 work Statistical Methods for Research Workers and his
1935 The Design of Experiments,[23][24][25] where he developed rigorous design of experiments models. He originated the concepts of sufficiency, ancillary statistics, Fisher's
linear discriminator and Fisher information.[26] In his 1930 book The Genetical Theory of Natural Selection, he applied statistics to various biological concepts such as Fisher's
principle[27] (which A. W. F. Edwards called "probably the most celebrated argument in evolutionary biology") and Fisherian runaway,[28][29][30][31][32][33] a concept in sexual
selection about a positive feedback runaway affect found in evolution.
The final wave, which mainly saw the refinement and expansion of earlier developments, emerged from the collaborative work between Egon Pearson and Jerzy Neyman in
the 1930s. They introduced the concepts of "Type II" error, power of a test and confidence intervals. Jerzy Neyman in 1934 showed that stratified random sampling was in
general a better method of estimation than purposive (quota) sampling.[34]
Today, statistical methods are applied in all fields that involve decision making, for making accurate inferences from a collated body of data and for making decisions in the
face of uncertainty based on statistical methodology. The use of modern computers has expedited large-scale statistical computations and has also made possible new
methods that are impractical to perform manually. Statistics continues to be an area of active research for example on the problem of how to analyze big data.[35]

Statistical data[edit]
Main article: Statistical data

Data collection[edit]
Sampling[edit]
When full census data cannot be collected, statisticians collect sample data by developing specific experiment designs and survey samples. Statistics itself also provides tools
for prediction and forecasting through statistical models.
To use a sample as a guide to an entire population, it is important that it truly represents the overall population. Representative sampling assures that inferences and
conclusions can safely extend from the sample to the population as a whole. A major problem lies in determining the extent that the sample chosen is actually representative.
Statistics offers methods to estimate and correct for any bias within the sample and data collection procedures. There are also methods of experimental design for experiments
that can lessen these issues at the outset of a study, strengthening its capability to discern truths about the population.
Sampling theory is part of the mathematical discipline of probability theory. Probability is used in mathematical statistics to study the sampling distributions of sample
statistics and, more generally, the properties of statistical procedures. The use of any statistical method is valid when the system or population under consideration satisfies the
assumptions of the method. The difference in point of view between classic probability theory and sampling theory is, roughly, that probability theory starts from the given
parameters of a total population to deduce probabilities that pertain to samples. Statistical inference, however, moves in the opposite direction—inductively inferring from
samples to the parameters of a larger or total population.
Experimental and observational studies[edit]
A common goal for a statistical research project is to investigate causality, and in particular to draw a conclusion on the effect of changes in the values of predictors
or independent variables on dependent variables. There are two major types of causal statistical studies: experimental studies and observational studies. In both types of
studies, the effect of differences of an independent variable (or variables) on the behavior of the dependent variable are observed. The difference between the two types lies in
how the study is actually conducted. Each can be very effective. An experimental study involves taking measurements of the system under study, manipulating the system,
and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an
observational study does not involve experimental manipulation. Instead, data are gathered and correlations between predictors and response are investigated. While the tools
of data analysis work best on data from randomized studies, they are also applied to other kinds of data—like natural experiments and observational studies[36]—for which a
statistician would use a modified, more structured estimation method (e.g., Difference in differences estimation and instrumental variables, among many others) that
produce consistent estimators.
Experiments[edit]
The basic steps of a statistical experiment are:

1. Planning the research, including finding the number of replicates of the study, using the following information: preliminary estimates regarding the size of treatment
effects, alternative hypotheses, and the estimated experimental variability. Consideration of the selection of experimental subjects and the ethics of research is
necessary. Statisticians recommend that experiments compare (at least) one new treatment with a standard treatment or control, to allow an unbiased estimate of the
difference in treatment effects.
2. Design of experiments, using blocking to reduce the influence of confounding variables, and randomized assignment of treatments to subjects to allow unbiased
estimates of treatment effects and experimental error. At this stage, the experimenters and statisticians write the experimental protocol that will guide the performance
of the experiment and which specifies the primary analysis of the experimental data.
3. Performing the experiment following the experimental protocol and analyzing the data following the experimental protocol.
4. Further examining the data set in secondary analyses, to suggest new hypotheses for future study.
5. Documenting and presenting the results of the study.
Experiments on human behavior have special concerns. The famous Hawthorne study examined changes to the working environment at the Hawthorne plant of the Western
Electric Company. The researchers were interested in determining whether increased illumination would increase the productivity of the assembly line workers. The
researchers first measured the productivity in the plant, then modified the illumination in an area of the plant and checked if the changes in illumination affected productivity. It
turned out that productivity indeed improved (under the experimental conditions). However, the study is heavily criticized today for errors in experimental procedures,
specifically for the lack of a control group and blindness. The Hawthorne effect refers to finding that an outcome (in this case, worker productivity) changed due to observation
itself. Those in the Hawthorne study became more productive not because the lighting was changed but because they were being observed.[37]
Observational study[edit]
An example of an observational study is one that explores the association between smoking and lung cancer. This type of study typically uses a survey to collect observations
about the area of interest and then performs statistical analysis. In this case, the researchers would collect observations of both smokers and non-smokers, perhaps through
a cohort study, and then look for the number of cases of lung cancer in each group.[38] A case-control study is another type of observational study in which people with and
without the outcome of interest (e.g. lung cancer) are invited to participate and their exposure histories are collected.

Types of data[edit]
Main articles: Statistical data type and Levels of measurement
Various attempts have been made to produce a taxonomy of levels of measurement. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio
scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one (injective) transformation. Ordinal measurements have imprecise
differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have
meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with longitude and temperature measurements in Celsius or Fahrenheit),
and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any
rescaling transformation.
Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical
variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or continuous, due to their numerical nature.
Such distinctions can often be loosely correlated with data type in computer science, in that dichotomous categorical variables may be represented with the Boolean data type,
polytomous categorical variables with arbitrarily assigned integers in the integral data type, and continuous variables with the real data type involving floating
point computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.
Other categorizations have been proposed. For example, Mosteller and Tukey (1977)[39] distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder
(1990)[40] described continuous counts, continuous ratios, count ratios, and categorical modes of data. (See also: Chrisman (1998),[41] van den Berg (1991).[42])
The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by
issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely
reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is
sensible to contemplate depends on the question one is trying to answer."[43]:82

Methods[edit]
This section needs additional citations for verification. Please help improve this article by adding citations to reliable
sources. Unsourced material may be challenged and removed. (December 2020) (Learn how and when to remove this template
message)

Descriptive statistics[edit]
Main article: Descriptive statistics
A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features of a collection of information,[44] while descriptive
statistics in the mass noun sense is the process of using and analyzing those statistics. Descriptive statistics is distinguished from inferential statistics (or inductive statistics),
in that descriptive statistics aims to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.

Inferential statistics[edit]
Main article: Statistical inference
Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution.[45] Inferential statistical analysis infers properties of
a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population. Inferential statistics can
be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data
come from a larger population.
Terminology and theory of inferential statistics[edit]
Statistics, estimators and pivotal quantities[edit]
Consider independent identically distributed (IID) random variables with a given probability distribution: standard statistical inference and estimation theory defines a random
sample as the random vector given by the column vector of these IID variables.[46] The population being examined is described by a probability distribution that may have
unknown parameters.
A statistic is a random variable that is a function of the random sample, but not a function of unknown parameters. The probability distribution of the statistic, though, may have
unknown parameters. Consider now a function of the unknown parameter: an estimator is a statistic used to estimate such function. Commonly used estimators include sample
mean, unbiased sample variance and sample covariance.
A random variable that is a function of the random sample and of the unknown parameter, but whose probability distribution does not depend on the unknown parameter is
called a pivotal quantity or pivot. Widely used pivots include the z-score, the chi square statistic and Student's t-value.
Between two estimators of a given parameter, the one with lower mean squared error is said to be more efficient. Furthermore, an estimator is said to be unbiased if
its expected value is equal to the true value of the unknown parameter being estimated, and asymptotically unbiased if its expected value converges at the limit to the true
value of such parameter.
Other desirable properties for estimators include: UMVUE estimators that have the lowest variance for all possible values of the parameter to be estimated (this is usually an
easier property to verify than efficiency) and consistent estimators which converges in probability to the true value of such parameter.
This still leaves the question of how to obtain estimators in a given situation and carry the computation, several methods have been proposed: the method of moments,
the maximum likelihood method, the least squares method and the more recent method of estimating equations.
Null hypothesis and alternative hypothesis[edit]
Interpretation of statistical information can often involve the development of a null hypothesis which is usually (but not necessarily) that no relationship exists among variables
or that no change occurred over time.[47][48]
The best illustration for a novice is the predicament encountered by a criminal trial. The null hypothesis, H0, asserts that the defendant is innocent, whereas the alternative
hypothesis, H1, asserts that the defendant is guilty. The indictment comes because of suspicion of the guilt. The H0 (status quo) stands in opposition to H1 and is maintained
unless H1 is supported by evidence "beyond a reasonable doubt". However, "failure to reject H0" in this case does not imply innocence, but merely that the evidence was
insufficient to convict. So the jury does not necessarily accept H0 but fails to reject H0. While one can not "prove" a null hypothesis, one can test how close it is to being true
with a power test, which tests for type II errors.
What statisticians call an alternative hypothesis is simply a hypothesis that contradicts the null hypothesis.
Error[edit]
Working from a null hypothesis, two broad categories of error are recognized:

 Type I errors where the null hypothesis is falsely rejected, giving a "false positive".
 Type II errors where the null hypothesis fails to be rejected and an actual difference between populations is missed, giving a "false negative".
Standard deviation refers to the extent to which individual observations in a sample differ from a central value, such as the sample or population mean, while Standard
error refers to an estimate of difference between sample mean and population mean.
A statistical error is the amount by which an observation differs from its expected value, a residual is the amount an observation differs from the value the estimator of the
expected value assumes on a given sample (also called prediction).
Mean squared error is used for obtaining efficient estimators, a widely used class of estimators. Root mean square error is simply the square root of mean squared error.
A least squares fit: in red the points to be fitted, in blue the fitted line.

Many statistical methods seek to minimize the residual sum of squares, and these are called "methods of least squares" in contrast to Least absolute deviations. The latter
gives equal weight to small and big errors, while the former gives more weight to large errors. Residual sum of squares is also differentiable, which provides a handy property
for doing regression. Least squares applied to linear regression is called ordinary least squares method and least squares applied to nonlinear regression is called non-linear
least squares. Also in a linear regression model the non deterministic part of the model is called error term, disturbance or more simply noise. Both linear regression and non-
linear regression are addressed in polynomial least squares, which also describes the variance in a prediction of the dependent variable (y axis) as a function of the
independent variable (x axis) and the deviations (errors, noise, disturbances) from the estimated (fitted) curve.
Measurement processes that generate statistical data are also subject to error. Many of these errors are classified as random (noise) or systematic (bias), but other types of
errors (e.g., blunder, such as when an analyst reports incorrect units) can also be important. The presence of missing data or censoring may result in biased estimates and
specific techniques have been developed to address these problems.[49]
Interval estimation[edit]
Main article: Interval estimation

Confidence intervals: the red line is true value for the mean in this example, the blue lines are random confidence intervals for 100 realizations.

Most studies only sample part of a population, so results don't fully represent the whole population. Any estimates obtained from the sample only approximate the population
value. Confidence intervals allow statisticians to express how closely the sample estimate matches the true value in the whole population. Often they are expressed as 95%
confidence intervals. Formally, a 95% confidence interval for a value is a range where, if the sampling and analysis were repeated under the same conditions (yielding a
different dataset), the interval would include the true (population) value in 95% of all possible cases. This does not imply that the probability that the true value is in the
confidence interval is 95%. From the frequentist perspective, such a claim does not even make sense, as the true value is not a random variable. Either the true value is or is
not within the given interval. However, it is true that, before any data are sampled and given a plan for how to construct the confidence interval, the probability is 95% that the
yet-to-be-calculated interval will cover the true value: at this point, the limits of the interval are yet-to-be-observed random variables. One approach that does yield an interval
that can be interpreted as having a given probability of containing the true value is to use a credible interval from Bayesian statistics: this approach depends on a different way
of interpreting what is meant by "probability", that is as a Bayesian probability.
In principle confidence intervals can be symmetrical or asymmetrical. An interval can be asymmetrical because it works as lower or upper bound for a parameter (left-sided
interval or right sided interval), but it can also be asymmetrical because the two sided interval is built violating symmetry around the estimate. Sometimes the bounds for a
confidence interval are reached asymptotically and these are used to approximate the true bounds.
Significance[edit]
Main article: Statistical significance
Statistics rarely give a simple Yes/No type answer to the question under analysis. Interpretation often comes down to the level of statistical significance applied to the numbers
and often refers to the probability of a value accurately rejecting the null hypothesis (sometimes referred to as the p-value).

In this graph the black line is probability distribution for the test statistic, the critical region is the set of values to the right of the observed data point (observed value of the test statistic) and the p-
value is represented by the green area.

The standard approach[46] is to test a null hypothesis against an alternative hypothesis. A critical region is the set of values of the estimator that leads to refuting the null
hypothesis. The probability of type I error is therefore the probability that the estimator belongs to the critical region given that null hypothesis is true (statistical significance)
and the probability of type II error is the probability that the estimator doesn't belong to the critical region given that the alternative hypothesis is true. The statistical power of a
test is the probability that it correctly rejects the null hypothesis when the null hypothesis is false.
Referring to statistical significance does not necessarily mean that the overall result is significant in real world terms. For example, in a large study of a drug it may be shown
that the drug has a statistically significant but very small beneficial effect, such that the drug is unlikely to help the patient noticeably.
Although in principle the acceptable level of statistical significance may be subject to debate, the significance level is the largest p-value that allows the test to reject the null
hypothesis. This test is logically equivalent to saying that the p-value is the probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test
statistic. Therefore, the smaller the significance level, the lower the probability of committing type I error.
Some problems are usually associated with this framework (See criticism of hypothesis testing):

 A difference that is highly statistically significant can still be of no practical significance, but it is possible to properly formulate tests to account for this. One response
involves going beyond reporting only the significance level to include the p-value when reporting whether a hypothesis is rejected or accepted. The p-value, however, does
not indicate the size or importance of the observed effect and can also seem to exaggerate the importance of minor differences in large studies. A better and increasingly
common approach is to report confidence intervals. Although these are produced from the same calculations as those of hypothesis tests or p-values, they describe both
the size of the effect and the uncertainty surrounding it.
 Fallacy of the transposed conditional, aka prosecutor's fallacy: criticisms arise because the hypothesis testing approach forces one hypothesis (the null hypothesis) to be
favored, since what is being evaluated is the probability of the observed result given the null hypothesis and not probability of the null hypothesis given the observed result.
An alternative to this approach is offered by Bayesian inference, although it requires establishing a prior probability.[50]
 Rejecting the null hypothesis does not automatically prove the alternative hypothesis.
 As everything in inferential statistics it relies on sample size, and therefore under fat tails p-values may be seriously mis-computed.[clarification needed]
Examples[edit]
Some well-known statistical tests and procedures are:

 Analysis of variance (ANOVA)


 Chi-squared test
 Correlation
 Factor analysis
 Mann–Whitney U
 Mean square weighted deviation (MSWD)
 Pearson product-moment correlation coefficient
 Regression analysis
 Spearman's rank correlation coefficient
 Student's t-test
 Time series analysis
 Conjoint Analysis
Exploratory data analysis[edit]
Main article: Exploratory data analysis
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or
not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Misuse[edit]
Main article: Misuse of statistics
Misuse of statistics can produce subtle but serious errors in description and interpretation—subtle in the sense that even experienced professionals make such errors, and
serious in the sense that they can lead to devastating decision errors. For instance, social policy, medical practice, and the reliability of structures like bridges all rely on the
proper use of statistics.
Even when statistical techniques are correctly applied, the results can be difficult to interpret for those lacking expertise. The statistical significance of a trend in the data—
which measures the extent to which a trend could be caused by random variation in the sample—may or may not agree with an intuitive sense of its significance. The set of
basic statistical skills (and skepticism) that people need to deal with information in their everyday lives properly is referred to as statistical literacy.
There is a general perception that statistical knowledge is all-too-frequently intentionally misused by finding ways to interpret only the data that are favorable to the
presenter.[51] A mistrust and misunderstanding of statistics is associated with the quotation, "There are three kinds of lies: lies, damned lies, and statistics". Misuse of statistics
can be both inadvertent and intentional, and the book How to Lie with Statistics[51] outlines a range of considerations. In an attempt to shed light on the use and misuse of
statistics, reviews of statistical techniques used in particular fields are conducted (e.g. Warne, Lazo, Ramos, and Ritter (2012)).[52]
Ways to avoid misuse of statistics include using proper diagrams and avoiding bias.[53] Misuse can occur when conclusions are overgeneralized and claimed to be
representative of more than they really are, often by either deliberately or unconsciously overlooking sampling bias.[54] Bar graphs are arguably the easiest diagrams to use and
understand, and they can be made either by hand or with simple computer programs.[53] Unfortunately, most people do not look for bias or errors, so they are not noticed. Thus,
people may often believe that something is true even if it is not well represented.[54] To make data gathered from statistics believable and accurate, the sample taken must be
representative of the whole.[55] According to Huff, "The dependability of a sample can be destroyed by [bias]... allow yourself some degree of skepticism."[56]
To assist in the understanding of statistics Huff proposed a series of questions to be asked in each case:[51]

 Who says so? (Does he/she have an axe to grind?)


 How does he/she know? (Does he/she have the resources to know the facts?)
 What's missing? (Does he/she give us a complete picture?)
 Did someone change the subject? (Does he/she offer us the right answer to the wrong problem?)
 Does it make sense? (Is his/her conclusion logical and consistent with what we already know?)

The confounding variable problem: X and Y may be correlated, not because there is causal relationship between them, but because both depend on a third variable Z. Z is called a confounding
factor.

Misinterpretation: correlation[edit]
See also: Correlation does not imply causation
The concept of correlation is particularly noteworthy for the potential confusion it can cause. Statistical analysis of a data set often reveals that two variables (properties) of the
population under consideration tend to vary together, as if they were connected. For example, a study of annual income that also looks at age of death might find that poor
people tend to have shorter lives than affluent people. The two variables are said to be correlated; however, they may or may not be the cause of one another. The correlation
phenomena could be caused by a third, previously unconsidered phenomenon, called a lurking variable or confounding variable. For this reason, there is no way to
immediately infer the existence of a causal relationship between the two variables.
Applications[edit]
Applied statistics, theoretical statistics and mathematical statistics[edit]
Applied statistics comprises descriptive statistics and the application of inferential statistics.[57][58] Theoretical statistics concerns the logical arguments underlying justification of
approaches to statistical inference, as well as encompassing mathematical statistics. Mathematical statistics includes not only the manipulation of probability
distributions necessary for deriving results related to methods of estimation and inference, but also various aspects of computational statistics and the design of experiments.
Statistical consultants can help organizations and companies that don't have in-house expertise relevant to their particular questions.

Machine learning and data mining[edit]


Machine learning models are statistical and probabilistic models that capture patterns in the data through use of computational algorithms.

Statistics in academia[edit]
Statistics is applicable to a wide variety of academic disciplines, including natural and social sciences, government, and business. Business statistics applies statistical
methods in econometrics, auditing and production and operations, including services improvement and marketing research.[59] A study of two journals in tropical biology found
that the 12 most frequent statistical tests are: Analysis of Variance (ANOVA), Chi-Square Test, Student’s T Test, Linear Regression, Pearson’s Correlation Coefficient, Mann-
Whitney U Test, Kruskal-Wallis Test, Shannon’s Diversity Index, Tukey's Test, Cluster Analysis, Spearman’s Rank Correlation Test and Principal Component Analysis.[60]
A typical statistics course covers descriptive statistics, probability, binomial and normal distributions, test of hypotheses and confidence intervals, linear regression, and
correlation.[61] Modern fundamental statistical courses for undergraduate students focus on correct test selection, results interpretation, and use of free statistics software.[60]

Statistical computing[edit]

gretl, an example of an open source statistical package

Main article: Computational statistics


The rapid and sustained increases in computing power starting from the second half of the 20th century have had a substantial impact on the practice of statistical science.
Early statistical models were almost always from the class of linear models, but powerful computers, coupled with suitable numerical algorithms, caused an increased interest
in nonlinear models (such as neural networks) as well as the creation of new types, such as generalized linear models and multilevel models.
Increased computing power has also led to the growing popularity of computationally intensive methods based on resampling, such as permutation tests and the bootstrap,
while techniques such as Gibbs sampling have made use of Bayesian models more feasible. The computer revolution has implications for the future of statistics with a new
emphasis on "experimental" and "empirical" statistics. A large number of both general and special purpose statistical software are now available. Examples of available
software capable of complex statistical computation include programs such as Mathematica, SAS, SPSS, and R.

Business statistics[edit]
In business, "statistics" is a widely used management- and decision support tool. It is particularly applied in financial management, marketing management,
and production, services and operations management . [62] [63] Statistics is also heavily used in management accounting and auditing. The discipline of Management
Science formalizes the use of statistics, and other mathematics, in business. (Econometrics is the application of statistical methods to economic data in order to give empirical
content to economic relationships.)
A typical "Business Statistics" course is intended for business majors, and covers [64] descriptive statistics (collection, description, analysis, and summary of data), probability
(typically the binomial and normal distributions), test of hypotheses and confidence intervals, linear regression, and correlation; (follow-on) courses may
include forecasting, time series, decision trees, multiple linear regression, and other topics from business analytics more generally. See also Business mathematics
§ University level. Professional certification programs, such as the CFA, often include topics in statistics.

Statistics applied to mathematics or the arts[edit]


Traditionally, statistics was concerned with drawing inferences using a semi-standardized methodology that was "required learning" in most sciences.[citation needed] This tradition
has changed with the use of statistics in non-inferential contexts. What was once considered a dry subject, taken in many fields as a degree-requirement, is now viewed
enthusiastically.[according to whom?] Initially derided by some mathematical purists, it is now considered essential methodology in certain areas.

 In number theory, scatter plots of data generated by a distribution function may be transformed with familiar tools used in statistics to reveal underlying patterns, which may
then lead to hypotheses.
 Methods of statistics including predictive methods in forecasting are combined with chaos theory and fractal geometry to create video works that are considered to have
great beauty.[citation needed]
 The process art of Jackson Pollock relied on artistic experiments whereby underlying distributions in nature were artistically revealed.[citation needed] With the advent of
computers, statistical methods were applied to formalize such distribution-driven natural processes to make and analyze moving video art.[citation needed]
 Methods of statistics may be used predicatively in performance art, as in a card trick based on a Markov process that only works some of the time, the occasion of which
can be predicted using statistical methodology.
 Statistics can be used to predicatively create art, as in the statistical or stochastic music invented by Iannis Xenakis, where the music is performance-specific. Though this
type of artistry does not always come out as expected, it does behave in ways that are predictable and tunable using statistics.

Specialized disciplines[edit]
Main article: List of fields of application of statistics
Statistical techniques are used in a wide range of types of scientific and social research, including: biostatistics, computational biology, computational sociology, network
biology, social science, sociology and social research. Some fields of inquiry use applied statistics so extensively that they have specialized terminology. These disciplines
include:

 Actuarial science (assesses risk in the insurance and finance industries)


 Applied information economics
 Astrostatistics (statistical evaluation of astronomical data)
 Biostatistics
 Chemometrics (for analysis of data from chemistry)
 Data mining (applying statistics and pattern recognition to discover knowledge from data)
 Data science
 Demography (statistical study of populations)
 Econometrics (statistical analysis of economic data)
 Energy statistics
 Engineering statistics
 Epidemiology (statistical analysis of disease)
 Geography and geographic information systems, specifically in spatial analysis
 Image processing
 Jurimetrics (law)
 Medical statistics
 Political science
 Psychological statistics
 Reliability engineering
 Social statistics
 Statistical mechanics
In addition, there are particular types of statistical analysis that have also developed their own specialised terminology and methodology:

 Bootstrap / jackknife resampling


 Multivariate statistics
 Statistical classification
 Structured data analysis
 Structural equation modelling
 Survey methodology
 Survival analysis
 Statistics in various sports, particularly baseball – known as sabermetrics – and cricket
Statistics form a key basis tool in business and manufacturing as well. It is used to understand measurement systems variability, control processes (as in statistical process
control or SPC), for summarizing data, and to make data-driven decisions. In these roles, it is a key tool, and perhaps the only reliable tool.

Descriptive statistics
From Wikipedia, the free encyclopedia
Jump to navigationJump to search

Part of a series on

Research
show

List of academic fields

show

Research design

show

Philosophy

show

Research strategy

show

Methodology

show
Methods

Philosophy portal

 v
 t
 e

A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection
of information,[1] while descriptive statistics (in the mass noun sense) is the process of using and analysing those statistics. Descriptive statistics is
distinguished from inferential statistics (or inductive statistics) by its aim to summarize a sample, rather than use the data to learn about
the population that the sample of data is thought to represent.[2] This generally means that descriptive statistics, unlike inferential statistics, is not
developed on the basis of probability theory, and are frequently non-parametric statistics.[3] Even when a data analysis draws its main conclusions using
inferential statistics, descriptive statistics are generally also presented.[4] For example, in papers reporting on human subjects, typically a table is included
giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical
characteristics such as the average age, the proportion of subjects of each sex, the proportion of subjects with related co-morbidities, etc.
Some measures that are commonly used to describe a data set are measures of central tendency and measures of variability or dispersion. Measures of
central tendency include the mean, median and mode, while measures of variability include the standard deviation (or variance), the minimum and
maximum values of the variables, kurtosis and skewness.[5]

Contents

 1Use in statistical analysis


o 1.1Univariate analysis
o 1.2Bivariate and multivariate analysis
 2References
 3External links

Use in statistical analysis[edit]


Descriptive statistics provide simple summaries about the sample and about the observations that have been made. Such summaries may be
either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. These summaries may either form the basis of the initial
description of the data as part of a more extensive statistical analysis, or they may be sufficient in and of themselves for a particular investigation.
For example, the shooting percentage in basketball is a descriptive statistic that summarizes the performance of a player or a team. This number is the
number of shots made divided by the number of shots taken. For example, a player who shoots 33% is making approximately one shot in every three.
The percentage summarizes or describes multiple discrete events. Consider also the grade point average. This single number describes the general
performance of a student across the range of their course experiences.[6]
The use of descriptive and summary statistics has an extensive history and, indeed, the simple tabulation of populations and of economic data was the
first way the topic of statistics appeared. More recently, a collection of summarisation techniques has been formulated under the heading of exploratory
data analysis: an example of such a technique is the box plot.
In the business world, descriptive statistics provides a useful summary of many types of data. For example, investors and brokers may use a historical
account of return behaviour by performing empirical and analytical analyses on their investments in order to make better investing decisions in the future.
Univariate analysis[edit]
Univariate analysis involves describing the distribution of a single variable, including its central tendency (including the mean, median, and mode) and
dispersion (including the range and quartiles of the data-set, and measures of spread such as the variance and standard deviation). The shape of the
distribution may also be described via indices such as skewness and kurtosis. Characteristics of a variable's distribution may also be depicted in
graphical or tabular format, including histograms and stem-and-leaf display.
Bivariate and multivariate analysis[edit]
When a sample consists of more than one variable, descriptive statistics may be used to describe the relationship between pairs of variables. In this
case, descriptive statistics include:

 Cross-tabulations and contingency tables


 Graphical representation via scatterplots
 Quantitative measures of dependence
 Descriptions of conditional distributions
The main reason for differentiating univariate and bivariate analysis is that bivariate analysis is not only simple descriptive analysis, but also it describes
the relationship between two different variables.[7] Quantitative measures of dependence include correlation (such as Pearson's r when both variables are
continuous, or Spearman's rho if one or both are not) and covariance (which reflects the scale variables are measured on). The slope, in regression
analysis, also reflects the relationship between variables. The unstandardised slope indicates the unit change in the criterion variable for a one unit
change in the predictor. The standardised slope indicates this change in standardised (z-score) units. Highly skewed data are often transformed by taking
logarithms. Use of logarithms makes graphs more symmetrical and look more similar to the normal distribution, making them easier to interpret
intuitively.[8]:47

Data collection
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced
material may be challenged and removed.
Find sources: "Data collection" – news · newspapers · books · scholar · JSTOR (April 2017) (Learn how and when to remove this template message)
Example of data collection in the biological sciences: Adélie penguins are identified and weighed each time they cross the automated weighbridge on their way to or from the sea.[1]

Data collection is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer
relevant questions and evaluate outcomes. Data collection is a research component in all study fields, including physical and social
sciences, humanities,[2] and business. While methods vary by discipline, the emphasis on ensuring accurate and honest collection remains the same. The
goal for all data collection is to capture quality evidence that allows analysis to lead to the formulation of convincing and credible answers to the
questions that have been posed. Data collection and validation consists of four steps when it involves taking a census and seven steps when it involves
sampling.[3]

Contents

 1Importance
 2Data integrity issues[6]
o 2.1Quality assurance
o 2.2Quality control
 3Data collection on z/OS
 4DMPs and data collection
 5References
 6External links

Importance[edit]
Regardless of the field of study or preference for defining data (quantitative or qualitative), accurate data collection is essential to maintain research
integrity. The selection of appropriate data collection instruments (existing, modified, or newly developed) and delineated instructions for their correct use
reduce the likelihood of errors.
A formal data collection process is necessary as it ensures that the data gathered are both defined and accurate. This way, subsequent decisions based
on arguments embodied in the findings are made using valid data.[4] The process provides both a baseline from which to measure and in certain cases an
indication of what to improve.
There are 5 common data collection methods:

1. closed-ended surveys and quizzes,


2. open-ended surveys and questionnaires,
3. 1-on-1 interviews,
4. focus groups, and
5. direct observation.[5]

Data integrity issues[6][edit]


The main reason for maintaining data integrity is to support the observation of errors in the data collection process. Those errors may be made
intentionally (deliberate falsification) or non-intentionally (random or systematic errors).
There are two approaches that may protect data integrity and secure scientific validity of study results invented by Craddick, Crawford, Rhodes, Redican,
Rukenbrod and Laws in 2003:

 Quality assurance – all actions carried out before data collection


 Quality control – all actions carried out during and after data collection
Quality assurance[edit]
Further information: quality assurance
Its main focus is prevention which is primarily a cost-effective activity to protect the integrity of data collection. Standardization of protocol best
demonstrates this cost-effective activity, which is developed in a comprehensive and detailed procedures manual for data collection. The risk of failing to
identify problems and errors in the research process is evidently caused by poorly written guidelines. Listed are several examples of such failures:

 Uncertainty of timing, methods and identification of the responsible person


 Partial listing of items needed to be collected
 Vague description of data collection instruments instead of rigorous step-by-step instructions on administering tests
 Failure to recognize exact content and strategies for training and retraining staff members responsible for data collection
 Unclear instructions for using, making adjustments to, and calibrating data collection equipment
 No predetermined mechanism to document changes in procedures that occur during the investigation
Quality control[edit]
Further information: quality control
Since quality control actions occur during or after the data collection all the details are carefully documented. There is a necessity for a clearly defined
communication structure as a precondition for establishing monitoring systems. Uncertainty about the flow of information is not recommended as a poorly
organized communication structure leads to lax monitoring and can also limit the opportunities for detecting errors. Quality control is also responsible for
the identification of actions necessary for correcting faulty data collection practices and also minimizing such future occurrences. A team is more likely to
not realize the necessity to perform these actions if their procedures are written vaguely and are not based on feedback or education.
Data collection problems that necessitate prompt action:

 Systematic errors
 Violation of protocol
 Fraud or scientific misconduct
 Errors in individual data items
 Individual staff or site performance problems

Data collection on z/OS[edit]


z/OS is a widely used operating system for IBM mainframe. It is designed to offer a stable, secure, and continuously available environment for
applications running on the mainframe. Operational data is data that z/OS system produces when it runs. This data indicates the health of the system
and can be used to identify sources of performance and availability issues in the system. The analysis of operational data by analytics platforms provide
insights and recommended actions to make the system work more efficiently, and to help resolve or prevent problems. IBM Z Common Data Provider
collects IT operational data from z/OS systems, transforms it to a consumable format, and streams it to analytics platforms.[7]
IBM Z Common Data Provider supports the collection of the following operational data:[8]

 System Management Facilities (SMF) data


 Log data from the following sources:
o Job log, the output which is written to a data definition (DD) by a running job
o z/OS UNIX log file, including the UNIX System Services system log (syslogd)
o Entry-sequenced Virtual Storage Access Method (VSAM) cluster
o z/OS system log (SYSLOG)
o IBM Tivoli NetView for z/OS messages
o IBM WebSphere Application Server for z/OS High Performance Extensible Logging (HPEL) log
o IBM Resource Measurement Facility (RMF) Monitor III reports
 User application data, the operational data from users' own applications

DMPs and data collection[edit]


DMP is the abbreviation for data management platform. It is a centralized storage and analytical system for data. Mainly used by marketers, DMPs exist
to compile and transform large amounts of data into discernible information.[9] Marketers may want to receive and utilize first, second and third-party data.
DMPs enable this, because they are the aggregate system of DSPs (demand side platform) and SSPs (supply side platform). When in comes to
advertising, DMPs are integral for optimizing and guiding marketers in future campaigns. This system and their effectiveness is proof that categorized,
analyzed, and compiled data is far more useful than raw data.

See also

 Scientific data archiving


 Data curation
 Data management
 Data collection system
 Experiment
 Observational study
 Sampling (statistics)
 Statistical survey
 Survey data collection
 Qualitative method
 Quantitative method
 Quantitative methods in criminology
Statistical inference
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
Not to be confused with Statistical interference.

Part of a series on

Research

show

List of academic fields

show
Research design

show

Philosophy

show

Research strategy

show

Methodology

show

Methods

Philosophy portal

 v
 t
 e

Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.[1] Inferential statistical analysis
infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a
larger population.
Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it
does not rest on the assumption that the data come from a larger population. In machine learning, the term inference is sometimes used instead to mean
"make a prediction, by evaluating an already trained model";[2] in this context inferring properties of the model is referred to as training or learning (rather
than inference), and using a model for prediction is referred to as inference (instead of prediction); see also predictive inference.
Contents

 1Introduction
 2Models and assumptions
o 2.1Degree of models/assumptions
o 2.2Importance of valid models/assumptions
 2.2.1Approximate distributions
o 2.3Randomization-based models
 2.3.1Model-based analysis of randomized experiments
 2.3.2Model-free randomization inference
 3Paradigms for inference
o 3.1Frequentist inference
 3.1.1Examples of frequentist inference
 3.1.2Frequentist inference, objectivity, and decision theory
o 3.2Bayesian inference
 3.2.1Examples of Bayesian inference
 3.2.2Bayesian inference, subjectivity and decision theory
o 3.3Likelihood-based inference
o 3.4AIC-based inference
o 3.5Other paradigms for inference
 3.5.1Minimum description length
 3.5.2Fiducial inference
 3.5.3Structural inference
 4Inference topics
 5History
 6See also
 7Notes
 8References
o 8.1Citations
o 8.2Sources
 9Further reading
 10External links

Introduction[edit]
Statistical inference makes propositions about a population, using data drawn from the population with some form of sampling. Given a hypothesis about
a population, for which we wish to draw inferences, statistical inference consists of (first) selecting a statistical model of the process that generates the
data and (second) deducing propositions from the model.[citation needed]
Konishi & Kitagawa state, "The majority of the problems in statistical inference can be considered to be problems related to statistical
modeling".[3] Relatedly, Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical
part of an analysis".[4]
The conclusion of a statistical inference is a statistical proposition.[5] Some common forms of statistical proposition are the following:
 a point estimate, i.e. a particular value that best approximates some parameter of interest;
 an interval estimate, e.g. a confidence interval (or set estimate), i.e. an interval constructed using a dataset drawn from a population so that, under
repeated sampling of such datasets, such intervals would contain the true parameter value with the probability at the stated confidence level;
 a credible interval, i.e. a set of values containing, for example, 95% of posterior belief;
 rejection of a hypothesis;[note 1]
 clustering or classification of data points into groups.

Models and assumptions[edit]


Main articles: Statistical model and Statistical assumptions
Any statistical inference requires some assumptions. A statistical model is a set of assumptions concerning the generation of the observed data and
similar data. Descriptions of statistical models usually emphasize the role of population quantities of interest, about which we wish to draw
inference.[6] Descriptive statistics are typically used as a preliminary step before more formal inferences are drawn.[7]
Degree of models/assumptions[edit]
Statisticians distinguish between three levels of modeling assumptions;

 Fully parametric: The probability distributions describing the data-generation process are assumed to be fully described by a family of probability
distributions involving only a finite number of unknown parameters.[6] For example, one may assume that the distribution of population values is truly
Normal, with unknown mean and variance, and that datasets are generated by 'simple' random sampling. The family of generalized linear models is a
widely used and flexible class of parametric models.
 Non-parametric: The assumptions made about the process generating the data are much less than in parametric statistics and may be minimal.[8] For
example, every continuous probability distribution has a median, which may be estimated using the sample median or the Hodges–Lehmann–Sen
estimator, which has good properties when the data arise from simple random sampling.
 Semi-parametric: This term typically implies assumptions 'in between' fully and non-parametric approaches. For example, one may assume that a
population distribution has a finite mean. Furthermore, one may assume that the mean response level in the population depends in a truly linear
manner on some covariate (a parametric assumption) but not make any parametric assumption describing the variance around that mean (i.e. about
the presence or possible form of any heteroscedasticity). More generally, semi-parametric models can often be separated into 'structural' and 'random
variation' components. One component is treated parametrically and the other non-parametrically. The well-known Cox model is a set of semi-
parametric assumptions.
Importance of valid models/assumptions[edit]
See also: Statistical model validation
Whatever level of assumption is made, correctly calibrated inference in general requires these assumptions to be correct; i.e. that the data-generating
mechanisms really have been correctly specified.
Incorrect assumptions of 'simple' random sampling can invalidate statistical inference.[9] More complex semi- and fully parametric assumptions are also
cause for concern. For example, incorrectly assuming the Cox model can in some cases lead to faulty conclusions.[10] Incorrect assumptions of Normality
in the population also invalidates some forms of regression-based inference.[11] The use of any parametric model is viewed skeptically by most experts in
sampling human populations: "most sampling statisticians, when they deal with confidence intervals at all, limit themselves to statements about
[estimators] based on very large samples, where the central limit theorem ensures that these [estimators] will have distributions that are nearly
normal."[12] In particular, a normal distribution "would be a totally unrealistic and catastrophically unwise assumption to make if we were dealing with any
kind of economic population."[12] Here, the central limit theorem states that the distribution of the sample mean "for very large samples" is approximately
normally distributed, if the distribution is not heavy tailed.
Approximate distributions[edit]
Main articles: Statistical distance, Asymptotic theory (statistics), and Approximation theory
Given the difficulty in specifying exact distributions of sample statistics, many methods have been developed for approximating these.
With finite samples, approximation results measure how close a limiting distribution approaches the statistic's sample distribution: For example, with
10,000 independent samples the normal distribution approximates (to two digits of accuracy) the distribution of the sample mean for many population
distributions, by the Berry–Esseen theorem.[13] Yet for many practical purposes, the normal approximation provides a good approximation to the sample-
mean's distribution when there are 10 (or more) independent samples, according to simulation studies and statisticians' experience.[13] Following
Kolmogorov's work in the 1950s, advanced statistics uses approximation theory and functional analysis to quantify the error of approximation. In this
approach, the metric geometry of probability distributions is studied; this approach quantifies approximation error with, for example, the Kullback–Leibler
divergence, Bregman divergence, and the Hellinger distance.[14][15][16]
With indefinitely large samples, limiting results like the central limit theorem describe the sample statistic's limiting distribution, if one exists. Limiting
results are not statements about finite samples, and indeed are irrelevant to finite samples.[17][18][19] However, the asymptotic theory of limiting distributions is
often invoked for work with finite samples. For example, limiting results are often invoked to justify the generalized method of moments and the use
of generalized estimating equations, which are popular in econometrics and biostatistics. The magnitude of the difference between the limiting distribution
and the true distribution (formally, the 'error' of the approximation) can be assessed using simulation.[20] The heuristic application of limiting results to finite
samples is common practice in many applications, especially with low-dimensional models with log-concave likelihoods (such as with one-
parameter exponential families).
Randomization-based models[edit]
Main article: Randomization
See also: Random sample and Random assignment
For a given dataset that was produced by a randomization design, the randomization distribution of a statistic (under the null-hypothesis) is defined by
evaluating the test statistic for all of the plans that could have been generated by the randomization design. In frequentist inference, randomization allows
inferences to be based on the randomization distribution rather than a subjective model, and this is important especially in survey sampling and design of
experiments.[21][22] Statistical inference from randomized studies is also more straightforward than many other situations.[23][24][25] In Bayesian inference,
randomization is also of importance: in survey sampling, use of sampling without replacement ensures the exchangeability of the sample with the
population; in randomized experiments, randomization warrants a missing at random assumption for covariate information.[26]
Objective randomization allows properly inductive procedures.[27][28][29][30][31] Many statisticians prefer randomization-based analysis of data that was
generated by well-defined randomization procedures.[32] (However, it is true that in fields of science with developed theoretical knowledge and
experimental control, randomized experiments may increase the costs of experimentation without improving the quality of inferences.[33][34]) Similarly,
results from randomized experiments are recommended by leading statistical authorities as allowing inferences with greater reliability than do
observational studies of the same phenomena.[35] However, a good observational study may be better than a bad randomized experiment.
The statistical analysis of a randomized experiment may be based on the randomization scheme stated in the experimental protocol and does not need a
subjective model.[36][37]
However, at any time, some hypotheses cannot be tested using objective statistical models, which accurately describe randomized experiments or
random samples. In some cases, such randomized studies are uneconomical or unethical.
Model-based analysis of randomized experiments[edit]
It is standard practice to refer to a statistical model, e.g., a linear or logistic models, when analyzing data from randomized experiments.[38] However, the
randomization scheme guides the choice of a statistical model. It is not possible to choose an appropriate model without knowing the randomization
scheme.[22] Seriously misleading results can be obtained analyzing data from randomized experiments while ignoring the experimental protocol; common
mistakes include forgetting the blocking used in an experiment and confusing repeated measurements on the same experimental unit with independent
replicates of the treatment applied to different experimental units.[39]
Model-free randomization inference[edit]
Model-free techniques provide a complement to model-based methods, which employ reductionist strategies of reality-simplification. The former
combine, evolve, ensemble and train algorithms dynamically adapting to the contextual affinities of a process and learning the intrinsic characteristics of
the observations.[38][40]
For example, model-free simple linear regression is based either on

 a random design, where the pairs of observations are independent and identically distributed (iid), or

 a deterministic design, where the variables are deterministic, but the corresponding response variables are random and independent with a

common conditional distribution, i.e., , which is independent of the index .

In either case, the model-free randomization inference for features of the common conditional distribution relies on some regularity conditions, e.g.

functional smoothness. For instance, model-free randomization inference for the population feature conditional mean, , can be consistently estimated

via local averaging or local polynomial fitting, under the assumption that is smooth. Also, relying on asymptotic normality or resampling, we can

construct confidence intervals for the population feature, in this case, the conditional mean, .[41]

Paradigms for inference[edit]


Different schools of statistical inference have become established. These schools—or "paradigms"—are not mutually exclusive, and methods that work
well under one paradigm often have attractive interpretations under other paradigms.
Bandyopadhyay & Forster[42] describe four paradigms: "(i) classical statistics or error statistics, (ii) Bayesian statistics, (iii) likelihood-based statistics, and
(iv) the Akaikean-Information Criterion-based statistics". The classical (or frequentist) paradigm, the Bayesian paradigm, the likelihoodist paradigm, and
the AIC-based paradigm are summarized below.
Frequentist inference[edit]
Main article: Frequentist inference
This paradigm calibrates the plausibility of propositions by considering (notional) repeated sampling of a population distribution to produce datasets
similar to the one at hand. By considering the dataset's characteristics under repeated sampling, the frequentist properties of a statistical proposition can
be quantified—although in practice this quantification may be challenging.
Examples of frequentist inference[edit]

 p-value
 Confidence interval
 Null hypothesis significance testing
Frequentist inference, objectivity, and decision theory[edit]
One interpretation of frequentist inference (or classical inference) is that it is applicable only in terms of frequency probability; that is, in terms of repeated
sampling from a population. However, the approach of Neyman[43] develops these procedures in terms of pre-experiment probabilities. That is, before
undertaking an experiment, one decides on a rule for coming to a conclusion such that the probability of being correct is controlled in a suitable way:
such a probability need not have a frequentist or repeated sampling interpretation. In contrast, Bayesian inference works in terms of conditional
probabilities (i.e. probabilities conditional on the observed data), compared to the marginal (but conditioned on unknown parameters) probabilities used in
the frequentist approach.
The frequentist procedures of significance testing and confidence intervals can be constructed without regard to utility functions. However, some
elements of frequentist statistics, such as statistical decision theory, do incorporate utility functions.[citation needed] In particular, frequentist developments of
optimal inference (such as minimum-variance unbiased estimators, or uniformly most powerful testing) make use of loss functions, which play the role of
(negative) utility functions. Loss functions need not be explicitly stated for statistical theorists to prove that a statistical procedure has an optimality
property.[44] However, loss-functions are often useful for stating optimality properties: for example, median-unbiased estimators are optimal under absolute
value loss functions, in that they minimize expected loss, and least squares estimators are optimal under squared error loss functions, in that they
minimize expected loss.
While statisticians using frequentist inference must choose for themselves the parameters of interest, and the estimators/test statistic to be used, the
absence of obviously explicit utilities and prior distributions has helped frequentist procedures to become widely viewed as 'objective'.[45]
Bayesian inference[edit]
See also: Bayesian inference
The Bayesian calculus describes degrees of belief using the 'language' of probability; beliefs are positive, integrate to one, and obey probability axioms.
Bayesian inference uses the available posterior beliefs as the basis for making statistical propositions [46]. There are several different justifications for
using the Bayesian approach.
Examples of Bayesian inference[edit]

 Credible interval for interval estimation


 Bayes factors for model comparison
Bayesian inference, subjectivity and decision theory[edit]
Many informal Bayesian inferences are based on "intuitively reasonable" summaries of the posterior. For example, the posterior mean, median and
mode, highest posterior density intervals, and Bayes Factors can all be motivated in this way. While a user's utility function need not be stated for this
sort of inference, these summaries do all depend (to some extent) on stated prior beliefs, and are generally viewed as subjective conclusions. (Methods
of prior construction which do not require external input have been proposed but not yet fully developed.)
Formally, Bayesian inference is calibrated with reference to an explicitly stated utility, or loss function; the 'Bayes rule' is the one which maximizes
expected utility, averaged over the posterior uncertainty. Formal Bayesian inference therefore automatically provides optimal decisions in a decision
theoretic sense. Given assumptions, data and utility, Bayesian inference can be made for essentially any problem, although not every statistical inference
need have a Bayesian interpretation. Analyses which are not formally Bayesian can be (logically) incoherent; a feature of Bayesian procedures which use
proper priors (i.e. those integrable to one) is that they are guaranteed to be coherent. Some advocates of Bayesian inference assert that
inference must take place in this decision-theoretic framework, and that Bayesian inference should not conclude with the evaluation and summarization
of posterior beliefs.
Likelihood-based inference[edit]
Main article: Likelihoodism

This section needs expansion. You


can help by adding to it. (March
2019)

Likelihoodism approaches statistics by using the likelihood function. Some likelihoodists reject inference, considering statistics as only computing support
from evidence. Others, however, propose inference based on the likelihood function, of which the best-known is maximum likelihood estimation.
AIC-based inference[edit]
Main article: Akaike information criterion

This section needs expansion. You


can help by adding to it. (November
2017)

The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. Given a collection of models for
the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.
AIC is founded on information theory: it offers an estimate of the relative information lost when a given model is used to represent the process that
generated the data. (In doing so, it deals with the trade-off between the goodness of fit of the model and the simplicity of the model.)
Other paradigms for inference[edit]
Minimum description length[edit]
Main article: Minimum description length
The minimum description length (MDL) principle has been developed from ideas in information theory[47] and the theory of Kolmogorov complexity.[48] The
(MDL) principle selects statistical models that maximally compress the data; inference proceeds without assuming counterfactual or non-falsifiable "data-
generating mechanisms" or probability models for the data, as might be done in frequentist or Bayesian approaches.
However, if a "data generating mechanism" does exist in reality, then according to Shannon's source coding theorem it provides the MDL description of
the data, on average and asymptotically.[49] In minimizing description length (or descriptive complexity), MDL estimation is similar to maximum likelihood
estimation and maximum a posteriori estimation (using maximum-entropy Bayesian priors). However, MDL avoids assuming that the underlying
probability model is known; the MDL principle can also be applied without assumptions that e.g. the data arose from independent sampling.[49][50]
The MDL principle has been applied in communication-coding theory in information theory, in linear regression,[50] and in data mining.[48]
The evaluation of MDL-based inferential procedures often uses techniques or criteria from computational complexity theory.[51]
Fiducial inference[edit]
Main article: Fiducial inference
Fiducial inference was an approach to statistical inference based on fiducial probability, also known as a "fiducial distribution". In subsequent work, this
approach has been called ill-defined, extremely limited in applicability, and even fallacious.[52][53] However this argument is the same as that which
shows[54] that a so-called confidence distribution is not a valid probability distribution and, since this has not invalidated the application of confidence
intervals, it does not necessarily invalidate conclusions drawn from fiducial arguments. An attempt was made to reinterpret the early work of
Fisher's fiducial argument as a special case of an inference theory using Upper and lower probabilities.[55]
Structural inference[edit]
Developing ideas of Fisher and of Pitman from 1938 to 1939,[56] George A. Barnard developed "structural inference" or "pivotal inference",[57] an approach
using invariant probabilities on group families. Barnard reformulated the arguments behind fiducial inference on a restricted class of models on which
"fiducial" procedures would be well-defined and useful.

Inference topics[edit]
The topics below are usually included in the area of statistical inference.

1. Statistical assumptions
2. Statistical decision theory
3. Estimation theory
4. Statistical hypothesis testing
5. Revising opinions in statistics
6. Design of experiments, the analysis of variance, and regression
7. Survey sampling
8. Summarizing statistical data

History[edit]
Al-Kindi, an Arab mathematician in the 9th century, made the earliest known use of statistical inference in his Manuscript on Deciphering Cryptographic
Messages, a work on cryptanalysis and frequency analysis.[58]

See also[edit]
 Algorithmic inference
 Induction (philosophy)
 Informal inferential reasoning
 Population proportion
 Philosophy of statistics
 Predictive inference
 Information field theory
Correlation and dependence
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
This article is about correlation and dependence in statistical data. For other uses, see correlation (disambiguation).

Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. The correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope
of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because
the variance of Y is zero.

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the
broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related.
Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the
price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.
Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce
less power on a mild day based on the correlation between electricity demand and weather. In this example, there is a causal relationship, because
extreme weather causes people to use more electricity for heating or cooling. However, in general, the presence of a correlation is not sufficient to infer
the presence of a causal relationship (i.e., correlation does not imply causation).
Formally, random variables are dependent if they do not satisfy a mathematical property of probabilistic independence. In informal parlance, correlation is
synonymous with dependence. However, when used in a technical sense, correlation refers to any of several specific types of mathematical operations
between the tested variables and their respective expected values. Essentially, correlation is the measure of how two or more variables are related to

one another. There are several correlation coefficients, often denoted or , measuring the degree of correlation. The most common of these is
the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables (which may be present even when one variable
is a nonlinear function of the other). Other correlation coefficients – such as Spearman's rank correlation – have been developed to be
more robust than Pearson's, that is, more sensitive to nonlinear relationships.[1][2][3] Mutual information can also be applied to measure dependence
between two variables.

Contents

 1Pearson's product-moment coefficient


o 1.1Definition
o 1.2Symmetry property
o 1.3Correlation and independence
o 1.4Sample correlation coefficient
 2Example
 3Rank correlation coefficients
 4Other measures of dependence among random variables
 5Sensitivity to the data distribution
 6Correlation matrices
 7Nearest valid correlation matrix
 8Uncorrelatedness and independence of stochastic processes
 9Common misconceptions
o 9.1Correlation and causality
o 9.2Simple linear correlations
 10Bivariate normal distribution
 11See also
 12References
 13Further reading
 14External links

Pearson's product-moment coefficient[edit]


Main article: Pearson product-moment correlation coefficient

Example scatterplots of various datasets with various correlation coefficients.


Definition[edit]
The most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient (PPMCC), or "Pearson's
correlation coefficient", commonly called simply "the correlation coefficient". Mathematically, it is defined as the quality of least squares fitting to the
original data. It is obtained by taking the ratio of the covariance of the two variables in question of our numerical dataset, normalized to the square root of
their variances. Mathematically, one simply divides the covariance of the two variables by the product of their standard deviations. Karl
Pearson developed the coefficient from a similar but slightly different idea by Francis Galton.[4]
A Pearson product-moment correlation coefficient attempts to establish a line of best fit through a dataset of two variables by essentially laying out the
expected values and the resulting Pearson's correlation coefficient indicates how far away the actual dataset is from the expected values. Depending on
the sign of our Pearson's correlation coefficient, we can end up with either a negative or positive correlation if there is any sort of relationship between the
variables of our data set.

The population correlation coefficient between two random variables and with expected values and and standard

deviations and is defined as

where is the expected value operator, means covariance, and is a widely used alternative notation for the correlation coefficient. The
Pearson correlation is defined only if both standard deviations are finite and positive. An alternative formula purely in terms of moments is

Symmetry property[edit]

The correlation coefficient is symmetric: . This is verified by the commutative property of multiplication.
Correlation and independence[edit]
It is a corollary of the Cauchy–Schwarz inequality that the absolute value of the Pearson correlation coefficient is not bigger than 1. Therefore, the value
of a correlation coefficient ranges between -1 and +1. The correlation coefficient is +1 in the case of a perfect direct (increasing) linear relationship

(correlation), −1 in the case of a perfect inverse (decreasing) linear relationship (anti-correlation),[5] and some value in the open interval in all other
cases, indicating the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated).
The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.
If the variables are independent, Pearson's correlation coefficient is 0, but the converse is not true because the correlation coefficient detects only linear
dependencies between two variables.
For example, suppose the random variable is symmetrically distributed about zero, and . Then is completely determined by , so

that and are perfectly dependent, but their correlation is zero; they are uncorrelated. However, in the special case

when and are jointly normal, uncorrelatedness is equivalent to independence.


Even though uncorrelated data does not necessarily imply independence, one can check if random variables are independent if their mutual
information is 0.
Sample correlation coefficient[edit]

Given a series of measurements of the pair indexed by , the sample correlation coefficient can be used to estimate the population Pearson

correlation between and . The sample correlation coefficient is defined as

where and are the sample means of and , and and are the corrected sample standard deviations of and .

Equivalent expressions for are

where and are the uncorrected sample standard deviations of and .

If and are results of measurements that contain measurement error, the realistic limits on the correlation coefficient are not −1 to +1 but
a smaller range.[6] For the case of a linear model with a single independent variable, the coefficient of determination (R squared) is the square

of , Pearson's product-moment coefficient.

Example[edit]
Consider the joint probability distribution of and given in the table below.
For this joint distribution, the marginal distributions are:

This yields the following expectations and variances:

Therefore:

Rank correlation coefficients[edit]


Main articles: Spearman's rank correlation coefficient and Kendall tau rank correlation coefficient
Rank correlation coefficients, such as Spearman's rank correlation coefficient and Kendall's rank correlation coefficient
(τ) measure the extent to which, as one variable increases, the other variable tends to increase, without requiring that
increase to be represented by a linear relationship. If, as the one variable increases, the other decreases, the rank
correlation coefficients will be negative. It is common to regard these rank correlation coefficients as alternatives to
Pearson's coefficient, used either to reduce the amount of calculation or to make the coefficient less sensitive to non-
normality in distributions. However, this view has little mathematical basis, as rank correlation coefficients measure a
different type of relationship than the Pearson product-moment correlation coefficient, and are best seen as measures of
a different type of association, rather than as an alternative measure of the population correlation coefficient.[7][8]
To illustrate the nature of rank correlation, and its difference from linear correlation, consider the following four pairs of

numbers :
(0, 1), (10, 100), (101, 500), (102, 2000).

As we go from each pair to the next pair increases, and so does . This relationship is perfect, in the sense

that an increase in is always accompanied by an increase in . This means that we have a perfect rank
correlation, and both Spearman's and Kendall's correlation coefficients are 1, whereas in this example Pearson
product-moment correlation coefficient is 0.7544, indicating that the points are far from lying on a straight line. In the

same way if always decreases when increases, the rank correlation coefficients will be −1, while the
Pearson product-moment correlation coefficient may or may not be close to −1, depending on how close the points
are to a straight line. Although in the extreme cases of perfect rank correlation the two coefficients are both equal
(being both +1 or both −1), this is not generally the case, and so values of the two coefficients cannot meaningfully
be compared.[7] For example, for the three pairs (1, 1) (2, 3) (3, 2) Spearman's coefficient is 1/2, while Kendall's
coefficient is 1/3.

Other measures of dependence among random variables[edit]


See also: Pearson product-moment correlation coefficient § Variants
The information given by a correlation coefficient is not enough to define the dependence structure between random
variables.[9] The correlation coefficient completely defines the dependence structure only in very particular cases, for
example when the distribution is a multivariate normal distribution. (See diagram above.) In the case of elliptical
distributions it characterizes the (hyper-)ellipses of equal density; however, it does not completely characterize the
dependence structure (for example, a multivariate t-distribution's degrees of freedom determine the level of tail
dependence).
Distance correlation[10][11] was introduced to address the deficiency of Pearson's correlation that it can be zero for
dependent random variables; zero distance correlation implies independence.
The Randomized Dependence Coefficient[12] is a computationally efficient, copula-based measure of dependence
between multivariate random variables. RDC is invariant with respect to non-linear scalings of random variables, is
capable of discovering a wide range of functional association patterns and takes value zero at independence.
For two binary variables, the odds ratio measures their dependence, and takes range non-negative numbers,

possibly infinity: . Related statistics such as Yule's Y and Yule's Q normalize this to the correlation-like

range . The odds ratio is generalized by the logistic model to model cases where the dependent variables are
discrete and there may be one or more independent variables.
The correlation ratio, entropy-based mutual information, total correlation, dual total correlation and polychoric
correlation are all also capable of detecting more general dependencies, as is consideration of the copula between
them, while the coefficient of determination generalizes the correlation coefficient to multiple regression.

Sensitivity to the data distribution[edit]


Further information: Pearson product-moment correlation coefficient § Sensitivity to the data distribution

The degree of dependence between variables and does not depend on the scale on which the variables

are expressed. That is, if we are analyzing the relationship between and , most correlation measures are

unaffected by transforming to a + bX and to c + dY, where a, b, c, and d are constants (b and d being
positive). This is true of some correlation statistics as well as their population analogues. Some correlation statistics,
such as the rank correlation coefficient, are also invariant to monotone transformations of the marginal distributions

of and/or .

Pearson/Spearman correlation coefficients between and are shown when the two variables' ranges are unrestricted, and when the

range of is restricted to the interval (0,1).

Most correlation measures are sensitive to the manner in which and are sampled. Dependencies tend to
be stronger if viewed over a wider range of values. Thus, if we consider the correlation coefficient between the
heights of fathers and their sons over all adult males, and compare it to the same correlation coefficient calculated
when the fathers are selected to be between 165 cm and 170 cm in height, the correlation will be weaker in the latter
case. Several techniques have been developed that attempt to correct for range restriction in one or both variables,
and are commonly used in meta-analysis; the most common are Thorndike's case II and case III equations.[13]
Various correlation measures in use may be undefined for certain joint distributions of X and Y. For example, the
Pearson correlation coefficient is defined in terms of moments, and hence will be undefined if the moments are
undefined. Measures of dependence based on quantiles are always defined. Sample-based statistics intended to
estimate population measures of dependence may or may not have desirable statistical properties such as
being unbiased, or asymptotically consistent, based on the spatial structure of the population from which the data
were sampled.
Sensitivity to the data distribution can be used to an advantage. For example, scaled correlation is designed to use
the sensitivity to the range in order to pick out correlations between fast components of time series.[14] By reducing the
range of values in a controlled manner, the correlations on long time scale are filtered out and only the correlations
on short time scales are revealed.

Correlation matrices[edit]
The correlation matrix of random variables is the matrix whose entry is . Thus the diagonal
entries are all identically unity. If the measures of correlation used are product-moment coefficients, the correlation

matrix is the same as the covariance matrix of the standardized random variables for . This applies both to

the matrix of population correlations (in which case is the population standard deviation), and to the matrix of

sample correlations (in which case denotes the sample standard deviation). Consequently, each is necessarily
a positive-semidefinite matrix. Moreover, the correlation matrix is strictly positive definite if no variable can have all its
values exactly generated as a linear function of the values of the others.

The correlation matrix is symmetric because the correlation between and is the same as the correlation

between and .
A correlation matrix appears, for example, in one formula for the coefficient of multiple determination, a measure of
goodness of fit in multiple regression.
In statistical modelling, correlation matrices representing the relationships between variables are categorized into
different correlation structures, which are distinguished by factors such as the number of parameters required to
estimate them. For example, in an exchangeable correlation matrix, all pairs of variables are modeled as having the
same correlation, so all non-diagonal elements of the matrix are equal to each other. On the other hand,
an autoregressive matrix is often used when variables represent a time series, since correlations are likely to be
greater when measurements are closer in time. Other examples include independent, unstructured, M-dependent,
and Toeplitz.
Nearest valid correlation matrix[edit]
In some applications (e.g., building data models from only partially observed data) one wants to find the "nearest"
correlation matrix to an "approximate" correlation matrix (e.g., a matrix which typically lacks semi-definite
positiveness due to the way it has been computed).
In 2002, Higham[15] formalized the notion of nearness using the Frobenius norm and provided a method for computing
the nearest correlation matrix using the Dykstra's projection algorithm, of which an implementation is available as an
online Web API.[16]
This sparked interest in the subject, with new theoretical (e.g., computing the nearest correlation matrix with factor
structure[17]) and numerical (e.g. usage the Newton's method for computing the nearest correlation matrix[18]) results
obtained in the subsequent years.

Uncorrelatedness and independence of stochastic processes[edit]


Similarly for two stochastic processes and : If they are independent, then they are uncorrelated.[19]:p. 151

Common misconceptions[edit]
Correlation and causality[edit]
Main article: Correlation does not imply causation
See also: Normally distributed and uncorrelated does not imply independent
The conventional dictum that "correlation does not imply causation" means that correlation cannot be used by itself
to infer a causal relationship between the variables.[20] This dictum should not be taken to mean that correlations
cannot indicate the potential existence of causal relations. However, the causes underlying the correlation, if any,
may be indirect and unknown, and high correlations also overlap with identity relations (tautologies), where no causal
process exists. Consequently, a correlation between two variables is not a sufficient condition to establish a causal
relationship (in either direction).
A correlation between age and height in children is fairly causally transparent, but a correlation between mood and
health in people is less so. Does improved mood lead to improved health, or does good health lead to good mood, or
both? Or does some other factor underlie both? In other words, a correlation can be taken as evidence for a possible
causal relationship, but cannot indicate what the causal relationship, if any, might be.
Simple linear correlations[edit]
Four sets of data with the same correlation of 0.816

The Pearson correlation coefficient indicates the strength of a linear relationship between two variables, but its value
generally does not completely characterize their relationship.[21] In particular, if the conditional

mean of given , denoted , is not linear in , the correlation coefficient will not fully determine the

form of .
The adjacent image shows scatter plots of Anscombe's quartet, a set of four different pairs of variables created

by Francis Anscombe.[22] The four variables have the same mean (7.5), variance (4.12), correlation (0.816) and
regression line (y = 3 + 0.5x). However, as can be seen on the plots, the distribution of the variables is very different.
The first one (top left) seems to be distributed normally, and corresponds to what one would expect when
considering two variables correlated and following the assumption of normality. The second one (top right) is not
distributed normally; while an obvious relationship between the two variables can be observed, it is not linear. In this
case the Pearson correlation coefficient does not indicate that there is an exact functional relationship: only the
extent to which that relationship can be approximated by a linear relationship. In the third case (bottom left), the
linear relationship is perfect, except for one outlier which exerts enough influence to lower the correlation coefficient
from 1 to 0.816. Finally, the fourth example (bottom right) shows another example when one outlier is enough to
produce a high correlation coefficient, even though the relationship between the two variables is not linear.
These examples indicate that the correlation coefficient, as a summary statistic, cannot replace visual examination of
the data. The examples are sometimes said to demonstrate that the Pearson correlation assumes that the data
follow a normal distribution, but this is only partially correct.[4] The Pearson correlation can be accurately calculated
for any distribution that has a finite covariance matrix, which includes most distributions encountered in practice.
However, the Pearson correlation coefficient (taken together with the sample mean and variance) is only a sufficient
statistic if the data is drawn from a multivariate normal distribution. As a result, the Pearson correlation coefficient
fully characterizes the relationship between variables if and only if the data are drawn from a multivariate normal
distribution.

Bivariate normal distribution[edit]


If a pair of random variables follows a bivariate normal distribution, the conditional mean is a linear function

of , and the conditional mean is a linear function of . The correlation

coefficient between and , along with the marginal means and variances of and , determines
this linear relationship:

where and are the expected values of and , respectively, and and are the standard

deviations of and , respectively.

See also[edit]

 Mathematics portal

Further information: Correlation (disambiguation)

 Autocorrelation
 Canonical correlation
 Coefficient of determination
 Cointegration
 Concordance correlation coefficient
 Cophenetic correlation
 Correlation function
 Correlation gap
 Covariance
 Covariance and correlation
 Cross-correlation
 Ecological correlation
 Fraction of variance unexplained
 Genetic correlation
 Goodman and Kruskal's lambda
 Illusory correlation
 Interclass correlation
 Intraclass correlation
 Lift (data mining)
 Mean dependence
 Modifiable areal unit problem
 Multiple correlation
 Point-biserial correlation coefficient
 Quadrant count ratio
 Spurious correlation
 Statistical arbitrage
 Subindependence
Regression analysis
From Wikipedia, the free encyclopedia
Jump to navigationJump to search

Part of a series on

Regression analysis

Models

 Linear regression
 Simple regression
 Polynomial regression
 General linear model

 Generalized linear model


 Discrete choice
 Binomial regression
 Binary regression
 Logistic regression
 Multinomial logistic regression
 Mixed logit
 Probit
 Multinomial probit
 Ordered logit
 Ordered probit
 Poisson

 Multilevel model
 Fixed effects
 Random effects
 Linear mixed-effects model
 Nonlinear mixed-effects model

 Nonlinear regression
 Nonparametric
 Semiparametric
 Robust
 Quantile
 Isotonic
 Principal components
 Least angle
 Local
 Segmented

 Errors-in-variables

Estimation

 Least squares
 Linear
 Non-linear

 Ordinary
 Weighted
 Generalized

 Partial
 Total
 Non-negative
 Ridge regression
 Regularized

 Least absolute deviations


 Iteratively reweighted
 Bayesian
 Bayesian multivariate

Background

 Regression validation
 Mean and predicted response
 Errors and residuals
 Goodness of fit
 Studentized residual
 Gauss–Markov theorem

 Mathematics portal

 v
 t
 e

Part of a series on

Machine learning
and data mining
show

Problems

show

Supervised learning
(classification • regression)

show

Clustering

show

Dimensionality reduction

show

Structured prediction

show

Anomaly detection

show

Artificial neural network

show
Reinforcement learning

show

Theory

show

Machine-learning venues

show

Related articles

 v
 t
 e

Regression line for 50 random points in a Gaussian distribution around the line y=1.5x+2 (not shown).

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called
the 'outcome variable') and one or more independent variables (often called 'predictors', 'covariates', or 'features'). The most common form of regression
analysis is linear regression, in which one finds the line (or a more complex linear combination) that most closely fits the data according to a specific
mathematical criterion. For example, the method of ordinary least squares computes the unique line (or hyperplane) that minimizes the sum of squared
differences between the true data and that line (or hyperplane). For specific mathematical reasons (see linear regression), this allows the researcher to
estimate the conditional expectation (or population average value) of the dependent variable when the independent variables take on a given set of
values. Less common forms of regression use slightly different procedures to estimate alternative location parameters (e.g., quantile regression or
Necessary Condition Analysis[1]) or estimate the conditional expectation across a broader collection of non-linear models (e.g., nonparametric regression).
Regression analysis is primarily used for two conceptually distinct purposes. First, regression analysis is widely used for prediction and forecasting,
where its use has substantial overlap with the field of machine learning. Second, in some situations regression analysis can be used to infer causal
relationships between the independent and dependent variables. Importantly, regressions by themselves only reveal relationships between a dependent
variable and a collection of independent variables in a fixed dataset. To use regressions for prediction or to infer causal relationships, respectively, a
researcher must carefully justify why existing relationships have predictive power for a new context or why a relationship between two variables has a
causal interpretation. The latter is especially important when researchers hope to estimate causal relationships using observational data.[2][3]

Contents

 1History
 2Regression model
 3Underlying assumptions
 4Linear regression
o 4.1General linear model
o 4.2Diagnostics
o 4.3Limited dependent variables
 5Nonlinear regression
 6Interpolation and extrapolation
 7Power and sample size calculations
 8Other methods
 9Software
 10See also
 11References
 12Further reading
 13External links

History[edit]
The earliest form of regression was the method of least squares, which was published by Legendre in 1805,[4] and by Gauss in 1809.[5] Legendre and
Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the Sun (mostly comets, but
also later the then newly discovered minor planets). Gauss published a further development of the theory of least squares in 1821,[6] including a version of
the Gauss–Markov theorem.
The term "regression" was coined by Francis Galton in the 19th century to describe a biological phenomenon. The phenomenon was that the heights of
descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean).[7][8] For
Galton, regression had only this biological meaning,[9][10] but his work was later extended by Udny Yule and Karl Pearson to a more general statistical
context.[11][12] In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption
was weakened by R.A. Fisher in his works of 1922 and 1925.[13][14][15] Fisher assumed that the conditional distribution of the response variable is Gaussian,
but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.
In the 1950s and 1960s, economists used electromechanical desk "calculators" to calculate regressions. Before 1970, it sometimes took up to 24 hours
to receive the result from one regression.[16]
Regression methods continue to be an area of active research. In recent decades, new methods have been developed for robust regression, regression
involving correlated responses such as time series and growth curves, regression in which the predictor (independent variable) or response variables are
curves, images, graphs, or other complex data objects, regression methods accommodating various types of missing data, nonparametric
regression, Bayesian methods for regression, regression in which the predictor variables are measured with error, regression with more predictor
variables than observations, and causal inference with regression.

Regression model[edit]
In practice, researchers first select a model they would like to estimate and then use their chosen method (e.g., ordinary least squares) to estimate the
parameters of that model. Regression models involve the following components:

 The unknown parameters, often denoted as a scalar or vector .

 The independent variables, which are observed in data and are often denoted as a vector (where denotes a row of data).

 The dependent variable, which are observed in data and often denoted using the scalar .

 The error terms, which are not directly observed in data and are often denoted using the scalar .
In various fields of application, different terminologies are used in place of dependent and independent variables.

Most regression models propose that is a function of and , with representing an additive error term that may stand in for un-modeled

determinants of or random statistical noise:

The researchers' goal is to estimate the function that most closely fits the data. To carry out regression analysis, the form of the

function must be specified. Sometimes the form of this function is based on knowledge about the relationship between and that does

not rely on the data. If no such knowledge is available, a flexible or convenient form for is chosen. For example, a simple univariate regression

may propose , suggesting that the researcher believes to be a reasonable approximation for the statistical process generating the data.
Once researchers determine their preferred statistical model, different forms of regression analysis provide tools to estimate the parameters . For

example, least squares (including its most common variant, ordinary least squares) finds the value of that minimizes the sum of squared

errors . A given regression method will ultimately provide an estimate of , usually denoted to distinguish the estimate from the true

(unknown) parameter value that generated the data. Using this estimate, the researcher can then use the fitted value for prediction or to assess

the accuracy of the model in explaining the data. Whether the researcher is intrinsically interested in the estimate or the predicted value will
depend on context and their goals. As described in ordinary least squares, least squares is widely used because the estimated

function approximates the conditional expectation .[5] However, alternative variants (e.g., least absolute deviations or quantile regression) are

useful when researchers want to model other functions .


It is important to note that there must be sufficient data to estimate a regression model. For example, suppose that a researcher has access

to rows of data with one dependent and two independent variables: . Suppose further that the researcher wants to estimate a bivariate

linear model via least squares: . If the researcher only has access to data points, then they could find infinitely many combinations that

explain the data equally well: any combination can be chosen that satisfies , all of which lead to and are therefore valid solutions that

minimize the sum of squared residuals. To understand why there are infinitely many options, note that the system of equations is to be solved
for 3 unknowns, which makes the system underdetermined. Alternatively, one can visualize infinitely many 3-dimensional planes that go

through fixed points.

More generally, to estimate a least squares model with distinct parameters, one must have distinct data points. If , then there does not

generally exist a set of parameters that will perfectly fit the data. The quantity appears often in regression analysis, and is referred to as

the degrees of freedom in the model. Moreover, to estimate a least squares model, the independent variables must be linearly independent: one
must not be able to reconstruct any of the independent variables by adding and multiplying the remaining independent variables. As discussed

in ordinary least squares, this condition ensures that is an invertible matrix and therefore that a unique solution exists.

Underlying assumptions[edit]
This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced
material may be challenged and removed. (December 2020) (Learn how and when to remove this template message)

By itself, a regression is simply a calculation using the data. In order to interpret the output of a regression as a meaningful statistical quantity that
measures real-world relationships, researchers often rely on a number of classical assumptions. These often include:

 The sample is representative of the population at large.


 The independent variables are measured with no error.

 Deviations from the model have an expected value of zero, conditional on covariates:

 The variance of the residuals is constant across observations (homoscedasticity).

 The residuals are uncorrelated with one another. Mathematically, the variance–covariance matrix of the errors is diagonal.
A handful of conditions are sufficient for the least-squares estimator to possess desirable properties: in particular, the Gauss–Markov assumptions
imply that the parameter estimates will be unbiased, consistent, and efficient in the class of linear unbiased estimators. Practitioners have developed
a variety of methods to maintain some or all of these desirable properties in real-world settings, because these classical assumptions are unlikely to
hold exactly. For example, modeling errors-in-variables can lead to reasonable estimates independent variables are measured with

errors. Heteroscedasticity-consistent standard errors allow the variance of to change across values of . Correlated errors that exist within
subsets of the data or follow specific patterns can be handled using clustered standard errors, geographic weighted regression, or Newey–

West standard errors, among other techniques. When rows of data correspond to locations in space, the choice of how to model within
geographic units can have important consequences. [17][18]
The subfield of econometrics is largely focused on developing techniques that allow
researchers to make reasonable real-world conclusions in real-world settings, where classical assumptions do not hold exactly.

Linear regression[edit]
Main article: Linear regression
See simple linear regression for a derivation of these formulas and a numerical example

In linear regression, the model specification is that the dependent variable, is a linear combination of the parameters (but need not be linear in

the independent variables). For example, in simple linear regression for modeling data points there is one independent variable: , and two

parameters, and :

straight line:
In multiple linear regression, there are several independent variables or functions of independent variables.
Adding a term in to the preceding regression gives:

parabola:

This is still linear regression; although the expression on the right hand side is quadratic in the independent variable , it is linear in the

parameters , and

In both cases, is an error term and the subscript indexes a particular observation.
Returning our attention to the straight line case: Given a random sample from the population, we estimate the population parameters and
obtain the sample linear regression model:

The residual, , is the difference between the value of the dependent variable predicted by the model, , and the true value of the

dependent variable, . One method of estimation is ordinary least squares. This method obtains parameter estimates that minimize the
sum of squared residuals, SSR:

Minimization of this function results in a set of normal equations, a set of simultaneous linear equations in the parameters, which are

solved to yield the parameter estimators, .

Illustration of linear regression on a data set.

In the case of simple regression, the formulas for the least squares estimates are
where is the mean (average) of the values and is the mean of the values.
Under the assumption that the population error term has a constant variance, the estimate of that variance is given by:

This is called the mean square error (MSE) of the regression. The denominator is the sample size reduced by the number of

model parameters estimated from the same data, for regressors or if an intercept is used.[19] In this

case, so the denominator is .


The standard errors of the parameter estimates are given by

Under the further assumption that the population error term is normally distributed, the researcher can use these
estimated standard errors to create confidence intervals and conduct hypothesis tests about the population
parameters.
General linear model[edit]
For a derivation, see linear least squares
For a numerical example, see linear regression

In the more general multiple regression model, there are independent variables:

where is the -th observation on the -th independent variable. If the first independent variable takes

the value 1 for all , , then is called the regression intercept.

The least squares parameter estimates are obtained from normal equations. The residual can be written as
The normal equations are

In matrix notation, the normal equations are written as

where the element of is , the element of the column vector is , and

the element of is . Thus is , is , and is . The solution is

Diagnostics[edit]
Main article: Regression diagnostics
See also: Category:Regression diagnostics
Once a regression model has been constructed, it may be important to confirm the goodness of
fit of the model and the statistical significance of the estimated parameters. Commonly used
checks of goodness of fit include the R-squared, analyses of the pattern of residuals and
hypothesis testing. Statistical significance can be checked by an F-test of the overall fit, followed
by t-tests of individual parameters.
Interpretations of these diagnostic tests rest heavily on the model's assumptions. Although
examination of the residuals can be used to invalidate a model, the results of a t-test or F-test are
sometimes more difficult to interpret if the model's assumptions are violated. For example, if the
error term does not have a normal distribution, in small samples the estimated parameters will not
follow normal distributions and complicate inference. With relatively large samples, however,
a central limit theorem can be invoked such that hypothesis testing may proceed using asymptotic
approximations.
Limited dependent variables[edit]
Limited dependent variables, which are response variables that are categorical variables or are
variables constrained to fall only in a certain range, often arise in econometrics.
The response variable may be non-continuous ("limited" to lie on some subset of the real line). For
binary (zero or one) variables, if analysis proceeds with least-squares linear regression, the model
is called the linear probability model. Nonlinear models for binary dependent variables include
the probit and logit model. The multivariate probit model is a standard method of estimating a joint
relationship between several binary dependent variables and some independent variables.
For categorical variables with more than two values there is the multinomial logit. For ordinal
variables with more than two values, there are the ordered logit and ordered
probit models. Censored regression models may be used when the dependent variable is only
sometimes observed, and Heckman correction type models may be used when the sample is not
randomly selected from the population of interest. An alternative to such procedures is linear
regression based on polychoric correlation (or polyserial correlations) between the categorical
variables. Such procedures differ in the assumptions made about the distribution of the variables in
the population. If the variable is positive with low values and represents the repetition of the
occurrence of an event, then count models like the Poisson regression or the negative
binomial model may be used.

Nonlinear regression[edit]
Main article: Nonlinear regression
When the model function is not linear in the parameters, the sum of squares must be minimized by
an iterative procedure. This introduces many complications which are summarized in Differences
between linear and non-linear least squares.

Interpolation and extrapolation[edit]

In the middle, the interpolated straight line represents the best balance between the points above and below this line.
The dotted lines represent the two extreme lines. The first curves represent the estimated values. The outer curves
represent a prediction for a new measurement.[20]

Regression models predict a value of the Y variable given known values of the X variables.
Prediction within the range of values in the dataset used for model-fitting is known informally
as interpolation. Prediction outside this range of the data is known as extrapolation. Performing
extrapolation relies strongly on the regression assumptions. The further the extrapolation goes
outside the data, the more room there is for the model to fail due to differences between the
assumptions and the sample data or the true values.
It is generally advised[citation needed] that when performing extrapolation, one should accompany the
estimated value of the dependent variable with a prediction interval that represents the uncertainty.
Such intervals tend to expand rapidly as the values of the independent variable(s) moved outside
the range covered by the observed data.
For such reasons and others, some tend to say that it might be unwise to undertake
extrapolation.[21]
However, this does not cover the full set of modeling errors that may be made: in particular, the
assumption of a particular form for the relation between Y and X. A properly conducted regression
analysis will include an assessment of how well the assumed form is matched by the observed
data, but it can only do so within the range of values of the independent variables actually
available. This means that any extrapolation is particularly reliant on the assumptions being made
about the structural form of the regression relationship. Best-practice advice here[citation needed] is that a
linear-in-variables and linear-in-parameters relationship should not be chosen simply for
computational convenience, but that all available knowledge should be deployed in constructing a
regression model. If this knowledge includes the fact that the dependent variable cannot go outside
a certain range of values, this can be made use of in selecting the model – even if the observed
dataset has no values particularly near such bounds. The implications of this step of choosing an
appropriate functional form for the regression can be great when extrapolation is considered. At a
minimum, it can ensure that any extrapolation arising from a fitted model is "realistic" (or in accord
with what is known).

Power and sample size calculations[edit]


There are no generally agreed methods for relating the number of observations versus the number

of independent variables in the model. One rule of thumb conjectured by Good and Hardin is ,

where is the sample size, is the number of independent variables and is the
number of observations needed to reach the desired precision if the model had only one
independent variable.[22] For example, a researcher is building a linear regression model using a

dataset that contains 1000 patients ( ). If the researcher decides that five observations are

needed to precisely define a straight line ( ), then the maximum number of independent
variables the model can support is 4, because
Other methods[edit]
Although the parameters of a regression model are usually estimated using the method of least
squares, other methods which have been used include:

 Bayesian methods, e.g. Bayesian linear regression


 Percentage regression, for situations where reducing percentage errors is deemed more
appropriate.[23]
 Least absolute deviations, which is more robust in the presence of outliers, leading
to quantile regression
 Nonparametric regression, requires a large number of observations and is computationally
intensive
 Scenario optimization, leading to interval predictor models
 Distance metric learning, which is learned by the search of a meaningful distance metric in
a given input space.[24]

Software[edit]
For a more comprehensive list, see List of statistical packages.
All major statistical software packages perform least squares regression analysis and
inference. Simple linear regression and multiple regression using least squares can be done in
some spreadsheet applications and on some calculators. While many statistical software
packages can perform various types of nonparametric and robust regression, these methods
are less standardized; different software packages implement different methods, and a method
with a given name may be implemented differently in different packages. Specialized
regression software has been developed for use in fields such as survey analysis and
neuroimaging.

See also[edit]

 Mathematics portal

 Anscombe's quartet
 Curve fitting
 Estimation theory
 Forecasting
 Fraction of variance unexplained
 Function approximation
 Generalized linear models
 Kriging (a linear least squares estimation algorithm)
 Local regression
 Modifiable areal unit problem
 Multivariate adaptive regression splines
 Multivariate normal distribution
 Pearson product-moment correlation coefficient
 Quasi-variance
 Prediction interval
 Regression validation
 Robust regression
 Segmented regression
 Signal processing
 Stepwise regression
 Trend estimation
Multivariate statistics
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
"Multivariate analysis" redirects here. For the usage in mathematics, see Multivariable calculus.
Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable.
Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they
relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate
analyses in order to understand the relationships between variables and their relevance to the problem being studied.
In addition, multivariate statistics is concerned with multivariate probability distributions, in terms of both

 how these can be used to represent the distributions of observed data;


 how they can be used as part of statistical inference, particularly where several different quantities are of interest to the same analysis.
Certain types of problems involving multivariate data, for example simple linear regression and multiple regression, are not usually considered to be
special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome
variable given the other variables.

Contents

 1Multivariate analysis
o 1.1Types of analysis
 2Important probability distributions
 3History
 4Applications
 5Software and tools
 6See also
 7References
 8Further reading
 9External links

Multivariate analysis[edit]
Multivariate analysis (MVA) is based on the principles of multivariate statistics. Typically, MVA is used to address the situations where multiple
measurements are made on each experimental unit and the relations among these measurements and their structures are important.[1] A modern,
overlapping categorization of MVA includes:[1]

 Normal and general multivariate models and distribution theory


 The study and measurement of relationships
 Probability computations of multidimensional regions
 The exploration of data structures and patterns
Multivariate analysis can be complicated by the desire to include physics-based analysis to calculate the effects of variables for a hierarchical
"system-of-systems". Often, studies that wish to use multivariate analysis are stalled by the dimensionality of the problem. These concerns are often
eased through the use of surrogate models, highly accurate approximations of the physics-based code. Since surrogate models take the form of an
equation, they can be evaluated very quickly. This becomes an enabler for large-scale MVA studies: while a Monte Carlo simulation across the
design space is difficult with physics-based codes, it becomes trivial when evaluating surrogate models, which often take the form of response-
surface equations.
Types of analysis[edit]
There are many different models, each with its own type of analysis:

1. Multivariate analysis of variance (MANOVA) extends the analysis of variance to cover cases where there is more than one dependent variable
to be analyzed simultaneously; see also Multivariate analysis of covariance (MANCOVA).
2. Multivariate regression attempts to determine a formula that can describe how elements in a vector of variables respond simultaneously to
changes in others. For linear relations, regression analyses here are based on forms of the general linear model. Some suggest that
multivariate regression is distinct from multivariable regression, however, that is debated and not consistently true across scientific fields.[2]
3. Principal components analysis (PCA) creates a new set of orthogonal variables that contain the same information as the original set. It rotates
the axes of variation to give a new set of orthogonal axes, ordered so that they summarize decreasing proportions of the variation.
4. Factor analysis is similar to PCA but allows the user to extract a specified number of synthetic variables, fewer than the original set, leaving
the remaining unexplained variation as error. The extracted variables are known as latent variables or factors; each one may be supposed to
account for covariation in a group of observed variables.
5. Canonical correlation analysis finds linear relationships among two sets of variables; it is the generalised (i.e. canonical) version of
bivariate[3] correlation.
6. Redundancy analysis (RDA) is similar to canonical correlation analysis but allows the user to derive a specified number of synthetic variables
from one set of (independent) variables that explain as much variance as possible in another (independent) set. It is a multivariate analogue
of regression.
7. Correspondence analysis (CA), or reciprocal averaging, finds (like PCA) a set of synthetic variables that summarise the original set. The
underlying model assumes chi-squared dissimilarities among records (cases).
8. Canonical (or "constrained") correspondence analysis (CCA) for summarising the joint variation in two sets of variables (like redundancy
analysis); combination of correspondence analysis and multivariate regression analysis. The underlying model assumes chi-squared
dissimilarities among records (cases).
9. Multidimensional scaling comprises various algorithms to determine a set of synthetic variables that best represent the pairwise distances
between records. The original method is principal coordinates analysis (PCoA; based on PCA).
10. Discriminant analysis, or canonical variate analysis, attempts to establish whether a set of variables can be used to distinguish between two or
more groups of cases.
11. Linear discriminant analysis (LDA) computes a linear predictor from two sets of normally distributed data to allow for classification of new
observations.
12. Clustering systems assign objects into groups (called clusters) so that objects (cases) from the same cluster are more similar to each other
than objects from different clusters.
13. Recursive partitioning creates a decision tree that attempts to correctly classify members of the population based on a dichotomous
dependent variable.
14. Artificial neural networks extend regression and clustering methods to non-linear multivariate models.
15. Statistical graphics such as tours, parallel coordinate plots, scatterplot matrices can be used to explore multivariate data.
16. Simultaneous equations models involve more than one regression equation, with different dependent variables, estimated together.
17. Vector autoregression involves simultaneous regressions of various time series variables on their own and each other's lagged values.
18. Principal response curves analysis (PRC) is a method based on RDA that allows the user to focus on treatment effects over time by correcting
for changes in control treatments over time.[4]
19. Iconography of correlations consists in replacing a correlation matrix by a diagram where the “remarkable” correlations are represented by a
solid line (positive correlation), or a dotted line (negative correlation).

Important probability distributions[edit]


There is a set of probability distributions used in multivariate analyses that play a similar role to the corresponding set of distributions that are used
in univariate analysis when the normal distribution is appropriate to a dataset. These multivariate distributions are:

 Multivariate normal distribution


 Wishart distribution
 Multivariate Student-t distribution.
The Inverse-Wishart distribution is important in Bayesian inference, for example in Bayesian multivariate linear regression. Additionally, Hotelling's
T-squared distribution is a multivariate distribution, generalising Student's t-distribution, that is used in multivariate hypothesis testing.

History[edit]
Anderson's 1958 textbook, An Introduction to Multivariate Statistical Analysis,[5] educated a generation of theorists and applied statisticians;
Anderson's book emphasizes hypothesis testing via likelihood ratio tests and the properties of power
functions: admissibility, unbiasedness and monotonicity.[6][7]
MVA once solely stood in the statistical theory realms due to the size, complexity of underlying data set and high computational consumption.
With the dramatic growth of computational power, MVA now plays an increasingly important role in data analysis and has wide application
in OMICS fields.

Applications[edit]
 Multivariate hypothesis testing
 Dimensionality reduction
 Latent structure discovery
 Clustering
 Multivariate regression analysis
 Classification and discrimination analysis
 Variable selection
 Multidimensional Scaling
 Data mining

Software and tools[edit]


There are an enormous number of software packages and other tools for multivariate analysis, including:

 JMP (statistical software)


 MiniTab
 Calc
 PSPP
 R[8]
 SAS (software)
 SciPy for Python
 SPSS
 Stata
 STATISTICA
 The Unscrambler
 WarpPLS
 SmartPLS
 MATLAB
 Eviews
 NCSS (statistical software) includes multivariate analysis.
 The Unscrambler® X is a multivariate analysis tool.
 SIMCA

See also[edit]
 Estimation of covariance matrices
 Important publications in multivariate analysis
 Multivariate testing in marketing
 Structured data analysis (statistics)
 Structural equation modeling
 RV coefficient
 Bivariate analysis
 Design of experiments (DoE)
 Dimensional analysis
 Exploratory data analysis
 OLS
 Partial least squares regression
 Pattern recognition
 Principal component analysis (PCA)
 Regression analysis
 Soft independent modelling of class analogies (SIMCA)
 Statistical interference
 Univariate analysis
Time series
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
Time series: random data plus trend, with best-fit line and different applied filters

In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at
successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts
of sunspots, and the daily closing value of the Dow Jones Industrial Average.
Time series are very frequently plotted via run charts (a temporal line chart). Time series are used in statistics, signal processing, pattern
recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, electroencephalography, control
engineering, astronomy, communications engineering, and largely in any domain of applied science and engineering which
involves temporal measurements.
Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the
data. Time series forecasting is the use of a model to predict future values based on previously observed values. While regression analysis is often
employed in such a way as to test relationships between one or more different time series, this type of analysis is not usually called "time series
analysis", which refers in particular to relationships between different points in time within a single series. Interrupted time series analysis is used to
detect changes in the evolution of a time series from before to after some intervention which may affect the underlying variable.
Time series data have a natural temporal ordering. This makes time series analysis distinct from cross-sectional studies, in which there is no natural
ordering of the observations (e.g. explaining people's wages by reference to their respective education levels, where the individuals' data could be
entered in any order). Time series analysis is also distinct from spatial data analysis where the observations typically relate to geographical locations
(e.g. accounting for house prices by the location as well as the intrinsic characteristics of the houses). A stochastic model for a time series will generally
reflect the fact that observations close together in time will be more closely related than observations further apart. In addition, time series models will
often make use of the natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values,
rather than from future values (see time reversibility).
Time series analysis can be applied to real-valued, continuous data, discrete numeric data, or discrete symbolic data (i.e. sequences of characters, such
as letters and words in the English language[1]).

Contents

 1Methods for analysis


 2Panel data
 3Analysis
o 3.1Motivation
o 3.2Exploratory analysis
o 3.3Curve fitting
o 3.4Function approximation
o 3.5Prediction and forecasting
o 3.6Classification
o 3.7Signal estimation
o 3.8Segmentation
 4Models
o 4.1Notation
o 4.2Conditions
o 4.3Tools
o 4.4Measures
 5Visualization
o 5.1Overlapping charts
o 5.2Separated charts
 6See also
 7References
 8Further reading
 9External links

Methods for analysis[edit]


Methods for time series analysis may be divided into two classes: frequency-domain methods and time-domain methods. The former include spectral
analysis and wavelet analysis; the latter include auto-correlation and cross-correlation analysis. In the time domain, correlation and analysis can be made
in a filter-like manner using scaled correlation, thereby mitigating the need to operate in the frequency domain.
Additionally, time series analysis techniques may be divided into parametric and non-parametric methods. The parametric approaches assume that the
underlying stationary stochastic process has a certain structure which can be described using a small number of parameters (for example, using
an autoregressive or moving average model). In these approaches, the task is to estimate the parameters of the model that describes the stochastic
process. By contrast, non-parametric approaches explicitly estimate the covariance or the spectrum of the process without assuming that the process
has any particular structure.
Methods of time series analysis may also be divided into linear and non-linear, and univariate and multivariate.

Panel data[edit]
A time series is one type of panel data. Panel data is the general class, a multidimensional data set, whereas a time series data set is a one-dimensional
panel (as is a cross-sectional dataset). A data set may exhibit characteristics of both panel data and time series data. One way to tell is to ask what
makes one data record unique from the other records. If the answer is the time data field, then this is a time series data set candidate. If determining a
unique record requires a time data field and an additional identifier which is unrelated to time (student ID, stock symbol, country code), then it is panel
data candidate. If the differentiation lies on the non-time identifier, then the data set is a cross-sectional data set candidate.

Analysis[edit]
There are several types of motivation and data analysis available for time series which are appropriate for different purposes.
Motivation[edit]
In the context of statistics, econometrics, quantitative finance, seismology, meteorology, and geophysics the primary goal of time series analysis
is forecasting. In the context of signal processing, control engineering and communication engineering it is used for signal detection. Other application
are in data mining, pattern recognition and machine learning, where time series analysis can be used for clustering,[2][3] classification,[4] query by
content,[5] anomaly detection as well as forecasting.[citation needed]
Exploratory analysis[edit]

Tuberculosis incidence US 1953-2009

Further information: Exploratory analysis


A straightforward way to examine a regular time series is manually with a line chart. An example chart is shown on the right for tuberculosis incidence in
the United States, made with a spreadsheet program. The number of cases was standardized to a rate per 100,000 and the percent change per year in
this rate was calculated. The nearly steadily dropping line shows that the TB incidence was decreasing in most years, but the percent change in this rate
varied by as much as +/- 10%, with 'surges' in 1975 and around the early 1990s. The use of both vertical axes allows the comparison of two time series
in one graphic.
Other techniques include:

 Autocorrelation analysis to examine serial dependence


 Spectral analysis to examine cyclic behavior which need not be related to seasonality. For example, sun spot activity varies over 11 year
cycles.[6][7] Other common examples include celestial phenomena, weather patterns, neural activity, commodity prices, and economic activity.
 Separation into components representing trend, seasonality, slow and fast variation, and cyclical irregularity: see trend estimation and decomposition
of time series
Curve fitting[edit]
Main article: Curve fitting
Curve fitting[8][9] is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points,[10] possibly subject to
constraints.[11][12] Curve fitting can involve either interpolation,[13][14] where an exact fit to the data is required, or smoothing,[15][16] in which a "smooth" function
is constructed that approximately fits the data. A related topic is regression analysis,[17][18] which focuses more on questions of statistical inference such as
how much uncertainty is present in a curve that is fit to data observed with random errors. Fitted curves can be used as an aid for data
visualization,[19][20] to infer values of a function where no data are available,[21] and to summarize the relationships among two or more
variables.[22] Extrapolation refers to the use of a fitted curve beyond the range of the observed data,[23] and is subject to a degree of uncertainty[24] since it
may reflect the method used to construct the curve as much as it reflects the observed data.
The construction of economic time series involves the estimation of some components for some dates by interpolation between values ("benchmarks")
for earlier and later dates. Interpolation is estimation of an unknown quantity between two known quantities (historical data), or drawing conclusions
about missing information from the available information ("reading between the lines").[25] Interpolation is useful where the data surrounding the missing
data is available and its trend, seasonality, and longer-term cycles are known. This is often done by using a related series known for all relevant
dates.[26] Alternatively polynomial interpolation or spline interpolation is used where piecewise polynomial functions are fit into time intervals such that they
fit smoothly together. A different problem which is closely related to interpolation is the approximation of a complicated function by a simple function (also
called regression).The main difference between regression and interpolation is that polynomial regression gives a single polynomial that models the
entire data set. Spline interpolation, however, yield a piecewise continuous function composed of many polynomials to model the data set.
Extrapolation is the process of estimating, beyond the original observation range, the value of a variable on the basis of its relationship with another
variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a
higher risk of producing meaningless results.
Function approximation[edit]
Main article: Function approximation
In general, a function approximation problem asks us to select a function among a well-defined class that closely matches ("approximates") a target
function in a task-specific way. One can distinguish two major classes of function approximation problems: First, for known target functions approximation
theory is the branch of numerical analysis that investigates how certain known functions (for example, special functions) can be approximated by a
specific class of functions (for example, polynomials or rational functions) that often have desirable properties (inexpensive computation, continuity,
integral and limit values, etc.).
Second, the target function, call it g, may be unknown; instead of an explicit formula, only a set of points (a time series) of the form (x, g(x)) is provided.
Depending on the structure of the domain and codomain of g, several techniques for approximating g may be applicable. For example, if g is an
operation on the real numbers, techniques of interpolation, extrapolation, regression analysis, and curve fitting can be used. If the codomain (range or
target set) of g is a finite set, one is dealing with a classification problem instead. A related problem of online time series approximation[27] is to summarize
the data in one-pass and construct an approximate representation that can support a variety of time series queries with bounds on worst-case error.
To some extent the different problems (regression, classification, fitness approximation) have received a unified treatment in statistical learning theory,
where they are viewed as supervised learning problems.
Prediction and forecasting[edit]
In statistics, prediction is a part of statistical inference. One particular approach to such inference is known as predictive inference, but the prediction can
be undertaken within any of the several approaches to statistical inference. Indeed, one description of statistics is that it provides a means of transferring
knowledge about a sample of a population to the whole population, and to other related populations, which is not necessarily the same as prediction over
time. When information is transferred across time, often to specific points in time, the process is known as forecasting.

 Fully formed statistical models for stochastic simulation purposes, so as to generate alternative versions of the time series, representing what might
happen over non-specific time-periods in the future
 Simple or fully formed statistical models to describe the likely outcome of the time series in the immediate future, given knowledge of the most recent
outcomes (forecasting).
 Forecasting on time series is usually done using automated statistical software packages and programming languages, such
as Julia, Python, R, SAS, SPSS and many others.
 Forecasting on large scale data can be done with Apache Spark using the Spark-TS library, a third-party package.[28]
Classification[edit]
Main article: Statistical classification
Assigning time series pattern to a specific category, for example identify a word based on series of hand movements in sign language.
Signal estimation[edit]
See also: Signal processing and Estimation theory
This approach is based on harmonic analysis and filtering of signals in the frequency domain using the Fourier transform, and spectral density estimation,
the development of which was significantly accelerated during World War II by mathematician Norbert Wiener, electrical engineers Rudolf E.
Kálmán, Dennis Gabor and others for filtering signals from noise and predicting signal values at a certain point in time. See Kalman filter, Estimation
theory, and Digital signal processing
Segmentation[edit]
Main article: Time-series segmentation
Splitting a time-series into a sequence of segments. It is often the case that a time-series can be represented as a sequence of individual segments,
each with its own characteristic properties. For example, the audio signal from a conference call can be partitioned into pieces corresponding to the times
during which each person was speaking. In time-series segmentation, the goal is to identify the segment boundary points in the time-series, and to
characterize the dynamical properties associated with each segment. One can approach this problem using change-point detection, or by modeling the
time-series as a more sophisticated system, such as a Markov jump linear system.

Models[edit]
Models for time series data can have many forms and represent different stochastic processes. When modeling variations in the level of a process, three
broad classes of practical importance are the autoregressive (AR) models, the integrated (I) models, and the moving average (MA) models. These three
classes depend linearly on previous data points.[29] Combinations of these ideas produce autoregressive moving average (ARMA) and autoregressive
integrated moving average (ARIMA) models. The autoregressive fractionally integrated moving average (ARFIMA) model generalizes the former three.
Extensions of these classes to deal with vector-valued data are available under the heading of multivariate time-series models and sometimes the
preceding acronyms are extended by including an initial "V" for "vector", as in VAR for vector autoregression. An additional set of extensions of these
models is available for use where the observed time-series is driven by some "forcing" time-series (which may not have a causal effect on the observed
series): the distinction from the multivariate case is that the forcing series may be deterministic or under the experimenter's control. For these models, the
acronyms are extended with a final "X" for "exogenous".
Non-linear dependence of the level of a series on previous data points is of interest, partly because of the possibility of producing a chaotic time series.
However, more importantly, empirical investigations can indicate the advantage of using predictions derived from non-linear models, over those from
linear models, as for example in nonlinear autoregressive exogenous models. Further references on nonlinear time series analysis: (Kantz and
Schreiber),[30] and (Abarbanel)[31]
Among other types of non-linear time series models, there are models to represent the changes of variance over time (heteroskedasticity). These models
represent autoregressive conditional heteroskedasticity (ARCH) and the collection comprises a wide variety of representation (GARCH, TARCH,
EGARCH, FIGARCH, CGARCH, etc.). Here changes in variability are related to, or predicted by, recent past values of the observed series. This is in
contrast to other possible representations of locally varying variability, where the variability might be modelled as being driven by a separate time-varying
process, as in a doubly stochastic model.
In recent work on model-free analyses, wavelet transform based methods (for example locally stationary wavelets and wavelet decomposed neural
networks) have gained favor. Multiscale (often referred to as multiresolution) techniques decompose a given time series, attempting to illustrate time
dependence at multiple scales. See also Markov switching multifractal (MSMF) techniques for modeling volatility evolution.
A Hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved
(hidden) states. An HMM can be considered as the simplest dynamic Bayesian network. HMM models are widely used in speech recognition, for
translating a time series of spoken words into text.
Notation[edit]
A number of different notations are in use for time-series analysis. A common notation specifying a time series X that is indexed by the natural
numbers is written
X = (X1, X2, ...).
Another common notation is
Y = (Yt: t ∈ T),
where T is the index set.
Conditions[edit]
There are two sets of conditions under which much of the theory is built:

 Stationary process
 Ergodic process
However, ideas of stationarity must be expanded to consider two important ideas: strict stationarity and second-order stationarity. Both models
and applications can be developed under each of these conditions, although the models in the latter case might be considered as only partly
specified.
In addition, time-series analysis can be applied where the series are seasonally stationary or non-stationary. Situations where the amplitudes of
frequency components change with time can be dealt with in time-frequency analysis which makes use of a time–frequency representation of a
time-series or signal.[32]
Tools[edit]
Tools for investigating time-series data include:

 Consideration of the autocorrelation function and the spectral density function (also cross-correlation functions and cross-spectral density
functions)
 Scaled cross- and auto-correlation functions to remove contributions of slow components[33]
 Performing a Fourier transform to investigate the series in the frequency domain
 Use of a filter to remove unwanted noise
 Principal component analysis (or empirical orthogonal function analysis)
 Singular spectrum analysis
 "Structural" models:
o General State Space Models
o Unobserved Components Models
 Machine Learning
o Artificial neural networks
o Support vector machine
o Fuzzy logic
o Gaussian process
o Hidden Markov model
 Queueing theory analysis
 Control chart
o Shewhart individuals control chart
o CUSUM chart
o EWMA chart
 Detrended fluctuation analysis
 Nonlinear mixed-effects modeling
 Dynamic time warping[34]
 Cross-correlation[35]
 Dynamic Bayesian network
 Time-frequency analysis techniques:
o Fast Fourier transform
o Continuous wavelet transform
o Short-time Fourier transform
o Chirplet transform
o Fractional Fourier transform
 Chaotic analysis
o Correlation dimension
o Recurrence plots
o Recurrence quantification analysis
o Lyapunov exponents
o Entropy encoding
Measures[edit]
Time series metrics or features that can be used for time series classification or regression analysis:[36]

 Univariate linear measures


o Moment (mathematics)
o Spectral band power
o Spectral edge frequency
o Accumulated Energy (signal processing)
o Characteristics of the autocorrelation function
o Hjorth parameters
o FFT parameters
o Autoregressive model parameters
o Mann–Kendall test
 Univariate non-linear measures
o Measures based on the correlation sum
o Correlation dimension
o Correlation integral
o Correlation density
o Correlation entropy
o Approximate entropy[37]
o Sample entropy
o Fourier entropyuk
o Wavelet entropy
o Rényi entropy
o Higher-order methods
o Marginal predictability
o Dynamical similarity index
o State space dissimilarity measures
o Lyapunov exponent
o Permutation methods
o Local flow
 Other univariate measures
o Algorithmic complexity
o Kolmogorov complexity estimates
o Hidden Markov Model states
o Rough path signature[38]
o Surrogate time series and surrogate correction
o Loss of recurrence (degree of non-stationarity)
 Bivariate linear measures
o Maximum linear cross-correlation
o Linear Coherence (signal processing)
 Bivariate non-linear measures
o Non-linear interdependence
o Dynamical Entrainment (physics)
o Measures for Phase synchronization
o Measures for Phase locking
 Similarity measures:[39]
o Cross-correlation
o Dynamic Time Warping[34]
o Hidden Markov Models
o Edit distance
o Total correlation
o Newey–West estimator
o Prais–Winsten transformation
o Data as Vectors in a Metrizable Space
 Minkowski distance
 Mahalanobis distance
o Data as time series with envelopes
 Global standard deviation
 Local standard deviation
 Windowed standard deviation
o Data interpreted as stochastic series
 Pearson product-moment correlation coefficient
 Spearman's rank correlation coefficient
o Data interpreted as a probability distribution function
 Kolmogorov–Smirnov test
 Cramér–von Mises criterion

Visualization[edit]
Time series can be visualized with two categories of chart: Overlapping Charts and Separated Charts. Overlapping Charts display all-time series
on the same layout while Separated Charts presents them on different layouts (but aligned for comparison purpose)[40]
Overlapping charts[edit]
 Braided graphs
 Line charts
 Slope graphs
 GapChartfr
Separated charts[edit]
 Horizon graphs
 Reduced line chart (small multiples)
 Silhouette graph
 Circular silhouette graph

See also[edit]
 Anomaly time series
 Chirp
 Decomposition of time series
 Detrended fluctuation analysis
 Digital signal processing
 Distributed lag
 Estimation theory
 Forecasting
 Hurst exponent
 Monte Carlo method
 Panel analysis
 Random walk
 Scaled correlation
 Seasonal adjustment
 Sequence analysis
 Signal processing
 Time series database (TSDB)
 Trend estimation
 Unevenly spaced time series

References[edit]
1. ^ Lin, Jessica; Keogh, Eamonn; Lonardi, Stefano; Chiu, Bill (2003). "A symbolic representation of time series, with implications for streaming algorithms". Proceedings of the 8th
ACM SIGMOD workshop on Research issues in data mining and knowledge discovery. New York: ACM Press. pp. 2–
11. CiteSeerX 10.1.1.14.5597. doi:10.1145/882082.882086. S2CID 6084733.
2. ^ Liao, T. Warren (2005). "Clustering of time series data - a survey". Pattern Recognition. Elsevier. 38 (11): 1857–1874. doi:10.1016/j.patcog.2005.01.025. –
via ScienceDirect (subscription required)
3. ^ Aghabozorgi, Saeed; Shirkhorshidi, Ali S.; Wah, Teh Y. (2015). "Time-series clustering – A decade review". Information Systems. Elsevier. 53: 16–
38. doi:10.1016/j.is.2015.04.007. – via ScienceDirect (subscription required)
4. ^ Keogh, Eamonn J. (2003). "On the need for time series data mining benchmarks". Data Mining and Knowledge Discovery. Kluwer. 7: 349–
371. doi:10.1145/775047.775062. ISBN 158113567X. – via ACM Digital Library (subscription required)
5. ^ Agrawal, Rakesh; Faloutsos, Christos; Swami, Arun (October 1993). "Efficient Similarity Search In Sequence Databases". Proceedings of the 4th International Conference on
Foundations of Data Organization and Algorithms. International Conference on Foundations of Data Organization and Algorithms. 730. pp. 69–84. doi:10.1007/3-540-57301-
1_5. – via SpringerLink (subscription required)
6. ^ Bloomfield, P. (1976). Fourier analysis of time series: An introduction. New York: Wiley. ISBN 978-0471082569.
7. ^ Shumway, R. H. (1988). Applied statistical time series analysis. Englewood Cliffs, NJ: Prentice Hall. ISBN 978-0130415004.
8. ^ Sandra Lach Arlinghaus, PHB Practical Handbook of Curve Fitting. CRC Press, 1994.
9. ^ William M. Kolb. Curve Fitting for Programmable Calculators. Syntec, Incorporated, 1984.
10. ^ S.S. Halli, K.V. Rao. 1992. Advanced Techniques of Population Analysis. ISBN 0306439972 Page 165 (cf. ... functions are fulfilled if we have a good to moderate fit for the
observed data.)
11. ^ The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. By Nate Silver
12. ^ Data Preparation for Data Mining: Text. By Dorian Pyle.
13. ^ Numerical Methods in Engineering with MATLAB®. By Jaan Kiusalaas. Page 24.
14. ^ Numerical Methods in Engineering with Python 3. By Jaan Kiusalaas. Page 21.
15. ^ Numerical Methods of Curve Fitting. By P. G. Guest, Philip George Guest. Page 349.
16. ^ See also: Mollifier
17. ^ Fitting Models to Biological Data Using Linear and Nonlinear Regression. By Harvey Motulsky, Arthur Christopoulos.
18. ^ Regression Analysis By Rudolf J. Freund, William J. Wilson, Ping Sa. Page 269.
19. ^ Visual Informatics. Edited by Halimah Badioze Zaman, Peter Robinson, Maria Petrou, Patrick Olivier, Heiko Schröder. Page 689.
20. ^ Numerical Methods for Nonlinear Engineering Models. By John R. Hauser. Page 227.
21. ^ Methods of Experimental Physics: Spectroscopy, Volume 13, Part 1. By Claire Marton. Page 150.
22. ^ Encyclopedia of Research Design, Volume 1. Edited by Neil J. Salkind. Page 266.
23. ^ Community Analysis and Planning Techniques. By Richard E. Klosterman. Page 1.
24. ^ An Introduction to Risk and Uncertainty in the Evaluation of Environmental Investments. DIANE Publishing. Pg 69
25. ^ Hamming, Richard. Numerical methods for scientists and engineers. Courier Corporation, 2012.
26. ^ Friedman, Milton. "The interpolation of time series by related series." Journal of the American Statistical Association 57.300 (1962): 729–757.
27. ^ Gandhi, Sorabh, Luca Foschini, and Subhash Suri. "Space-efficient online approximation of time series data: Streams, amnesia, and out-of-order." Data Engineering (ICDE),
2010 IEEE 26th International Conference on. IEEE, 2010.
28. ^ Sandy Ryza (2020-03-18). "Time Series Analysis with Spark" (slides of a talk at Spark Summit East 2016). Databricks. Retrieved 2021-01-12.
29. ^ Gershenfeld, N. (1999). The Nature of Mathematical Modeling. New York: Cambridge University Press. pp. 205–208. ISBN 978-0521570954.
30. ^ Kantz, Holger; Thomas, Schreiber (2004). Nonlinear Time Series Analysis. London: Cambridge University Press. ISBN 978-0521529020.
31. ^ Abarbanel, Henry (Nov 25, 1997). Analysis of Observed Chaotic Data. New York: Springer. ISBN 978-0387983721.
32. ^ Boashash, B. (ed.), (2003) Time-Frequency Signal Analysis and Processing: A Comprehensive Reference, Elsevier Science, Oxford, 2003 ISBN 0-08-044335-4
33. ^ Nikolić, D.; Muresan, R. C.; Feng, W.; Singer, W. (2012). "Scaled correlation analysis: a better way to compute a cross-correlogram". European Journal of
Neuroscience. 35 (5): 742–762. doi:10.1111/j.1460-9568.2011.07987.x. PMID 22324876. S2CID 4694570.
34. ^ Jump up to:a b Sakoe, Hiroaki; Chiba, Seibi (1978). "Dynamic programming algorithm optimization for spoken word recognition". IEEE Transactions on Acoustics, Speech,
and Signal Processing. 26. pp. 43–49. doi:10.1109/TASSP.1978.1163055. S2CID 17900407.Missing or empty |title= (help)
35. ^ Goutte, Cyril; Toft, Peter; Rostrup, Egill; Nielsen, Finn Å.; Hansen, Lars Kai (1999). "On Clustering fMRI Time Series". NeuroImage. 9. pp. 298–
310. doi:10.1006/nimg.1998.0391. PMID 10075900. S2CID 14147564. Missing or empty |title= (help)
36. ^ Mormann, Florian; Andrzejak, Ralph G.; Elger, Christian E.; Lehnertz, Klaus (2007). "Seizure prediction: the long and winding road". Brain. 130 (2): 314–
333. doi:10.1093/brain/awl241. PMID 17008335.
37. ^ Land, Bruce; Elias, Damian. "Measuring the 'Complexity' of a time series".
38. ^ [1] Chevyrev, I., Kormilitzin, A. (2016) "A Primer on the Signature Method in Machine Learning, arXiv:1603.03788v1"
39. ^ Ropella, G. E. P.; Nag, D. A.; Hunt, C. A. (2003). "Similarity measures for automated comparison of in silico and in vitro experimental results". Engineering in Medicine and
Biology Society. 3: 2933–2936. doi:10.1109/IEMBS.2003.1280532. ISBN 978-0-7803-7789-9. S2CID 17798157.
40. ^ Tominski, Christian; Aigner, Wolfgang. "The TimeViz Browser:A Visual Survey of Visualization Techniques for Time-Oriented Data". Retrieved 1 June 2014.

Further reading[edit]
 Box, George; Jenkins, Gwilym (1976), Time Series Analysis: forecasting and control, rev. ed., Oakland, California: Holden-Day
 Durbin J., Koopman S.J. (2001), Time Series Analysis by State Space Methods, Oxford University Press.
 Gershenfeld, Neil (2000), The Nature of Mathematical Modeling, Cambridge University Press, ISBN 978-0-521-57095-4, OCLC 174825352
 Hamilton, James (1994), Time Series Analysis, Princeton University Press, ISBN 978-0-691-04289-3
 Priestley, M. B. (1981), Spectral Analysis and Time Series, Academic Press. ISBN 978-0-12-564901-8
 Shasha, D. (2004), High Performance Discovery in Time Series, Springer, ISBN 978-0-387-00857-8
 Shumway R. H., Stoffer D. S. (2017), Time Series Analysis and its Applications: With R Examples (ed. 4), Springer, ISBN 978-3-319-52451-1
 Weigend A. S., Gershenfeld N. A. (Eds.) (1994), Time Series Prediction: Forecasting the Future and Understanding the Past. Proceedings of
the NATO Advanced Research Workshop on Comparative Time Series Analysis (Santa Fe, May 1992), Addison-Wesley.
 Wiener, N. (1949), Extrapolation, Interpolation, and Smoothing of Stationary Time Series, MIT Press.
 Woodward, W. A., Gray, H. L. & Elliott, A. C. (2012), Applied Time Series Analysis, CRC Press.

External links[edit]
Wikimedia Commons has
media related to Time
series.

 Introduction to Time series Analysis (Engineering Statistics Handbook) — A practical guide to Time series analysis.
Survival analysis
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
This article may require cleanup to meet Wikipedia's quality standards. The specific problem is: Images of plain-text (content and tables),
which include word-processor proofreading markup. Should be converted to wikitext. Please help improve this article if you
can. (September 2019) (Learn how and when to remove this template message)

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced
material may be challenged and removed.
Find sources: "Survival analysis" – news · newspapers · books · scholar · JSTOR (April 2021) (Learn how and when to remove this template message)

Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and
failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration
modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer certain questions, such as what is the proportion
of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be
taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?
To answer such questions, it is necessary to define "lifetime". In the case of biological survival, death is unambiguous, but for mechanical
reliability, failure may not be well-defined, for there may well be mechanical systems in which failure is partial, a matter of degree, or not otherwise
localized in time. Even in biological problems, some events (for example, heart attack or other organ failure) may have the same ambiguity.
The theory outlined below assumes well-defined events at specific times; other cases may be better treated by models which explicitly account for
ambiguous events.
More generally, survival analysis involves the modelling of time to event data; in this context, death or failure is considered an "event" in the survival
analysis literature – traditionally only a single event occurs for each subject, after which the organism or mechanism is dead or broken. Recurring
event or repeated event models relax that assumption. The study of recurring events is relevant in systems reliability, and in many areas of social
sciences and medical research.

Contents

 1Introduction to survival analysis


o 1.1Definitions of common terms in survival analysis
o 1.2Example: Acute myelogenous leukemia survival data
 1.2.1Kaplan–Meier plot for the aml data
 1.2.2Life table for the aml data
 1.2.3Log-rank test: Testing for differences in survival in the aml data
o 1.3Cox proportional hazards (PH) regression analysis
 1.3.1Example: Cox proportional hazards regression analysis for melanoma
 1.3.2Cox model using a covariate in the melanoma data
 1.3.3Extensions to Cox models
o 1.4Tree-structured survival models
 1.4.1Example survival tree analysis
 1.4.2Survival random forests
 2General formulation
o 2.1Survival function
o 2.2Lifetime distribution function and event density
o 2.3Hazard function and cumulative hazard function
o 2.4Quantities derived from the survival distribution
 3Censoring
 4Fitting parameters to data
 5Non-parametric estimation
 6Computer software for survival analysis
 7Distributions used in survival analysis
 8Applications
 9See also
 10References
 11Further reading
 12External links

Introduction to survival analysis[edit]


Survival analysis is used in several ways:

 To describe the survival times of members of a group


o Life tables
o Kaplan–Meier curves
o Survival function
o Hazard function
 To compare the survival times of two or more groups
o Log-rank test
 To describe the effect of categorical or quantitative variables on survival
o Cox proportional hazards regression
o Parametric survival models
o Survival trees
o Survival random forests
Definitions of common terms in survival analysis[edit]
The following terms are commonly used in survival analyses:

 Event: Death, disease occurrence, disease recurrence, recovery, or other experience of interest
 Time: The time from the beginning of an observation period (such as surgery or beginning treatment) to (i) an event, or (ii) end of the study, or (iii)
loss of contact or withdrawal from the study.
 Censoring / Censored observation: Censoring occurs when we have some information about individual survival time, but we do not know the survival
time exactly. The subject is censored in the sense that nothing is observed or known about that subject after the time of censoring. A censored
subject may or may not have an event after the end of observation time.
 Survival function S(t): The probability that a subject survives longer than time t.
Example: Acute myelogenous leukemia survival data[edit]
This example uses the Acute Myelogenous Leukemia survival data set "aml" from the "survival" package in R. The data set is from Miller (1997)[1] and the
question is whether the standard course of chemotherapy should be extended ('maintained') for additional cycles.
The aml data set sorted by survival time is shown in the box.
aml data set sorted by survival time

 Time is indicated by the variable "time", which is the survival or censoring time
 Event (recurrence of aml cancer) is indicated by the variable "status". 0 = no event (censored), 1 = event (recurrence)
 Treatment group: the variable "x" indicates if maintenance chemotherapy was given
The last observation (11), at 161 weeks, is censored. Censoring indicates that the patient did not have an event (no recurrence of aml cancer). Another
subject, observation 3, was censored at 13 weeks (indicated by status=0). This subject was in the study for only 13 weeks, and the aml cancer did not
recur during those 13 weeks. It is possible that this patient was enrolled near the end of the study, so that they could be observed for only 13 weeks. It is
also possible that the patient was enrolled early in the study, but was lost to follow up or withdrew from the study. The table shows that other subjects
were censored at 16, 28, and 45 weeks (observations 17, 6, and 9 with status=0). The remaining subjects all experienced events (recurrence of aml
cancer) while in the study. The question of interest is whether recurrence occurs later in maintained patients than in non-maintained patients.
Kaplan–Meier plot for the aml data[edit]
The survival function S(t), is the probability that a subject survives longer than time t. S(t) is theoretically a smooth curve, but it is usually estimated using
the Kaplan–Meier (KM) curve. The graph shows the KM plot for the aml data and can be interpreted as follows:

 The x axis is time, from zero (when observation began) to the last observed time point.
 The y axis is the proportion of subjects surviving. At time zero, 100% of the subjects are alive without an event.
 The solid line (similar to a staircase) shows the progression of event occurrences.
 A vertical drop indicates an event. In the aml table shown above, two subjects had events at five weeks, two had events at eight weeks, one had an
event at nine weeks, and so on. These events at five weeks, eight weeks and so on are indicated by the vertical drops in the KM plot at those time
points.
 At the far right end of the KM plot there is a tick mark at 161 weeks. The vertical tick mark indicates that a patient was censored at this time. In the
aml data table five subjects were censored, at 13, 16, 28, 45 and 161 weeks. There are five tick marks in the KM plot, corresponding to these
censored observations.
Life table for the aml data[edit]
A life table summarizes survival data in terms of the number of events and the proportion surviving at each event time point. The life table for the aml
data, created using the R software, is shown.

Life table for the aml data

The life table summarizes the events and the proportion surviving at each event time point. The columns in the life table have the following interpretation:
 time gives the time points at which events occur.
 n.risk is the number of subjects at risk immediately before the time point, t. Being "at risk" means that the subject has not had an event before time t,
and is not censored before or at time t.
 n.event is the number of subjects who have events at time t.
 survival is the proportion surviving, as determined using the Kaplan–Meier product-limit estimate.
 std.err is the standard error of the estimated survival. The standard error of the Kaplan–Meier product-limit estimate it is calculated using
Greenwood's formula, and depends on the number at risk (n.risk in the table), the number of deaths (n.event in the table) and the proportion surviving
(survival in the table).
 lower 95% CI and upper 95% CI are the lower and upper 95% confidence bounds for the proportion surviving.
Log-rank test: Testing for differences in survival in the aml data[edit]
The log-rank test compares the survival times of two or more groups. This example uses a log-rank test for a difference in survival in the maintained
versus non-maintained treatment groups in the aml data. The graph shows KM plots for the aml data broken out by treatment group, which is indicated
by the variable "x" in the data.

Kaplan–Meier graph by treatment group in aml

The null hypothesis for a log-rank test is that the groups have the same survival. The expected number of subjects surviving at each time point in each is
adjusted for the number of subjects at risk in the groups at each event time. The log-rank test determines if the observed number of events in each group
is significantly different from the expected number. The formal test is based on a chi-squared statistic. When the log-rank statistic is large, it is evidence
for a difference in the survival times between the groups. The log-rank statistic approximately has a chi-squared distribution with one degree of freedom,
and the p-value is calculated using the chi-squared distribution.
For the example data, the log-rank test for difference in survival gives a p-value of p=0.0653, indicating that the treatment groups do not differ
significantly in survival, assuming an alpha level of 0.05. The sample size of 23 subjects is modest, so there is little power to detect differences between
the treatment groups. The chi-squared test is based on asymptotic approximation, so the p-value should be regarded with caution for small sample sizes.
Cox proportional hazards (PH) regression analysis[edit]
Kaplan–Meier curves and log-rank tests are most useful when the predictor variable is categorical (e.g., drug vs. placebo), or takes a small number of
values (e.g., drug doses 0, 20, 50, and 100 mg/day) that can be treated as categorical. The log-rank test and KM curves don't work easily with
quantitative predictors such as gene expression, white blood count, or age. For quantitative predictor variables, an alternative method is Cox proportional
hazards regression analysis. Cox PH models work also with categorical predictor variables, which are encoded as {0,1} indicator or dummy variables.
The log-rank test is a special case of a Cox PH analysis, and can be performed using Cox PH software.
Example: Cox proportional hazards regression analysis for melanoma[edit]
This example uses the melanoma data set from Dalgaard Chapter 14. [2]
Data are in the R package ISwR. The Cox proportional hazards regression using R gives the results shown in the box.

Cox proportional hazards regression output for melanoma data. Predictor variable is sex 1: female, 2: male.

The Cox regression results are interpreted as follows.

 Sex is encoded as a numeric vector (1: female, 2: male). The R summary for the Cox model gives the hazard ratio (HR) for the second group relative
to the first group, that is, male versus female.
 coef = 0.662 is the estimated logarithm of the hazard ratio for males versus females.
 exp(coef) = 1.94 = exp(0.662) - The log of the hazard ratio (coef= 0.662) is transformed to the hazard ratio using exp(coef). The summary for the Cox
model gives the hazard ratio for the second group relative to the first group, that is, male versus female. The estimated hazard ratio of 1.94 indicates
that males have higher risk of death (lower survival rates) than females, in these data.
 se(coef) = 0.265 is the standard error of the log hazard ratio.
 z = 2.5 = coef/se(coef) = 0.662/0.265. Dividing the coef by its standard error gives the z score.
 p=0.013. The p-value corresponding to z=2.5 for sex is p=0.013, indicating that there is a significant difference in survival as a function of sex.
The summary output also gives upper and lower 95% confidence intervals for the hazard ratio: lower 95% bound = 1.15; upper 95% bound = 3.26.
Finally, the output gives p-values for three alternative tests for overall significance of the model:

 Likelihood ratio test = 6.15 on 1 df, p=0.0131


 Wald test = 6.24 on 1 df, p=0.0125
 Score (log-rank) test = 6.47 on 1 df, p=0.0110
These three tests are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The last row,
"Score (logrank) test" is the result for the log-rank test, with p=0.011, the same result as the log-rank test, because the log-rank test is a special case of a
Cox PH regression. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.
Cox model using a covariate in the melanoma data[edit]
The Cox model extends the log-rank test by allowing the inclusion of additional covariates. This example use the melanoma data set where the predictor
variables include a continuous covariate, the thickness of the tumor (variable name = "thick").

Histograms of melanoma tumor thickness

In the histograms, the thickness values don't look normally distributed. Regression models, including the Cox model, generally give more reliable results
with normally-distributed variables. For this example use a log transform. The log of the thickness of the tumor looks to be more normally distributed, so
the Cox models will use log thickness. The Cox PH analysis gives the results in the box.
Cox PH output for melanoma data set with covariate log tumor thickness

The p-value for all three overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. The p-value for log(thick) is
6.9e-07, with a hazard ratio HR = exp(coef) = 2.18, indicating a strong relationship between the thickness of the tumor and increased risk of death.
By contrast, the p-value for sex is now p=0.088. The hazard ratio HR = exp(coef) = 1.58, with a 95% confidence interval of 0.934 to 2.68. Because the
confidence interval for HR includes 1, these results indicate that sex makes a smaller contribution to the difference in the HR after controlling for the
thickness of the tumor, and only trend toward significance. Examination of graphs of log(thickness) by sex and a t-test of log(thickness) by sex both
indicate that there is a significant difference between men and women in the thickness of the tumor when they first see the clinician.
The Cox model assumes that the hazards are proportional. The proportional hazard assumption may be tested using the R function cox.zph(). A p-value
is less than 0.05 indicates that the hazards are not proportional. For the melanoma data, p=0.222, indicating that the hazards are, at least approximately,
proportional. Additional tests and graphs for examining a Cox model are described in the textbooks cited.
Extensions to Cox models[edit]
Cox models can be extended to deal with variations on the simple analysis.

 Stratification. The subjects can be divided into strata, where subjects within a stratum are expected to be relatively more similar to each other than to
randomly chosen subjects from other strata. The regression parameters are assumed to be the same across the strata, but a different baseline
hazard may exist for each stratum. Stratification is useful for analyses using matched subjects, for dealing with patient subsets, such as different
clinics, and for dealing with violations of the proportional hazard assumption.
 Time-varying covariates. Some variables, such as gender and treatment group, generally stay the same in a clinical trial. Other clinical variables,
such as serum protein levels or dose of concomitant medications may change over the course of a study. Cox models may be extended for such
time-varying covariates.
Tree-structured survival models[edit]
The Cox PH regression model is a linear model. It is similar to linear regression and logistic regression. Specifically, these methods assume that a single
line, curve, plane, or surface is sufficient to separate groups (alive, dead) or to estimate a quantitative response (survival time).
In some cases alternative partitions give more accurate classification or quantitative estimates. One set of alternative methods are tree-structured
survival models, including survival random forests. Tree-structured survival models may give more accurate predictions than Cox models. Examining
both types of models for a given data set is a reasonable strategy.
Example survival tree analysis[edit]
This example of a survival tree analysis uses the R package "rpart". The example is based on 146 stage C prostate cancer patients in the data set stagec
in rpart. Rpart and the stagec example are described in the PDF document "An Introduction to Recursive Partitioning Using the RPART Routines". Terry
M. Therneau, Elizabeth J. Atkinson, Mayo Foundation. September 3, 1997.
The variables in stages are:

 pgtime: time to progression, or last follow-up free of progression


 pgstat: status at last follow-up (1=progressed, 0=censored)
 age: age at diagnosis
 eet: early endocrine therapy (1=no, 0=yes)
 ploidy: diploid/tetraploid/aneuploid DNA pattern
 g2: % of cells in G2 phase
 grade: tumor grade (1-4)
 gleason: Gleason grade (3-10)
The survival tree produced by the analysis is shown in the figure.
Survival tree for prostate cancer data set

Each branch in the tree indicates a split on the value of a variable. For example, the root of the tree splits subjects with grade < 2.5 versus subjects with
grade 2.5 or greater. The terminal nodes indicate the number of subjects in the node, the number of subjects who have events, and the relative event
rate compared to the root. In the node on the far left, the values 1/33 indicate that one of the 33 subjects in the node had an event, and that the relative
event rate is 0.122. In the node on the far right bottom, the values 11/15 indicate that 11 of 15 subjects in the node had an event, and the relative event
rate is 2.7.
Survival random forests[edit]
An alternative to building a single survival tree is to build many survival trees, where each tree is constructed using a sample of the data, and average the
trees to predict survival. This is the method underlying the survival random forest models. Survival random forest analysis is available in the R package
"randomForestSRC".
The randomForestSRC package includes an example survival random forest analysis using the data set pbc. This data is from the Mayo Clinic Primary
Biliary Cirrhosis (PBC) trial of the liver conducted between 1974 and 1984. In the example, the random forest survival model gives more accurate
predictions of survival than the Cox PH model. The prediction errors are estimated by bootstrap re-sampling.

General formulation[edit]
This section does not cite any sources. Please help improve this section by adding citations to reliable sources. Unsourced material may be
challenged and removed. (April 2021) (Learn how and when to remove this template message)

Survival function[edit]
Main article: Survival function
The object of primary interest is the survival function, conventionally denoted S, which is defined as

where t is some time, T is a random variable denoting the time of death, and "Pr" stands for probability. That is, the survival function is the probability
that the time of death is later than some specified time t. The survival function is also called the survivor function or survivorship function in problems
of biological survival, and the reliability function in mechanical survival problems. In the latter case, the reliability function is denoted R(t).
Usually one assumes S(0) = 1, although it could be less than 1 if there is the possibility of immediate death or failure.
The survival function must be non-increasing: S(u) ≤ S(t) if u ≥ t. This property follows directly because T>u implies T>t. This reflects the notion that
survival to a later age is possible only if all younger ages are attained. Given this property, the lifetime distribution function and event density
(F and f below) are well-defined.
The survival function is usually assumed to approach zero as age increases without bound (i.e., S(t) → 0 as t → ∞), although the limit could be
greater than zero if eternal life is possible. For instance, we could apply survival analysis to a mixture of stable and unstable carbon isotopes;
unstable isotopes would decay sooner or later, but the stable isotopes would last indefinitely.
Lifetime distribution function and event density[edit]
Related quantities are defined in terms of the survival function.
The lifetime distribution function, conventionally denoted F, is defined as the complement of the survival function,

If F is differentiable then the derivative, which is the density function of the lifetime distribution, is conventionally denoted f,

The function f is sometimes called the event density; it is the rate of death or failure events per unit time.
The survival function can be expressed in terms of probability distribution and probability density functions

Similarly, a survival event density function can be defined as


In other fields, such as statistical physics, the survival event density function is known as the first passage time density.
Hazard function and cumulative hazard function[edit]

The hazard function, conventionally denoted or , is defined as the event rate at time t conditional on survival until time t or
later (that is, T ≥ t). Suppose that an item has survived for a time t and we desire the probability that it will not survive for an additional
time dt:

Force of mortality is a synonym of hazard function which is used particularly in demography and actuarial science, where it is

denoted by . The term hazard rate is another synonym.

The force of mortality of the survival function is defined as


The force of mortality is also called the force of failure. It is the probability density function of the distribution of mortality.
In actuarial science, the hazard rate is the rate of death for lives aged x. For a life aged x, the force of mortality t years later is the
force of mortality for a (x + t)–year old. The hazard rate is also called the failure rate. Hazard rate and failure rate are names used in
reliability theory.
Any function h is a hazard function if and only if it satisfies the following properties:

1. ,

2. .
In fact, the hazard rate is usually more informative about the underlying mechanism of failure than the other representatives of a
lifetime distribution.

The hazard function must be non-negative, λ(t) ≥ 0, and its integral over must be infinite, but is not otherwise constrained; it
may be increasing or decreasing, non-monotonic, or discontinuous. An example is the bathtub curve hazard function, which is large
for small values of t, decreasing to some minimum, and thereafter increasing again; this can model the property of some
mechanical systems to either fail soon after operation, or much later, as the system ages.
The hazard function can alternatively be represented in terms of the cumulative hazard function, conventionally

denoted or :

so transposing signs and exponentiating


or differentiating (with the chain rule)

The name "cumulative hazard function" is derived from the fact that

which is the "accumulation" of the hazard over time.

From the definition of , we see that it increases without bound as t tends to infinity (assuming that S(t) tends to

zero). This implies that must not decrease too quickly, since, by definition, the cumulative hazard has to

diverge. For example, is not the hazard function of any survival distribution, because its integral converges to 1.
The survival function S(t), the cumulative hazard function Λ(t), the density f(t), the hazard function λ(t), and the
lifetime distribution function F(t) are related through

Quantities derived from the survival distribution[edit]

Future lifetime at a given time is the time remaining until death, given survival to age . Thus, it

is in the present notation. The expected future lifetime is the expected value of future lifetime. The

probability of death at or before age , given survival until age , is just

Therefore, the probability density of future lifetime is

and the expected future lifetime is

where the second expression is obtained using integration by parts.


For , that is, at birth, this reduces to the expected lifetime.
In reliability problems, the expected lifetime is called the mean time to failure, and the expected future
lifetime is called the mean residual lifetime.
As the probability of an individual surviving until age t or later is S(t), by definition, the expected
number of survivors at age t out of an initial population of n newborns is n × S(t), assuming the same
survival function for all individuals. Thus the expected proportion of survivors is S(t). If the survival of
different individuals is independent, the number of survivors at age t has a binomial distribution with
parameters n and S(t), and the variance of the proportion of survivors is S(t) × (1-S(t))/n.
The age at which a specified proportion of survivors remain can be found by solving the equation S(t)
= q for t, where q is the quantile in question. Typically one is interested in the median lifetime, for
which q = 1/2, or other quantiles such as q = 0.90 or q = 0.99.

Censoring[edit]
Censoring is a form of missing data problem in which time to event is not observed for reasons such
as termination of study before all recruited subjects have shown the event of interest or the subject
has left the study prior to experiencing an event. Censoring is common in survival analysis.
If only the lower limit l for the true event time T is known such that T > l, this is called right censoring.
Right censoring will occur, for example, for those subjects whose birth date is known but who are still
alive when they are lost to follow-up or when the study ends. We generally encounter right-censored
data.
If the event of interest has already happened before the subject is included in the study but it is not
known when it occurred, the data is said to be left-censored.[3] When it can only be said that the event
happened between two observations or examinations, this is interval censoring.
Left censoring occurs for example when a permanent tooth has already emerged prior to the start of a
dental study that aims to estimate its emergence distribution. In the same study, an emergence time is
interval-censored when the permanent tooth is present in the mouth at the current examination but not
yet at the previous examination. Interval censoring often occurs in HIV/AIDS studies. Indeed, time to
HIV seroconversion can be determined only by a laboratory assessment which is usually initiated after
a visit to the physician. Then one can only conclude that HIV seroconversion has happened between
two examinations. The same is true for the diagnosis of AIDS, which is based on clinical symptoms
and needs to be confirmed by a medical examination.
It may also happen that subjects with a lifetime less than some threshold may not be observed at all:
this is called truncation. Note that truncation is different from left censoring, since for a left censored
datum, we know the subject exists, but for a truncated datum, we may be completely unaware of the
subject. Truncation is also common. In a so-called delayed entry study, subjects are not observed at
all until they have reached a certain age. For example, people may not be observed until they have
reached the age to enter school. Any deceased subjects in the pre-school age group would be
unknown. Left-truncated data are common in actuarial work for life insurance and pensions.[4]
Left-censored data can occur when a person's survival time becomes incomplete on the left side of the
follow-up period for the person. For example, in an epidemiological example, we may monitor a patient
for an infectious disorder starting from the time when he or she is tested positive for the infection.
Although we may know the right-hand side of the duration of interest, we may never know the exact
time of exposure to the infectious agent.[5]

Fitting parameters to data[edit]


Survival models can be usefully viewed as ordinary regression models in which the response variable
is time. However, computing the likelihood function (needed for fitting parameters or making other
kinds of inferences) is complicated by the censoring. The likelihood function for a survival model, in the
presence of censored data, is formulated as follows. By definition the likelihood function is
the conditional probability of the data given the parameters of the model. It is customary to assume
that the data are independent given the parameters. Then the likelihood function is the product of the
likelihood of each datum. It is convenient to partition the data into four categories: uncensored, left
censored, right censored, and interval censored. These are denoted "unc.", "l.c.", "r.c.", and "i.c." in the
equation below.

For uncensored data, with equal to the age at death, we have

For left-censored data, such that the age at death is known to be less than , we have

For right-censored data, such that the age at death is known to be greater than , we
have

For an interval censored datum, such that the age at death is known to be less

than and greater than , we have


An important application where interval-censored data arises is current status data,

where an event is known not to have occurred before an observation time and
to have occurred before the next observation time.

Non-parametric estimation[edit]
The Kaplan–Meier estimator can be used to estimate the survival function.
The Nelson–Aalen estimator can be used to provide a non-parametric estimate of
the cumulative hazard rate function.

Computer software for survival analysis[edit]


The textbook by Kleinbaum has examples of survival analyses using SAS, R, and
other packages.[6] The textbooks by Brostrom,[7] Dalgaard[2] and Tableman and
Kim[8] give examples of survival analyses using R (or using S, and which run in R).

Distributions used in survival analysis[edit]


 Exponential distribution
 Weibull distribution
 Log-logistic distribution
 Gamma distribution
 Exponential-logarithmic distribution

Applications[edit]
 Credit risk[9][10]
 False conviction rate of inmates sentenced to death[11]
 Lead times for metallic components in the aerospace industry[12]
 Predictors of criminal recidivism[13]
 Survival distribution of radio-tagged animals[14]
 Time-to-violent death of Roman emperors[15]

See also[edit]
 Accelerated failure time model
 Bayesian survival analysis
 Cell survival curve
 Censoring (statistics)
 Failure rate
 Frequency of exceedance
 Kaplan–Meier estimator
 Logrank test
 Maximum likelihood
 Mortality rate
 MTBF
 Proportional hazards models
 Reliability theory
 Residence time (statistics)
 Survival function
 Survival rate

References[edit]
1. ^ Miller, Rupert G. (1997), Survival analysis, John Wiley & Sons, ISBN 0-471-25218-2
2. ^ Jump up to:a b Dalgaard, Peter (2008), Introductory Statistics with R (Second ed.),
Springer, ISBN 978-0387790534
3. ^ Darity, William A. Jr., ed. (2008). "Censoring, Left and Right". International Encyclopedia of the
Social Sciences. 1 (2nd ed.). Macmillan. pp. 473–474. Retrieved 6 November 2016.
4. ^ Richards, S. J. (2012). "A handbook of parametric survival models for actuarial
use". Scandinavian Actuarial Journal. 2012 (4): 233–
257. doi:10.1080/03461238.2010.506688. S2CID 119577304.
5. ^ Singh, R.; Mukhopadhyay, K. (2011). "Survival analysis in clinical trials: Basics and must know
areas". Perspect Clin Res. 2 (4): 145–148. doi:10.4103/2229-
3485.86872. PMC 3227332. PMID 22145125.
6. ^ Kleinbaum, David G.; Klein, Mitchel (2012), Survival analysis: A Self-learning text(Third ed.),
Springer, ISBN 978-1441966452
7. ^ Brostrom, Göran (2012), Event History Analysis with R (First ed.), Chapman &
Hall/CRC, ISBN 978-1439831649
8. ^ Tableman, Mara; Kim, Jong Sung (2003), Survival Analysis Using S (First ed.), Chapman and
Hall/CRC, ISBN 978-1584884088
9. ^ Stepanova, Maria; Thomas, Lyn (2002-04-01). "Survival Analysis Methods for Personal Loan
Data". Operations Research. 50 (2): 277–289. doi:10.1287/opre.50.2.277.426. ISSN 0030-364X.
10. ^ Glennon, Dennis; Nigro, Peter (2005). "Measuring the Default Risk of Small Business Loans: A
Survival Analysis Approach". Journal of Money, Credit and Banking. 37 (5): 923–
947. doi:10.1353/mcb.2005.0051. ISSN 0022-2879. JSTOR 3839153. S2CID 154615623.
11. ^ Kennedy, Edward H.; Hu, Chen; O’Brien, Barbara; Gross, Samuel R. (2014-05-20). "Rate of
false conviction of criminal defendants who are sentenced to death". Proceedings of the National
Academy of Sciences. 111 (20): 7230–
7235. Bibcode:2014PNAS..111.7230G. doi:10.1073/pnas.1306417111. ISSN 0027-
8424. PMC 4034186. PMID 24778209.
12. ^ de Cos Juez, F. J.; García Nieto, P. J.; Martínez Torres, J.; Taboada Castro, J. (2010-10-01).
"Analysis of lead times of metallic components in the aerospace industry through a supported
vector machine model". Mathematical and Computer Modelling. Mathematical Models in
Medicine, Business & Engineering 2009. 52 (7): 1177–
1184. doi:10.1016/j.mcm.2010.03.017. ISSN 0895-7177.
13. ^ Spivak, Andrew L.; Damphousse, Kelly R. (2006). "Who Returns to Prison? A Survival Analysis
of Recidivism among Adult Offenders Released in Oklahoma, 1985 – 2004". Justice Research
and Policy. 8 (2): 57–88. doi:10.3818/jrp.8.2.2006.57. ISSN 1525-1071. S2CID 144566819.
14. ^ Pollock, Kenneth H.; Winterstein, Scott R.; Bunck, Christine M.; Curtis, Paul D. (1989). "Survival
Analysis in Telemetry Studies: The Staggered Entry Design". The Journal of Wildlife
Management. 53 (1): 7–15. doi:10.2307/3801296. ISSN 0022-541X. JSTOR 3801296.
15. ^ Saleh, Joseph Homer (2019-12-23). "Statistical reliability analysis for a most dangerous
occupation: Roman emperor". Palgrave Communications. 5 (1): 1–7. doi:10.1057/s41599-019-
0366-y. ISSN 2055-1045.

Further reading[edit]
 Collett, David (2003). Modelling Survival Data in Medical
Research (Second ed.). Boca Raton: Chapman & Hall/CRC. ISBN 1584883251.
 Elandt-Johnson, Regina; Johnson, Norman (1999). Survival Models and Data
Analysis. New York: John Wiley & Sons. ISBN 0471349925.
 Kalbfleisch, J. D.; Prentice, Ross L. (2002). The statistical analysis of failure time
data. New York: John Wiley & Sons. ISBN 047136357X.
 Lawless, Jerald F. (2003). Statistical Models and Methods for Lifetime
Data (2nd ed.). Hoboken: John Wiley and Sons. ISBN 0471372153.
 Rausand, M.; Hoyland, A. (2004). System Reliability Theory: Models, Statistical
Methods, and Applications. Hoboken: John Wiley & Sons. ISBN 047147133X.

External links[edit]
 Therneau, Terry. "A Package for Survival Analysis in S". Archived from the
original on 2006-09-07. via Dr. Therneau's page on the Mayo Clinic website
 "Engineering Statistics Handbook". NIST/SEMATEK.
 SOCR, Survival analysis applet and interactive learning activity.
 Survival/Failure Time Analysis @ Statistics' Textbook Page
 Survival Analysis in R
 Lifelines, a Python package for survival analysis
 Survival Analysis in NAG Fortran Library
Actuarial science
From Wikipedia, the free encyclopedia
Jump to navigationJump to search

2003 US mortality (life) table, Table 1, Page 1

Actuarial science is the discipline that applies mathematical and statistical methods to assess risk in insurance, finance, and other industries and
professions. More generally, actuaries apply rigorous mathematics to model matters of uncertainty.
Actuaries are professionals trained in this discipline. In many countries, actuaries must demonstrate their competence by passing a series of rigorous
professional examinations.
Actuarial science includes a number of interrelated subjects, including mathematics, probability theory, statistics, finance, economics, and computer
science. Historically, actuarial science used deterministic models in the construction of tables and premiums. The science has gone through revolutionary
changes since the 1980s due to the proliferation of high speed computers and the union of stochastic actuarial models with modern financial theory.[1]
Many universities have undergraduate and graduate degree programs in actuarial science. In 2010, a study published by job search website CareerCast
ranked actuary as the #1 job in the United States.[2] The study used five key criteria to rank jobs: environment, income, employment outlook, physical
demands, and stress. A similar study by U.S. News & World Report in 2006 included actuaries among the 25 Best Professions that it expects will be in
great demand in the future.[3]

Contents

 1Life insurance, pensions and healthcare


 2Applied to other forms of insurance
 3Development
o 3.1Pre-formalization
o 3.2Initial development
o 3.3Early actuaries
o 3.4Technological advances
o 3.5Actuarial science related to modern financial economics
o 3.6History
o 3.7Actuaries in criminal justice
 4See also
 5References
o 5.1Works cited
o 5.2Bibliography
 6External links

Life insurance, pensions and healthcare[edit]


Actuarial science became a formal mathematical discipline in the late 17th century with the increased demand for long-term insurance coverage such as
burial, life insurance, and annuities. These long term coverages required that money be set aside to pay future benefits, such as annuity and death
benefits many years into the future. This requires estimating future contingent events, such as the rates of mortality by age, as well as the development
of mathematical techniques for discounting the value of funds set aside and invested. This led to the development of an important actuarial concept,
referred to as the present value of a future sum. Certain aspects of the actuarial methods for discounting pension funds have come under criticism from
modern financial economics.[citation needed]

 In traditional life insurance, actuarial science focuses on the analysis of mortality, the production of life tables, and the application of compound
interest to produce life insurance, annuities and endowment policies. Contemporary life insurance programs have been extended to include credit and
mortgage insurance, key person insurance for small businesses, long term care insurance and health savings accounts.[4]
 In health insurance, including insurance provided directly by employers, and social insurance, actuarial science focuses on the analysis of rates of
disability, morbidity, mortality, fertility and other contingencies. The effects of consumer choice and the geographical distribution of the utilization of
medical services and procedures, and the utilization of drugs and therapies, is also of great importance. These factors underlay the development of
the Resource-Base Relative Value Scale (RBRVS) at Harvard in a multi-disciplined study.[5] Actuarial science also aids in the design of benefit
structures, reimbursement standards, and the effects of proposed government standards on the cost of healthcare.[6]
 In the pension industry, actuarial methods are used to measure the costs of alternative strategies with regard to the design, funding, accounting,
administration, and maintenance or redesign of pension plans. The strategies are greatly influenced by short-term and long-term bond rates, the
funded status of the pension and benefit arrangements, collective bargaining; the employer's old, new and foreign competitors; the changing
demographics of the workforce; changes in the internal revenue code; changes in the attitude of the internal revenue service regarding the calculation
of surpluses; and equally importantly, both the short and long term financial and economic trends. It is common with mergers and acquisitions that
several pension plans have to be combined or at least administered on an equitable basis. When benefit changes occur, old and new benefit plans
have to be blended, satisfying new social demands and various government discrimination test calculations, and providing employees and retirees
with understandable choices and transition paths. Benefit plans liabilities have to be properly valued, reflecting both earned benefits for past service,
and the benefits for future service. Finally, funding schemes have to be developed that are manageable and satisfy the standards board or regulators
of the appropriate country, such as the Financial Accounting Standards Board in the United States.[citation needed]
 In social welfare programs, the Office of the Chief Actuary (OCACT), Social Security Administration plans and directs a program of actuarial
estimates and analyses relating to SSA-administered retirement, survivors and disability insurance programs and to proposed changes in those
programs. It evaluates operations of the Federal Old-Age and Survivors Insurance Trust Fund and the Federal Disability Insurance Trust Fund,
conducts studies of program financing, performs actuarial and demographic research on social insurance and related program issues involving
mortality, morbidity, utilization, retirement, disability, survivorship, marriage, unemployment, poverty, old age, families with children, etc., and projects
future workloads. In addition, the Office is charged with conducting cost analyses relating to the Supplemental Security Income (SSI) program, a
general-revenue financed, means-tested program for low-income aged, blind and disabled people. The Office provides technical and consultative
services to the Commissioner, to the Board of Trustees of the Social Security Trust Funds, and its staff appears before Congressional Committees to
provide expert testimony on the actuarial aspects of Social Security issues.[citation needed]

Applied to other forms of insurance[edit]


Actuarial science is also applied to property, casualty, liability, and general insurance. In these forms of insurance, coverage is generally provided on a
renewable period, (such as a yearly). Coverage can be cancelled at the end of the period by either party.[citation needed]
Property and casualty insurance companies tend to specialize because of the complexity and diversity of risks.[citation needed] One division is to organize around
personal and commercial lines of insurance. Personal lines of insurance are for individuals and include fire, auto, homeowners, theft and umbrella
coverages. Commercial lines address the insurance needs of businesses and include property, business continuation, product liability, fleet/commercial
vehicle, workers compensation, fidelity & surety, and D&O insurance. The insurance industry also provides coverage for exposures such as catastrophe,
weather-related risks, earthquakes, patent infringement and other forms of corporate espionage, terrorism, and "one-of-a-kind" (e.g., satellite launch).
Actuarial science provides data collection, measurement, estimating, forecasting, and valuation tools to provide financial and underwriting data for
management to assess marketing opportunities and the nature of the risks. Actuarial science often helps to assess the overall risk from catastrophic
events in relation to its underwriting capacity or surplus.[citation needed]
In the reinsurance fields, actuarial science can be used to design and price reinsurance and retrocession arrangements, and to establish reserve funds
for known claims and future claims and catastrophes.[citation needed]

Development[edit]
Pre-formalization[edit]
Elementary mutual aid agreements and pensions arose in antiquity.[7] Early in the Roman empire, associations were formed to meet the expenses of
burial, cremation, and monuments—precursors to burial insurance and friendly societies. A small sum was paid into a communal fund on a weekly basis,
and upon the death of a member, the fund would cover the expenses of rites and burial. These societies sometimes sold shares in the building
of columbāria, or burial vaults, owned by the fund—the precursor to mutual insurance companies.[8] Other early examples of mutual surety and assurance
pacts can be traced back to various forms of fellowship within the Saxon clans of England and their Germanic forebears, and to Celtic society.[9] However,
many of these earlier forms of surety and aid would often fail due to lack of understanding and knowledge.[10]
Initial development[edit]
The 17th century was a period of advances in mathematics in Germany, France and England. At the same time there was a rapidly growing desire and
need to place the valuation of personal risk on a more scientific basis. Independently of each other, compound interest was studied and probability
theory emerged as a well-understood mathematical discipline. Another important advance came in 1662 from a London draper named John Graunt, who
showed that there were predictable patterns of longevity and death in a group, or cohort, of people of the same age, despite the uncertainty of the date of
death of any one individual. This study became the basis for the original life table. One could now set up an insurance scheme to provide life insurance or
pensions for a group of people, and to calculate with some degree of accuracy how much each person in the group should contribute to a common fund
assumed to earn a fixed rate of interest. The first person to demonstrate publicly how this could be done was Edmond Halley (of Halley's comet fame).
Halley constructed his own life table, and showed how it could be used to calculate the premium amount someone of a given age should pay to purchase
a life annuity.[11]
Early actuaries[edit]
James Dodson's pioneering work on the long term insurance contracts under which the same premium is charged each year led to the formation of the
Society for Equitable Assurances on Lives and Survivorship (now commonly known as Equitable Life) in London in 1762.[12] William Morgan is often
considered the father of modern actuarial science for his work in the field in the 1780s and 90s. Many other life insurance companies and pension funds
were created over the following 200 years. Equitable Life was the first to use the word "actuary" for its chief executive officer in 1762.[13] Previously,
"actuary" meant an official who recorded the decisions, or "acts", of ecclesiastical courts.[10] Other companies that did not use such mathematical and
scientific methods most often failed or were forced to adopt the methods pioneered by Equitable.[14]
Technological advances[edit]
In the 18th and 19th centuries, calculations were performed without computers. The calculations of life insurance premiums and reserving requirements
are rather complex, and actuaries developed techniques to make the calculations as easy as possible, for example "commutation functions" (essentially
precalculated columns of summations over time of discounted values of survival and death probabilities).[15] Actuarial organizations were founded to
support and further both actuaries and actuarial science, and to protect the public interest by promoting competency and ethical standards.[16] However,
calculations remained cumbersome, and actuarial shortcuts were commonplace. Non-life actuaries followed in the footsteps of their life insurance
colleagues during the 20th century. The 1920 revision for the New-York based National Council on Workmen's Compensation Insurance rates took over
two months of around-the-clock work by day and night teams of actuaries.[17] In the 1930s and 1940s, the mathematical foundations
for stochastic processes were developed.[18] Actuaries could now begin to estimate losses using models of random events, instead of
the deterministic methods they had used in the past. The introduction and development of the computer further revolutionized the actuarial profession.
From pencil-and-paper to punchcards to current high-speed devices, the modeling and forecasting ability of the actuary has rapidly improved, while still
being heavily dependent on the assumptions input into the models, and actuaries needed to adjust to this new world .[19]
Actuarial science related to modern financial economics[edit]
Traditional actuarial science and modern financial economics in the US have different practices, which is caused by different ways of calculating funding
and investment strategies, and by different regulations.[citation needed]
Regulations are from the Armstrong investigation of 1905, the Glass–Steagall Act of 1932, the adoption of the Mandatory Security Valuation Reserve by
the National Association of Insurance Commissioners, which cushioned market fluctuations, and the Financial Accounting Standards Board, (FASB) in
the US and Canada, which regulates pensions valuations and funding.[citation needed]
History[edit]
Historically, much of the foundation of actuarial theory predated modern financial theory. In the early twentieth century, actuaries were developing many
techniques that can be found in modern financial theory, but for various historical reasons, these developments did not achieve much recognition.[20]
As a result, actuarial science developed along a different path, becoming more reliant on assumptions, as opposed to the arbitrage-free risk-neutral
valuation concepts used in modern finance. The divergence is not related to the use of historical data and statistical projections of liability cash flows, but
is instead caused by the manner in which traditional actuarial methods apply market data with those numbers. For example, one traditional actuarial
method suggests that changing the asset allocation mix of investments can change the value of liabilities and assets (by changing the discount
rate assumption). This concept is inconsistent with financial economics.[citation needed]
The potential of modern financial economics theory to complement existing actuarial science was recognized by actuaries in the mid-twentieth
century.[21] In the late 1980s and early 1990s, there was a distinct effort for actuaries to combine financial theory and stochastic methods into their
established models.[22] Ideas from financial economics became increasingly influential in actuarial thinking, and actuarial science has started to embrace
more sophisticated mathematical modelling of finance.[23] Today, the profession, both in practice and in the educational syllabi of many actuarial
organizations, is cognizant of the need to reflect the combined approach of tables, loss models, stochastic methods, and financial theory.[24] However,
assumption-dependent concepts are still widely used (such as the setting of the discount rate assumption as mentioned earlier), particularly in North
America.[citation needed]
Product design adds another dimension to the debate. Financial economists argue that pension benefits are bond-like and should not be funded with
equity investments without reflecting the risks of not achieving expected returns. But some pension products do reflect the risks of unexpected returns. In
some cases, the pension beneficiary assumes the risk, or the employer assumes the risk. The current debate now seems to be focusing on four
principles:

1. financial models should be free of arbitrage


2. assets and liabilities with identical cash flows should have the same price. This is at odds with FASB
3. the value of an asset is independent of its financing
4. the final issue deals with how pension assets should be invested
Essentially, financial economics state that pension assets should not be invested in equities for a variety of theoretical and practical reasons.[25]
Actuaries in criminal justice[edit]
There is an increasing trend to recognize that actuarial skills can be applied to a range of applications outside the traditional fields of insurance, pensions,
etc. One notable example is the use in some US states of actuarial models to set criminal sentencing guidelines. These models attempt to predict the
chance of re-offending according to rating factors which include the type of crime, age, educational background and ethnicity of the offender.[26] However,
these models have been open to criticism as providing justification for discrimination against specific ethnic groups by law enforcement personnel.
Whether this is statistically correct or a self-fulfilling correlation remains under debate.[27]
Another example is the use of actuarial models to assess the risk of sex offense recidivism. Actuarial models and associated tables, such as the
MnSOST-R, Static-99, and SORAG, have been used since the late 1990s to determine the likelihood that a sex offender will re-offend and thus whether
he or she should be institutionalized or set free.[28]

Census
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
Census taker visits a family of Indigenous Dutch Travellers living in a caravan, Netherlands 1925

A census is the procedure of systematically calculating, acquiring and recording information about the members of a given population. This term is used
mostly in connection with national population and housing censuses; other common censuses include the census of agriculture, and other censuses such
as the traditional culture, business, supplies, and traffic censuses. The United Nations defines the essential features of population and housing censuses
as "individual enumeration, universality within a defined territory, simultaneity and defined periodicity", and recommends that population censuses be
taken at least every ten years. United Nations recommendations also cover census topics to be collected, official definitions, classifications and other
useful information to co-ordinate international practices.[1][2]
The Food and Agriculture Organization of the United Nations (FAO), in turn, defines the census of agriculture as “a statistical operation for collecting,
processing and disseminating data on the structure of agriculture, covering the whole or a significant part of a country.” “In a census of agriculture, data
are collected at the holding level.[3]
The word is of Latin origin: during the Roman Republic, the census was a list that kept track of all adult males fit for military service. The modern census
is essential to international comparisons of any kind of statistics, and censuses collect data on many attributes of a population, not just how many people
there are. Censuses typically began as the only method of collecting national demographic data, and are now part of a larger system of different surveys.
Although population estimates remain an important function of a census, including exactly the geographic distribution of the population or the agricultural
population, statistics can be produced about combinations of attributes e.g. education by age and sex in different regions. Current administrative
data systems allow for other approaches to enumeration with the same level of detail but raise concerns about privacy and the possibility of biasing
estimates.[4]
A census can be contrasted with sampling in which information is obtained only from a subset of a population; typically main population estimates are
updated by such intercensal estimates. Modern census data are commonly used for research, business marketing, and planning, and as a baseline for
designing sample surveys by providing a sampling frame such as an address register. Census counts are necessary to adjust samples to be
representative of a population by weighting them as is common in opinion polling. Similarly, stratification requires knowledge of the relative sizes of
different population strata, which can be derived from census enumerations. In some countries, the census provides the official counts used to apportion
the number of elected representatives to regions (sometimes controversially – e.g., Utah v. Evans). In many cases, a carefully chosen random sample
can provide more accurate information than attempts to get a population census.[5]

World map showing countries' most recent censuses as of 2014:


2005 or after
2000–2004
1995–1999
1990–1994
1970–1989

Contents

 1Sampling
 2Residence definitions
 3Enumeration strategies
 4Technology
 5Development
 6Uses of census data
o 6.1Census data and research
 7Privacy and data stewardship
 8History of censuses
o 8.1Egypt
o 8.2Ancient Greece
o 8.3Israel
o 8.4China
o 8.5India
o 8.6Rome
o 8.7Rashidun and Umayyad Caliphates
o 8.8Medieval Europe
o 8.9Inca Empire
o 8.10Spanish Empire
 9World population estimates
 10Impact of COVID-19 on census
o 10.1Impact
o 10.2Adaptation
 11Modern implementation
 12See also
 13Sources
 14Notes
 15References
 16External links

Sampling[edit]

Tehran Census 1869[6]

A census is often construed as the opposite of a sample as its intent is to count everyone in a population rather than a fraction. However, population
censuses do rely on a sampling frame to count the population. This is the only way to be sure that everyone has been included as otherwise those not
responding would not be followed up on and individuals could be missed. The fundamental premise of a census is that the population is not known and a
new estimate is to be made by the analysis of primary data. The use of a sampling frame is counterintuitive as it suggests that the population size is
already known. However, a census is also used to collect attribute data on the individuals in the nation, not only to assess population size. This process
of sampling marks the difference between a historical census, which was a house to house process or the product of an imperial decree, and the modern
statistical project. The sampling frame used by census is almost always an address register. Thus it is not known if there is anyone resident or how many
people there are in each household. Depending on the mode of enumeration, a form is sent to the householder, an enumerator calls, or administrative
records for the dwelling are accessed. As a preliminary to the dispatch of forms, census workers will check any address problems on the ground. While it
may seem straightforward to use the postal service file for this purpose, this can be out of date and some dwellings may contain a number of
independent households. A particular problem is what are termed 'communal establishments' which category includes student residences, religious
orders, homes for the elderly, people in prisons etc. As these are not easily enumerated by a single householder, they are often treated differently and
visited by special teams of census workers to ensure they are classified appropriately.

Residence definitions[edit]
Individuals are normally counted within households, and information is typically collected about the household structure and the housing. For this reason
international documents refer to censuses of population and housing. Normally the census response is made by a household, indicating details of
individuals resident there. An important aspect of census enumerations is determining which individuals can be counted and which cannot be counted.
Broadly, three definitions can be used: de facto residence; de jure residence; and permanent residence. This is important in considering individuals who
have multiple or temporary addresses. Every person should be identified uniquely as resident in one place; but the place where they happen to be
on Census Day, their de facto residence, may not be the best place to count them. Where an individual uses services may be more useful, and this is at
their usual residence. An individual may be recorded at a "permanent" address, which might be a family home for students or long term migrants.
A precise definition of residence is needed, to decide whether visitors to a country should be included in the population count. This is becoming more
important as students travel abroad for education for a period of several years. Other groups causing problems of enumeration are new-born babies,
refugees, people away on holiday, people moving home around census day, and people without a fixed address.
People with second homes because they are working in another part of the country or have a holiday cottage are difficult to fix at a particular address;
this sometimes causes double counting or houses being mistakenly identified as vacant. Another problem is where people use a different address at
different times e.g. students living at their place of education in term time but returning to a family home during vacations, or children whose parents have
separated who effectively have two family homes. Census enumeration has always been based on finding people where they live, as there is no
systematic alternative: any list used to find people is likely to be derived from census activities in the first place. Recent UN guidelines provide
recommendations on enumerating such complex households.[7]
In the census of agriculture, data is collected at the agricultural holding unit. An agricultural holding is an economic unit of agricultural production under
single management comprising all livestock kept and all land used wholly or partly for agricultural production purposes, without regard to title, legal form,
or size. Single management may be exercised by an individual or household, jointly by two or more individuals or households, by a clan or tribe, or by a
juridical person such as a corporation, cooperative or government agency. The holding's land may consist of one or more parcels, located in one or more
separate areas or in one or more territorial or administrative divisions, providing the parcels share the same production means, such as labour, farm
buildings, machinery or draught animals.[3]

Enumeration strategies[edit]
Historical censuses used crude enumeration assuming[clarification needed] absolute accuracy. Modern approaches take into account the problems of overcount
and undercount, and the coherence of census enumerations with other official sources of data.[clarification needed][8] This reflects a realist approach to
measurement, acknowledging that under any definition of residence there is a true value of the population[gobbledegook] but this can never be measured with
complete accuracy. An important aspect of the census process is to evaluate the quality of the data.[9]
Many countries use a post-enumeration survey to adjust the raw census counts.[10] This works in a similar manner to capture-recapture estimation for
animal populations. Among census experts this method is called dual system enumeration (DSE). A sample of households are visited by interviewers
who record the details of the household as at census day. These data are then matched to census records, and the number of people missed can be
estimated by considering the numbers of people who are included in one count but not the other. This allows adjustments to the count for non-response,
varying between different demographic groups. An explanation using a fishing analogy can be found in "Trout, Catfish and Roach..."[11] which won an
award from the Royal Statistical Society for excellence in official statistics in 2011.

Enumerator conducting a survey using a mobile phone-based questionnaire in rural Zimbabwe.

Triple system enumeration has been proposed as an improvement as it would allow evaluation of the statistical dependence of pairs of sources.
However, as the matching process is the most difficult aspect of census estimation this has never been implemented for a national enumeration. It would
also be difficult to identify three different sources that were sufficiently different to make the triple system effort worthwhile. The DSE approach has
another weakness in that it assumes there is no person counted twice (over count). In de facto residence definitions this would not be a problem but in de
jure definitions individuals risk being recorded on more than one form leading to double counting. A particular problem here is students who often have a
term time and family address.
Several countries have used a system which is known as short form/long form.[12] This is a sampling strategy which randomly chooses a proportion of
people to send a more detailed questionnaire to (the long form). Everyone receives the short form questions. This means more data are collected, but
without imposing a burden on the whole population. This also reduces the burden on the statistical office. Indeed, in the UK until 2001 all residents were
required to fill in the whole form but only a 10% sample were coded and analysed in detail.[13] New technology means that all data are now scanned and
processed. Recently there has been controversy in Canada about the cessation of the mandatory long form census; the head of Statistics Canada, Munir
Sheikh, resigned upon the federal government's decision to do so.[14]
The use of alternative enumeration strategies is increasing[15] but these are not as simple as many people assume, and are only used in developed
countries.[16] The Netherlands has been most advanced in adopting a census using administrative data. This allows a simulated census to be conducted
by linking several different administrative databases at an agreed time. Data can be matched and an overall enumeration established allowing for
discrepancies between different data sources. A validation survey is still conducted in a similar way to the post enumeration survey employed in a
traditional census.
Other countries which have a population register use this as a basis for all the census statistics needed by users. This is most common among Nordic
countries, but requires many distinct registers to be combined, including population, housing, employment and education. These registers are then
combined and brought up to the standard of a statistical register by comparing the data in different sources and ensuring the quality is sufficient for
official statistics to be produced.[17] A recent innovation is the French instigation of a rolling census programme with different regions enumerated each
year, so that the whole country is completely enumerated every 5 to 10 years.[18] In Europe, in connection with the 2010 census round, many countries
adopted alternative census methodologies, often based on the combination of data from registers, surveys and other sources.[19]

Technology[edit]
Censuses have evolved in their use of technology: censuses in 2010 used many new types of computing. In Brazil, handheld devices were used by
enumerators to locate residences on the ground. In many countries, census returns could be made via the Internet as well as in paper form. DSE is
facilitated by computer matching techniques which can be automated, such as propensity score matching. In the UK, all census formats are scanned and
stored electronically before being destroyed, replacing the need for physical archives. The record linking to perform an administrative census would not
be possible without large databases being stored on computer systems.
There are sometimes problems in introducing new technology. The US census had been intended to use handheld computers, but cost escalated and
this was abandoned, with the contract being sold to Brazil. Online response has some advantages, but one of the functions of the census is to make sure
everyone is counted accurately. A system which allowed people to enter their address without verification would be open to abuse. Therefore,
households have to be verified on the ground, typically by an enumerator visit or post out[clarification needed]. Paper forms are still necessary for those without
access to the internet. It is also possible that the hidden nature[clarification needed] of an administrative[clarification needed] census means that users are not engaged with the
importance of contributing their data to official statistics.
Alternatively, population estimations may be carried out remotely with GIS and remote sensing technologies.[20]

Development[edit]
According to UNFPA, "The information generated by a population and housing census – numbers of people, their distribution, their living conditions and
other key data – is critical for development."[21] This is because this type of data is essential for policymakers so that they know where to invest.
Unfortunately, many countries have outdated or inaccurate data about their populations and thus have difficulty in addressing the needs of the
population.
UNFPA said:[21]
"The unique advantage of the census is that it represents the entire statistical universe, down to the smallest geographical units, of a country or region.
Planners need this information for all kinds of development work, including: assessing demographic trends; analysing socio-economic
conditions;[22] designing evidence-based poverty-reduction strategies; monitoring and evaluating the effectiveness of policies; and tracking progress
toward national and internationally agreed development goals."
In addition to making policymakers aware of population issues, the census is also an important tool for identifying forms of social, demographic or
economic exclusions, such as inequalities relating to race, ethics, and religion as well as disadvantaged groups such as those with disabilities and the
poor.
An accurate census can empower local communities by providing them with the necessary information to participate in local decision-making and
ensuring they are represented.
The importance of the census of agriculture for development is that it gives a snapshot of the structure of the agricultural sector in a country and, when
compared with previous censuses, provides an opportunity to identify trends and structural transformations of the sector, and points towards areas for
policy intervention. Census data are used as a benchmark for current statistics and their value is increased when they are employed together with other
data sources.[3]

Uses of census data[edit]


Early censuses in the 19th century collected paper documents which had to be collated by hand, so the statistical information obtained was quite basic.
The government owned the data could publish statistics on the state of the nation.[23] The results were used to measure changes in the population and
apportion representation. Population estimates could be compared to those of other countries.
By the beginning of the 20th century, censuses were recording households and some indications of their employment. In some countries, census
archives are released for public examination after many decades, allowing genealogists to track the ancestry of interested people. Archives provide a
substantial historical record which may challenge established views. Information such as job titles and arrangements for the destitute and sick may also
shed light on the historical structure of society.
Political considerations influence the census in many countries. In Canada in 2010 for example, the government under the leadership of Stephen Harper
abolished the mandatory long-form census. This abolition was a response to protests from some Canadians who resented the personal questions.[24] The
long-form census was reinstated by the Justin Trudeau government in 2016.
Census data and research[edit]
As governments assumed responsibility for schooling and welfare, large government research departments made extensive use of census data.
Population projections could be made, to help plan for provision in local government and regions. Central government could also use census data to
allocate funding. Even in the mid 20th century, census data was only directly accessible to large government departments. However, computers meant
that tabulations could be used directly by university researchers, large businesses and local government offices. They could use the detail of the data to
answer new questions and add to local and specialist knowledge.
Nowadays, census data are published in a wide variety of formats to be accessible to business, all levels of government, media, students and teachers,
charities, and any citizen who is interested; researchers in particular have an interest in the role of Census Field Officers (CFO) and their
assistants.[25] Data can be represented visually or analysed in complex statistical models, to show the difference between certain areas, or to understand
the association between different personal characteristics. Census data offer a unique insight into small areas and small demographic groups which
sample data would be unable to capture with precision.
In the census of agriculture, users need census data to:

1. support and contribute to evidence-based agricultural planning and policy-making. The census information is essential, for example, to monitor the
performance of a policy or programme designed for crop diversification or to address food security issues;
2. provide data to facilitate research, investment and business decisions both in the public and private sector;
3. contribute to monitoring environmental changes and evaluating the impact of agricultural practices on the environment such as tillage practices,
crop rotation or sources of greenhouse gas (GHG) emissions;
4. provide relevant data on work inputs and main work activities, as well as on the labour force in the agriculture sector;
5. provide an important information base for monitoring some key indicators of the Sustainable Development Goals (SDGs), in particular those goals
related to food security in agricultural holdings, the role of women in agricultural activities and rural poverty;
6. provide baseline data both at the national and small administrative and geographical levels for formulating, monitoring and evaluating programmes
and projects interventions;
7. provide essential information on subsistence agriculture and for the estimation of the non-observed economy, which plays an important role in the
compilation of the national accounts and the economic accounts for agriculture.[3]

Privacy and data stewardship[edit]


Although the census provides useful statistical information about a population, the availability of this information could sometimes lead to abuses, political
or otherwise, by the linking of individuals' identities to anonymous census data.[26] This is particularly important when individuals' census responses are
made available in microdata form, but even aggregate-level data can result in privacy breaches when dealing with small areas and/or rare
subpopulations.
For instance, when reporting data from a large city, it might be appropriate to give the average income for black males aged between 50 and 60.
However, doing this for a town that only has two black males in this age group would be a breach of privacy because either of those persons, knowing his
own income and the reported average, could determine the other man's income.
Typically, census data are processed to obscure such individual information. Some agencies do this by intentionally introducing small statistical errors to
prevent the identification of individuals in marginal populations;[27] others swap variables for similar respondents. Whatever is done to reduce the privacy
risk, new improved electronic analysis of data can threaten to reveal sensitive individual information. This is known as statistical disclosure control.
Another possibility is to present survey results by means of statistical models in the form of a multivariate distribution mixture.[28] The statistical information
in the form of conditional distributions (histograms) can be derived interactively from the estimated mixture model without any further access to the
original database. As the final product does not contain any protected microdata, the model-based interactive software can be distributed without any
confidentiality concerns.
Another method is simply to release no data at all, except very large scale data directly to the central government. Different release strategies between
government have led to an international project (IPUMS) to co-ordinate access to microdata and corresponding metadata. Such projects such
as SDMX also promote standardising metadata, so that best use can be made of the minimal data available.

History of censuses[edit]
Egypt[edit]
Censuses in Egypt first appeared in the late Middle Kingdom and developed in the New Kingdom[29] Pharaoh Amasis, according to Herodotus, required
every Egyptian to declare annually to the nomarch, "whence he gained his living".[30] Under the Ptolemies and the Romans several censuses were
conducted in Egypt by government officials [31]
Ancient Greece[edit]
There are several accounts of ancient Greek city states carrying out censuses.[32]
Israel[edit]
Censuses are mentioned in the Bible. God commands a per capita tax to be paid with the census[33] for the upkeep of the Tabernacle. The Book of
Numbers is named after the counting of the Israelite population[34] according to the house of the Fathers after the exodus from Egypt. A second census
was taken while the Israelites were camped in the plains of Moab.[35]
King David performed a census that produced disastrous results.[36] His son, King Solomon, had all of the foreigners in Israel counted.[37]
When the Romans took over Judea in AD 6, the legate Publius Sulpicius Quirinius organised a census for tax purposes. The Gospel of Luke links the
birth of Jesus either to this event, or to an otherwise unknown census conducted prior to Quirinius’ tenure.[38][39]
China[edit]
One of the world's earliest preserved censuses[40] was held in China in AD 2 during the Han Dynasty, and is still considered by scholars to be quite
accurate.[41][42][43][44] The population was registered as having 57,671,400 individuals in 12,366,470 households but on this occasion only taxable families had
been taken into account - indicating the income and the number of soldiers who could be mobilized.[45][43] Another census was held in AD 144.
India[edit]
The oldest recorded census in India is thought to have occurred around 330 BC during the reign of Emperor Chandragupta Maurya under the leadership
of Kautilya or Chanakya and Ashoka.[46]
Rome[edit]
See also: Roman censor and Indiction
The English term is taken directly from the Latin census, from censere ("to estimate"). The census played a crucial role in the administration of the
Roman government, as it was used to determine the class a citizen belonged to for both military and tax purposes. Beginning in the middle republic, it
was usually carried out every five years.[47] It provided a register of citizens and their property from which their duties and privileges could be listed. It is
said to have been instituted by the Roman king Servius Tullius in the 6th century BC,[48] at which time the number of arms-bearing citizens was
supposedly counted at around 80,000.[49] The 6 AD "census of Quirinius" undertaken following the imposition of direct Roman rule in Judea was partially
responsible for the development of the Zealot movement and several failed rebellions against Rome that ended in the Diaspora. The 15-
year indiction cycle established by Diocletian in AD 297 was based on quindecennial censuses and formed the basis for dating in late antiquity and under
the Byzantine Empire.
Rashidun and Umayyad Caliphates[edit]
In the Middle Ages, the Caliphate began conducting regular censuses soon after its formation, beginning with the one ordered by the
second Rashidun caliph, Umar.[50]
Medieval Europe[edit]
The Domesday Book was undertaken in AD 1086 by William I of England so that he could properly tax the land he had recently conquered. In 1183, a
census was taken of the crusader Kingdom of Jerusalem, to ascertain the number of men and amount of money that could possibly be raised against an
invasion by Saladin, sultan of Egypt and Syria.
1328 : First national census of France (L'État des paroisses et des feux) mostly for fiscal purposes. It estimated the French population at 16 to 17
millions.
Inca Empire[edit]
In the 15th century, the Inca Empire had a unique way to record census information. The Incas did not have any written language but recorded
information collected during censuses and other numeric information as well as non-numeric data on quipus, strings from llama or alpaca hair or cotton
cords with numeric and other values encoded by knots in a base-10 positional system.[citation needed]
Spanish Empire[edit]
On May 25, 1577, King Philip II of Spain ordered by royal cédula the preparation of a general description of Spain's holdings in the Indies. Instructions
and a questionnaire, issued in 1577 by the Office of the Cronista Mayor, were distributed to local officials in the Viceroyalties of New Spain and Peru to
direct the gathering of information. The questionnaire, composed of fifty items, was designed to elicit basic information about the nature of the land and
the life of its peoples. The replies, known as "relaciones geográficas", were written between 1579 and 1585 and were returned to the Cronista Mayor in
Spain by the Council of the Indies.

World population estimates[edit]


The earliest estimate of the world population was made by Giovanni Battista Riccioli in 1661; the next by Johann Peter Süssmilch in 1741, revised in
1762; the third by Karl Friedrich Wilhelm Dieterici in 1859.[51]
In 1931, Walter Willcox published a table in his book, International Migrations: Volume II Interpretations, that estimated the 1929 world population to be
roughly 1.8 billion.

League of Nations and International Statistical Institute estimates of the world population in 1929

Impact of COVID-19 on census[edit]


Impact[edit]
UNFPA predicts that the COVID-19 pandemic will threaten the successful conduct of censuses of population and housing in many countries through
delays, interruptions that compromise quality, or complete cancellation of census projects. Domestic and donor financing for census may be diverted to
address COVID-19 leaving census without crucial funds. Several countries have already taken decisions to postpone the census, with many others yet to
announce the way forward. In some countries this is already happening.[52]
The pandemic has also affected the planning and implementation of censuses of agriculture in all world's regions. The extent of the impact has varied
according to the stages at which the censuses are, ranging from planning (i.e. staffing, procurement, preparation of frames, questionnaires), fieldwork
(field training and enumeration) or data processing/analysis stages. The census of agriculture's reference period is the agricultural year. Thus, a delay in
any census activity may be critical and can result in a full year postponement of the enumeration if the agricultural season is missed. Some publications
have discussed the impact of COVID-19 on national censuses of agriculture.[53][54][55][56]
Adaptation[edit]
UNFPA has requested a global effort to assure that even where census is delayed, census planning and preparations are not cancelled, but continue in
order to assure that implementation can proceed safely when the pandemic is under control. While new census methods, including online, register-
based, and hybrid approaches are being used across the world, these demand extensive planning and preconditions that cannot be created at short
notice. The continuing low supply of personal protective equipment to protect against COVID-19 has immediate implications for conducting census in
communities at risk of transmission. UNFPA Procurement Office is partnering with other agencies to explore new supply chains and resources.[52]

Demographic statistics
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
This article does not cite any sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be
challenged and removed.
Find sources: "Demographic statistics" – news · newspapers · books · scholar · JSTOR (July 2012) (Learn how and when to remove this template
message)

Demographic statistics are measures of the characteristics of, or changes to, a population. Records of births, deaths, marriages, immigration and
emigration and a regular census of population provide information that is key to making sound decisions about national policy.
A useful summary of such data is the population pyramid. It provides data about the sex and age distribution of the population in an accessible graphical
format.
Another summary is called the life table. For a cohort of persons born in the same year, it traces and projects their life experiences from birth to death.
For a given cohort, the proportion expected to survive each year (or decade in an abridged life table) is presented in tabular or graphical form.
The ratio of males to females by age indicates the consequences of differing mortality rates on the sexes. Thus, while values above one are common for
newborns, the ratio dwindles until it is well below one for the older population.

Contents

 1Collection
 2Population estimates and projections
 3History
 4Metadata
 5Statistical sources
 6See also
 7Further reading
 8External links

Collection[edit]
National population statistics are usually collected by conducting a census. However, because these are usually huge logistical exercises, countries
normally conduct censuses only once every five to 10 years. Even when a census is conducted it may miss counting everyone (known as undercount).
Also, some people counted in the census may be recorded in a different place than where they usually live, because they are travelling, for example (this
may result in overcounting). Consequently, raw census numbers are often adjusted to produce census estimates that identify such statistics as resident
population, residents, tourists and other visitors, nationals and aliens (non-nationals). For privacy reasons, particularly when there are small counts, some
census results may be rounded, often to the nearest ten, hundred, thousand and sometimes randomly up, down or to another small number such as
within 3 of the actual count.
Between censuses, administrative data collected by various agencies about population events such as births, deaths, and cross-border migration may be
used to produce intercensal estimates.

Population estimates and projections[edit]


Population estimates are usually derived from census and other administrative data. Population estimates are normally produced after the date the
estimate is for.
Some estimates, such as the Usually resident population estimate who usually lives in a locality as at the census date, even though the census did not
count them within that locality. Census questions usually include a questions about where a person usually lives, whether they are a resident or visitor, or
also live somewhere else, to allow these estimates to be made.
Other estimates are concerned with estimating population on a particular date that is different from the census date, for example the middle or end of a
calendar or financial year. These estimates often use birth and death records and migration data to adjust census counts for the changes that have
happened since the census.
Population projections are produced in advance of the date they are for. They use time series analysis of existing census data and other sources of
population information to forecast the size of future populations. Because there are unknown factors that may affect future population changes,
population projections often incorporate high and low as well as expected values for future populations. Population projections are often recomputed after
a census has been conducted. It depends on how the area is adjusted in a particular demarcation.

History[edit]
While many censuses were conducted in antiquity, there are few population statistics that survive. One example though can be found in the Bible, in
chapter 1 of the Book of Numbers. Not only are the statistics given, but the method used to compile those statistics is also described. In modern-day
terms, this metadata about the census is probably of as much value as the statistics themselves as it allows researchers to determine not only what was
being counted but how and why it was done.
Metadata[edit]
Modern population statistics are normally accompanied by metadata that explains how the statistics have been compiled and adjusted to compensate for
any collection issues.

Statistical sources[edit]
Most countries have a census bureau or government agency responsible for conducting censuses. Many of these agencies publish their country's census
results and other population statistics on their agency's website.

See also[edit]
 Demographic window
 Census - Census Bureau, Census tract, Census block group, Census block.
 Intercensal estimate
 Population projection
Econometrics
From Wikipedia, the free encyclopedia
Jump to navigationJump to search

Part of a series on

Economics

 Index
 Outline
 Category

show

History, branches and classification

show

Concepts, theory and techniques

show

By application
show

Notable economists

show

Lists

show

Glossary

 Business portal
 Money portal

 v
 t
 e
For broader coverage of this topic, see Mathematical economics.
Econometrics is the application of statistical methods to economic data in order to give empirical content to economic relationships.[1] More precisely, it is
"the quantitative analysis of actual economic phenomena based on the concurrent development of theory and observation, related by appropriate
methods of inference".[2] An introductory economics textbook describes econometrics as allowing economists "to sift through mountains of data to extract
simple relationships".[3] The first known use of the term "econometrics" (in cognate form) was by Polish economist Paweł Ciompa in 1910.[4] Jan
Tinbergen is one of the two founding fathers of econometrics.[5][6][7] The other, Ragnar Frisch, also coined the term in the sense in which it is used today.[8]
A basic tool for econometrics is the multiple linear regression model.[9] Econometric theory uses statistical theory and mathematical statistics to evaluate
and develop econometric methods.[10][11] Econometricians try to find estimators that have desirable statistical properties including unbiasedness, efficiency,
and consistency. Applied econometrics uses theoretical econometrics and real-world data for assessing economic theories, developing econometric
models, analysing economic history, and forecasting.

Contents

 1Basic models: linear regression


 2Theory
 3Methods
 4Example
 5Journals
 6Limitations and criticisms
 7See also
 8Further reading
 9References
 10External links

Basic models: linear regression[edit]


A basic tool for econometrics is the multiple linear regression model.[9] In modern econometrics, other statistical tools are frequently used, but linear
regression is still the most frequently used starting point for an analysis.[9] Estimating a linear regression on two variables can be visualised as fitting a line
through data points representing paired values of the independent and dependent variables.

Okun's law representing the relationship between GDP growth and the unemployment rate. The fitted line is found using regression analysis.

For example, consider Okun's law, which relates GDP growth to the unemployment rate. This relationship is represented in a linear regression where the

change in unemployment rate ( ) is a function of an intercept ( ), a given value of GDP growth multiplied by a slope coefficient and an error

term, :

The unknown parameters and can be estimated. Here is estimated to be −1.77 and is estimated to be 0.83. This means that if
GDP growth increased by one percentage point, the unemployment rate would be predicted to drop by 0.83 - 1.77 *1 points. The model could then be
tested for statistical significance as to whether an increase in growth is associated with a decrease in the unemployment, as hypothesized. If the

estimate of were not significantly different from 0, the test would fail to find evidence that changes in the growth rate and unemployment rate
were related. The variance in a prediction of the dependent variable (unemployment) as a function of the independent variable (GDP growth) is given
in polynomial least squares.

Theory[edit]
See also: Estimation theory
Econometric theory uses statistical theory and mathematical statistics to evaluate and develop econometric methods.[10][11] Econometricians try to
find estimators that have desirable statistical properties including unbiasedness, efficiency, and consistency. An estimator is unbiased if its expected
value is the true value of the parameter; it is consistent if it converges to the true value as the sample size gets larger, and it is efficient if the
estimator has lower standard error than other unbiased estimators for a given sample size. Ordinary least squares (OLS) is often used for estimation
since it provides the BLUE or "best linear unbiased estimator" (where "best" means most efficient, unbiased estimator) given the Gauss-
Markov assumptions. When these assumptions are violated or other statistical properties are desired, other estimation techniques such as maximum
likelihood estimation, generalized method of moments, or generalized least squares are used. Estimators that incorporate prior beliefs are advocated
by those who favour Bayesian statistics over traditional, classical or "frequentist" approaches.

Methods[edit]
Main article: Methodology of econometrics
Applied econometrics uses theoretical econometrics and real-world data for assessing economic theories, developing econometric models,
analysing economic history, and forecasting.[12]
Econometrics may use standard statistical models to study economic questions, but most often they are with observational data, rather than
in controlled experiments.[13] In this, the design of observational studies in econometrics is similar to the design of studies in other observational
disciplines, such as astronomy, epidemiology, sociology and political science. Analysis of data from an observational study is guided by the study
protocol, although exploratory data analysis may be useful for generating new hypotheses.[14] Economics often analyses systems of equations and
inequalities, such as supply and demand hypothesized to be in equilibrium. Consequently, the field of econometrics has developed methods
for identification and estimation of simultaneous equations models. These methods are analogous to methods used in other areas of science, such as
the field of system identification in systems analysis and control theory. Such methods may allow researchers to estimate models and investigate
their empirical consequences, without directly manipulating the system.
One of the fundamental statistical methods used by econometricians is regression analysis.[15] Regression methods are important in econometrics
because economists typically cannot use controlled experiments. Econometricians often seek illuminating natural experiments in the absence of
evidence from controlled experiments. Observational data may be subject to omitted-variable bias and a list of other problems that must be
addressed using causal analysis of simultaneous-equation models.[16]
In addition to natural experiments, quasi-experimental methods have been used increasingly commonly by econometricians since the 1980s, in order
to credibly identify causal effects.[17]

Example[edit]
A simple example of a relationship in econometrics from the field of labour economics is:

This example assumes that the natural logarithm of a person's wage is a linear function of the number of years of education that person has

acquired. The parameter measures the increase in the natural log of the wage attributable to one more year of education. The term is a
random variable representing all other factors that may have direct influence on wage. The econometric goal is to estimate the

parameters, under specific assumptions about the random variable . For example, if is uncorrelated with years of education, then
the equation can be estimated with ordinary least squares.
If the researcher could randomly assign people to different levels of education, the data set thus generated would allow estimation of the effect of
changes in years of education on wages. In reality, those experiments cannot be conducted. Instead, the econometrician observes the years of
education of and the wages paid to people who differ along many dimensions. Given this kind of data, the estimated coefficient on Years of
Education in the equation above reflects both the effect of education on wages and the effect of other variables on wages, if those other variables
were correlated with education. For example, people born in certain places may have higher wages and higher levels of education. Unless the
econometrician controls for place of birth in the above equation, the effect of birthplace on wages may be falsely attributed to the effect of
education on wages.
The most obvious way to control for birthplace is to include a measure of the effect of birthplace in the equation above. Exclusion of birthplace,

together with the assumption that is uncorrelated with education produces a misspecified model. Another technique is to include in the

equation additional set of measured covariates which are not instrumental variables, yet render identifiable.[18] An overview of econometric
methods used to study this problem were provided by Card (1999).[19]

Journals[edit]
The main journals that publish work in econometrics are Econometrica, the Journal of Econometrics, The Review of Economics and
Statistics, Econometric Theory, the Journal of Applied Econometrics, Econometric Reviews, The Econometrics Journal,[20] and the Journal of
Business & Economic Statistics.

Limitations and criticisms[edit]


See also: Criticisms of econometrics
Like other forms of statistical analysis, badly specified econometric models may show a spurious relationship where two variables are correlated
but causally unrelated. In a study of the use of econometrics in major economics journals, McCloskey concluded that some economists report p-
values (following the Fisherian tradition of tests of significance of point null-hypotheses) and neglect concerns of type II errors; some economists
fail to report estimates of the size of effects (apart from statistical significance) and to discuss their economic importance. She also argues that
some economists also fail to use economic reasoning for model selection, especially for deciding which variables to include in a regression.[21][22]
In some cases, economic variables cannot be experimentally manipulated as treatments randomly assigned to subjects.[23] In such cases,
economists rely on observational studies, often using data sets with many strongly associated covariates, resulting in enormous numbers of
models with similar explanatory ability but different covariates and regression estimates. Regarding the plurality of models compatible with
observational data-sets, Edward Leamer urged that "professionals ... properly withhold belief until an inference can be shown to be adequately
insensitive to the choice of assumptions".[23]

See also
National accounts
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
Part of a series on

Macroeconomics

show

Basic concepts

show

Policies

show

Models

show

Related fields

show

Schools
show

People

show

See also

 Money portal
 Business portal

 v
 t
 e

National accounts is included in the JEL classification


codes as JEL: C82 and JEL:E01

National accounts or national account systems (NAS) are the implementation of complete and consistent accounting techniques for measuring the
economic activity of a nation. These include detailed underlying measures that rely on double-entry accounting. By design, such accounting makes the
totals on both sides of an account equal even though they each measure different characteristics, for example production and the income from it. As
a method, the subject is termed national accounting or, more generally, social accounting.[1] Stated otherwise, national accounts as systems may be
distinguished from the economic data associated with those systems.[2] While sharing many common principles with business accounting, national
accounts are based on economic concepts.[3] One conceptual construct for representing flows of all economic transactions that take place in an economy
is a social accounting matrix with accounts in each respective row-column entry.[4]
National accounting has developed in tandem with macroeconomics from the 1930s with its relation of aggregate demand to total output through
interaction of such broad expenditure categories as consumption and investment.[5] Economic data from national accounts are also used for empirical
analysis of economic growth and development.[1][6]

Contents

 1Scope
 2Main components
 3History
 4See also
 5References
 6External links

Scope[edit]
National accounts broadly present output, expenditure, and income activities of the economic actors (households, corporations, government) in an
economy, including their relations with other countries' economies, and their wealth (net worth). They present both flows (measured but it is over a
period) and stocks (measured at the end of a period), ensuring that the flows are reconciled with the stocks. As to flows, the national income and product
accounts (in U.S. terminology) provide estimates for the money value of income and output per year or quarter, including GDP. As to stocks, the 'capital
accounts' are a balance-sheet approach that has assets on one side (including values of land, the capital stock, and financial assets) and liabilities
and net worth on the other, measured as of the end of the accounting period. National accounts also include measures of the changes in assets,
liabilities, and net worth per accounting period. These may refer to flow of funds accounts or, again, capital accounts.[1]
There are a number of aggregate measures in the national accounts, notably including gross domestic product or GDP, perhaps the most widely cited
measure of aggregate economic activity. Ways of breaking down GDP include as types of income (wages, profits, etc.) or expenditure (consumption,
investment/saving, etc.). Measures of these are examples of macro-economic data.[7][8][9][10] Such aggregate measures and their change over time are
generally of strongest interest to economic policymakers, although the detailed national accounts contain a source of information for economic analysis,
for example in the input-output tables which show how industries interact with each other in the production process.
National accounts can be presented in nominal or real amounts, with real amounts adjusted to remove the effects of price changes over time.[11] A
corresponding price index can also be derived from national output. Rates of change of the price level and output may also be of interest. An inflation
rate (growth rate of the price level) may be calculated for national output or its expenditure components. Economic growth rates (most commonly the
growth rate of GDP) are generally measured in real (constant-price) terms. One use of economic-growth data from the national accounts is in growth
accounting across longer periods of time for a country or across to estimate different sources of growth, whether from growth of factor
inputs or technological change.[12]
The accounts are derived from a wide variety of statistical source data including surveys, administrative and census data, and regulatory data, which are
integrated and harmonized in the conceptual framework. They are usually compiled by national statistical offices and/or central banks in each country,
though this is not always the case, and may be released on both an annual and (less detailed) quarterly frequency. Practical issues include inaccuracies
from differences between economic and accounting methodologies, lack of controlled experiments on quality of data from diverse sources, and
measurement of intangibles and services of the banking and financial sectors.[13]
Two developments relevant to the national accounts since the 1980s include the following. Generational accounting is a method for measuring
redistribution of lifetime tax burdens across generations from social insurance, including social security and social health insurance. It has been proposed
as a better guide to the sustainability of a fiscal policy than budget deficits, which reflect only taxes minus spending in the current
year.[14] Environmental or green national accounting is the method of valuing environmental assets, which are usually not counted in measuring national
wealth, in part due to the difficulty of valuing them. The method has been proposed as an alternative to an implied zero valuation of environmental assets
and as a way of measuring the sustainability of welfare levels in the presence of environmental degradation.[15]
Macroeconomic data not derived from the national accounts are also of wide interest, for example some cost-of-living indexes, the unemployment rate,
and the labor force participation rate.[16] In some cases, a national-accounts counterpart of these may be estimated, such as a price index computed from
the personal consumption expenditures and the GDP gap (the difference between observed GDP and potential GDP).[17]
Main components[edit]
The presentation of national accounts data may vary by country (commonly, aggregate measures are given greatest prominence), however the main
national accounts include the following accounts for the economy as a whole and its main economic actors.

 Current accounts:
production accounts which record the value of domestic output and the goods and services used up in producing that output. The balancing item
of the accounts is value added, which is equal to GDP when expressed for the whole economy at market prices and in gross terms;
income accounts, which show primary and secondary income flows - both the income generated in production (e.g. wages and salaries) and
distributive income flows (predominantly the redistributive effects of government taxes and social benefit payments). The balancing item of the
accounts is disposable income ("National Income" when measured for the whole economy);
expenditure accounts, which show how disposable income is either consumed or saved. The balancing item of these accounts is saving.

 Capital accounts, which record the net accumulation, as the result of transactions, of non-financial assets; and the financing, by way of saving and
capital transfers, of the accumulation. Net lending/borrowing is the balancing item for these accounts
 Financial accounts, which show the net acquisition of financial assets and the net incurrence of liabilities. The balance on these accounts is the net
change in financial position.
 Balance sheets, which record the stock of assets, both financial and non-financial, and liabilities at a particular point in time. Net worth is the
balance from the balance sheets (United Nations, 1993).
The accounts may be measured as gross or net of consumption of fixed capital (a concept in national accounts similar to depreciation in business
accounts).
Notably absent from these components, however, is unpaid work, because its value is not included in any of the aforementioned categories of
accounts, just as it is not included in calculating gross domestic product (GDP). An Australian study has shown the value of this uncounted work to be
approximately 50% of GDP, making its exclusion rather significant.[18] As GDP is tied closely to the national accounts system,[19] this may lead to a
distorted view of national accounts. Because national accounts are widely used by governmental policy-makers in implementing controllable
economic agendas,[20] some analysts have advocated for either a change in the makeup of national accounts or adjustments in the formulation
of public policy.[21]

History[edit]
The original motivation for the development of national accounts and the systematic measurement of employment was the need for accurate
measures of aggregate economic activity. This was made more pressing by the Great Depression and as a basis
for Keynesian macroeconomic stabilisation policy and wartime economic planning. The first efforts to develop such measures were undertaken in the
late 1920s and 1930s, notably by Colin Clark and Simon Kuznets. Richard Stone of the U.K. led later contributions during World War II and thereafter.
The first formal national accounts were published by the United States in 1947. Many European countries followed shortly thereafter, and the United
Nations published A System of National Accounts and Supporting Tables in 1952.[1][22] International standards for national accounting are defined by
the United Nations System of National Accounts, with the most recent version released for 2008.[23]
Even before that in early 1920s there were national economic accounts tables. One of such systems was called Balance of national economy and
was used in USSR and other socialistic countries to measure the efficiency of socialistic production.[24]
In Europe, the worldwide System of National Accounts has been adapted in the European System of Accounts (ESA), which is applied by members
of the European Union and many other European countries. Research on the subject continues from its beginnings through today.[25]

Official statistics
From Wikipedia, the free encyclopedia
Jump to navigationJump to search

Official statistics on Germany in 2010, published in UNECE Countries in Figures 2011.

Official statistics are statistics published by government agencies or other public bodies such as international organizations as a public good. They
provide quantitative or qualitative information on all major areas of citizens' lives, such as economic and social development,[1] living
conditions,[2] health,[3] education,[4] and the environment.[5]
During the 15th and 16th centuries, statistics were a method for counting and listing populations and State resources. The term statistics comes from
the New Latin statisticum collegium (council of state) and refers to science of the state.[6] According to the Organisation for Economic Co-operation and
Development, official statistics are statistics disseminated by the national statistical system, excepting those that are explicitly not to be official".[7]
Governmental agencies at all levels, including municipal, county, and state administrations, may generate and disseminate official statistics. This broader
possibility is accommodated by later definitions. For example:
Almost every country in the world has one or more government agencies (usually national institutes) that supply decision-makers and other users
including the general public and the research community with a continuing flow of information (...). This bulk of data is usually called official statistics.
Official statistics should be objective and easily accessible and produced on a continuing basis so that measurement of change is possible.[8]
Official statistics result from the collection and processing of data into statistical information by a government institution or international organisation.
They are then disseminated to help users develop their knowledge about a particular topic or geographical area, make comparisons between countries or
understand changes over time. Official statistics make information on economic and social development accessible to the public, allowing the impact of
government policies to be assessed, thus improving accountability.
Contents

 1Aim
 2Various categories
 3Most common indicators used in official statistics
 4Users
o 4.1Users with a general interest
o 4.2Users with a business interest
o 4.3Users with a research interest
 5Producers at the national level
 6Production process
 7Data revision
 8Data Sources
o 8.1Statistical survey or sample survey
o 8.2Census
o 8.3Register
 9Official Statistics presentation
 10Release
 11Quality criteria to be respected
o 11.1Relevance
o 11.2Impartiality
o 11.3Dissemination
o 11.4Independence
o 11.5Transparency
o 11.6Confidentiality
o 11.7International standards
 12See also
 13References
 14Further reading
 15External sources

Aim[edit]
Official statistics provide a picture of a country or different phenomena through data, and images such as graph and maps. Statistical information covers
different subject areas (economic, demographic, social etc.). It provides basic information for decision making, evaluations and assessments at different
levels.
The goal of statistical organizations is to produce relevant, objective and accurate[9] statistics to keep users well informed and assist good policy and
decision-making.
Various categories[edit]
The Fundamental Principles of Official Statistics were adopted in 1992 by the United Nations Economic Commission for Europe, and subsequently
endorsed as a global standard by the United Nations Statistical Commission.[10] According to the first Principle "Official statistics provide an indispensable
element in the information system of a democratic society, serving the government, the economy and the public with data about the economic,
demographic, social and environmental situation".[11]
The categorization of the domains of official statistics has been further developed in the Classification of Statistical Activities, endorsed by the
Conference of European Statisticians and various other bodies.[12]

Most common indicators used in official statistics[edit]


Statistical indicators provide an overview of the social, demographic and economic structure of society. Moreover, these indicators facilitate comparisons
between countries and regions.
For population, the main indicators are:

 Total population
 Population density
 Population by age
 Life expectancy at birth and at age 65
 Foreign born
 Foreigners in population
 Total fertility rate
 Infant mortality
The gender statistics include:

 Women in labour force


 Gender pay gap[13]
In the employment category:

 Employment rate
 Unemployment rate
 Youth unemployment rate
 Economic activity rate (women and men)
 Employment in major sectors: agriculture, industry, services
There are many indicators for the economy:

 Gross Domestic Product


 Gross Domestic Product per capita
 Real GDP growth rate
 GDP by major economic sectors: agriculture, industry, services
 Consumer price index[14]
 Purchasing Power Parity[15]
 Exchange rate
 Gross external debt
For trade indicators we find:

 Exports of goods and services


 Imports of goods and services
 Balance of payments[16]
 Trade balance
 Major import partners
 Major export partners
Environment indicators:

 Land use
 Water supply and consumption
 Environmental protection expenditure
 Generation and treatment of waste
 Chemical use
For the energy field:

 Total energy consumption


 Primary energy sources
 Energy consumption in transport
 Electricity consumption
 Consumption of renewable energy sources

Users[edit]
The three user types of official statistics

Official statistics are intended for a wide range of users including governments (central and local), research institutions, professional statisticians,
journalists and the media, businesses, educational institutions and the general public. There are three types of users: those with a general interest,
business interest or research interest. Each of these user groups has different needs for statistical information.
Users with a general interest[edit]
Users with a general interest include the media, schools and the general public. They use official statistics in order to be informed on a particular topic,
to observe trends within the society of a local area, country, region of the world.
Users with a business interest[edit]
Users with a business interest include decision makers and users with a particular interest for which they want more detailed information. For them,
official statistics are an important reference, providing information on the phenomena or circumstances their own work is focusing on. For instance, those
users will take some official statistics into consideration before launching a product, or deciding on a specific policy or on a marketing strategy. As with
the general interest users, this group does not usually have a good understanding of statistical methodologies, but they need more detailed information
than the general users.
Users with a research interest[edit]
Users with a research interest are universities, consultants and government agencies. They generally understand something about statistical
methodology and want to dig deeper into the facts and the statistical observations; they have an analytical purpose in inventing or explaining
interrelations of causes and effects of different phenomena. In this field, official statistics are also used to assess a government's policies.
One common point for all these users is their need to be able to trust the official information. They need to be confident that the results published are
authoritative and unbiased. Producers of official statistics must maintain a reputation of professionalism and independence.
The statistical system must be free from interference that could influence decisions on the choice of sources, methods used for data collection, the
selection of results to be released as official, and the timing and form of dissemination. Statistical business processes should be transparent and
follow international standards of good practice.
Statistical programs are decided on an annual or multi-annual basis by governments in many countries. They also provide a way to judge the
performance of the statistical system.

Producers at the national level[edit]


See also: List of national and international statistical services
Official statistics are collected and produced by national statistical organisations (NSOs), or other organisations (e.g. central banks) that form part of
the national statistical system in countries where statistical production is de-centralized. These organisations are responsible for producing and
disseminating official statistical information, providing the highest quality data. Quality in the context of official statistics is a multi-faceted concept,
consisting of components such as relevance, completeness, timeliness, accuracy, accessibility, clarity, cost-efficiency, transparency, comparability and
coherence.
The core tasks of NSOs, for both centralized and decentralized systems, are determining user needs and filtering these for relevance. Then they
transform the relevant user needs into measurable concepts to facilitate data collection and dissemination. The NSO is in charge of the coordination
between statistical producers and of ensuring the coherence and compliance of the statistical system to agreed standards. The NSO has a coordination
responsibility as its President/Director General represents the entire national system of official statistics, both at the national and at international levels.

Production process[edit]
The production process of official statistics comprises 8 phases, as documented in the Generic Statistical Business Process Model (GSBPM)

 Specify Needs
 Design
 Build
 Collect
 Process
 Analyse
 Disseminate
 Evaluate

Data revision[edit]
Even after they have been published, some official statistics may be revised. Policy-makers may need preliminary statistics quickly for decision-making
purposes, but eventually it is important to publish the best available information, so official statistics are often published in several 'vintages'.
In order to understand the accuracy of economic data and the possible impact of data errors on macroeconomic decision-making, the Federal Reserve
Bank of Philadelphia has published a dataset[17] that records both initial real-time data estimates, and subsequent data revisions, for a large number of
macroeconomic series. A similar dataset for Europe[18] has been developed by the Euro-Area Business Cycle Network.

Data Sources[edit]
There are two sources of data for statistics. Primary, or "statistical" sources are data that are collected primarily for creating official statistics, and include
statistical surveys and censuses. Secondary, or "non-statistical" sources, are data that have been primarily collected for some other purpose
(administrative data, private sector data etc.).
Statistical survey or sample survey[edit]
A statistical survey or a sample survey is an investigation about the characteristics of a phenomenon by means of collecting data from a sample of the
population and estimating their characteristics through the systematic use of statistical methodology.

o The main advantages are the direct control over data collection and the possibility to ask for data according to statistical definitions.
o Disadvantages include the high cost of data collection and the quality issues relating to non-response and survey errors.
There are various survey methods that can be used such as direct interviewing, telephone, mail, online surveys.
Census[edit]
A census is a complete enumeration of a population or groups at a point in time with respect to well-defined characteristics (population, production). Data
are collected for a specific reference period. A census should be taken at regular intervals in order to have comparable information available, therefore,
most statistical censuses are conducted every 5 or 10 years. Data are usually collected through questionnaires mailed to respondents, via the Internet, or
completed by an enumerator visiting respondents, or contacting them by telephone.

o An advantage is that censuses provide better data than surveys for small geographic areas or sub-groups of the population. Census data can also
provide a basis for sampling frames used in subsequent surveys.
o The major disadvantage of censuses is usually the high cost associated with planning and conducting them, and processing the resulting data.
In 2005, the United Nations Economic and Social Council adopted a resolution urging: "Member States to carry out a population and housing census and
to disseminate census results as an essential source of information for small area, national, regional and international planning and development; and to
provide census results to national stakeholders as well as the United Nations and other appropriate intergovernmental organizations to assist in studies
on population, environment, and socio-economic development issues and programs".[19]
Register[edit]
A register is a database that is updated continuously for a specific purpose and from which statistics can be collected and produced. It contains
information on a complete group of units.

o An advantage is the total coverage even if collecting and processing represent low cost. It allows producing more detailed statistics than using
surveys. Different registers can be combined and linked together on the basis of defined keys (personal identification codes, business
identification codes, address codes etc.). Moreover, individual administrative registers are usually of high quality and very detailed.
o A disadvantage is the possible under-coverage that can be the case if the incentive or the cultural tradition of registering events and changes are
weak, if the classification principles of the register are not clearly defined or if the classifications do not correspond to the needs of statistical
production to be derived from them.
There are different types of registers:
→Administrative registers[20] or records can help the NSI in collecting data. Using the existing administrative data for statistical production may be
approved by the public because it can be seen as a cost efficient method; individuals and enterprises are less harassed by a response burden; data
security is better as fewer people handle it and data have an electronic format.
→Private registers such as registers operated by insurance companies and employer organizations can also be used in the production process of
official statistics, providing there is an agreement or legislation on this.
→Statistical registers are frequently based on combined data from different administrative registers or other data sources.
→For businesses, it is often legally indispensable to be registered in their country to a business register which is a system that makes business
information collection easier.
→It is possible to find agricultural registers and registers of dwellings.
Even though different types of data collection exist, the best estimates are based on a combination of different sources providing the strengths and
reducing the weakness of each individual source.

Official Statistics presentation[edit]


Official statistics can be presented in different ways. Analytical texts and tables are the most traditional ways. Graphs and charts summarize data
highlighting information content visually. They can be extremely effective in expressing key results, or illustrating a presentation. Sometimes a picture is
worth a thousand words. Graphs and charts usually have a heading describing the topic.
There are different types of graphic but usually the data determine the type that is going to be used.

 To illustrate changes over time, a line graph would be recommended. This is usually used to display variables whose values represent a regular
progression.

Stacked bar chart showing the sectoral contribution to total business services growth, 2001-2005 for members of UNECE.

 For categorical data, it is better to use a bar graph either vertical or horizontal. They are often used to represent percentages and rates and also to
compare countries, groups or illustrate changes over time. The same variable can be plotted against itself for two groups. An example of this is the
age pyramid.
 Pie chart can be used to represent share of 100 per cent. Pie charts highlight the topic well only when there are few segments.
 Stacked bar charts, whether vertical or horizontal, are used to compare compositions across categories. They can be used to compare percentage
composition and are most effective for categories that add up to 100 per cent, which make a full stacked bar chart. Their use is usually restricted to a
small number of categories.
 Tables are a complement to related texts and support the analysis. They help to minimize numbers in the description and also eliminate the need to
discuss small variables that are not essential. Tables rank data by order or other hierarchies to make the numbers easily understandable. They
usually show the figures from the highest to the lowest.
 Another type of visual presentation of statistical information is thematic map. They can be used to illustrate differences or similarities between
geographical areas, regions or countries. The most common statistical map that is used is called the choropleth map where different shades of a
colour are used to highlight contrasts between regions; darker colour means a greater statistical value. This type of map is best used for ratio[21] data
but for other data, proportional or graduated symbol maps, such as circles, are preferred. The size of the symbol increases in proportion to the value
of the observed object.

Release[edit]
Official statistics are part of our everyday life. They are everywhere: in newspapers, on television and radio, in presentations and discussions. For most
citizens, the media provide their only exposure to official statistics. Television is the primary news source for citizens in industrialized countries, even
if radio and newspapers still play an important role in the dissemination of statistical information. On the other hand, newspapers and specialized
economic and social magazines can provide more detailed coverage of statistical releases as the information on a specific theme can be quite
extensive. Official statistics provides us with important information on the situation and the development trends in our society.
Users can gather information making use of the services of the National Statistical Offices. They can easily find it on the agency's website. The
development of computing technologies and the Internet has enabled users - businesses, educational institutions and households among others- to have
access to statistical information. The Internet has become an important tool for statistical producers to disseminate their data and information. People are
able to access information online. The supply of information from statistical agencies has increased. Today the advanced agencies provide the
information on their websites in an understandable way, often categorized for different groups of users. Several glossaries have been set up by different
organizations or statistical offices to provide more information and definitions in the field of statistics and consequently official statistics.

Quality criteria to be respected[edit]


The quality criteria of a national statistical office are the following: relevance, impartiality, dissemination, independence, transparency, confidentiality,
international standards[citation needed]. There principles apply not only to the NSO but to all producers of official statistics. Therefore, not every figure reported by
a public body should be considered as official statistics, but those produced and disseminated according to the principles. Adherence to these principles
will enhance the credibility of the NSO and other official statistical producers and build public trust in the reliability of the information and results that are
produced.
Relevance[edit]
Relevance is the first and most important principles to be respected for national statistical offices. When releasing information, data and official statistics
should be relevant in order to fulfil the needs of users as well as both public and private sector decision makers. Production of official statistics is relevant
if it corresponds to different user needs like public, governments, businesses, research community, educational institutions, NGOs and international
organizations or if it satisfies basic information in each area and citizen's right to information.
Impartiality[edit]
Once the survey has been made, the NSO checks the quality of the results and then they have to be disseminated no matter what impact they can have
on some users, whether good or bad. All should accept the results released by the NSO as authoritative. Users need to perceive the results as unbiased
representation of relevant aspects of the society. Moreover, the impartiality principle implies the fact that NSOs have to use understandable terminology
for statistics' dissemination, questionnaires and material published so that everyone can have access to their information.
Dissemination[edit]
In order to maximize dissemination, statistics should be presented in a way that facilitates proper interpretation and meaningful comparisons. To reach
the general public and non-expert users when disseminating, NSOs have to add explanatory comments to explain the significance of the results released
and make analytical comments when necessary. There is a need to identify clearly what the preliminary, final and revised results are, in order to avoid
confusion for users. All results of official statistics have to be publicly accessible. There are no results that should be characterized as official and for the
exclusive use of the government. Moreover, they should be disseminated simultaneously.
Independence[edit]
Users can be consulted by NSOs but the decisions should be made by statistical bodies. Information and activities of producers of official statistics
should be independent of political control. Moreover, NSOs have to be free of any political interference that could influence their work and thus, the
results. They should not make any political advice or policy-perspective comments on the results released at any time, even at press conferences or in
interviews with the media.
Transparency[edit]
The need for transparency is essential for NSOs to gain the trust of the public. They have to expose to the public the methods they use to produce official
statistics, and be accountable for all the decisions they take and the results they publish. Also, statistical producers should warn users of certain
interpretations and false conclusions even if they try to be as precise as possible. Furthermore, the quality of the accurate and timely results must be
assessed prior to release. But if errors in the results occur before or after the data revision,[22] they should be directly corrected and information should be
disseminated to the users at the earliest possible time. Producers of official statistics have to set analytical systems in order to change or improve their
activities and methods.
Confidentiality[edit]
All data collected by the national statistical office must protect the privacy of individual respondents, whether persons or businesses. But on the contrary,
government units such as institutions cannot invoke statistical confidentiality. All respondents have to be informed about the purpose and legal basis of
the survey and especially about the confidentiality measures. The statistical office should not release any information that could identify an individual or
group without prior consent. After data collection, replies should go back directly to the statistical producer, without involving any intermediary. Data
processing implies that filled-in paper and electronic form with full names should be destroyed.
International standards[edit]
The use of international standards at the national level aims to improve international comparability for national users and facilitate decision-making,
especially when controversial. Moreover, the overall structure, including concepts and definitions, should follow internationally accepted standards,
guidelines or good practices. International recommendations and standards for statistical methods approved by many countries provide them with a
common basis like the two standards of the International Monetary Fund, SDDS for Special Data Dissemination Standards and GDDS for General
Data Dissemination System. Their aim is to guide countries in the dissemination of their economic and financial data to the public. Once approved,
these standards have to be observed by all producers of official statistics and not only by the NSO.
Demographic statistics
From Wikipedia, the free encyclopedia
(Redirected from Population statistics)

Jump to navigationJump to search


This article does not cite any sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be
challenged and removed.
Find sources: "Demographic statistics" – news · newspapers · books · scholar · JSTOR (July 2012) (Learn how and when to remove this template
message)

Demographic statistics are measures of the characteristics of, or changes to, a population. Records of births, deaths, marriages, immigration and
emigration and a regular census of population provide information that is key to making sound decisions about national policy.
A useful summary of such data is the population pyramid. It provides data about the sex and age distribution of the population in an accessible graphical
format.
Another summary is called the life table. For a cohort of persons born in the same year, it traces and projects their life experiences from birth to death.
For a given cohort, the proportion expected to survive each year (or decade in an abridged life table) is presented in tabular or graphical form.
The ratio of males to females by age indicates the consequences of differing mortality rates on the sexes. Thus, while values above one are common for
newborns, the ratio dwindles until it is well below one for the older population.

Contents

 1Collection
 2Population estimates and projections
 3History
 4Metadata
 5Statistical sources
 6See also
 7Further reading
 8External links

Collection[edit]
National population statistics are usually collected by conducting a census. However, because these are usually huge logistical exercises, countries
normally conduct censuses only once every five to 10 years. Even when a census is conducted it may miss counting everyone (known as undercount).
Also, some people counted in the census may be recorded in a different place than where they usually live, because they are travelling, for example (this
may result in overcounting). Consequently, raw census numbers are often adjusted to produce census estimates that identify such statistics as resident
population, residents, tourists and other visitors, nationals and aliens (non-nationals). For privacy reasons, particularly when there are small counts, some
census results may be rounded, often to the nearest ten, hundred, thousand and sometimes randomly up, down or to another small number such as
within 3 of the actual count.
Between censuses, administrative data collected by various agencies about population events such as births, deaths, and cross-border migration may be
used to produce intercensal estimates.

Population estimates and projections[edit]


Population estimates are usually derived from census and other administrative data. Population estimates are normally produced after the date the
estimate is for.
Some estimates, such as the Usually resident population estimate who usually lives in a locality as at the census date, even though the census did not
count them within that locality. Census questions usually include a questions about where a person usually lives, whether they are a resident or visitor, or
also live somewhere else, to allow these estimates to be made.
Other estimates are concerned with estimating population on a particular date that is different from the census date, for example the middle or end of a
calendar or financial year. These estimates often use birth and death records and migration data to adjust census counts for the changes that have
happened since the census.
Population projections are produced in advance of the date they are for. They use time series analysis of existing census data and other sources of
population information to forecast the size of future populations. Because there are unknown factors that may affect future population changes,
population projections often incorporate high and low as well as expected values for future populations. Population projections are often recomputed after
a census has been conducted. It depends on how the area is adjusted in a particular demarcation.

History[edit]
While many censuses were conducted in antiquity, there are few population statistics that survive. One example though can be found in the Bible, in
chapter 1 of the Book of Numbers. Not only are the statistics given, but the method used to compile those statistics is also described. In modern-day
terms, this metadata about the census is probably of as much value as the statistics themselves as it allows researchers to determine not only what was
being counted but how and why it was done.

Metadata[edit]
Modern population statistics are normally accompanied by metadata that explains how the statistics have been compiled and adjusted to compensate for
any collection issues.

Statistical sources[edit]
Most countries have a census bureau or government agency responsible for conducting censuses. Many of these agencies publish their country's census
results and other population statistics on their agency's website.

See also[edit]
 Demographic window
 Census - Census Bureau, Census tract, Census block group, Census block.
 Intercensal estimate
 Population projection

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy