Principles of Impact Evaluation
Principles of Impact Evaluation
Principles of Impact Evaluation
IMPACT
EVALUATION
PART I KEY CONCEPTS
Definition
Impact evaluation is an assessment of how the
intervention being evaluated affects outcomes,
whether these effects are intended or
unintended.
The proper analysis of impact requires a
counterfactual of what those outcomes would
have been in the absence of the intervention.
Impact evaluation guidelines accordingly
define impact as the “the attainment of
development goals of the project or program,
or rather the contributions to their attainment.
” The ADB guidelines state the same point as
follows: “project impact evaluation establishes
whether the intervention had a welfare effect
on individuals, households, and communities,
and whether this effect can be attributed to the
concerned intervention”.
The counterfactual
Counterfactual: outputs and outcomes in the
absence of the intervention.
The counterfactual is necessary for comparing
actual outputs and outcomes to what they would
have been in the absence of the intervention, i.e.
with versus without.
The most common counterfactual is to use a
comparison group.
The difference in outcomes between the
beneficiaries of the intervention (the treatment
group) and the comparison group, is a single
difference measure of impact.
This measure can suffer from various problems, so
that a double difference, comparing the difference in
the change in the outcome for treatment and
comparison groups, is to be preferred.
Purpose of impact
evaluation
Impact evaluation serves both objectives of
evaluation: lesson-learning and accountability
A properly designed impact evaluation can answer
the question of whether the program is working or
not, and hence assist in decisions about scaling up.
A well-designed impact evaluation can also answer
questions about program design: which bits work
and which bits don’t, and so provide policy-relevant
information for redesign and the design of future
programs.
We want to know why and how a program works,
not just if it does.
When to do an impact
evaluation?
It is not feasible to conduct impact evaluations
for all interventions.
The need is to build a strong evidence base for
all sectors in a variety of contexts to provide
guidance for policy-makers.
The following are examples of the types of
intervention when impact evaluation would be
useful:
Innovative schemes
Pilot programs which are due to be substantially
scaled up Interventions for which there is scant
solid evidence of impact in the given context
A selection of other interventions across an
agency’s portfolio on an occasional basis
PART II EVALUATION
DESIGN
Key elements in evaluation design
Deciding whether to proceed with the evaluation
Identifying key evaluation questions
The evaluation design should be embedded in the
program theory
The comparison group must serve as the basis for a
credible counterfactual, addressing issues of selection
bias (the comparison group is drawn from a different
population than the treatment group) and contagion
(the comparison group is affected by the intervention
or a similar intervention by another agency).
Findings should be triangulated
The evaluation must be well contextualised
Establishing the program theory
The program theory documents the causal (or results)
chain from inputs to outcomes.
The theory is an expression of the log frame, but with
a more explicit analysis of the assumptions underlying
the theory.
Alternative causal paths may also be identified.
The theory must also allow for the major external
factors influencing outcomes.
Using the theory-based approach avoids ‘black box’
impact evaluations. Black box evaluations are those
which give a finding on impact, but no indication as to
why the intervention is or is not doing. Answering the
why question requires looking inside the box, or along
the results chain.
Selecting the evaluation approach
A major concern in selecting the
evaluation approach is the way in which
the problem of selection bias will be
addressed.
How this will be done depends on an
understanding of how such biases may be
generated, which requires a good
understanding of how the beneficiaries are
identified by the program.
Figure 1 (Annex B) shows a
decision tree for selecting an
evaluation approach. The basic
steps in this decision tree are as
follows
1. If the evaluation is being designed ex-ante, is
randomization possible? If the treatment group is
chosen at random then a random sample drawn from
the sample population is a valid comparison group,
and will remain so provided contamination can be
avoided. This approach does not mean that targeting
is not possible. The random allocation may be to a
subgroup of the total population, e.g. from the poorest
districts.
2. If not, are all selection determinants observed? If they
are, then there are a number of regression-based
approaches which can remove the selection bias.
3. If the selection determinants are unobserved then if
they are thought to be time invariant then using
panel data will remove their influence, so a baseline
is essential (or some means of substituting for a
baseline).
4. If the study is ex post so a panel is not possible and
selection is determined by unobservables, then
some means of observing the supposed
unobservables should be sought. If that is not the
case, then a pipeline approach can be used if there
are as yet untreated beneficiaries.
5. If none of the above are possible then the problem
of selection bias cannot be addressed. Any impact
evaluation will have to rely heavily on the program
theory and triangulation to build an argument by
plausible association.
What is Ex-ante
evaluation?
• The term ex-ante (sometimes written ex
ante or exante) is a phrase meaning
"before the event".
• Ex-ante is used most commonly in the
commercial world, where results of a
particular action, or series of actions, are
forecast in advance (or intended). The
opposite of ex-ante is ex-post (actual)
(or ex post).
• Ex-ante evaluation enables analysis of the anticipated
impacts of the planned programme.
• Analysis-based ex-ante evaluation endeavours to
optimise the structure of the programme, the
sequence of priorities, as well as the external and
internal coherence of the programme.
• This evaluation projects anticipated results, and
therefore estimates – in accordance with the indicators
and parameters of the concerned area – the economic
and social grounds of the approved priorities and
objectives.
• It justifies the grounds of decisions recommending use
of funding, and ensures all information needed for
donor decisions and monitoring local implementation.
Designing the baseline
survey
Ideally a baseline survey will be available so that double
difference estimates can be made. Important principles in
designing the survey are:
Conduct the baseline survey as early as possible.
The survey design must be based on the evaluation design
which is, in turn, based on the program theory. Data must be
collected across the results chain, not just on outcomes.
The comparison group sample must be of adequate size, and
subject to the same, or virtually the same, questionnaire. Whilst
some intervention-specific questions may not be appropriate,
similar questions of a more general nature can help test for
contagion.
• Multiple instruments (e.g. household and facility
level) are usually desirable, and must be coded in
such a way that they can be linked.
• Survey design takes time. Allow six months from
beginning design to going to the field, though 3-4
months can be possible. Test, test and re-test the
instruments. Run planned tabulations and analyses
with dummy data or the data from the pilot. Once
data are collected one to two months are required for
data entry and cleaning.
• Include information to allow tracing of the
respondents for later rounds of the survey, and ensure
that they can be linked in the data.
• Avoid changes in survey design between rounds.
Ideally the same team will conduct all rounds of the
survey.
Options when there is no
baseline
• Evaluations are often conducted ex post, and there is
no baseline available. Under these circumstances the
following options can be considered:
1. If treatment and comparison groups are drawn from
the same population and some means is found to
address selection bias (which will have to be quasi-
experimental, since randomization is ruled out unless
the treatment had been randomized, but if the
program designers had thought of that they will have
thought of a baseline also), then a single difference
estimate is in principle valid.
2. Find another data set to serve as a baseline. If there
was a baseline survey but with a poor or absent
comparison group, then a national survey might be
used to create a comparison group using propensity
score matching.
3. Field a survey using recall on the variables of interest.
Many commentators are critical of relying on recall.
But all survey questions are recall, so it is a question of
degree. The evaluator need use his or her
judgment as to what it is reasonable to expect a
respondent to remember. It is reasonable to expect
people to recall major life changes, introduction of new
farming methods or crops, acquisition of large assets
and so on. But not the exact amounts and prices of
transactions. When people do recall there may be
telescoping (thinking things were more recent than
they were), so it is useful to refer to some widely
known event as a time benchmark for recall questions.
4.If all the above fail, then the study was
make build a strong analysis of the
causal chain (program theory). Often a
relatively descriptive analysis can
identify breaks in the chain and so very
plausibly argue there was low impact.