Harlen 2005
Harlen 2005
Harlen 2005
207 – 223
Wynne Harlen*
Faculty of Education, University of Cambridge
This article concerns the use of assessment for learning (formative assessment) and assessment of
learning (summative assessment), and how one can affect the other in either positive or negative
ways. It makes a case for greater use of teachers’ judgements in summative assessment, the reasons
for this being found in the research that is reviewed in the first sections of the article. This research,
concerning the impact of summative assessment, particularly high-stakes testing and examinations,
on students’ motivation for learning and on teachers and the curriculum, reveals some seriously
detrimental effects. Suggestions for changes that would reduce the negative effects include making
greater use of teachers’ summative assessment. However, this raises other issues, about the
reliability and validity of teachers’ assessment. Research on ways of improving the dependability of
teachers’ summative assessment suggests actions that would equally support more effective use of
assessment to help learning. The later sections of the article address the issues and opportunities
relating to the possibility of assessment that serves both formative and summative purposes, with
examples of what this means in practice, leading to the conclusion that the distinction between
formative and summative purposes of assessment should be maintained, while assessment systems
should be planned and implemented to enable evidence of students’ ongoing learning to be used for
both purposes.
Introduction
All assessment in the context of education involves making decisions about what is
relevant evidence for a particular purpose, how to collect the evidence, how to
interpret it and how to communicate it to intended users. Such decisions follow from
the purpose of conducting the assessment. These purposes include helping learning,
*Haymount Coach House, Bridgend, Duns, Berwickshire, TD11 3DJ, UK. Email:
wynne@torphin.freeserve.co.uk
teachers of using teachers’ judgements, are outlined. These point to actions that need
to be taken to improve the dependability of teachers’ assessments, actions that
coincide with the key features of using assessment formatively. This leads to the
discussion of how to bring about synergy between the processes of formative and
summative assessment.
The reference here to a threat to validity of the assessment is but one of several. High-
stakes tests are inevitably designed to be as ‘objective’ as possible, since there is a
premium on reliable marking in the interests of fairness. This has the effect of
reducing what is assessed to what can be readily and reliably marked. Generally this
excludes many worthwhile outcomes of education such as problem-solving and
critical thinking.
210 W. Harlen
The review not only identified the negative impacts of testing, but also gave clues as
to what actions could be taken to reduce these impacts. Suggested action included, at
the class level: explaining to students the purpose of tests and involving them in
decisions about tests; using assessment to convey a sense of progress in their learning
to students; providing explanations to students about the purpose of tests and other
assessments of their learning; providing feedback that helps further learning; and
developing students’ self-assessment skills and their use of criteria relating to
learning, rather than test performance. It is noteworthy that these actions refer to
several of the key features of assessment used to help learning.
Implications for assessment policy were drawn from the findings by convening a
consultation conference of experts representing policy-makers, practitioners, teacher
educators and researchers. The policy implications included steps that should be taken
212 W. Harlen
to reduce the high stakes of summative assessment, by using a wider range of indicators
of school performance, and by using a more valid approach to tracking standards at the
national level, through testing a sample of students rather than a whole age group. It was
also emphasized that more valid information about individual student performance was
needed than could be obtained through testing alone, and that more use should be make
of teachers’ judgements as part of summative assessment. We now turn to the potential
advantages and disadvantages of this latter course of action.
Two further systematic reviews of research (Harlen, 2004a, 2004b) were carried out
to bring together relevant evidence to answer these questions. The definition of
summative assessment by teachers adopted in the reviews was
The process by which teachers gather evidence in a planned and systematic way in order
to draw inferences about their students’ learning, based on their professional judgement,
and to report at a particular time on their students’ achievements.
This excludes the role of teachers as markers or examiners in the context of external
examinations, where they do not mark their own students’ work.
In addition to defining reliability and validity it was found useful to discuss
approaches in terms of dependability. The interdependence between the concepts of
reliability and validity means that increasing one tends to decrease the other.
Dependability is a combination of the two, defined in this instance as the extent to
which reliability is optimized while ensuring validity. This definition prioritizes
validity, since a main reason for using teachers’ assessment rather than depending
entirely on tests for external summative assessment is to increase the construct
validity of the assessment.
The main findings from the two systematic reviews of research on the use of
teachers’ assessment for summative purposes are given in Box 2.
. The extent to which the assessment tasks, and the criteria used in judging
them, are specified are key variables affecting dependability. Where neither
tasks nor criteria are well specified, dependability is low.
. Detailed criteria, describing progressive levels of competency, have been
shown to be capable of supporting reliable assessment by teachers.
. Tightly specifying tasks does not necessarily increase reliability and is likely
to reduce validity by reducing the opportunity for a broad range of learning
outcomes to be included.
. Greater dependability is found where there are detailed, but generic, criteria
that allow evidence to be gathered from the full range of classroom work.
. Bias in teachers’ assessments is generally due to teachers taking into account
information about non-relevant aspects of students’ behaviour or being
apparently influenced by gender, special educational needs, or the general or
verbal ability of a student in judging performance in a particular task.
. Researchers claim that bias in teachers’ assessment is susceptible to
correction through focused workshop training.
214 W. Harlen
Opportunities
There is considerable similarity in some of the implications from the research
evidence in the three reviews, relating particularly to: the importance of providing
non-judgemental feedback that helps students know where they are in relation to
learning goals; the need for teachers to share with students the reasons for, and goals
of, assessment; the value to teachers of using assessment to learn more about their
students and to reflect on the adequacy of the learning opportunities being provided;
teachers and students placing less emphasis on comparisons among students and
more on individual development; and helping students to take responsibility for their
learning and work towards learning goals rather than performance goals. All these
points are ones that favour formative assessment as well as improving the
dependability and positive impact of summative assessment by teachers. It follows
that the actions teachers need to take in developing their assessment for summative
purposes overlap to a great extent with the actions required for practising formative
assessment.
The next section explores the extent to which assessment information can be used
for both summative and formative purposes, without the use for one purpose
endangering the effectiveness of use for the other. Some of those involved in
developing assessment have argued that the distinction is not helpful and that we
should simply strive for ‘good assessment’. Good formative assessment will support
good judgements by teachers about student progress and levels of attainment
(Hutchinson, 2001) and good summative assessment will provide feedback that can
be used to help learning. Maxwell (2004) describes progressive assessment, which we
consider below, as blurring the boundary between formative and summative
assessment. However, it remains the case that formative and summative are different
purposes of assessment and while the same information may be used for both, it is
necessary to ensure that the information is used in ways that serve these purposes. It
seems that, under current arrangements, in practice information is gathered initially
with one of these purposes in mind and may or may not be used for the other. These
are arguments to return to after looking at these current practices and considering the
possibility of collecting information designed for both purposes.
materials and opportunities for learning available and, most importantly, making
clear the purposes and goals of the work.
Some examples of using assessment in this way are provided by Maxwell (2004)
and Black et al. (2003). Maxwell describes the approach to assessment used in the
Senior Certificate in Queensland, in which evidence is collected over time in a
student portfolio, as ‘progressive assessment’. He states that
All progressive assessment necessarily involves feedback to the student about the quality
of their (sic) performance. This can be expressed in terms of the student’s progress
towards desired learning outcomes and suggested steps for further development and
improvement. . .
For this approach to work, it is necessary to express the learning expectations in terms of
common dimensions of learning (criteria). Then there can be discussion about whether
the student is on-target with respect to the learning expectations and what needs to be
done to improve performance on future assessment where the same dimensions appear.
As the student builds up the portfolio of evidence of their performance, earlier assessment
may be superseded by later assessment covering the same underlying dimensions of
learning. The aim is to report ‘where the student got to’ in their learning journey, not
where they started or where they were on the average across the whole course. (Maxwell,
2004, pp. 2, 3)
innovation was to ask students to set test questions and devise marking schemes. This
helped them ‘both to understand the assessment process and to focus further efforts
for improvement’ (Black et al., 2003, p. 54). The third change was for the teachers to
use the outcome of tests diagnostically and to involve students in marking each
other’s tests, in some cases after devising the mark scheme. This has some similarity
to the approach reported by Carter (1997), which she called ‘test analysis’. In this the
teacher returned test papers to students after indicating where there were errors, but
leaving the students to find and correct these errors. The students’ final mark
reflected their response to the test analysis as well as the initial answers. Carter
described this as shifting the responsibility for learning to the students, who were
encouraged to work together to find and correct their errors.
These approaches are ones that teachers can use in the context of classroom tests
over which they have complete control. Black et al. (2003) noted that when external
tests are involved, the process can move ‘from developing understanding to ‘‘teaching
to the test’’. More generally, the pressures exerted by current external testing and
assessment requirements are not fully consistent with good formative practices’
(Black et al., 2003, p. 56). These teachers used their creativity to graft formative value
on to summative procedures. A more fundamental change is needed if assessment is
to be designed to serve both purposes from the start.
There is the potential for such change in the use of computers for assessment,
which provide the opportunity for assessment to serve both formative and
summative purposes. In the majority of studies of the use of ICT for assessment
of creative and critical thinking, reviewed by Harlen & Deakin Crick (2003b), the
assessment was intended to help development of understanding and skills as well as
to assess attainment in understanding and skills. The effectiveness of computer
programs for both these purposes was demonstrated by those studies where
computer-based assessment was compared with assessment by paper and pencil
(Jackson, 1989; Kumar et al., 1993). The mechanism for the formative impact was
the feedback that students received from the program. In some cases this was no
more than reflecting back to the students the moves or links they made between
concepts or variables as they attempted to solve a problem. In others (e.g.
Osmundson et al., 1999), the feedback was in providing a ‘score’ for a concept map
that they created on the screen by dragging concepts and links. The score compared
the students’ maps with an ‘expert map’ and required a much greater degree of
analysis than could be provided in any other way. In other studies (Schacter et al.,
1997) the computer program used a record of all mouse clicks in order to provide
feedback to the students and teacher information about the processes used in
reaching a solution. Schacter et al. referred to this as ‘bridging the gap between
testing and instruction’.
In order for assessment to have a formative purpose it is necessary to be able to
report not only the students’ final performance, but also what processes students need
to improve in order to raise their performance. The collection of information about
processes, even if feasible in a non-computer-based assessment, is immensely time-
consuming and would not be a realistic approach to meeting the need for information
218 W. Harlen
for improving learning. The use of computers makes this information available, in
some cases instantly, so that it provides feedback for the learner and the teacher that
can be used both formatively and summatively. In these cases the process of
assessment itself begins to impact on performance; teaching and assessment begin to
coalesce. Factors identified as values of using computers for learning then become
equally factors of value for assessment. These include: speed of processing, which
supports speed of learning; elements of motivation such as confidence, autonomy,
self-regulation and enthusiasm, which support concentration and effort; ease of
making revisions and improved presentation, which support quality of writing and
other products; and information handling and organization, which support under-
standing (NCET, 1994).
Figure 1. Formative and summative assessment using the same evidence but different criteria
reporting levels. In this process the change over time can be taken into account so
that, as in the Queensland portfolio assessment, preference is given to evidence that
shows progress during the period covered by the summative assessment. This process
is similar to the one teachers are advised to use in arriving at their teachers’
assessment for reporting at the end of key stages in the National Curriculum
assessment. The difference is that in the approach suggested here teachers have
gathered information in ways suggested above (incorporating the key features of
formative assessment) over the whole period of students’ learning, and used it to help
students with their learning.
The detailed indicators will map onto the broader criteria, as suggested in
Figure 1. The mapping will smooth out any misplacement of the detailed
indicators. But it is important not to see this mapping as a summation of
judgements about each indicator. Instead the evidence is re-evaluated against the
broader reporting criteria.
Conclusion
What has the research evidence reviewed and the arguments presented here to say in
relation to the questions of whether teachers’ summative assessment and assessment
for learning need to be considered as distinct from each other or how they can be
harmonized? There seems to be value in maintaining the distinction between
formative and summative purposes of assessment while seeking synergy in relation to
the processes of assessment. These different purposes are real. One can conduct the
same assessment and use it for different purposes just as one can travel between two
places for different purposes. As the purpose is the basis for evaluating the success of
the journey, so the purpose of assessment enables us to evaluate whether or not the
purpose has been achieved. If we fuse, or confuse, formative and summative
purposes, experience strongly suggests that ‘good assessment’ will mean good
assessment of learning, not for learning.
It is suggested here that the synergy of formative and summative assessment
comes from making use of the same evidence for the two purposes. This can be,
as in the Queensland example, where work collected in the portfolio is used to
Tensions and synergies of teachers’ summative practices 221
provide feedback to the students at the time it is completed as well as being used
later in assessing overall attainment. Here the procedures for using the assessment
to help learning are less well defined than in the approach that starts from the
formative use. Possibly different emphases are appropriate at different stages of
education, the detailed indicators being particularly suited at the primary level
where teachers have opportunity to gather evidence frequently but, at the same
time, need more structured help in deciding next steps across the range of subjects
they teach.
Synergy also comes from having the same person responsible for using the evidence
for both purposes. All assessment involves judgement and will therefore be subject to
some error and bias, as noted in the research findings. While this aspect has been
given attention in the context of teachers’ assessment for summative uses, it no doubt
exists in teachers’ assessment for formative purposes. Although it is not necessary to
be over-concerned about the reliability of assessment for this purpose (because it
occurs regularly and the teacher will be able to use feedback to correct a mistaken
judgement), the more carefully the assessment is made, the more value it will have in
helping learning. Thus the procedures for ensuring more dependable summative
assessment will benefit the formative use and, as noted, the teacher’s understanding
of the learning goals and the nature of progression in achieving them. For example,
experience shows that moderation of teachers’ judgements, necessary for external
uses of summative assessment, can be conducted so that it not only serves a quality
control function, but also has an impact on the process of assessment by teachers,
having a quality assurance function as well (ASF, 2004b). This will improve the
collection and use of evidence for a formative purpose as well as a summative
purpose.
The procedures that will most help both the effectiveness of formative assessment
and the reliability of summative assessment are those that involve teachers in planning
assessment and developing criteria. Through this involvement they develop owner-
ship of the procedures and criteria and understand the process of assessment,
including such matters as what makes an adequate sample of behaviour, as well as the
goals and processes of learning. This leads to the position that synergy between
formative and summative assessment requires that systems should be designed with
these two purposes in mind and should include arrangements for using evidence for
both purposes.
References
Ames, C. (1990) Motivation: what teachers need to know, Teachers College Record, 91, 409–21.
ARG (Assessment Reform Group) (2002) Testing, motivation and learning (Cambridge, Cambridge
University Faculty of Education). Available from the ARG website www.assessment-reform-
group.org
ASF (2004a) ASF Working Paper 2 Available from the ARG website.
ASF (2004b) ASF Working Paper 1 available from the ARG website.
Black, P. & Wiliam, D. (1998) Assessment and classroom learning, Assessment in Education, 5(1), 7–
74.
222 W. Harlen
Black, P., Harrison, C., Lee, C., Marshall, B. & Wiliam, D. (2002) Working inside the black box
(London, King’s College London).
Black, P., Harrison, C., Lee, C., Marshall, B. & Wiliam, D. (2003) Assessment for learning: putting it
into practice (Maidenhead, Open University Press).
Broadfoot, P., Pollard, A., Osborn, M., McNess, E. & Triggs, P. (1998) Categories, standards and
instrumentalism: theorizing the changing discourse of assessment policy in English primary
education, paper presented at the Annual Meeting of the American Educational Research
Association, 13–17 April, San Diego, California, USA.
Carter, C. R. (1997) Assessment: shifting the responsibility, Journal of Secondary Gifted Education,
9(2), Winter 1997/8, 68–75.
Crooks, T. J. (1988) The impact of classroom evaluation practices on students, Review of
Educational Research, 58, 438–81.
Cumming, J. & Maxwell, G. S. (2004) Assessment in Australian schools: current practice and
trends, Assessment in Education, 11(1), 89–108.
Dweck, C. S (1992) The study of goals in psychology, Psychological Science, 3, 165–7.
Gordon, S. & Rees, M. (1997) High-stakes testing: worth the price?, Journal of School Leadership, 7,
345–68.
Harlen, W. (2004a) A systematic review of the evidence of reliability and validity of assessment by
teachers used for summative purposes (EPPI-Centre Review), Research Evidence in Education
Library, issue 3 (London, EPPI-Centre, Social Science Research Unit, Institute of
Education). Available on the website at: http://eppi.ioe.ac.uk/EPPIWeb/home.aspx?page = /
reel/review_groups/assessment/review_three.htm
Harlen, W. (2004b) A systematic review of the evidence of the impact on students, teachers and the
curriculum of the process of using assessment by teachers for summative purposes (EPPI-
Centre Review), Research Evidence in Education Library, issue 4 (London, EPPI-Centre, Social
Science Research Unit, Institute of Education). Available on the website at: http://
eppi.ioe.ac.uk/EPPIWeb/home.aspx?page = /reel/review_groups/assessment/review_four.htm
Harlen, W. & Deakin Crick, R. (2002) A systematic review of the impact of summative assessment
and tests on students’ motivation for learning (EPPI-Centre Review), Research Evidence in
Education Library, issue 1 (London, EPPI-Centre, Social Science Research Unit, Institute of
Education). Available on the website at: http://eppi.ioe.ac.uk/EPPIWeb/home.aspx?page = /
reel/review_groups/assessment/review_one.htm
Harlen, W. & Deakin Crick, R. (2003a) Teaching and motivation for learning, Assessment in
Education, 10(2), 169 – 208.
Harlen, W. & Deakin Crick, R. (2003b) A systematic review of the impact on students and teachers
of the use of ICT for assessment of creative and critical thinking skills (EPPI-Centre Review),
Research Evidence in Education Library, issue 2 (London, EPPI-Centre, Social Science
Research Unit, Institute of Education). Available on the website at: http://eppi.ioe.ac.uk/
EPPIWeb/home.aspx?page = /reel/review_groups/assessment/review_two.htm
Harlen, W. & James, M. J. (1997) Assessment and learning: differences and relationships between
formative and summative assessment, Assessment in Education, 4(3), 365–80.
Harlen, W. & Qualter, A. (2004) The teaching of science in primary schools (4th edn) (London, David
Fulton).
Hutchinson, C. (2001) Assessment is for learning: the way ahead (Internal Policy Paper, Scottish
Executive Education Department (SEED)).
Jackson, B. (1989) A comparison between computer-based and traditional assessment tests, and
their effects on pupil learning and scoring, School Science Review, 69, 809–15.
Johnston, J. & McClune, W. (2000) Selection project sel 5.1: pupil motivation and attitudes–self-
esteem, locus of control, learning disposition and the impact of selection on teaching and
learning, in: The effects of the selective system of secondary education in Northern Ireland, Research
Papers, Vol. II (Bangor, Co. Down, Department of Education), 1–37.
Tensions and synergies of teachers’ summative practices 223
Kellaghan, T., Madaus, G. F. & Raczek, A. (1996) The use of external examinations to improve student
motivation (Washington, DC, American Educational Research Association).
Kohn, A. (2000) The case against standardized testing (Portsmouth, NH, Heinemann).
Koretz, D. (1988) Arriving at Lake Wobegon: are standardized tests exaggerating achievement and
distorting instruction?, American Educator, 12(2), 8 – 15, 46 – 52.
Koretz, D., Linn, R. L., Dunbar, S. B. & Shepard, L. A. (1991) The effects of high-stakes testing on
achievement: preliminary findings about generalization across tests, paper presented at the
annual meeting of the American Educational Research Association, 11 April, Chicago.
Kumar, D. (1993) Effect of HyperCard and traditional performance assessment methods on expert-novice
chemistry problem solving, annual meeting of the National Association for Research in Science
Teaching, Atlanta, GA.
Linn, R. (2000) Assessments and accountability, Educational Researcher, 29, 4–16.
Linn, R., Dunbar, S., Harnisch, D. & Hastings, C. (Eds) (1982) The validity of the title 1 evaluation
and reporting systems (Beverley Hills, CA, Sage).
Masters, G. & Forster, M. (1996) Progress maps (Victoria, Australian Council for Educational
Research).
Maxwell, G. S. (2004) Progressive assessment for learning and certification: some lessons from
school-based assessment in Queensland, paper presented at the third conference of the
Association of Commonwealth Examination and Assessment Boards, March, Nadi, Fiji.
National Council for Educational Technology (NCET) (1994) Integrated learning systems: a report of
the pilot evaluation of ILS in the UK (Coventry, NCET).
Osborn, M., McNess, E., Broadfoot, P., Pollard, A. & Triggs, P. (2000) What teachers do: changing
policy and practice in primary education (London, Continuum).
Osmundson, E., Chung, G., Herl, H. & Klein, D. (1999) Knowledge mapping in the classroom: a tool
for examining the development of students’ conceptual understandings, Research Report (Los
Angeles, CA, Centre for Research on Evaluation, Standards and Student Testing).
Pollard, A., Triggs, P., Broadfoot, P., McNess, E. & Osborn, M. (2000) What pupils say: changing
policy and practice in primary education (London, Continuum), chaps 7 and 10.
Reay, D. & Wiliam, D. (1999) ‘I’ll be a nothing’: structure, agency and the construction of identity
through assessment, British Educational Research Journal, 25, 345–54.
Schacter, J., Herl, H. E., Chung, G. K. W. K., O’Neil, H. F. O., Dennis, R. & Lee, J. J. (1997)
Feasibility of a web-based assessment of problem solving, annual meeting of the American
Educational Research Association, April, Chicago.
Shepard, L. (1991) Will national tests improve student learning?, Phi Delta Kappan, 72(4), 232–8.
Stiggins, R. J. (1999) Assessment, student confidence and school success, Phi Delta Kappan, 81(3),
191–8.
Wilson, M. (1990) Measurement of developmental levels, in: T. Husen & T. N. Postlethwaite
(Eds) International encyclopedia of education: research and studies. Supplementary vol. 2 (Oxford,
Pergamon), 152–8.
Wilson, M., Kennedy, C. & Draney, K. (2004) GradeMap (Version 4.0) [computer program]
(Berkeley, University of California, BEAR Center).
Wood, R. (1991) Assessment and testing: a survey of research (Cambridge, Cambridge University
Press).