Assessment in Mathematics Education Responding To Issues

ZDM (2018) 50:555–570
https://doi.org/10.1007/s11858-018-0963-z
SURVEY PAPER
Assessment in mathematics education: responding to issues

regarding methodology, policy, and equity
Guri A. Nortvedt1 · Nils Buchholtz1
Accepted: 19 June 2018 / Published online: 23 June 2018

© The Author(s) 2018
Abstract
In educational contexts, assessments may be designed to target students, preservice teachers, or teachers, either as individuals
or as representatives of a group, and for a multitude of purposes. One key aim of assessment in mathematics education is to
provide evidence that can be used to make decisions about or improve mathematics education, which then raises questions
about which aspects of mathematical competence should be assessed—as well as how and for what purpose. This review
paper addresses three related themes: (1) issues related to the assessment process and to the development of assessments that
can validly assess mathematical competence in all its complexity; (2) issues related to educational policy and policy-making
based on assessment data, in particular the reciprocal relationship between assessment and policy; and (3) issues related
to equity, such as gender issues or the achievement gap between majority and minority students. Awareness of the relation
between assessment, teaching, and learning is shown throughout the paper. Strong relationships between the three focus areas
are found, that impact assessment validity and call for further development of assessment practices in mathematics education.
Keywords Assessment · Validity · Methodology · Policy · Equity
1 Introduction higher education or to monitor or evaluate the effectiveness

of schools or educational systems. In addition, assessment
Why have a ZDM Mathematics Education issue on assess- outcomes are widely used to inform policy-making and deci-
ment in mathematics education? The last three decades have sions about educational reform (Elstad et al. 2009; Newton
seen an increasing focus on assessment (see, for instance, 2007; Nortvedt 2018).
Wiliam 2007; Suurtamm et al. 2016). From a strong focus Students, teachers, policy-makers, and even researchers
on examinations and knowledge-based assessments, assess- may have naïve and strong beliefs about the objectivity and
ments for learning, national tests, and international com- validity of assessments, including the belief that a single test
parative studies have gradually entered the scene. Today a or observation can tell the truth about the achievements of
wide range of different assessment formats and purposes students, teachers, or educational systems (Stobart 2008).
exist. Throughout the world, various forms of assessment The applied assessment and the purposes for which the data
are employed to elicit information that can be used to are to be used may not align well (Newton 2007), which
inform decisions about mathematics education at the indi- might be the case even when assessment is used for admis-
vidual, institutional, and national levels. Many scholars have sion to higher education or for policy-making.
claimed that assessment should be used primarily to improve Mathematics education as a research field is young com-
learning (Black and Wiliam 2012; Hattie and Timperley pared with mathematics itself (Kilpatrick 2014) or educa-
2007; Niss 1993; van den Heuvel-Panhuizen and Becker tion in general (Wiliam 2003). Thus discussions within
2003). Mathematics tests may also be used for admission to the field of mathematics education are often influenced by
discussions in the neighbouring disciplines. When we dis-
cuss assessment in mathematics education, the discussion is
* Guri A. Nortvedt
g.a.nortvedt@ils.uio.no often based on insights from educational research in com-
bination with our knowledge about mathematics education
1
Department of Teacher Education and School Research, and beliefs about mathematics. That is, current discussions
University of Oslo, PO Box 1099, Blindern, N0317 Oslo, of assessment in mathematics education reflect ongoing
Norway
13
Vol.:(0123456789)
556 G. A. Nortvedt, N. Buchholtz
discussions about the purpose of mathematics education in external or large-scale assessment. This section is some-
general. Past and current debates have demonstrated that what longer than the two following sections.
academics within our field might not share a unified under- 2. Policy issues related to the interpretation, use, and mis-
standing of the purpose of mathematics education (Niss use of assessment outcomes in policy development and
2007); we might not agree on which aspects of mathemat- the possible consequences for mathematics education.
ics are worth teaching, or how learners learn mathematics. This aspect includes a discussion of the reciprocal rela-
Our views about mathematics also colour what we believe tionship between assessment and policy.
should be assessed and how such assessment should be done, 3. Equity issues, including equity and social-justice issues,
in addition to which issues we tend to identify when discuss- and what should be taken into account to develop fair
ing current and future aspects of assessment in mathematics assessments. We use gender differences and issues
education. The mathematics education research community related to assessing migrant students as illustrative
has engaged in numerous debates about assessment instru- examples in the discussion of possible consequences of
ments, procedures, and outcomes over the past few decades. the current assessment policies and practices.
For instance, it is much easier to assess students’ calculation
skills than it is to assess their problem-solving skills, and The primary aim of this special issue is to move beyond
many teacher-made tests primarily comprise algorithmic traditional divides—such as large-scale versus classroom
tasks (Palm et al. 2011; Schoenfeld 2007). What should a assessment, assessment at different levels, and targeting of
‘good’ test look like, and what should it assess? Many of different groups—and instead to discuss more fundamental
the unresolved issues that have emerged over the past dec- issues related to assessment in mathematics education.
ade (e.g. Black and Wiliam 2005; Kaiser et al. 2017; Niss Each of the articles in this ZDM special issue addresses
2007; Suurtaam et al. 2016) still remain unanswered. These one or more of the three focus areas, as these areas have
debates about the methodological and technical issues con- many strong connections, including between assessment
nected to assessment design and implementation are con- formats and the opportunity for students from diverse
cerned with not only what we assess but also how we assess backgrounds to demonstrate their competence, among oth-
and what conclusions we can draw from our assessments. ers. The 13 articles represent novel perspectives on the
Thus, these debates also concern how assessment outcomes three issues identified for this article, or they discuss how
can and are used in decision-making. these issues have been treated within mathematics educa-
We should note that these factors refer not only to the tion research in general. We have included them all among
issues and challenges we have discussed briefly in this intro- the pool of papers, book chapters, and articles that we used
duction but also to the possibilities that assessment provides for this review; many of them appear in more than one of
in connection with mathematics education ‘for all’. Thus, the three sections.
the assessment debate concerns equity issues in addition to
methodology and policy. Strong relationships might exist
between equity and how we assess. For instance, when low-
socioeconomic status (SES) students are frequently reported 2 Review procedures
to have lower achievement scores than those of high-SES
students (OECD 2013a; Mullis et al. 2016), is it because Our review might be termed a state-of-the-art review,
the low-SES students have acquired less of the measured according to the categorisation of Grant and Booth (2009),
competence, or because of some artefact connected to the who state that such reviews “tend to address more cur-
items, or the assessment itself? The goal of this paper is to rent matters in contrast to other combined retrospective
discuss issues connected with each of these three areas sepa- and current approaches. [They] may offer new perspec-
rately; we do so on the basis of a selective review of existing tives on issue or point out area(s) for further research”
research literature on assessment in mathematics education. (p. 95, authors’ additions). In line with Grant and Booth’s
The three focus areas may be described as follows. description, we conducted an extensive search of current
research literature on assessment in mathematics education
1. Methodological and technical issues in developing (2000–2018). No formal quality assessments (i.e., formal
and conducting assessments, related both to what is categorisations) were performed on the research results.
assessed and how it is assessed—that is, to the rela- Rather, the purpose of the analysis has been to describe the
tionship between the purpose of the assessment and the current state of knowledge on assessment in mathematics
assessment format. The discussion focusses on the four education—particularly in relation to the three areas of
stages of the assessment process, which include frame- concern we identified above—and to discuss priorities for
works, operationalisation, measurement, and validation. future investigations and development.
The discussions include both classroom assessment and
13
Assessment in mathematics education: responding to issues regarding methodology, policy,… 557
2.1 Emergence of the three issues ‘equity’. The searches returned a large number of research
publications, and the two authors used their knowledge of
Traditionally, classroom or teacher assessment has been the field to identify any sub-topics within each of the three
discussed separately from large-scale or high-stakes assess- areas (cf., Grant and Booth 2009). A snowballing technique
ment. For example, in the Second Handbook of Research was applied to follow discussions or themes that emerged
on Mathematics Teaching and Learning, edited by Frank from the articles that were initially identified. The purpose
Lester (2007), the section on assessment comprises three of our review was to provide a good overview of emerg-
chapters that focus on classroom assessment, high-stakes ing issues rather than to discuss each issue in full detail.
testing, and international large-scale testing. This traditional The review thus did not follow the procedures typically
division between classroom and high-stakes assessment still used in systematic reviews; searches were ended once suf-
very much exists, although efforts have been made to look ficient research had been found to identify central ideas and
past it. In the research literature, authors now look at over- issues, thus enabling each topic to be discussed thoroughly.
arching issues and problematiques related to all assessment According to Grant and Booth (2009), the state-of-the-art
formats and purposes. For instance, members of the Topic review is particularly well suited to identifying the research
Study Group (TSG) 33 on assessment and testing in math- approaches and main characteristics of a topic. A careful
ematics education from the 12th International Congress on reader will discover that some of the issues we discuss are
Mathematical Education (ICME-12) in Seoul reflected on not novel but have been discussed for a long time. In these
broad issues such as the development of assessment tasks cases, we use ‘older’ references (2000–2007) together with
in light of the complexity of mathematical thinking or the more recent ones.
design of alternative assessment modes (e.g., online test- In this review we have narrowed the scope of our analy-
ing) in mathematics (Suurtamm and Neubrand 2015). In sis to the assessment of competence of students, preservice
addition, the members of TSG 39 (Large-Scale Assessment teachers, and teachers. In some parts of the review, we have
and Testing in Mathematics Education) and TSG 40 (Class- not distinguished between the three groups, since similar
room Assessment for Mathematics Learning) from the 2016 issues emerged in the literature search (e.g., connected
ICME-13 congress in Hamburg chose to work together to to assessment formats) for all groups. In the sections that
develop a Springer volume on assessment in mathematics focussed on policy, research on policy-making for primary
education (Suurtamm et al. 2016). This development indi- and secondary education emerged more often than research
cates that certain issues related to assessment in mathematics on policy issues targeted towards mathematics teacher edu-
education are not connected to format or level but rather to cation; we found the same situation for research on equity.
more fundamental, underlying structures. Assessment for- Although we did find some research on equity issues related
mat and purpose affect students who take the assessment to the assessment of preservice teachers, very little of this
differently and also affect the lessons that we can learn from research focusses on issues connected to underserved
implementing an assessment or from analyses of assessment groups.
outcomes. As a result, not only methodology but also equity
and policy have emerged as key issues to be included in this
review. 3 Issues in methodology
2.2 Procedures The major methodological issues in assessment in math-

ematics can be related to the ‘what’ and ‘how’ questions
After identifying the three areas (methodology, policy, and of assessment. The ‘what’ question relates to the aspects
equity), we performed thematic searches to identify scien- of mathematical competencies that can be validly assessed,
tific articles, book chapters, or books that address funda- while the ‘how’ question regards the assessment format or
mental aspects in relation to these three issues. Searches method for measuring the competencies addressed in the
were made using Eric, Google Scholar, and ORIA1, applying ‘what’ question. Figure 1 shows the assessment process,
search words and phrases such as ‘mathematics assessment’ which consists of several consecutive sub-processes that
together with terms such as ‘methodology’, ‘policy’, and guide the development of assessments, both in large-scale
studies and in classroom assessments (for an example, see
OECD 2013c).
1
ORIA is a search engine supported by the Norwegian University Methodological challenges may be found in each of the
Library that searches several external databases simultaneously (https sub-processes: (1) in defining a framework or conceptualis-
://bibsys-almapr imo.hosted.exlibr isgroup.com/primo-explore/searc ing the content to be assessed, (2) in operationalising the
h?vid=UIO&sortby=rank&lang=no_NO). Searches are performed in
a large number of library data bases such as the Springer data bases framework and developing assessment formats and items,
or Thomson Reuter data bases, ProQuest. (3) in implementing the measurement, and (4) in interpreting
13
very little progress has been made as regards the

Framework assessment of essential ingredients in mathematical
competencies, such as asking questions, conjecturing,
posing problems, constructing argument, including
formal proofs, making use of and switching between
Validaon Operaonalisaon representations, communication and such like. Not
only is research lacking, assessment instruments are
largely lacking as well. Much assessment and testing is
still focused on students’ solving of already formulated
problems. This shows that as far as assessment is con-
Measurement cerned there is indeed a long way to Utopia. (p. 1306).
In the years since Niss (2007) articulated this concern,
some progress has been made in tasks for classroom assess-
Fig. 1 The assessment process
ment (as well as tasks in large-scale assessment) that better
reflect the complexity of mathematical thinking and problem
solving. Our field has even seen progress in how we assess
the outcomes of assessment and validating the assessment
various sub-areas of mathematical competence, includ-
instruments.
ing that of both students and preservice teachers (see, for
In educational assessments such as large-scale interna-
instance, Suurtamm et al. 2016; Fujita et al. 2018; Ubuz
tional and national assessments—exemplified by PISA,
and Aydin 2018). The PISA 2012 framework, for instance,
TIMSS, and national tests and examinations—an assessment
attempted to describe and assess the different modelling and
framework is usually developed that describes the content
problem-solving processes, and identified these processes
or concepts to be assessed and how the framework should
as the main reporting categories, rather than using the tradi-
be operationalised. Standards regulating how the assessment
tional division into different mathematical content strands as
should be implemented are often included in the framework
in previous cycles (Niss 2015; OECD 2013a, c).
or in accompanying guides or technical frameworks (e.g.,
But once we define what we will assess, how do we
OECD 2013c). In classroom assessments, a national curricu-
operationalise the framework and develop an assessment
lum may be viewed as an assessment framework, although
situation from it? Challenges connected to the process of
a framework might also be provided by the local school or
designing and implementing an assessment will not differ
local school authorities that includes standards and goals
depending on whom we assess, be it students in compulsory
for mathematics teachers. Teachers have to interpret these
education or in teacher education; rather, it is the process of
frameworks and standards and decide on what to teach and
operationalising the framework content that is challenging
how to assess this content. They often use a range of teacher-
for different stakeholders to agree on (Kaarstein 2014). A
selected or teacher-made assessment instruments that are
closer look at the research on mathematics teacher educa-
closely linked to what the students have been learning (Suur-
tion might illuminate the relationship between framework
tam et al. 2016; Wiliam 2007).
conceptualisation and operationalisation. Neubrand (2018)
Teachers also need to interpret and validate assessment
provides an overview of different conceptualisations of
outcomes, both from the assessments they themselves make
mathematics teacher competence as well as approaches to
and from external assessments. If the tasks used for instruc-
assessing such competence. Starting from the assumption
tion and assessment are too similar, however, then there is
that teaching is a profession, he defines critical aspects of
a danger of overestimating what students have achieved,
teaching competence and how major studies in the field have
because the curriculum and instruction might be narrowed
addressed these aspects over the past two decades. He con-
towards tested topics and even towards certain problematic
cludes that a common denominator across projects is that
styles or formats (Hamilton et al. 2007). Previous research
they have disregarded teaching practice when operationalis-
has revealed that teacher-made tests often assess what might
ing the assessment framework. This disregard may be seen
be termed lower-order skills (see, for instance, Palm et al.
as a major weakness that affects the validity of these studies
2011). Even established large-scale assessments might
and calls for further methodological developments in math-
test only certain aspects of mathematical competence; for
ematics (preservice) teacher assessments.
instance the TIMSS study aims to assess what is shared by
Further, we may identify the assumption of unidimen-
the curricula in the participating countries (Mullis et al.
sionality of content or competence defined in most assess-
2016). This lack of rigour might be a common issue shared
ment frameworks as a single trait as a key issue connected
at all levels of educational assessment. According to Niss
to the operationalisation of theoretical constructs defined
(2007),
13
in corresponding frameworks. This assumption tends to from secondary-level mathematics teaching (Rowland and
yield large-scale assessments in particular (van den Heuvel- Ruthven 2010; Speer et al. 2015). Corresponding issues of
Panhuizen and Becker 2003). The conceptualisations and operationalisation can be identified regarding student assess-
operationalisations that studies such as PISA make, subsume ment. Care must be taken when deciding on what content
various mathematical activities that define mathematical lit- areas in mathematics (geometry, algebra, or arithmetic, for
eracy; they might include mathematising, arguing, proving, example) and what mathematical processes (such as proving,
and problem solving, among others, under one broad psy- modelling, understanding, or interpreting) are to be assessed
chological construct (e.g., OECD 2013c). Researchers often and linked to the theoretical concepts behind the traits that
critique such assumptions when discussing large-scale stud- are to be assessed.
ies in mathematics education (e.g., Wuttke 2007). Moving on to the issue of measurement, assessment
In what might be seen as a contrast to this approach, other frameworks that integrate a wide range of mathematical
studies might restrict what is measured in ways that could processes or multiple tests, and where a variety of different
lead to construct underrepresentation. Several examples assessment formats are involved, such as paper-based and
may be found regarding changes to how and what aspects computer-based testing, represent a methodological chal-
of competence are assessed; these changes might seem like lenge, given that both assessment content and instruments
pragmatic choices at the time. For instance, when assessing in these cases are multifaceted and complex. Recent tech-
teacher competence, we have observed a tendency to narrow nological developments have facilitated a change of assess-
the assessed abilities to discrete compartmentalised facets ment mode, but these developments have not come without
of teaching competence that are ‘easier’ to assess and can challenges. As of 2015, for example, the PISA study has
be explained with local theories from mathematics educa- switched from paper-based to computer-based assessment as
tion, such as the teaching of algebra (e.g., Lynch and Star its main assessment mode. Even if meta-analyses conclude
2014), diagnostic competence (e.g., Hoth et al. 2016), or that the mode of delivery does not greatly affect scores when
school-related mathematical knowledge (e.g., Buchholtz assessing established constructs (Wang et al. 2007), other
et al. 2013). studies have revealed that factors such as on-screen reading,
An important question to consider in assessment design screen size, or resolution do affect cross-country compari-
is whether to attempt to assess an overall competence or sons (Jerrim 2016). One possible solution to this challenge
whether to restrict what is to be assessed to some predefined of mode effects, which has been applied in the PISA study,
aspects of teacher competence. For instance, Martinovic and is to statistically adjust for the results (OECD 2016). Still, a
Manizade (2018) describe in their paper the development of major question to be discussed is: Are we really measuring
an instrument for assessing teachers’ knowledge for teaching the same construct?
geometry. They focus on methodological issues connected to The issue of assessment mode relates to more than large-
measuring mathematical knowledge for teaching; they also scale assessment. Several modes may be observed in class-
describe their approach to task design—targeting knowl- room assessments that might involve a change from written
edge for teaching the area of a trapezoid and for accom- tests to oral presentations or from paper-based to computer-
panying assessment tools. Unlike assessing mathematics based assessments. The shift from the extensive use of writ-
teacher competence on a more generic level, they discuss ten tests to assessments for learning, for instance, might be
the benefits of developing assessment instruments within seen as a shift from summative to formative assessment
a well-defined and narrow topic in mathematics, and of or from focussing on answers to focussing on mathemati-
combining different measures to ensure the validity of the cal processes (Wiliam 2007). Hoogland and Tout (2018)
assessed construct. This approach can provide insight into discuss how computer-based assessment might offer new
well-defined restricted areas of teacher competence. Still, opportunities to assess more of students’ mathematical
questions remain of the generalisability of assessment results competence. For instance, recent developments in technol-
to other aspects of teacher competence, for example in terms ogy might support the assessment of higher-order thinking
of policy-making. skills in mathematics while also offering opportunities to
Another issue connected to operationalisation that applies use authentic tasks. Fujita et al. (2018), for instance, discuss
to well-known teacher-education studies in mathemat- computer-based feedback and analyse the use of procedural
ics is the lack of consideration of practical mathematical feedback when conducting geometry proofs. In particular,
knowledge for teaching (Buchholtz et al. 2014). Most of the they analyse how learners can overcome logical circular-
applied assessment frameworks to date neglect the teaching ity with the help of computer-generated feedback and thus
context that teachers experience in their classrooms. Thus address more than issues about assessing procedural skills
the frameworks include only a few facets of professional in mathematics, such as conducting a flow-chart proof in
abilities and lack generalisability to other teaching content geometry. They discuss methodical challenges about devel-
or contexts, often overlooking what distinguishes elementary oping computer-based assessments as well as the finding
13
that, for some students, supplementary teacher interventions administered to a group of preservice teachers participating
are necessary, thus indicating that some further development in a school practice course. In their paper, the authors inte-
is still needed. grate findings from an analysis of survey data with analyses
A key difference between the approaches that Fujita et al. of e-portfolio data to gather as much information as possible
(2018) and Hoogland and Tout (2018) take compared to the on preservice teachers’ learning opportunities and the acqui-
OECD’s approach (2016) is that the OECD, in PISA 2015, sition of situation-specific skills during the course. Leder
merely transferred their paper-based mathematics trend and Forgasz (2018) also discuss how several assessment for-
items2 to a computerised format but failed to take advan- mats might be combined to create equal opportunities for
tage of the possibilities a digital platform might offer, as in male and female students to demonstrate their mathematical
the two other studies. While PISA for trend items primarily competence in a summative assessment. While the authors
utilised possibilities for automated scoring, Fujita et al. and indicate that not simply using multiple tests but using mul-
Hoogland and Tout discuss challenges related to designing tiple tests with different types of tasks and different formats
richer and more authentic tasks. Hoogland and Tout state might be more equitable, integrating or combining similar
that more and more advanced and sophisticated tools exist or different assessment formats might influence the validity
that now enable efficient, automated testing and scoring of of the conclusions that are drawn from the interpretations of
what might be termed lower-order content, such as calcula- assessment outcomes.
tion skills, procedural speed, and factual knowledge. Thus, Any assessment needs to be validated so that all interpre-
technology can also limit what is assessed. The way forward tations drawn from the assessment results will be justified
might be to introduce frameworks that will allow categorisa- and appropriate for the intended use and the assessed group
tion of task content and, as such, could be used to scrutinise of students or teachers (Newton and Shaw 2014). Pankow
the operationalisation of the test content. et al. (2018) present a specific validation approach for an
Another challenge in assessment design is to avoid com- assessment of teacher competencies. They validated a test
partmentalisation or the loss of detail due to overly inclu- to be used for assessing teachers’ perception of students’
sive assessment designs. This scenario might be remedied errors: a capacity they judge to be crucial to mathematics
by combining or integrating diverse assessment formats in teachers’ competence. To check whether the test they devel-
mathematics into the same assessment, or by applying sev- oped was appropriate for the intended group, they compared
eral linked assessments to assess a larger variety of contexts the test results of different control groups, including stu-
or situations. When a single assessment is used in situa- dents, preservice teachers, mathematics teachers, and math-
tions where far-reaching consequences might occur—such ematicians. They concluded that not only could mathemat-
as admission to further studies or for certification—such ics teachers recognise students’ errors faster than could the
situations involve a great risk of making the wrong deci- other groups, but their perception of students’ errors was
sion, since affective and physical conditions can influence also more closely related to the domain of teachers’ math-
test takers’ possibilities of demonstrating their competence ematical content knowledge than to the domain of teachers’
(Pajares and Miller 1995; Ma 1999). mathematics pedagogical content knowledge.
Learning or longitudinal development of mathematical Ubuz and Aydin (2018), who also addresses the valida-
thinking cannot be displayed by a single summative assess- tion of test instruments and the use of test results in educa-
ment. Correspondingly, in order to contribute to fair assess- tional research, applies the Standards for Educational and
ment, all assessments should take into account different and Psychological Testing (AERA, APA, and NCME 2014) to
multiple sources of individual students’ performance (Leder validate an assessment of students’ knowledge of triangles.
and Forgasz 2018), including classroom-based performance. Unlike most tests, this test was developed to be multidimen-
High-quality formative and summative assessments, includ- sional and to assess declarative, procedural, and conditional
ing multiple modes such as computer-based assessment, knowledge about triangles. Ubuz and Aydin applied factor
offer students a range of opportunities to demonstrate their analysis to identify differentiated structures between the dif-
mathematical knowledge (National Council of Teachers of ferent knowledge facets and thus broadened the validity of
Mathematics 2016). Buchholtz et al. (2018) discuss how an her instrument; she also stresses the need to take the external
approach that integrates summative and formative assess- validity of assessments into account.
ment formats in mathematics teacher education might con- Validity not only affects different facets of test design but
tribute to the preservice teachers’ opportunities to learn in a also affects the use of assessment data to inform educational
German teacher education programme. All assessments were decisions such as policy-making. In terms of both intended
and unintended consequences, any inferences drawn from
invalid conceptualised assessments might have far-reaching
2
This remark concerns only trend items. New science items used in consequences, which lead to an important question: Are
PISA 2015 utilised the digital media to a larger extent. assessment data valid for what we want to use them for?
13
4 Policy issues Taiwan. Their analysis focusses on the chain of policy

development, from assessment studies to an educational
According to Ernest (2014), policy in mathematics educa- vision, followed by the implementation of a change to
tion is primarily related to the curriculum, that is, what is national assessments or curricula. Lin et al. to a large
viewed as valid and important to teach in compulsory and degree attributed the mechanisms of success and failure
further education. He distinguishes amongst policy aspects in evidence-based policy-making to traditional Confucian
such as (1) policy debates about the aim and design of heritage, although they also found influences of Western
mathematics education, (2) those concerning mathematics thinking on curriculum development.
teacher education, and (3) the assessment system. As such, Applying assessment data to shape educational policy
policy influences mathematics education, both what hap- does come with certain risks connected to operational-
pens in classrooms and what happens in teacher education. ising the theoretical constructs found in the assessment
Although the term ‘educational policy’ usually refers to framework. This situation is illustrated in a study by Yang
principles or guidelines that educational authorities have Hansen and Strietholt (2018), who reanalysed PISA data to
developed, policies might also be developed by a munici- investigate whether schooling perpetuates SES inequalities
pality, a school board, a school, or even a teacher associa- in mathematics performance. The rationale for their study
tion (e.g., Australian Association of Mathematics Teachers is that previous research has shown that low-SES students
Inc. 2008). Such guidelines might identify how different show lower mathematics achievement than students from
assessments are to be used by different stakeholders in high-SES backgrounds. Yang Hansen and Strietholt, in not-
education (see for instance Elstad et al. 2009). ing that previous research has found that schooling does
A key aim of educational assessment is to elicit evi- not contribute to learning to the same extent across ability
dence to be used to inform or monitor teaching. For levels, suggest that this situation is due to how opportunity
instance, international comparative studies might provide for learning (OTL) is often measured. The complexity of
the educational authorities of a country with insights about the test design, as well as the analytical approaches used
how their country is doing compared with others (Sälzer in international large-scale studies, together make it chal-
and Prenzel 2014). Assessment studies can provide use- lenging to apply these studies for policy-making. A study
ful information to improve mathematics education (Cai by Shen and Tam (2008) illustrates this scenario very well.
et al. 2016); thus assessment is a primary source for pol- They examined the problem of culturally different reference
icy-making. Burkhardt and Schoenfeld (2018) argue that standards by comparing subjective indicators with student
significant advances have been made within the field of performance in TIMSS data from 1995, 1999, and 2003.
mathematics education in conducting both formative and While measures of students’ liking of mathematics and
summative assessments but that these advances have not science, self-perceived competence, self-perception, and
made a comparable impact on learning. In their review of mathematics achievement were usually slightly positively
previous research that has aimed to improve assessment correlated within countries, the correlation was negative
practices and the use of assessment data to improve teach- when the authors conducted between-country analyses. This
ing and learning at the classroom level, the authors con- outcome was most likely due to the high academic standards
clude that policy-makers typically underestimate the chal- attributed to high-achieving countries and the low academic
lenges involved in assessment design and development. standards attributed to low-performing countries.
This underestimation can lead to low-quality high-stakes Clearly, evidence-based policy is challenging, and evi-
tests and to a lack of uptake of assessments for learning in dence exists that stakeholders at all levels often reduce
mathematics classrooms. the complexity found in assessment data when looking for
When educational policies are developed based on information to support their policies. For instance, rather
empirical data from assessment studies, this is usually than using the detailed information that assessments offer
referred to as evidence-based policy-making (Gaber et al. to identify possible insights or shortcomings in mathematics
2012). For instance, Hsieh et al. (2014) discuss how Tai- education, both the media and policy-makers tend to focus
wan used The Teacher Education and Development Study more on overall results and the ranking of countries; they
in Mathematics (TEDS-M) results to inform and adjust conceive the rank as a quality indicator (Auld and Morris
mathematics teacher education. Further, Lin et al. (2018) 2016; Hopfenbeck and Görgen 2017). Some efforts may
applied a four-phase framework that recognises social and be observed in which agencies and official bodies attempt
cultural characteristics of both Taiwanese and Western to break down assessment outcomes to better explain and
educational systems to discuss how outcomes from inter- inform policy-makers about assessment outcomes; one
national comparative studies have influenced educational example is the OECD report series Education at a Glance.
policy in compulsory education and teacher education in Such efforts might not take into account the cultural context
of national educational systems. Analyses by Baird et al.
13
(2016) and Nortvedt (2018) have indicated that the national teaching and should provide information about individual
context influences policy decisions to a large extent. students as well as information that can be used to evalu-
Many academics question the use of assessment data to ate the success of mathematics education at the regional
inform educational practice (e.g., Biesta 2009) and voice or national level. Having multiple purposes such as these
a concern that large-scale international studies move the is highly problematic, since an assessment that validly and
world towards a globalised and more uniform mathematics reliably provides information at the national level might not
education that fails to take national traditions or needs into provide the same at the student level (Newton 2007; Fischer
account (Stobart 2008). One might argue that such uses of 2004).
international studies are based on surface-level analyses and To summarise, a two-directional, reciprocal relationship
that in-depth analyses of assessment data are still neces- may be seen between policy and mathematics education,
sary. Consequently, the question of what might be done to where assessment outcomes inform and influence policy-
improve stakeholders’ abilities to interpret assessment data making (Baird et al. 2016) and educational policies influ-
remains a key question. Burkhardt and Schoenfeld (2018) ence assessment, teaching, or education programmes (Cai
propose that the research community within mathematics et al. 2015, 2016; Middleton et al. 2015). For instance, the
education should engage with decision-makers. The authors phenomenon of ‘teaching to the test’ (Hamilton et al. 2007)
indicate that the real challenge, however, is to find better is usually interpreted as a negative policy influence on math-
ways to mitigate trust in simple statistics gleaned from ematics education. In this case, low-performing students
what they have called high-stakes low-quality tests as well might be asked not to attend school on the day students are
as increasing interest in high-quality assessments that can tested, or teachers might fabricate test results (Nichols and
contribute to improved mathematics education. Berliner 2007). Equally, national examinations and high-
Policy-makers and teachers alike often struggle to inter- stakes tests may influence the content that is offered within
pret the assessment data provided by high- and low-stakes mathematics teaching (Hamilton et al. 2007). Educators
tests and to use the test outcomes to inform their teaching have been discussing the ‘what you test is what you get
(Groß Ophoff 2013). This scenario is often the case with (WYTIWYG)’ principle for a long time, but it is a prin-
external assessments such as international comparative ciple that works both ways: educational authorities might
studies and national tests, where what is tested often does initiate assessment reforms to influence mathematics teach-
not cover the full national curriculum, and only sample test ing. While emphasising certain educational standards more
items are released. In this situation, test outcomes might be than others can restrict the implementation of a curriculum,
used primarily for school self-presentation rather than to ini- educational authorities might also use changes to national
tiate change processes (Brown and Harris 2009). In addition, examinations to influence changes in mathematics education
teachers might feel controlled rather than encouraged by (Lin et al. 2018), including the uptake of digital tools such
the assessments (Stobart 2008). Hallinger and Heck (2010) as dynamic geometry or CAS tools.
promote collaborative school leadership, where school lead-
ers and teachers share accountability for school practices
(such as ownership and responsibility for assessment out- 5 Equity issues
comes) and collaboratively interpret assessment data and
plan interventions. Their research indicates that such prac- Equity in mathematics education concerns equal opportuni-
tices can, over time, lead to higher student achievement in ties to learn important mathematical content for all students
mathematics. (see for instance Burkhardt and Schoenfeld 2018). Similarly,
A key issue discussed in the research literature concerns equity within mathematics assessment means that all stu-
the use of assessment data for policy development (Lin et al. dents should have the same opportunities to demonstrate
this issue; Nortvedt 2018) and whether existing assessment their mathematical competence. In an educational system
practices provide the information that stakeholders need to focussed on ‘education for all’ (Niss 2007), equity is the
make informed decisions (Gaber et al. 2012). This issue gold standard compared with equality, where the same treat-
points to the potential consequences linked to the use of ment is offered to all students but without the recognition
assessment data to inform decisions, since (1) more than that different students might need different kinds of support
one stakeholder is often involved and (2) the potential for to achieve equity (Heritage and Wylie 2018). Achievement
misinterpretation always exists. Different stakeholders will gaps between groups of students might indicate inequity,
apply assessment results for different purposes and therefore especially if the differences are systematic. Various gaps
often need different kinds of evidence to support their deci- such as these exist today; for instance, both gender differ-
sions (Newton 2007). For example, some countries imple- ences and an achievement gap between majority and minor-
ment national tests to provide evidence both to teachers and ity students are frequently visible in mathematics assess-
decision-makers. That is, the same assessment should inform ments (e.g., OECD 2013a, 2015; Klenowski 2009). Indeed,
13
we could point to several cases of inequitable assessment as a control variable might be challenging when investigat-
practices, for example with regard to centralised versus ing gender differences in achievement, given potentially dif-
decentralised national examinations (Wößmann 2005) or ferent response patterns to questionnaires that investigate
assessments of students with special needs (Scherer et al. self-beliefs. If gendered ways of expressing oneself can be
2016). In this review paper we use gender differences and identified when beliefs are measured, such expression should
the achievement gap between majority and minority students be taken into consideration when reporting outcomes. Gen-
to illustrate how equity issues are linked to methodology dered ways of responding might also yield assessment for-
and policy. mats. For example, Leder and Forgasz (2018) discuss how
Gender differences in mathematics education have assessment format and purpose influence male and female
been discussed for a long time (Leder and Forgasz 2018). students’ assessment outcomes. Taking into account differ-
Achievement differences have traditionally been visible in ences in interaction patterns, they conclude that within the
large-scale studies and high-stakes achievement tests such as Australian context, mathematics assessment still leads to
examinations and national tests (e.g., Liu and Wilson 2009; inequity. The authors question whether national tests provide
OECD 2013a). Male students usually outperform female the same credible and important information about boys and
students, although the reverse pattern is visible in some girls; they ask critical questions about the terminology of
instances. Girls outperform boys in TIMSS or PISA in some testing and test validity from a gender perspective.
countries for mathematics overall or for specific content Equity is a key concern in multicultural classrooms. The
areas (OECD 2013a; Mullis et al. 2016). Interestingly, the heterogeneity of students—and consequently the diversity
last PISA cycle showed that within OECD nations, the gen- in mathematics classrooms—has increased considerably
der differences in mathematics are decreasing (OECD 2016), in recent years (OECD 2015), not least as a result of the
indicating that gender differences in achievement might not increased number of refugees and the integration of students
be particularly consistent across countries and time. An from crisis regions into the world’s school systems (Paxton
emerging alternative explanation concerns the mathematics et al. 2011). Many refugees have very little formal educa-
teaching and curricula students are exposed to. Ayalon and tion and severely interrupted or no substantive schooling, all
Livneh (2013) found that in countries with a standardised of which limits their education (Miller and Mitchell 2006).
educational system, boys and girls are exposed to the same Increased migration for work purposes is also visible in our
content and teaching activities, which might lead to more globalising society. The PISA study has shown that 13%
similar achievement. The authors identified gender stratifi- of the 15-year-old students in OECD countries came from
cation as the most marginal factor involved in the creation immigrant backgrounds in 2015, compared with only 9% in
of an achievement gap. We argue that these findings relate 2006. In this context, classrooms are not only characterised
to the validity of the assessment, since the interpretations of by increased linguistic heterogeneity but also by increased
achievement gaps might reflect national or institutional dif- heterogeneity in terms of prior mathematical knowledge
ferences in schooling. When boys and girls are not exposed (OECD 2016).
to the same curricula or teaching experiences, the same test International comparative studies (e.g., OECD 2015;
might not be a valid assessment for both groups. Museus et al. 2011) have frequently identified achievement
In addition, differences between male and female students gaps between majority and minority students. Similar pat-
have previously been visible in attitudes toward mathematics terns may be observed in national assessments, and overall
and beliefs about mathematics or oneself as a mathemat- differences between majority and migrant students might
ics student (e.g., one’s self-efficacy and motivation). Boys indicate inequity in relation to mathematics teaching and/or
generally report more positive attitudes, while girls tend to assessment. While first-generation immigrant students tend
report more anxiety (OECD 2013b). This difference might to score significantly below non-immigrant students in most
be an indication of varying attitudes or beliefs, or it might countries, second-generation immigrant students tend to per-
indicate gendered response patterns to survey questions. form somewhere between the two groups (OECD 2015),
Indeed, some researchers have suggested that gender differ- although a fair degree of complexity makes the result pat-
ences might be the outcome of different attitudes towards the terns a challenge to disentangle. In PISA 2012, for instance,
test situation (e.g., test anxiety or performance avoidance) immigrant students scored at the level of non-immigrant
rather than real differences in mathematical knowledge or students in some countries (e.g., New Zealand); in addition,
competence (Cotton et al. 2010; Hannon 2012; Hyde and second-generation students scored higher than non-immi-
Mertz 2009; Lindberg et al. 2010). Studies often explain gen- grant students in a few countries, such as Australia (OECD
der differences in mathematics achievement by boys’ more 2015).
positive attitudes toward competition in general, but when Previous research has identified language factors as a
controlling for such factors, these gender differences disap- factor contributing to the assessment gap between majority
pear (Hannon 2012; Cotton et al. 2010). The use of beliefs and minority students (Klenowski 2009; Abedi and Lord
13
2001). Paper and pencil tests that assess students’ mathemat- assessment as being student-focussed by (1) being mindful
ics competence might draw heavily on a student’s ability to of the student populations within the class/school/district/
read and interpret texts. Previous research has demonstrated country, (2) using appropriate language, (3) acknowledg-
a high positive correlation between reading comprehension ing students’ differences, and (4) applying tools that are
and mathematics achievement (e.g., Nortvedt 2011; Nort- appropriate for diverse students and that are applied with
vedt et al. 2016). The issue of language as a ‘gatekeeper’ the intention of improving learning for all students who
to accessing test content, as well as the consequences of participate in the assessment. Care should be taken in all
not mastering the language of assessment to a sufficient phases of assessment development and implementation to
extent, are also illustrated by the strong negative effect of allow for a valid assessment of student competence, which
late arrival (and consequently shorter exposure) to language could be taken into account by utilising tasks with fewer
instruction in the host country (OECD 2015). An alterna- language barriers or introducing computer-based assess-
tive explanation for the achievement gap is the efficiency ment. Computer-based assessment offers different means
(or lack thereof) of the educational system of the host coun- to contextualise and display mathematical tasks that are
try. Migrant students from the same origin country tend to more connected to the realistic situation the task is sup-
perform very differently in different educational systems; posed to present, such as datasets, video vignettes, graphi-
students from Arabic-speaking countries, for instance, per- cal displays, and other means of presenting content and
formed far better in the Netherlands than they did in Finland mathematical problems.
on PISA 2012, although the average scores for the two coun- Task aspects such as video and graphical displays might
tries were very similar (OECD 2015). also lessen the need for language fluency to comprehend
Finally, the gap between majority and minority students the mathematical problem at hand (Sangwin 2013; Hoog-
might also be partially due to how we assess students’ com- land and Tout 2018). For example, in national low-stake
petence or whose curricula we assess (Stobart 2008). Herit- tests, supplementary test items might be developed for
age and Wylie (2018), who present a case study on forma- students with special needs or language barriers in order
tive assessment in mathematics education, discuss a sample to achieve higher measurement accuracy and to ensure
lesson taught by a teacher who has implemented assessment that test items have appropriate difficulty levels and are
for learning in a multicultural classroom. Heritage and Wylie generally understood by the students (Institut zur Qual-
address the challenges and benefits connected to assessment itätsentwicklung im Bildungswesen IQB 2017). Students
for learning; in particular, they address identity and equity might also be empowered if they learn how to self-regu-
questions while highlighting the challenges that language late, as this capacity is crucial to mathematics learning
requirements offer within the mathematics classroom. They (Semana and Santos 2018). Students who participate
conclude that effective assessment for learning practices in mathematics teaching with teachers who help them
can support both language and concept development among understand assessment criteria and participate in class
minority students, even when these students are instructed discussions, will develop their capacity to self-regulate in
in a language they do not speak. mathematics learning situations, but to different degrees.
While we often assume that mathematics education is Semana and Santos propose that some students might need
culture-neutral, research indicates that the way in which we more support than others to develop this capacity, and that
express ourselves and view mathematics is in fact highly differentiation is necessary to achieve equity with regard
cultural (Klenowski 2009). A primary question about equity to students’ opportunities to learn mathematics.
then relates to the opportunities that both majority and To judge if an assessment is fair, we need a theoreti-
minority students have to use and display culture-specific cal framework comprising relevant factors. While some
knowledge in assessment situations, and whether this knowl- have advocated a socio-cultural perspective for framing
edge or competence is generally acknowledged as valid and assessments (e.g. Klenowski 2009), others have called for
important mathematical knowledge (Klenowski 2009). The culturally responsive assessments (e.g. Montenegro and
lack of culturally responsive assessment could result in une- Jankowski 2017). Research literature focussed on how
qual educational opportunities for migrant students (Hopson equity for migrant and majority students might be achieved
and Hood 2005). The outcome could be that migrant stu- in high-stakes testing remains scarce, however, as is cul-
dents will be more likely to leave school early (Bradshaw turally responsive mathematics assessment initiated by the
et al. 2008) or that fewer migrant students will be admitted teacher. In addition, previous research has revealed chal-
to higher education (Hopson and Hood 2005). lenges for teachers in relation to using rich tasks in assess-
The concept of culturally fair or responsive assessment situations (e.g., Siemon et al. 2004). Further, Wong and
ment is challenging to define, as doing so necessitates Glass (2005) identified challenges in designing professional
a broad discussion of the term ‘culture’. Montenegro development to sensitise teachers to culturally responsive
and Jankowski (2017) describe culturally responsive assessment.
13
Fig. 2 Relations between

assessment, policy, and equity
6 Issues in assessment in mathematics issues as well as the opportunity to use assessment for pol-
education icy development. Ideally, the assessments used for policy-
making should provide important information necessary to
According to Niss (2007), ‘mathematics for all’ may be shape educational policies that can improve mathematics
viewed as the goal of mathematics education, since it offers education (Lin et al. 2018). In addition, policy-makers must
equal opportunities to develop mathematical competence to look past surface output (e.g., average scores) to identify
all students. Educational systems should educate citizens the crucial messages (Auld and Morris 2016; Burkhardt and
who can contribute to democracy and add to the technical Schoenfeld 2018).
and financial development of society; such citizens must also
possess the mathematical competence necessary for their 6.1 Relations between methodology, policy,
professional and everyday lives. To achieve these goals, we and equity
need to develop assessments that can assess all aspects of
mathematical competence, not only certain aspects that are In this section of the review, we do not treat issues related
easier to assess (Hoogland and Tout 2018; Burkhardt and to methodology, policy, and equity as three separate issues
Schoenfeld 2018). Research in the field should pay more because in reality, strong links are visible between the three
attention to the theoretical foundations of what is assessed areas of concern. Figure 2 displays the reciprocal relation-
and the development of the test instruments (Neubrand ships between methodology, equity, and policy. We should
2018; Martinovic and Manizade 2018; Yang; Hansen and note that assessment is related both to teaching practice and
Strietholt 2018; Ubuz and Aydin 2018; Pankow et al. 2018). to research. In the previous sections we identified relation-
Much research and development remains to be done ships between the assessment methodology and policy and
before equity issues in assessment in mathematics educa- between assessment methodology and equity.
tion can be properly dealt with. To better target assessments The reciprocal relationships displayed in Fig. 2 are visible
to individual levels of performance, we need more richness also in the literature we have reviewed in this paper, where
and variety in assessment formats (Leder and Forgasz 2018; authors often discuss mathematics teaching and learning in
Buchholtz et al. 2018). Not only culture and language need relation to assessment and assessment outcomes, or in rela-
to be taken into consideration but also how students respond tion to policy implementation. Researchers might decide to
to feedback (Heritage and Wylie 2018; Fujita et al. 2018; conduct research on how different measures influence the
Semana and Santos 2018). As the discussion in this review inferences we may draw from assessments; they might also
has shown, much still remains to be done 10 years after use output from international studies to inform educational
Niss’s (2007) statement about current issues connected to authorities about changes to teaching, learning, and assess-
assessing mathematical competence. The current assessment ment in mathematics that could be included in new policies
practices influence both methodological aspects and equity (Lin et al. 2018). In fact, a reciprocal relationship also exists
13
between research and practice, which reveals the complex possible intentional or unintentional consequences of the argu-
relationships between the areas we have discussed in this ment into account.
review (see Fig. 2). What we assess, and how we assess, Both intentional and unintentional consequences have
influence practice and policy. For instance, when large-scale previously been visible in educational systems that have
studies reveal an achievement gap between migrant and strong accountability practices and external controls. In these
majority students, that gap might influence both educational cases, studies have found that teachers often teach to the test,
policy and equity, thus showing the connection between low-performing children might be asked not to participate
these two fields and how we develop or deliver assessments. in assessments such as national tests, and teachers are often
evaluated based on students’ test scores (Baker et al. 2010;
Seeley 2006). Although the educational field has attempted
6.2 Assessment validity to improve teaching and teacher education, national and other
government-based assessments can carry the risk that their
The mutual influences of methodological, policy, and equity results will be used primarily to rank educational systems or
issues on one another are not least characterised by questions schools (Auld and Morris 2016). Equity issues such as gender
of assessment validity. Validity in general is the most funda- differences in mathematics, or the achievement gap between
mental, but also the most complex, quality criterion of any majority and minority students, are also strongly aligned to the
assessment. Despite the many debates about conceptions of consequential aspects of validity, especially when assessments
different types of validity in the history of educational and should inform measures to reduce inequality or are used to
psychological measurement (Newton and Shaw 2014), the identify at-risk students. Those who analyse assessment data
current standards for educational and psychological testing from computer-based assessments or learning analytics should
portray validity as a ‘unitary concept’ (AERA, APA, and be aware of the temporality of data and that today’s assessment
NCME 2014). Following this understanding, those in the results will be yesterday’s results tomorrow. Adding to the dan-
field distinguish various sources of evidence that they might ger of neglecting this transitory nature are the simplifications
use to support the interpretation of assessment outcomes so and the implicit assumptions that both individual people and
that these interpretations are not only convincing and plau- society as a whole make when response data are coded into
sible but also empirically and methodically justified and algorithms. There is thus the risk that these kinds of assess-
acceptable to society. Sources for validity evidence (such ment practices may reproduce and entrench existing biases
as content or construct representativeness) affect methodo- such as class, gender, or ethnicity (Wilson et al. 2017).
logical issues in several ways (Messick 1995). For exam- A multitude of issues are connected to validity, such as
ple, the content of an assessment influences how valid the the danger of using a single measurement point, the reli-
conceptualisation of important constructs in the assessment ability of the outcome, or the validity of inferences made
framework is, while the operationalisation of an assessment for different purposes. Pellegrino et al. (2001), for instance,
involves questions about meaningful and construct-valid claim that when “a single assessment is used for multiple
tasks and test formats. purposes…the more purposes a single assessment aims to
The validity of the measurement itself is also influenced serve, the more each purpose will be compromised” (p. 2).
by the technical quality of the assessment instrument as well Further, Newton (2007) points to the risks of ambiguous
as how the assessment is delivered to students, preservice allocations of assessment purposes, since policy-makers
teachers and teachers. Other sources of validity (such as the could be misled if the complexities of the assessment design
consequences of participating in an assessment) are more are over-simplified, for example when an assessment for a
relevant to policy and equity debates because these validity particular purpose (such as short-term system monitoring) is
sources are concerned with the interpretation of assessment wrongly used for long-term system monitoring. Therefore, to
results or plausible consequences of this interpretation to a scrutinise any potential consequences connected to the use
greater extent. Validity thus relates to decisions made by vari- of a particular assessment, different argument procedures
ous stakeholders at different levels in the educational system, might be applied, depending on whether the assessment was
from the teacher making informed decisions in the classroom, designed to assess individual students or for research or pro-
to the teacher educator or teacher education institution making gramme evaluation purposes at the institutional or national
decisions about teacher education programmes or professional level.
development programmes, to policy-makers shaping a novel
educational policy. The data these stakeholders apply come 6.3 Future directions for assessment
from different sources, from classroom discussions, observa- in mathematics education
tions, and teacher-made tests, to examinations and large-scale
national assessments or international comparative studies. Numerous researchers within the research community deal
Validity arguments must take the quality of the data and any with specific questions related to assessment in mathematics
13
education. In doing so, only isolated aspects of assessment 3. Efforts to improve equity. We propose that culturally
development or implementation are taken into consideration, responsive assessment represents a decisive further
leaving other aspects in the dark. The strong relationships development in the area of addressing heterogeneity
between teaching, learning, and assessment and between in assessment that takes into account the relationship
methodology, equity, and policy are challenging to disentan- between equity issues and educational policy. We must
gle without oversimplifying. For instance, ongoing revision extend the notion of cultural responsiveness by inves-
of curricula, the further development of school practices, tigating how large-scale assessment in mathematics
beliefs about the role of mathematics education, content, can allow students from diverse cultural backgrounds
and teaching methods all contribute to further complexity. to participate. Further investigations might include new
Correspondingly, assessment must be constantly adapted to approaches to researching the influence of language
new circumstances, such as increased heterogeneity in the on mathematics achievement in assessment situations,
classroom or new technological possibilities. We might see or the use of formative assessment in classrooms and
this as a daunting task, or it might encourage the research teacher education.
community to continue to work on the improvement of 4. Applying technology to develop better measures of math-
assessment practices in mathematics education. With regard ematical competence. The further development of com-
to assessment practices in mathematics education, we see puters, software, and digital tools has pushed forward
four key areas for future development, as described below. the question of whether, how, and to what extent we
can use technology to assess mathematical knowledge,
1. Efforts to develop more refined methods to improve the thinking, or skills. Recent developments have strongly
quality of educational assessment in mathematics edu- influenced assessment practice; we propose that those
cation. We propose that the high degree of complexity in the field should use this knowledge to develop better
found in the assessment process should be taken into measures of students’ mathematical competence, while
account to a larger extent than has previously been taking into consideration not only the multiple possibili-
feasible. For instance, multiple assessment formats or ties but also the technical and methodological challenges
multiple groups of test takers increase complexity and involved.
call for multiple approaches to validation. In the past,
mixed-methods research has developed into an applica- In reality, we cannot look at these four key areas for future
tion-oriented research methodology, which is suitable development independently. In order to work towards equity,
for addressing validity questions when these questions we might utilise technology to develop assessments where
are transferred to assessment practice. Applying mixed- language is understood more broadly; such assessments
methods research methods offers the possibility of man- might incorporate animations and visual displays, for exam-
aging this complexity in new ways. ple. Applying mixed methods to learn more from today’s
2. Efforts to improve the relation between research and assessment formats and practices is a key to further develop-
practice. We propose that the research community ing better tests and practices that can provide teachers and
should emphasise to a larger extent the relevance of policy-makers with the insights they need to improve their
assessment research to teaching practice and educational practices. Further, we might utilise a wider range of assess-
policy. Many see educational research and teaching prac- ment formats to enable both equity in mathematics educa-
tice as different reference systems that coexist indepen- tion and more suitable assessment data for policy-making.
dently of each other and that have different orientations, Finally, educational policies might affect what is assessed
which could explain why recent research results from and how; thus, such policies may contribute to more equi-
international large-scale studies or from effectiveness table practices in mathematics education. A stronger link
research often have little practical significance or trans- between research and teaching might be facilitated by care-
ferability (Burkhardt and Schoenfeld 2003, 2018). By fully considering the issues related to methodology, policy,
taking into account the increased heterogeneity found and equity discussed in this paper; such considerations
in classrooms and student backgrounds when analysing should concern each issue individually as well as the rela-
assessment data, those in the education assessment field tionships between the three issues and how they influence
will be able to provide more applicable advice for prac- one another and assessment validity. Because we use assess-
tice and policy. By doing so, we can increase the scope ment outcomes to inform teaching, select students for further
by which the respective stakeholders involved can view education, certify professionals, and shape educational poli-
the outcomes of both assessment and research studies cies, it is vital that we discuss the technical affordances and
as being valid, thus contributing to the quality of math- possibilities that each assessment format offers. Clearly, a
ematics education. special issue on assessment in mathematics education can
13
help us to address important challenges in our field of sci- Burkhardt, H., & Schoenfeld, A. (2003). Improving educational
entific inquiry. research: Toward a more useful, more influential, and better-
funded enterprise. Educational Researcher, 32(9), 3–14.
Burkhardt, H., & Schoenfeld, A. (2018). Assessment in the service
of learning: Challenges and opportunities. ZDM Mathematics
Open Access This article is distributed under the terms of the Crea- Education, 50(4), 1–15.
tive Commons Attribution 4.0 International License (http://creativeco Cai, J., Hwang, S., & Middleton, J. A. (2015). The role of large-scale
mmons.org/licenses/by/4.0/), which permits unrestricted use, distribu- studies in mathematics education. In J. A. Middleton, S. Hwang
tion, and reproduction in any medium, provided you give appropriate & J. Cai (Eds.), Large-scale studies in mathematics education
credit to the original author(s) and the source, provide a link to the (pp. 405–414). Cham: Springer.
Creative Commons license, and indicate if changes were made. Cai, J., Mok, I. A. C., Reddy, V., & Stacey, K. (2016). International
comparative studies in mathematics: Lessons for improving stu-
dents learning. In ICME-13 topical surveys (pp. 1–36). Cham
(Switzerland): Springer.
References Cotton, C., McIntyre, F., & Price, J. (2010). Gender differences dis-
appear with exposure to competition. Working paper 2010–11.
Abedi, J., & Lord, C. (2001). The language factors in mathematics University of Miami, Department of Economics. http://moya.
tests. Applied Measurement in Education, 14(3), 219–234. bus.miami.edu/~ccotton/papers/cotton_mcintyre_price_2009.
Auld, E., & Morris, P. (2016). PISA, policy and persuasion: Translating pdf. Accessed 9 July 2017.
complex conditions into education ‘best practice’. Comparative Elstad, E., Nortvedt, G. A., & Turmo, A. (2009). The Norwegian
Education, 52(2), 202–229. assessment system: An accountability perspective. CADMO,
Australian Association of Mathematics Teachers Inc. (2008). Position 17(1), 89–103.
paper on the practice of assessing mathematical learning. http:// Ernest, P. (2014). Policy debates in mathematics education. In S. Ler-
www.aamt.edu.au/content/download/9895/126744/file/Asses man (Ed.), Encyclopedia of mathematics education. Dordrecht:
sment_position_paper_2017.pdf. Accessed 9 July 2017. Springer.
Ayalon, H., & Livneh, I. (2013). Educational standardization and Fischer, R. (2004). Standardization to account for cross-cultural
gender differences in mathematics achievement: A comparative response bias: A classification of score adjustment procedures
study. Social Science Research, 42(2), 432–445. and review of research. Journal of Cross-Cultural Psychology,
Baird, J.-A., Johnson, S., Hopfenbeck, T. H., Isaacs, T., Sprague, T., 35(3), 263–282.
Stobart, G., & Yu, G. (2016). On the supranational spell of PISA Fujita, T., Jones, K., & Miyazaki, M. (2018). Learners’ use of domain-
in policy. Educational Research, 58(2), 121–138. specific computer-based feedback to overcome logical circularity
Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, in deductive proving in geometry. ZDM Mathematics Education,
H. F., Linn, R. L., Ravitch, D., et al. (2010). Problems with the 50(4), 1–15.
use of student test scores to evaluate teachers. Economic Policy Gaber, S., Cankar, G., Umek, L. M., & Tašner, V. (2012). The danger
Institute Briefing Paper #278. http://www.epi.org/publication/ of inadequate conceptualisation in PISA for education policy.
bp278/. Accessed 9 July 2017. Compare, 42(4), 647–663.
Biesta, G. (2009). Good education in an age of measurement: On the Grant, M., & Booth, A. (2009). A typology of reviews: An analysis of
need to reconnect with the question of purpose in education. 14 review types and associated methodologies. Health Informa-
Educational Assessment, Evaluation and Accountability, 21(1), tion and Libraries Journal, 26(2), 91–108.
33–46. Groß Ophoff, J. (2013). Lernstandserhebungen: Reflexion und Nutzung.
Black, P., & Wiliam, D. (2005). Inside the black box: Raising standards Münster: Waxmann.
through classroom assessment. The Phi Delta Kappan, 80(2), Hallinger, P., & Heck, R. H. (2010). Collaborative leadership and
139–148. school improvement: Understanding the impact on school capac-
Black, P., & Wiliam, D. (2012). Assessment for learning in the class- ity and student learning. School Leadership & Management,
room. In J. Gardner (Ed.), Assessment and learning (pp. 11–32). 30(2), 95–110.
London: Sage. Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S., Robyn,
Bradshaw, C. P., O’Brennan, L. M., & McNeely, C. A. (2008). Core A., Russell, J. L., et al. (2007). Standards-based accountability
competencies and the prevention of school failure and early under no child left behind: Experiences of teachers and adminis-
school leaving. New Directions for Child and Adolescent Devel- trators in three states. Santa Monica: RAND Corporation.
opment, 122, 19–32. Hannon, B. (2012). Test anxiety and perfomance-avoidance goals
Brown, G. T. L., & Harris, L. R. (2009). Unintended consequences explain gender differences in SAT-V, SAT-M, and overall SAT
of using tests to improve learning: How improvement-oriented scores. Personality and Individual Differences, 53(7), 816–820.
resources heighten conceptions of assessment as school account- Hattie, J. A. C., & Timperley, H. (2007). The power of feedback.
ability. Journal of Multidisciplinary Evaluation, 6(12), 68–91. Review of Educational Research, 77(1), 81–112.
Buchholtz, N., Kaiser, G., & Blömeke, S. (2014). Measuring pedagogi- Heritage, M., & Wylie, C. (2018). Reaping the benefits of assessment
cal content knowledge in mathematics—conceptualizing a com- for learning: Achievement, identity and equity. ZDM Mathemat-
plex domain. Journal für Mathematik-Didaktik, 35(1), 101–128. ics Education, 50(4), 1–13.
Buchholtz, N., Krosanke, N., Orschulik, A. B., & Vorhölter, K. (2018). Hoogland, K., & Tout, D. (2018). Computer-based assessment of
Combining and integrating formative and summative assessment mathematics in the 21st century: Pressures and tensions. ZDM
in mathematics teacher education. ZDM Mathematics Education, Mathematics Education, 50(4), 1–12.
50(4), 1–14. Hopfenbeck, T. H., & Görgen, K. (2017). The politics of PISA: The
Buchholtz, N., Leung, F. K. S., Ding, L., Kaiser, G., Park, K., & media, policy and public responses in Norway and England.
Schwarz, B. (2013). Future mathematics teachers’ professional European Journal of Education, 52(2), 195–205.
knowledge of elementary mathematics from an advanced stand- Hopson, R., & Hood, S. (2005). An untold story in evaluation roots:
point. ZDM, 45(1), 107–120. Reid E. Jackson and his contribution toward culturally responsive
evaluation at three quarters of a century. In S. Hood, R. Hopson
13
& H. Frierson (Eds.), The role of culture and cultural context S. Hwang (Eds.), Large-scale studies in mathematics education
(pp. 87–104). Greenwich: Information Age Publishing. (pp. 1–3). Cham: Springer.
Hoth, J., Döhrmann, M., Kaiser, G., Busse, A., König, J., & Blömeke, Miller, J., & Mitchell, J. (2006). Interrupted schooling and the acquisi-
S. (2016). Diagnostic competence of primary school mathemat- tion of literacy: Experiences of Sudanese refugees in Victorian
ics teachers during classroom situations.. ZDM Mathematics secondary schools. Australian Journal of Language and Literacy,
Education, 48(1), 41–53. 29(2), 150–162.
Hsieh, F.-J., Chu, C.-T., Hsieh, C.-J., & Lin, P.-J. (2014). In-depth Montenegro, E., & Jankowski, N. A. (2017). Equity and assessment:
analyses of different countries’ responses to MCK items: A Moving towards culturally responsive assessment. National Insti-
view on the differences within and between East and West. In tute for Learning Outcomes Assessment. http://learningoutcome
S. Blömeke, F.-J. Hsieh, G. Kaiser & W. H. Schmidt (Eds.), sasses sment .org/docume nts/Occasi onalP
aper2 9.pdf. Accessed 9.
International perspectives on teacher knowledge, beliefs and July 2017.
opportunities to learn (pp. 115–140). Dordrecht: Springer. Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2016). TIMSS
Hyde, J. S., & Mertz, J. E. (2009). Gender, culture, and mathematics 2015 international results in mathematics. Boston College
performance. Proceedings of the National Academy of Sciences TIMSS & PIRLS International Study Center website: http://
of the United States of America, 106(22), 8801–8807. timssandpirls.bc.edu/timss2015/international-results/. Accessed
Institut zur Qualitätsentwicklung im Bildungswesen (IQB). (2017). 9 July 2017.
Erprobungsstudie 2017 zu den Bildungsstandards Mathematik Museus, S. D., Palmer, R. T., Davis, R. J., & Maramba, D. (2011). Spe-
in der Sekundarstufe I. https://www.iqb.hu-berlin.de/bt/BT201 cial issue: Racial and ethnic minority student success in STEM
8/Erprobungsstudie2017. Accessed 27 Apr 2018. education. ASHE Higher Education Report, 36, 1–140.
Jerrim, J. (2016). PISA 2012: How do results for the paper and com- National Council of Teachers of Mathematics (NCTM). (2016). Large-
puter tests compare? Assessment in Education: Principles, scale mathematics assessments and high-stakes decisions: A
Policy & Practice, 23(4), 495–518. position of the National Council of Teachers of Mathematics.
Kaarstein, H. (2014). Norwegian mathematics teachers’ and edu- Reston: NCTM.
cational researchers’ perception of MPCK items used in the National Council on Measurement in Education (NCME). (2014).
TEDS-M study. Nordisk Matematikkdidaktikk, 19(3–4), 57–82. Standards for educational and psychological testing. Washing-
Kaiser, G., Blömeke, S., König, J., Busse, A., Döhrmann, M., & ton, DC: AERA.
Hoth, J. (2017). Professional competencies of (prospective) Neubrand, M. (2018). Conceptualizations of professional knowledge
mathematics teachers: Cognitive versus situated approaches. for teachers of mathematics. ZDM Mathematics Education,
Educational Studies in Mathematics, 94(2), 161–182. 50(4), 1–12.
Kilpatrick, J. (2014). History of research in mathematics education. Newton, P. E. (2007). Clarifying the purpose of educational assess-
In S. Lerman (Ed.), Encyclopedia of mathematics education. ment. Assessment in Education: Principles, Policy & Practice,
Dordrecht: Springer. 14(2), 149–170.
Klenowski, V. (2009). Australian indigenous students: Addressing Newton, P. E., & Shaw, S. D. (2014). Validity in educational and psy-
equity issues in assessment. Teacher Education, 20(1), 77–93. chological assessment. London: Sage.
Leder, G., & Forgasz, H. J. (2018). Measuring who counts: Gender Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: How high-
and mathematics assessment. ZDM Mathematics Education, stakes testing corrupts America’s schools. Cambridge: Harvard
50(4), 1–11. Education Press.
Lester, F. Jr. (Ed.). (2007). Second handbook of research on math- Niss, M. (1993). Assessment in mathematics education and its effects:
ematics teaching and learning. Charlotte: Information Age An Introduction. In M. Niss (Ed.), Investigations into assessment
Publishing. in mathematics education. An ICMI Study (pp. 1–30). Dordrecht:
Lin, F.-L., Wang, T.-Y., & Chang, Y.-P. (2018). Effects of large-scale Springer.
studies on mathematics education policy on Taiwan through the Niss, M. (2007). Reflections on the state of and trends in research on
lens of societal and cultural characteristics. ZDM Mathematics mathematics teaching and learning. In F. K. J. Lester (Ed.), Sec-
Education, 50(4), 1–14. ond handbook of research on mathematics teaching and learning
Lindberg, S. M., Hyde, J. S., Petersen, J. L., & Linn, M. C. (2010). New (pp. 1293–1312). Charlotte: Information Age Publishing.
trends in gender and mathematics performance: A meta-analysis. Niss, M. (2015). Mathematical competencies and PISA. In R. Turner
Psychological Bulletin, 136(6), 1123–1135. & K. Stacey (Eds.), Assessing mathematical literacy: The PISA
Liu, O. L., & Wilson, M. (2009). Gender differences in large-scale experience (pp. 35–55). Cham: Springer.
math assessments: PISA trend 2000 and 2003. Applied Measure- Nortvedt, G. A. (2011). Coping strategies applied to comprehend mul-
ment in Education, 22(2), 164–184. tistep arithmetic word problems by students with above-average
Lynch, K., & Star, J. R. (2014). Teachers’ views about multiple strat- numeracy skills and below-average reading skills. Journal for
egies in middle and high school mathematics. Mathematical Mathematical Behavior, 30(3), 255–269.
Thinking and Learning, 16(2), 85–108. Nortvedt, G. A. (2018). Policy impact of PISA on mathematics educa-
Ma, X. (1999). A meta-analysis of the relationship between anxiety tion: The case of Norway. European Journal for Psychology in
towards mathematics and achievement in mathematics. Journal Education, 33(3), 427–444.
for Research in Mathematics Education, 30(5), 520–540. Nortvedt, G. A., Gustafsson, J.-E., & Lehre, A.-C. W. G. (2016). The
Martinovic, D., & Manizade, A. G. (2018). The challenges in the importance of InQua for the relation between achievement in
assessment for knowledge for teaching geometry. ZDM Math- reading and mathematics. In T. Nilsen & J.-E. Gustafsson (Eds.),
ematics Education, 50(4), 1–17. Teacher quality, instructional quality and student outcome: Rela-
Messick, S. (1995). Validity of psychological assessment: Validation tionships across countries, cohorts and time (pp. 97–113). Cham:
of inferences from persons’ responses and performances as sci- Springer.
entific inquiry into score meaning. American Psychologist, 50, OECD. (2013a). PISA 2012 results: Student performance in mathemat-
741–749. ics, reading, science. Volume I. Paris: OECD Publishing.
Middleton, J. A., Cai, J., & Hwang, S. (2015). Why mathematics edu- OECD. (2013b). PISA 2012 results: Ready to learn. Students’ engage-
cation needs large-scale research. In J. A. Middleton, J. Cai & ment, drive and self-beliefs. Volume III. Paris: OECD Publishing.
13
OECD. (2013c). PISA 2012 assessment and analytical framework: Speer, N. M., King, K. D., & Howell, H. (2015). Definitions of math-
Mathematics, reading, science, problem solving and financial ematical knowledge for teaching: Using these constructs in
literacy. Paris: OECD Publishing. research on secondary and college mathematics teachers. Journal
OECD. (2015). Helping immigrant students to succeed at school—and of Mathematics Teacher Education, 18(2), 105–122.
beyond. Paris: OECD Publishing. Stobart, G. (2008). Testing times: The uses and abuses of assessment.
OECD. (2016). PISA 2015 results: Excellence and equity in education Oxford: Routledge.
(Vol I). Paris: OECD Publishing. Suurtamm, C., & Neubrand, M. (2015). Assessment and testing in
Pajares, F., & Miller, M. D. (1995). Mathematics self-efficacy and mathematics education. In S. J. Cho (Ed.), Proceedings of
mathematics performances: The need for specificity of assess- the 12th International Congress on Mathematical Education
ment. Journal of Counseling Psychology, 42(2), 190–198. (pp. 557–562). Cham: Springer.
Palm, T., Boesen, J., & Lithner, J. (2011). Mathematical reasoning in Suurtamm, C., Thompson, D. R., Kim, R. Y., Moreno, L. D., Sayac,
Swedish upper secondary level assessments. Mathematics Think- N., Schukajlow, S., et al. (2016). Assessment in mathematics
ing and Learning, 13(3), 221–246. education: Large-scale assessment and classroom assessment.
Pankow, L., Kaiser, G., & König, J. (2018). Perception of students’ Cham: Springer.
errors under time limitation: Are teachers better than mathemati- Ubuz, B., Aydin. (2018). Geometry knowledge test about triangles:
cians or students? Results of a validation study. ZDM Mathemat- Development and validation. ZDM Mathematics Education,
ics Education, 50(4), 1–12. 50(4).
Paxton, G., Smith, N., Win, A. K., Mulholland, N., & Hood, S. (2011). van den Heuvel-Panhuizen, M., & Becker, J. (2003). Towards a didactic
Refugee status report: A report on how refugee children and model for assessment design in mathematics education. In A. J.
young people in Victoria are faring. Melbourne: Department of Bishop, M. A. Clements, C. Keitel, J. Kilpatrick & F. K. S. Leung
Education and Early Childhood Development (DEECD). (Eds.), Second international handbook of mathematics education
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what (pp. 689–716). Dordrecht: Springer.
students know: The science and design of educational assess- Wang, S., Jiao, H., Young, M., Brooks, T., & Olson, J. (2007). A meta-
ment. Washington, DC: National Academy Press. analysis of testing mode effects in grade K–12 mathematics tests.
Rowland, T., & Ruthven, K. (2010). Mathematical knowledge in teach- Educational and Psychological Measurement, 67(2), 219–238.
ing. Dordrecht: Springer. Wiliam, D. (2003). The impact of educational research on mathematics
Sälzer, C., & Prenzel, M. (2014). Looking back at five rounds of PISA: education. In A. J. Bishop, M. A. Clements, C. Keitel, J. Kilpat-
Impacts on teaching and learning in Germany. Solsko Polje, rick & F. K. S. Leung (Eds.), Second international handbook
25(5/6), 53–72. of mathematics education (pp. 471–490). Dordrecht: Springer
Sangwin, C. J. (2013). Computer aided assessment of mathematics. Netherlands.
Oxford: Oxford University Press. Wiliam, D. (2007). Keeping learning on track. In F. K. J. Lester (Ed.),
Semana, S., & Santos, L. (2018). Self-regulation of learning in stu- Second handbook of research on mathematics teaching and
dent participation in mathematics assessment. ZDM Mathematics learning (pp. 1053–1098). Charlotte: Information Age.
Education, 50(4), 1–13. Wilson, A., Watson, C., Thompson, T. L., Drew, V., & Doyle, S.
Scherer, P., Beswick, K., DeBois, L., Healy, L., & Opitz, E. M. (2016). (2017). Learning analytics: Challenges and limitations. Teach-
Assistance of students with mathematical learning difficulties: ing in Higher Education, 22(8), 991–1007.
How can research support practice? ZDM, 48, 633–649. Wong, P. A., & Glass, R. D. (2005). Assessing a professional develop-
Schoenfeld, A. (2007). Issues and tensions in the assessment of math- ment school approach to preparing teachers for urban schools
ematical proficiency. In A. Schoenfeld (Ed.), Assessing mathe- serving low-income, culturally and linguistically diverse com-
matical proficiency (pp. 3–16). New York: Cambridge University munities. Teacher Education Quarterly, 32(3), 63–77.
Press. Wößmann, L. (2005). The effect heterogeneity of central examinations:
Seeley, C. (2006). Teaching to the test. NCTM News Bulletin. http:// Evidence from TIMSS, TIMSS-Repeat and PISA. Education
www.nctm.org/News-and-Calendar/Messages-from-the-Presi Economics, 13(2), 143–169.
dent/Archive/Cathy-Seeley/Teaching-to-the-Test/. Accessed 9 Wuttke, J. (2007). Uncertainties and bias in PISA. In S. T. Hopmann,
July 2017. G. Brinek & M. Retzl (Eds.), PISA according to PISA: Does
Shen, C., & Tam, H. P. (2008). The paradoxical relationship between PISA keep what it promises? Vienna: LIT-Verlag.
student achievement and self-perception: A cross-national analy- Hansen, K. Y., & Strietholt, R. (2018). Does schooling actually per-
sis based on three waves of TIMSS data. Educational Research petuate educational inequality in mathematics performance? A
and Evaluation, 14(1), 87–100. question of validity. ZDM Mathematics Education, 50(4), 1–6.
Siemon, D., Enilane, F., & McCarty, J. (2004). Supporting indigenous
students’ achievement in numeracy. Australian Primary Math-
ematics Classroom, 9(4), 50–53.
13

Assessment in Mathematics Education Responding To Issues

Uploaded by

Copyright:

Available Formats

Assessment in Mathematics Education Responding To Issues

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assessment in Mathematics Education Responding To Issues

Uploaded by

Copyright:

Available Formats

ZDM (2018) 50:555–570