Chapter 14
Chapter 14
14.1 Introduction
Imagine you have designed a new shared web space intended for advertising
second-hand goods. How would you find out whether householders would be
able to use it to find what they wanted and whether it was a reliable and effective
service? What evaluation methods would you employ?
Examples of the tasks that are given to users include searching for information,
reading different typefaces (e.g. Helvetica and Times), and navigating through
different menus. Time and number are the two main performance measures
used, in terms of the time it takes typical users to complete a task, such as finding
a website, and the number of errors that participants make, such as selecting
wrong menu options when creating a spreadsheet. The quantitative performance
measures that are obtained during the tests produce the following types of data
(Wixon and Wilson, 1997):
A key concern is the number of users that should be involved in a usability study:
five to twelve is considered an acceptable number (Dumas and Redish, 1999), but
sometimes it is possible to use fewer when there are budget and schedule
constraints. For instance, quick feedback about a design idea, such as the initial
placement of a logo on a website, can be obtained from only two or three users.
Many companies, such as Microsoft and IBM, used to test their products in
custom-built usability labs (Lund, 1994). These facilities comprise a main testing
laboratory, with recording equipment and the product being tested, and an
observation room where the evaluators watch what is going on and analyze the
data. There may also be a reception area for testers, a storage area, and a viewing
room for observers. The space may be arranged to superficially mimic features of
the real world. For example, if the product is an office product or for use in a
hotel reception area, the laboratory can be set up to look like those environments.
Soundproofing and lack of windows, telephones, co-workers, and other
workplace and social artifacts eliminate most of the normal sources of distraction
so that the users can concentrate on the tasks set for them to perform.
Figure 14.1 A usability laboratory in which evaluators watch participants on a
monitor and through a one-way mirror
Typically there are two to three wall-mounted video cameras that record the
user's behavior, such as hand movements, facial expression, and general body
language. Microphones are also placed near where the participants will be sitting
to record their utterances. Video and other data is fed through to monitors in the
observation room. The observation room is usually separated from the main
laboratory or workroom by a one-way mirror so that evaluators can watch
participants being tested but testers cannot see them. It can be a small
auditorium with rows of seats at different levels or, more simply, a small
backroom consisting of a row of chairs facing the monitors. They are designed so
that evaluators and others can watch the tests while ongoing, both on the
monitors and through the mirror. Figure 14.1 shows a typical arrangement.
Usability labs can be very expensive and labor-intensive to run and maintain. A
less expensive alternative, that started to become more popular in the early and
mid-90s, is the use of mobile usability testing equipment. Video cameras, laptops,
and other measuring equipment are temporarily set up in an office or other
space, converting it into a makeshift usability laboratory. Another advantage is
that equipment can be taken into work settings, enabling testing to be done on
site, making it less artificial and more convenient for the participants.
Figure 14.4 The Tracksys system being used with a mobile device camera that
attaches to a flexible arm, which mounts on a mobile device, and is tethered to
the lab
Another trend has been to conduct remote usability testing, where users perform
a set of tasks with a product in their own setting and their interactions with the
software are logged remotely. A popular example is UserZoom, which enables
users to participate in usability tests from their home, as illustrated in Figure
14.6. An advantage of UserZoom is that many users can be tested at the same time
and the logged data automatically compiled into statistical packages for data
analysis. For example, the number of clicks per page and the tracking of clicks
when searching websites for specified tasks can be readily obtained.
Usability specialists Budiu and Nielsen (2010) from the Nielsen Norman Group
conducted a usability test of the websites and apps specific to the iPad. They
wanted to understand how the interactions with the device affected people and to
get feedback to their clients and developers as well as people who were eager to
know if the iPad lived up to the hype – which was being reported at the time it
came to market. They used two methods: usability testing with think-aloud in
which users said what they were doing and thinking as they did it, and an expert
review. Here we describe the usability testing they conducted. A key question
they asked was: ‘Are user expectations different for the iPad compared with the
iPhone?’ A previous study they had conducted of the iPhone showed people
preferred using apps than browsing the web because the latter was slow and
cumbersome. Would this be the same for the iPad, where the screen is larger and
web pages are more similar to how they appear on laptops or desktop
computers? Budia and Nielsen also hoped that their study would be able to
address the question that many companies might be considering at that time:
whether it is worth developing specific websites for the iPad (in the way some
companies were doing for smartphones) or would the desktop versions be
acceptable when interacted with using the iPad multitouch interface.
Figure 14.5 The mobile head-mounted eye tracker
The usability testing was carried out in two cities in the United States, Fremont,
California, and Chicago. The test sessions were similar; the aim of both was to
understand the typical usability issues that people encounter when using
applications and accessing websites on the iPad. Seven participants were
recruited: all were experienced iPhone users who had owned their phones for at
least 3 months and who had used a variety of apps. One participant was also an
iPad owner. One reason for selecting participants who used iPhones was because
they would have previous experience of using apps and the web with a similar
interaction style to the iPad.
The participants were considered typical users. They varied in age and
occupation. Two participants were in their 20s, three were in their 30s, one in
their 50s, and one in their 60s. Their occupations were: food server, paralegal,
medical assistant, retired food driver, account rep, and homemaker. Three were
males and four were females.
Before taking part, the participants were asked to read and sign an informed
consent form agreeing to the terms and conditions of the study. This form
described:
The session started with participants being invited to explore any application
they found interesting on the iPad. They were asked to comment on what they
were looking for or reading, what they liked and disliked about a site, and what
made it easy or difficult to carry out a task. A moderator sat next to each
participant and observed and took notes. The sessions were video recorded and
lasted about 90 minutes. Participants worked on their own.
After exploring the iPad they were asked by the evaluator to open specific apps or
websites and to explore them and then carry out one or more tasks as they would
if they were on their own. Each was assigned the tasks in a randomized order. All
the apps that were tested were designed specifically for the iPad, but for some
tasks the users were asked to do the same task on a website that was not
specifically designed for the iPad. For these tasks the evaluators took care to
balance the presentation order so that the app would be first for some
participants and the website would be first for others. Over 60 tasks were chosen
from over 32 different sites. Examples are shown in Table 14.1.
ACTIVITY 14.1
1. To find out how participants interacted with apps and websites on the iPad.
The findings were intended to help developers determine whether specific
websites need to be developed for the iPad.
2. Our definition of usability suggests that the iPad should be: efficient, effective,
safe, easy to learn, easy to remember, and have good utility (i.e. good
usability). It should also support creativity and be motivating, helpful, and
satisfying to use (i.e. good user experience). The iPad is designed for the
general public so the range of users is broad in terms of age and experience
with technology.
3. The tasks are a small sample of the total set prepared by the evaluators. They
cover shopping, reading, planning, and finding a recipe – which are common
activities people engage in during their everyday lives.
The Equipment
The testing was done using a setup similar to the mobile usability kit shown
in Figure 14.7. A camera recorded the participant's interactions and gestures
when using the iPad and streamed the recording to a laptop computer. A webcam
was also used to record the expressions on the participants' faces and their think-
aloud commentary. The laptop ran software called Morae, which synchronized
these two data streams. Up to three observers (including the moderator sitting
next to the participant) watched the video streams (rather than observing the
participant directly) on their laptops situated on the table – meaning they did not
have to invade the participants' personal space.
Table 14.1 Examples of some of the tests used in the iPad evaluation (adapted
from Budiu and Nielsen, 2010).
Usability Problems
The main findings from the study showed that the participants were able to
interact with websites on the iPad but that it was not optimal. For example, links
on the pages were often too small to tap on reliably and the fonts were sometimes
difficult to read. The various usability problems identified in the study were
classified according to a number of well-known interaction design principles and
concepts, including: mental models, navigation, the quality of images, problems
of using a touchscreen with small target areas, lack of affordances, getting lost in
the application, the effects of changing orientations, working memory, and the
feedback that they received.
Figure 14.7 The setup used in the Chicago usability testing sessions
Figure 14.8 The cover of Time magazine showing the contents carousel
An example of a navigation problem identified during the evaluation is shown
in Figure 14.8. When the cover of Time magazine appears on the iPad, it doesn't
contain any hyperlinks. The contents page does but it is not easily accessible. In
order to access it, users must, first, tap the screen to reveal the controls at the
bottom, then select the Contents button to display the contents carousel, and then
select the contents page in the carousel.
Another problem they identified was the participants sometimes not knowing
where to tap on the iPad to select options such as buttons and menus. It is usually
obvious from the affordances of an interface as to where they are and how to
select them. However, the evaluators found that in many cases the participants
repeatedly tapped the iPad interface in order to initiate an action, such as when
trying to select an option from a menu.
Based on the findings of their study, Budia and Nielsen make a number of
recommendations, including supporting standard navigation. The results of the
study were written up as a report that was made publically available to app
developers and the general public (it is available from www.nngroup.com). It
provided a summary of key findings for the general public as well as specific
details of the problems the participants had with the iPad, so that developers
could decide whether to make specific websites and apps for the iPad.
While being revealing about how usable websites and apps are on the iPad, the
usability testing was not able to reveal how it will be used in people's everyday
lives. This would require an in the wild study where observations are made of
how people use them in their own homes and when traveling.
ACTIVITY 14.2
1. Was the selection of participants for the iPad study appropriate? Justify your
comments.
2. What might have been the problems with asking participants to think out loud
as they completed the tasks?
Comments
When more rigorous testing is needed, a set of standards can be used for
guidance – such an approach is described in Case Study 14.1 for the development
of the US Government's Recovery.gov website. For this large website, several
methods were used including: usability testing, expert reviews (discussed
in Chapter 15), and focus groups.
Ensuring Accessibility and Section 508 Compliance for the Recovery.gov Website
(Lazar et al, 2010b)
The American Recovery and Reinvestment Act (informally known as ‘the Stimulus Bill’)
became law on February 17, 2009, with the intention of infusing $787 billion into the US
economy to create jobs and improve economic conditions. The Act established an
independent board, the Recovery Accountability and Transparency Board, to oversee the
spending and detect, mitigate, and minimize any waste, fraud, or abuse. The law
required the Board to establish a website to provide the public with information on the
progress of the recovery effort. A simple website was launched the day that the Act was
signed into law, but one of the immediate goals of the Board was to create a more
detailed website, with data, geospatial features, and Web 2.0 functionality, including data
on every contract related to the Act. The goal was to provide political transparency at a
scale not seen before in the US federal government so that citizens could see how money
was being spent.
A major goal in the development of the Recovery.gov website was meeting the
requirement that it be accessible to those with disabilities, such as perceptual (visual,
hearing) and motor impairments. It had to comply with guidelines specified in Section
508 of the Rehabilitation Act (see the id-book.com website for details). At a broad level,
three main approaches were used to ensure compliance:
Usability testing with individual users, including those with perceptual and
motor impairments.
Routine testing for compliance with Section 508 of the Rehabilitation Act, done
every 3 months, using screenreaders such as JAWS, and automated testing
tools such as Watchfire.
Providing an online feedback loop, listening to users, and rapidly responding
to accessibility problems.
During development, ten 2 hour focus groups with users were convened in five cities. An
expert panel was also convened with four interface design experts, and usability testing
was performed, specifically involving 11 users with various impairments. Several weeks
before the launch of Recovery.gov 2.0, the development team visited the Department of
Defense Computer Accommodations Technology Evaluation Center (CAPTEC) to get
hands-on experience with various assistive technology devices (such as head-pointing
devices) which otherwise would not be available to the Recovery Accountability and
Transparency Board in their own offices.
Approaches were developed to meet each compliance standard, including situations
where existing regulations don't provide clear guidance, such as with PDF files. A large
number of PDF files are posted each month on Recovery.gov, and those files also undergo
Section 508 compliance testing. The files undergo automated accessibility inspections
using Adobe PDF accessibility tools, and if there are minor clarifications needed,
the Recovery.gov web managers make the changes; but if major changes are needed, the
PDF file is returned to the agency generating the PDF file, along with the Adobe-
generated accessibility report. The PDF file is not posted until it passes the Adobe
automated accessibility evaluation. Furthermore, no new modules or features are added
to the Recovery.gov site until they have been evaluated for accessibility using both
expert evaluations and automated evaluations. Because of the large number of visitors to
the Recovery.gov web site (an estimated 1.5 million monthly visitors), ensuring
accessible, usable interfaces is a high priority. The full case study is available on www.id-
book.com.
Hypotheses Testing
You might ask why you need a null hypothesis, since it seems the opposite of
what the experimenter wants to find out. It is put forward to allow the data to
contradict it. If the experimental data shows a big difference between selection
times for the two menu types, then the null hypothesis that menu type has no
effect on selection time can be rejected. Conversely, if there is no difference
between the two, then the null hypothesis cannot be rejected (i.e. the claim that
context menus are faster to select options from is not supported).
In order to test a hypothesis, the experimenter has to set up the conditions and
find ways to keep other variables constant, to prevent them from influencing the
findings. This is called the experimental design. Examples of other variables that
need to be kept constant for both types of menus include size and screen
resolution. For example, if the text is in 10 pt font size in one condition and 14 pt
font size in the other then it could be this difference that causes the effect (i.e.
differences in selection speed are due to font size). More than one condition can
also be compared with the control, for example:
Condition 3 = Scrolling
Sometimes an experimenter might want to investigate the relationship between
two independent variables: for example, age and educational background. A
hypothesis might be that young people are faster at searching on the web than
older people and that those with a scientific background are more effective at
searching on the web. An experiment would be set up to measure the time it
takes to complete the task and the number of searches carried out. The analysis
of the data would focus on both the effects of the main variables (age and
background) and also look for any interactions among them.
Hypothesis testing can also be extended to include even more variables, but it
makes the experimental design more complex. An example is testing the effects of
age and educational background on user performance for two methods of web
searching: one using a search engine and the other a browser. Again, the goal is
to test the effects of the main variables (age, educational background, web
searching method) and to look for any interactions among them. However, as the
number of variables increases in an experimental design, it makes it more
difficult to work out from the data what is causing the results.
Experimental Design
The names given for the different designs are: different-participant design, same-
participant design, and matched-pairs design. In different-participant design, a
single group of participants is allocated randomly to each of the experimental
conditions, so that different participants perform in different conditions. Another
term used for this experimental design is between-subjects design. An advantage
is that there are no ordering or training effects caused by the influence of
participants' experience of one set of tasks on their performance in the next, as
each participant only ever performs in one condition. A disadvantage is that large
numbers of participants are needed so that the effect of any individual
differences among participants, such as differences in experience and expertise,
is minimized. Randomly allocating the participants and pre-testing to identify any
participants that differ strongly from the others can help.
There are many types of statistics that can be used to test the probability of a
result occurring by chance but t-tests are the most widely used statistical test in
HCI and related fields, such as psychology. The scores, e.g. time taken for each
participant to select items for a menu in each condition (i.e. context and
cascading menus), are used to compute the means (x) and standard deviations
(SDs). The standard deviation is a statistical measure of the spread or variability
around the mean. The t-test uses a simple equation to test the significance of the
difference between the means for the two conditions. If they are significantly
different from each other we can reject the null hypothesis and in so doing infer
that the alternative hypothesis holds. A typical t-test result that compared menu
selection times for two groups with 9 and 12 participants each might be: t =
4.53, p < 0.05, df = 19. The t-value of 4.53 is the score derived from applying the t-
test; df stands for degrees of freedom, which represents the number of values in
the conditions that are free to vary. This is a complex concept that we will not
explain here other than to mention how it is derived and that it is always written
as part of the result of a t-test. The dfs are calculated by summing the number of
participants in one condition minus one and the number of participants in the
other condition minus one. It is calculated as df = (Na − 1) + (Nb − 1), where Na is
the number of participants in one condition and Nb is the number of participants
in the other condition. In our example, df = (9 − 1) + (12 − 1) = 19. p is the
probability that the effect found did not occur by chance. So, when p < 0.05, it
means that the effect found is probably not due to chance and that there is only a
5% chance that it could be by chance. In other words there most likely is a
difference between the two conditions. Typically, a value of p < 0.05 is considered
good enough to reject the null hypothesis, although lower levels of p are more
convincing, e.g. p < 0.01 where the effect found is even less likely to be due to
chance, there being only a 1% chance of that being the case.
Table 14.2 The advantages and disadvantages of different allocations of
participants to conditions
Field studies can range in time from just a few minutes to a period of several
months or even years. Data is collected primarily by observing and interviewing
people; collecting video, audio, and field notes to record what occurs in the
chosen setting. In addition, participants may be asked to fill out paper-based or
electronic diaries, that run on cell phones or other handheld devices, at particular
points during the day, such as when they are interrupted during their ongoing
activity or when they encounter a problem when interacting with a product or
when they are in a particular location (Figure 14.9). This technique is based on
the experience sampling method (ESM) used in healthcare (Csikszentmihalyhi
and Larson, 1987). Data on the frequency and patterns of certain daily activities,
such as the monitoring of eating and drinking habits, or social interactions like
phone and face-to-face conversations, are recorded. Software running on the cell
phones triggers messages to study participants at certain intervals, requesting
them to answer questions or fill out dynamic forms and checklists. These might
include recording what they are doing, what they are feeling like at a particular
time, where they are, or how many conversations they have had in the last hour.
When conducting a field study, deciding whether to tell the people being
observed, or asked to record information, that they are being studied and how
long the study or session will take is more difficult than in a laboratory situation.
For example, when studying people's interactions with an ambient display such
as the Hello.Wall, telling them that they are part of a study will likely change the
way they behave. Similarly, if people are using a dynamic town map in the High
Street, their interactions may only take a few seconds and so informing them that
they are being studied would disrupt their behavior.
Figure 14.9 An example of a context-aware experience sampling tool running on
a mobile device
The system for helping skiers to improve their performance (discussed in Chapter
12) was evaluated with skiers on the mountains to see how they used it and
whether they thought it really did help them to improve their ski performance. A
wide range of other studies have explored how new technologies have been
appropriated by people in their own cultures and settings. By appropriated we
mean how the participants use, integrate, and adapt the technology to suit their
needs, desires, and ways of living. For example, the drift table, an innovative
interactive map table described in Chapter 6, was placed in a number of homes in
London for a period of weeks to see how the home owners used it. The study
showed how the different homeowners interacted with it in quite different ways,
providing a range of accounts of how they understood it and what they did with
it. Another study, mentioned in Chapter 4, was of the Opinionizer system that was
designed as part of a social space where people could share their opinions
visually and anonymously, via a public display. The Opinionizer was intended to
encourage and facilitate conversations with strangers at a party or other social
gatherings. Observations of it being interacted with at a number of parties
showed a honey-pot effect: as the number of people in the immediate vicinity of
the system increased, a sociable buzz was created, where a variety of
conversations were started between the strangers. The findings from these and
other studies in natural settings are typically reported in the form of vignettes,
excerpts, critical incidents, patterns, and narratives to show how the products are
being appropriated and integrated into their surroundings.
Figure 14.10 UbiFit Garden's glanceable display: (a) at the beginning of the week
(small butterflies indicate recent goal attainments; the absence of flowers means
no activity this week); (b) a garden with workout variety; (c) the display on a
mobile phone (the large butterfl y indicates this week's goal was met)
Two evaluation methods were used in this study: an in the wild field study over 3
weeks with 12 participants and a survey with 75 respondents from 13 states
across the USA that covered a range of attitudes and behaviors with mobile
devices and physical activity. The goals of the field study were to identify
usability problems and to see how this technology fitted into the everyday lives of
the six men and six women, aged 25–35, who volunteered to participate in the
study. Eleven of these people were recruited by a market research firm and the
twelfth was recruited by a member of the research team. All were regular phone
users who wanted to increase their physical activity. None of the participants
knew each other. They came from a wide range of occupations, including
marketing specialist, receptionist, elementary school employee, musician,
copywriter, and more. Eight were full-time employed, two were homemakers,
one was employed part time, and one was self-employed. Six participants were
classified as overweight, five as normal, and one as obese, based on body mass
calculations.
The study lasted for 21 to 25 days, during which the participants were
interviewed individually three times. The first interview session focused on their
attitudes to physical activity and included setting their own activity goals. In the
second interview sessions (day 7) participants were allowed to revise their
weekly activity schedule. The last interview session took place on day 21. These
interviews were recorded and transcribed. The participants were compensated
for their participation.
Figure 14.11 shows the data that the evaluators collected for each participant for
the various exercises. Some of the data was inferred by the system, some was
manually written up in a journal, and some was a combination of the two. The
way in which they were recorded over time and participant varied (the
participants are represented by numbers in the vertical axis and the day of the
study is represented by the horizontal axis). The reason for this is that sometimes
the system inferred activities incorrectly which the participants then changed. An
example was housework, which was inferred as bicycling. Manually written up
activities (described as ‘journaled’ in the figure) included such things as
swimming and weightlifting, which the system could not or was not trained to
record.
From the interviews, the researchers learned about the users' reactions to the
usability of UbiFit Garden, how they felt when it went wrong, and how they felt
about it in general as a support for helping them to be fitter. Seven types of errors
with the system were reported. These included: making an error with the start
time, making an error with the duration, confusing activities in various ways,
failing to detect an activity, and detecting an activity when none occurred. These
were coded into categories backed up by quotes taken from the participants'
discussions during the focus groups. Two examples are:
Figure 14.11 Frequency of performed activities and how they were recorded for
each participant
What was really funny was, um, I did, I did some, um a bunch of housework one night. And
boom, boom, boom, I'm getting all these little pink flowers. Like, ooh, that was very
satisfying to get those. (P9, Consolvo et al, 2008, p. 1803)
… it's not the end of the world, [but]it's a little disappointing when you do an activity and it
[the fitness device] doesn’t log it [the activity] … and then I think, ‘am I doing something
wrong?’ (P2, Consolvo et al, 2008, p. 1803)
The silly flowers work, you know? … It's right there on your wallpaper so every time you
pick up your phone you are seeing it and you're like, ‘Oh, look at this. I have all those
flowers. I want more flowers.’ It's remarkable, for me it was remarkably like, ‘Oh well, if I
walk there it's just 10 minutes. I might get another flower. So, sure, I'll just walk.’(P5,
Consolvo et al, 2008, p. 1804)
Overall the study showed that participants liked the system (i.e. the user
experience). Some participants even commented about how the system motivated
them to exercise. However, there were also technical and usability problems that
needed to be improved, especially concerning the accuracy of activity recording.
ACTIVITY 14.3
1. Why was UbiFit Garden evaluated in the wild rather than in a controlled
laboratory setting?
2. Two types of data are presented from the field study. What does each
contribute to our understanding of the study?
Comment
1. The researchers wanted to find out how UbiFit Garden would be used in
people's everyday lives, what they felt about it, and what problems they
experienced over a long period of use. A controlled setting, even a living lab,
would have imposed too many restrictions on the participants to achieve this.
2. Figure 14.11 provides a visualization of the activity data collected for each
participant, showing how it was collected and recorded. The anecdotal quotes
provide information about how the participants felt about their experiences.
Case Study 14.2
Case study 14.2 (on the website) is about developing cross-cultural children's book
communities and is another example of a field study. It describes how a group of
researchers worked with teachers and school children to evaluate paper and technical
prototypes of tools for children to use in online communities.
The ICDL Communities project explored the social context surrounding next generation
learners and how they share books. This research focused on how to support an online
global community of children who don't speak the same languages but want to share the
same digital resources and interact with each other socially, learn about each others'
cultures, and make friends even if they do not speak the same language. Using specially
developed tools, children communicated inter-culturally, created and shared stories, and
built cross-cultural understanding.
This case study reports the results of three field studies during the iterative development
of the ICDL Communities software with children in pairs of countries: Hungary/USA,
Argentina/USA, and Mexico/USA (Komlodi et al, 2007). In the early evaluations
the researchers investigated how the children liked to represent themselves and their
team using paper (Figure 14.13). In later prototypes the children worked online in pairs
using tablet PCs (Figure 14.14).
Figure 14.13 American children make drawings to represent themselves and their
community
Figure 14.14 Mexican children working with an early prototype using a tablet PC
The findings from each field study enabled the researchers to learn more about the
children's needs, which enabled them to extend the functionality of the prototype and
refine its usability and improve the children's social experiences. As the system was
developed, it became clear that it was essential to support the entire context of use,
including providing team-building activities for children and support for teachers before
using the online tools.
From these evaluations researchers learned that: children enjoy interacting with other
children from different countries and a remarkable amount of communication takes
place even when the children do not share a common language; identity and
representation are particularly important to children when communicating online;
drawing and sharing stories is fun; providing support for children and teachers offline as
well as online is as essential for the success of this kind of project as developing good
software.
Field studies may be conducted where a behavior the researchers are interested
in only reveals itself after a long time of using a tool, such as those designed for
knowledge discovery, developed in information visualization. Here, the expected
changes in user problem-solving strategies may only emerge after days or weeks
of active use (Shneiderman and Plaisant, 2006). To evaluate the efficacy of such
tools, users are best studied in realistic settings of their own workplaces, dealing
with their own data, and setting their own agenda for extracting insights relevant
to their professional goals. An initial interview is usually carried out to ensure the
participant has a problem to work on, available data, and a schedule for
completion. Then the participant will get an introductory training session,
followed by 2–4 weeks of novice usage, followed by 2–4 weeks of mature usage,
leading to a semi-structured exit interview. Additional assistance may be
provided as needed, thereby reducing the traditional separation between
researcher and participant, but this close connection enables the researcher to
develop a deeper understanding of the users' struggles and successes with the
tools. Additional data such as daily diaries, automated logs of usage, structured
questionnaires, and interviews can also be used to provide a multidimensional
understanding of weaknesses and strengths of the tool.
Communicability evaluation
De Souza (2005) and her colleagues have developed a theory of HCI – semiotic
engineering – that provides tools for HCI design and evaluation. In semiotics the
fundamental entity is the sign, which can be a gesture, a symbol, or words, for example.
One way or another all of our communication is through signs; even when we don't
intend to produce signs, our mere existence conveys messages about us – how we dress,
how old we are, our gender, the way we speak, and so on are all signs that carry
information.
Recent work by these researchers uses the same theoretical basis – i.e. semiotics – for
developing an inspection method (de Souza et al, 2010). Chapter 15 describes inspection
methods in general, and details of this method are included in Case Study 14.3 on id-
book.com.
Typically, studies in the field, particularly those done in the wild are useful when
evaluators want to discover how new products and prototypes will be used
within their intended social and physical context of use. Routines and other types
of activities are analyzed as they unfold in their natural settings, describing and
conceptualizing the ways artifacts are used and appropriated. Interventions by
evaluators – other than the placement of the prototype or product in the setting,
and questions and/or probes to discover how the system is learned, used, and
adopted – are limited. In contrast evaluations in laboratories tend to focus on
usability and how users perform on predefined tasks.
With the development of a wide variety of mobile, ambient, wearable, and other
kinds of systems during the past few years, evaluators have to be creative in
adapting the methods that they use to meet the challenges of participants on the
move and in usual environments.
DILEMMA
How many users should I include in my evaluation study?
A question students always ask is how many users do I need to include in my study?
Deciding on how many to use for a usability study is partly a logistical issue that depends
on schedules, budgets, representative users, and facilities available. As already
mentioned, many professionals recommend that 5–12 testers is enough for many types of
studies such as those conducted in controlled or partially controlled settings (Dumas and
Redish, 1999), although a handful of users can provide useful feedback at early stages of
a design. Others say that as soon as the same kinds of problems start being revealed and
there is nothing new, it is time to stop. The more participants there are, the more
representative the findings will be across the user population but the study will also be
more expensive and time-consuming, so there is a trade-off to be made.
For field studies the number of people being studied will vary, depending on what is of
interest: it may be a family at home, a software team in an engineering firm, children in
a playground. The problem with field studies is that they may not be representative of
how other groups would act. However, the detailed findings gleaned from these studies
about how participants learn to use a technology and appropriate it over time can be
very revealing.
Assignment
This assignment continues work on the web-based ticket reservation system introduced at
the end of Chapter 10 and continued in Chapter 11. Using either the paper or software
prototype, or the HTML web pages developed to represent the basic structure of your
website, follow the instructions below to evaluate your prototype:
1. Based on your knowledge of the requirements for this system, develop a standard
task (e.g. booking two seats for a particular performance).
2. Consider the relationship between yourself and your participants. Do you need
to use an informed consent form? If so, prepare a suitable informed consent
form. Justify your decision.
3. Select three typical users, who can be friends or colleagues, and ask them to do
the task using your prototype.
4. Note the problems that each user encounters. If you can, time their
performance. (If you happen to have a camera, you could film each
participant.)
5. Since the system is not actually implemented, you cannot study it in typical
settings of use. However, imagine that you are planning such a study a field
study. How would you do it? What kinds of things would you need to take into
account? What sort of data would you collect and how would you analyze it?
6. What are the main benefits and problems with doing a controlled study verses
studying the product in a natural setting?
Summary
Key differences between usability testing, experiments, and field studies include the
location of the study – usability or makeshift usability lab, research lab, or natural
environment – and how much control is imposed. At one end of the spectrum is
laboratory testing and at the other are in the wild studies. Most studies use a
combination of different methods and evaluators often have to adapt their methods to
cope with unusual new circumstances created by the new systems being developed.
Key points
Further Reading
BUDIU, R. and NIElSAN, J. (2010) Usability of iPad Apps and Websites: First
research findings. Nielsen Norman Group, downloadable
from: www.nngroup.com/reports/mobile/ipad/ (accessed August, 2010). This
report discusses the usability testing of the iPad described in this chapter.
Can you tell us about your research and what motivates you?
I am an ethnographer who examines the interplay between technology and sociality. For
the past 6 years, I've primarily focused on how American teens integrate social media
into their daily practices. Because of this, I've followed the rise of many popular social
media services – MySpace, Facebook, YouTube, Twitter, etc. I examine what teens do on
these services, but I also consider how these technologies fit into teens' lives more
generally. Thus, I spend a lot of time driving around the United States talking to teens
and their parents, educators and youth ministers, law enforcement and social workers,
trying to get a sense of what teens' lives look like and where technology fits in.
How would you characterize good ethnography? And please include example(s)
from your own work
Many people ask me why I bother driving around the United States talking to teens when
I can see everything that they do online. Unfortunately, what's visible online is only a
small fraction of what they do and it's easy to misinterpret why teens do something
simply by looking at the traces of their actions. Getting into their lives, understanding
their logic, and seeing how technology connects with daily practice are critically
important, especially because teens don't have distinct online versus offl ine lives. It's all
intertwined so it's necessary to see what's going on from different angles. Of course, this
is just the data collection process. I tend to also confuse people because I document a lot
of my thinking and findings as I go, highlighting what I learned publicly for anyone to
disagree with me. I find that my blog provides a valuable feedback loop and I'm
especially fond of the teen commenters who challenge me on things. I've hired many of
them.
I know you have encountered some surprises – or maybe even a revelation – would
you tell us about it please?
From 2006 through 2007, I was talking with teens in different parts of the country and I
started noticing that some teens were talking about MySpace and some teens were
talking about Facebook. In Massachusetts, I met a young woman who uncomfortably told
me that the black kids in her school were on MySpace while the white kids were on
Facebook. She described MySpace as like ‘ghetto.’ I didn't enter into this project expecting
to analyze race and class dynamics in the United States but, after her comments, I
couldn't avoid them. I started diving into my data, realizing that race and class could
explain the difference between which teens preferred which sites. Uncomfortable with
this and totally afar from my intellectual strengths, I wrote a really awkward blog post
about what I was observing. For better or worse, the BBC picked this up as a ‘formal
report from UC Berkeley’ and I received over 10 000 messages over the next week. Some
were hugely critical, with some making assumptions about me, and my intentions. But
the teens who wrote consistently agreed. And then two teens starting pointing out to me
that it wasn't just an issue of choice, but an issue of movement, with some teens moving
from MySpace to Facebook because MySpace was less desirable and Facebook was safe.
Anyhow, recognizing the racist and classist roots of this, I spent a lot of time trying to
unpack the different language that teens used when talking about these sites in a paper
called ‘White Flight in Networked Publics? How Race and Class Shaped American Teen
Engagement with MySpace and Facebook.’