0% found this document useful (0 votes)

11 views

Chapter 14

This document discusses evaluation methods that range from controlled laboratory settings to natural field studies. It focuses on usability testing conducted in labs, experiments in research labs, and field studies in natural environments. Usability testing aims to evaluate how usable a product is by having users complete tasks while being observed and recorded in a lab setting. Experiments test hypotheses by manipulating variables and controlling conditions. Field studies observe user behavior in real-world settings to understand how products are used naturally.

Uploaded by

bensu0603

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Chapter 14

Uploaded by

bensu0603

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Chapter 14

EVALUATION STUDIES: FROM

CONTROLLED TO NATURAL SETTINGS
 14.1 Introduction
 14.2 Usability Testing
 14.3 Conducting Experiments
 14.4 Field Studies
Objectives

The main aims of this chapter are to:

 Explain how to do usability testing.

 Outline the basics of experimental design.
 Describe how to do field studies.

14.1 Introduction
Imagine you have designed a new shared web space intended for advertising
second-hand goods. How would you find out whether householders would be
able to use it to find what they wanted and whether it was a reliable and effective
service? What evaluation methods would you employ?

In this chapter we describe evaluation studies that take place in a spectrum of

settings, from controlled laboratories to natural settings. Within this spectrum we
focus on usability testing which takes place in usability labs; experiments which
take place in research labs; and field studies which take place in natural settings
such as people's homes, work, and leisure environments.

14.2 Usability Testing

The usability of products has traditionally been tested in controlled laboratory
settings. This approach emphasizes how usable a product is. It has been most
commonly used to evaluate desktop applications, such as websites, word
processors, and search tools. Doing usability testing in a laboratory, or a
temporarily assigned controlled environment, enables evaluators to control what
users do and to control environmental and social influences that might impact
the users, performance. The goal is to test whether the product being developed is
usable by the intended user population to achieve the tasks for which it was
designed.

14.2.1 Methods, Tasks, and Users

Collecting data about users' performance on predefined tasks is a central

component of usability testing. As mentioned in Chapter 12, a combination of
methods is often used to collect data. The data includes video recordings of the
users including facial expressions and keystrokes and mouse movements that are
logged. Sometimes participants are asked to think aloud while carrying out tasks,
as a way of revealing what they are thinking (see Section 14.2.3). In addition, a
user satisfaction questionnaire is used to find out how users actually feel about
using the product, through asking them to rate it along a number of scales, after
interacting with it. Structured or semi-structured interviews may also be
conducted with users to collect additional information about what they liked and
did not like about the product. Sometimes evaluators also collect data about how
the product is used in the field.

Examples of the tasks that are given to users include searching for information,
reading different typefaces (e.g. Helvetica and Times), and navigating through
different menus. Time and number are the two main performance measures
used, in terms of the time it takes typical users to complete a task, such as finding
a website, and the number of errors that participants make, such as selecting
wrong menu options when creating a spreadsheet. The quantitative performance
measures that are obtained during the tests produce the following types of data
(Wixon and Wilson, 1997):

 Time to complete a task.

 Time to complete a task after a specified time away from the product.
 Number and type of errors per task.
 Number of errors per unit of time.
 Number of navigations to online help or manuals.
 Number of users making a particular error.
 Number of users completing a task successfully.
In the early days of usability testing, tests were conducted to investigate specific
features of an interface. For example, a team of scientists from Xerox Corporation
ran a series of tests to determine what was the optimal number of buttons to put
on a mouse, how many items to put in a menu, and how to design icons, as part of
their Xerox Star office workstation system (Bewley et al, 1990). In total, over 15
tests were performed involving over 200 users, lasting over 400 hours. The results
of the various tests were then fed back into the design of the interface; for
example, three buttons on the mouse was found to be best for selecting from a set
of options.

A key concern is the number of users that should be involved in a usability study:
five to twelve is considered an acceptable number (Dumas and Redish, 1999), but
sometimes it is possible to use fewer when there are budget and schedule
constraints. For instance, quick feedback about a design idea, such as the initial
placement of a logo on a website, can be obtained from only two or three users.

14.2.2 Labs and Equipment

Many companies, such as Microsoft and IBM, used to test their products in
custom-built usability labs (Lund, 1994). These facilities comprise a main testing
laboratory, with recording equipment and the product being tested, and an
observation room where the evaluators watch what is going on and analyze the
data. There may also be a reception area for testers, a storage area, and a viewing
room for observers. The space may be arranged to superficially mimic features of
the real world. For example, if the product is an office product or for use in a
hotel reception area, the laboratory can be set up to look like those environments.
Soundproofing and lack of windows, telephones, co-workers, and other
workplace and social artifacts eliminate most of the normal sources of distraction
so that the users can concentrate on the tasks set for them to perform.
Figure 14.1 A usability laboratory in which evaluators watch participants on a
monitor and through a one-way mirror

Typically there are two to three wall-mounted video cameras that record the
user's behavior, such as hand movements, facial expression, and general body
language. Microphones are also placed near where the participants will be sitting
to record their utterances. Video and other data is fed through to monitors in the
observation room. The observation room is usually separated from the main
laboratory or workroom by a one-way mirror so that evaluators can watch
participants being tested but testers cannot see them. It can be a small
auditorium with rows of seats at different levels or, more simply, a small
backroom consisting of a row of chairs facing the monitors. They are designed so
that evaluators and others can watch the tests while ongoing, both on the
monitors and through the mirror. Figure 14.1 shows a typical arrangement.

Sometimes, modifications may have to be made to the room set-up to test

different types of applications. For example, Nodder and his colleagues at
Microsoft had to partition the space into two rooms when they were testing
NetMeeting, an early videoconferencing product, in the mid-1990s, as Figure
14.2 shows (Nodder et al, 1999). This allowed users in both rooms to be observed
when conducting a meeting via the videoconference system.

Usability labs can be very expensive and labor-intensive to run and maintain. A
less expensive alternative, that started to become more popular in the early and
mid-90s, is the use of mobile usability testing equipment. Video cameras, laptops,
and other measuring equipment are temporarily set up in an office or other
space, converting it into a makeshift usability laboratory. Another advantage is
that equipment can be taken into work settings, enabling testing to be done on
site, making it less artificial and more convenient for the participants.

There is an increasing number of products that are specifically designed for

mobile evaluation. Some are referred to as lab-in-a-box or lab-in-a-suitcase
because they pack away neatly into a convenient carrying case. One example is
the Tracksys portable lab that costs under $5000 depending on exactly what
equipment is included (Figure 14.3). It is composed of off-the-shelf components
that plug into a PC and can record video direct to hard disk (Figure 14.4). Some
recent additions to Tracksys include: a system called the GoToMeeting package,
an easier remote control system, new eye-tracking devices, including a remote
eye-tracker, and a mobile head-mounted eye tracker. The latter is fitted to a
helmet or lightweight cap so that the system offers complete portability for
research in sports, usability studies, and a variety of other disciplines (Figure
14.5). There are also mobile eye tracking systems available that can be used –
although these can cost over $20 000.

Figure 14.2 The testing arrangement used for NetMeeting, an early

videoconferencing product
Figure 14.3 The Tracksys lab-in-a-box system, which comprises components that
pack into a heavy duty padded fl ight case plus a PC system

Figure 14.4 The Tracksys system being used with a mobile device camera that
attaches to a flexible arm, which mounts on a mobile device, and is tethered to
the lab

Another trend has been to conduct remote usability testing, where users perform
a set of tasks with a product in their own setting and their interactions with the
software are logged remotely. A popular example is UserZoom, which enables
users to participate in usability tests from their home, as illustrated in Figure
14.6. An advantage of UserZoom is that many users can be tested at the same time
and the logged data automatically compiled into statistical packages for data
analysis. For example, the number of clicks per page and the tracking of clicks
when searching websites for specified tasks can be readily obtained.

14.2.3 An Example of Usability Testing: The iPad

Usability specialists Budiu and Nielsen (2010) from the Nielsen Norman Group
conducted a usability test of the websites and apps specific to the iPad. They
wanted to understand how the interactions with the device affected people and to
get feedback to their clients and developers as well as people who were eager to
know if the iPad lived up to the hype – which was being reported at the time it
came to market. They used two methods: usability testing with think-aloud in
which users said what they were doing and thinking as they did it, and an expert
review. Here we describe the usability testing they conducted. A key question
they asked was: ‘Are user expectations different for the iPad compared with the
iPhone?’ A previous study they had conducted of the iPhone showed people
preferred using apps than browsing the web because the latter was slow and
cumbersome. Would this be the same for the iPad, where the screen is larger and
web pages are more similar to how they appear on laptops or desktop
computers? Budia and Nielsen also hoped that their study would be able to
address the question that many companies might be considering at that time:
whether it is worth developing specific websites for the iPad (in the way some
companies were doing for smartphones) or would the desktop versions be
acceptable when interacted with using the iPad multitouch interface.
Figure 14.5 The mobile head-mounted eye tracker

Figure 14.6 UserZoom enables remote usability testing

The usability testing was carried out in two cities in the United States, Fremont,
California, and Chicago. The test sessions were similar; the aim of both was to
understand the typical usability issues that people encounter when using
applications and accessing websites on the iPad. Seven participants were
recruited: all were experienced iPhone users who had owned their phones for at
least 3 months and who had used a variety of apps. One participant was also an
iPad owner. One reason for selecting participants who used iPhones was because
they would have previous experience of using apps and the web with a similar
interaction style to the iPad.

The participants were considered typical users. They varied in age and
occupation. Two participants were in their 20s, three were in their 30s, one in
their 50s, and one in their 60s. Their occupations were: food server, paralegal,
medical assistant, retired food driver, account rep, and homemaker. Three were
males and four were females.

Before taking part, the participants were asked to read and sign an informed
consent form agreeing to the terms and conditions of the study. This form
described:

 what the participant would be asked to do;

 the length of time needed for the study;
 the compensation that would be offered (if any) for participating;
 the participants' right to withdraw from the study at any time;
 a promise that the person's identity would not be disclosed; and
 an agreement that the data collected would be confidential and would
not be made available to marketers or anyone other than the
evaluators.
The Tests

The session started with participants being invited to explore any application
they found interesting on the iPad. They were asked to comment on what they
were looking for or reading, what they liked and disliked about a site, and what
made it easy or difficult to carry out a task. A moderator sat next to each
participant and observed and took notes. The sessions were video recorded and
lasted about 90 minutes. Participants worked on their own.

After exploring the iPad they were asked by the evaluator to open specific apps or
websites and to explore them and then carry out one or more tasks as they would
if they were on their own. Each was assigned the tasks in a randomized order. All
the apps that were tested were designed specifically for the iPad, but for some
tasks the users were asked to do the same task on a website that was not
specifically designed for the iPad. For these tasks the evaluators took care to
balance the presentation order so that the app would be first for some
participants and the website would be first for others. Over 60 tasks were chosen
from over 32 different sites. Examples are shown in Table 14.1.

ACTIVITY 14.1

1. What was the main purpose of this study?

2. What aspects are considered to be important for good usability and user
experience?
3. How representative are the tasks outlined in Table 14.1 of what a typical iPad
user might do?
Comment

1. To find out how participants interacted with apps and websites on the iPad.
The findings were intended to help developers determine whether specific
websites need to be developed for the iPad.
2. Our definition of usability suggests that the iPad should be: efficient, effective,
safe, easy to learn, easy to remember, and have good utility (i.e. good
usability). It should also support creativity and be motivating, helpful, and
satisfying to use (i.e. good user experience). The iPad is designed for the
general public so the range of users is broad in terms of age and experience
with technology.
3. The tasks are a small sample of the total set prepared by the evaluators. They
cover shopping, reading, planning, and finding a recipe – which are common
activities people engage in during their everyday lives.
The Equipment

The testing was done using a setup similar to the mobile usability kit shown
in Figure 14.7. A camera recorded the participant's interactions and gestures
when using the iPad and streamed the recording to a laptop computer. A webcam
was also used to record the expressions on the participants' faces and their think-
aloud commentary. The laptop ran software called Morae, which synchronized
these two data streams. Up to three observers (including the moderator sitting
next to the participant) watched the video streams (rather than observing the
participant directly) on their laptops situated on the table – meaning they did not
have to invade the participants' personal space.

Table 14.1 Examples of some of the tests used in the iPad evaluation (adapted
from Budiu and Nielsen, 2010).

Usability Problems

The main findings from the study showed that the participants were able to
interact with websites on the iPad but that it was not optimal. For example, links
on the pages were often too small to tap on reliably and the fonts were sometimes
difficult to read. The various usability problems identified in the study were
classified according to a number of well-known interaction design principles and
concepts, including: mental models, navigation, the quality of images, problems
of using a touchscreen with small target areas, lack of affordances, getting lost in
the application, the effects of changing orientations, working memory, and the
feedback that they received.

Figure 14.7 The setup used in the Chicago usability testing sessions

Figure 14.8 The cover of Time magazine showing the contents carousel
An example of a navigation problem identified during the evaluation is shown
in Figure 14.8. When the cover of Time magazine appears on the iPad, it doesn't
contain any hyperlinks. The contents page does but it is not easily accessible. In
order to access it, users must, first, tap the screen to reveal the controls at the
bottom, then select the Contents button to display the contents carousel, and then
select the contents page in the carousel.

Another problem they identified was the participants sometimes not knowing
where to tap on the iPad to select options such as buttons and menus. It is usually
obvious from the affordances of an interface as to where they are and how to
select them. However, the evaluators found that in many cases the participants
repeatedly tapped the iPad interface in order to initiate an action, such as when
trying to select an option from a menu.

Getting lost in an application is an old but important problem for designers of

digital products. Some participants got lost because they tapped the iPad too
much and could not find a back button and could not get themselves back to the
home page. One participant said “ … I like having everything there [on the home
page]. That's just how my brain works” (Budiu and Nielsen, 2010, p. 58). Other
problems arose because applications appeared differently in the two views
possible on the iPad: portrait and landscape. More information was provided by
some app developers in the horizontal (landscape) view than in the vertical
(portrait) view which caused problems for some participants.

Interpreting and Presenting the Data

Based on the findings of their study, Budia and Nielsen make a number of
recommendations, including supporting standard navigation. The results of the
study were written up as a report that was made publically available to app
developers and the general public (it is available from www.nngroup.com). It
provided a summary of key findings for the general public as well as specific
details of the problems the participants had with the iPad, so that developers
could decide whether to make specific websites and apps for the iPad.

While being revealing about how usable websites and apps are on the iPad, the
usability testing was not able to reveal how it will be used in people's everyday
lives. This would require an in the wild study where observations are made of
how people use them in their own homes and when traveling.
ACTIVITY 14.2

1. Was the selection of participants for the iPad study appropriate? Justify your
comments.
2. What might have been the problems with asking participants to think out loud
as they completed the tasks?
Comments

1. The evaluator tried to get a representative set of participants across an age

range with similar skill levels – i.e. they had used an iPhone or iPad. Ideally, it
would have been good to have had more participants to see if the findings
were more generalizable. However, it was important to do the study as quickly
as possible and get the results out to developers and the general public.
2. The think-aloud technique was used to record verbally what the participants
were doing, trying to do, and thinking. If a person is concentrating hard on a
task, however, it can be difficult to talk at the same time. This can often be at
exactly the time when it is most interesting to hear the participants' comments
– i.e. when the task is difficult, the person goes quiet.

When more rigorous testing is needed, a set of standards can be used for
guidance – such an approach is described in Case Study 14.1 for the development
of the US Government's Recovery.gov website. For this large website, several
methods were used including: usability testing, expert reviews (discussed
in Chapter 15), and focus groups.

Case Study 14.1

Ensuring Accessibility and Section 508 Compliance for the Recovery.gov Website
(Lazar et al, 2010b)

The American Recovery and Reinvestment Act (informally known as ‘the Stimulus Bill’)
became law on February 17, 2009, with the intention of infusing $787 billion into the US
economy to create jobs and improve economic conditions. The Act established an
independent board, the Recovery Accountability and Transparency Board, to oversee the
spending and detect, mitigate, and minimize any waste, fraud, or abuse. The law
required the Board to establish a website to provide the public with information on the
progress of the recovery effort. A simple website was launched the day that the Act was
signed into law, but one of the immediate goals of the Board was to create a more
detailed website, with data, geospatial features, and Web 2.0 functionality, including data
on every contract related to the Act. The goal was to provide political transparency at a
scale not seen before in the US federal government so that citizens could see how money
was being spent.

A major goal in the development of the Recovery.gov website was meeting the
requirement that it be accessible to those with disabilities, such as perceptual (visual,
hearing) and motor impairments. It had to comply with guidelines specified in Section
508 of the Rehabilitation Act (see the id-book.com website for details). At a broad level,
three main approaches were used to ensure compliance:

 Usability testing with individual users, including those with perceptual and
motor impairments.
 Routine testing for compliance with Section 508 of the Rehabilitation Act, done
every 3 months, using screenreaders such as JAWS, and automated testing
tools such as Watchfire.
 Providing an online feedback loop, listening to users, and rapidly responding
to accessibility problems.
During development, ten 2 hour focus groups with users were convened in five cities. An
expert panel was also convened with four interface design experts, and usability testing
was performed, specifically involving 11 users with various impairments. Several weeks
before the launch of Recovery.gov 2.0, the development team visited the Department of
Defense Computer Accommodations Technology Evaluation Center (CAPTEC) to get
hands-on experience with various assistive technology devices (such as head-pointing
devices) which otherwise would not be available to the Recovery Accountability and
Transparency Board in their own offices.
Approaches were developed to meet each compliance standard, including situations
where existing regulations don't provide clear guidance, such as with PDF files. A large
number of PDF files are posted each month on Recovery.gov, and those files also undergo
Section 508 compliance testing. The files undergo automated accessibility inspections
using Adobe PDF accessibility tools, and if there are minor clarifications needed,
the Recovery.gov web managers make the changes; but if major changes are needed, the
PDF file is returned to the agency generating the PDF file, along with the Adobe-
generated accessibility report. The PDF file is not posted until it passes the Adobe
automated accessibility evaluation. Furthermore, no new modules or features are added
to the Recovery.gov site until they have been evaluated for accessibility using both
expert evaluations and automated evaluations. Because of the large number of visitors to
the Recovery.gov web site (an estimated 1.5 million monthly visitors), ensuring
accessible, usable interfaces is a high priority. The full case study is available on www.id-
book.com.

14.3 Conducting Experiments

In research contexts, specific hypotheses are tested that make a prediction about
the way users will perform with an interface. The benefits are more rigor and
confidence that one interface feature is easier to understand or faster than
another. An example of a hypothesis is: context menus (i.e. menus that provide
options related to the context determined by the users previous choices) are
easier to select options from compared with cascading menus. Hypotheses are
often based on a theory, such as Fitts Law (see next chapter), or previous
research findings. Specific measurements provide a way of testing the hypothesis.
In the above example, the accuracy of selecting menu options could be compared
by counting the number of errors made by participants for each menu type.

Hypotheses Testing

Typically, a hypothesis involves examining a relationship between two things,

called variables. Variables can be independent or dependent.
An independent variable is what the investigator manipulates (i.e. selects), and in
the above example it is the different menu types. The other variable is called
the dependent variable, and, in our example, this is the time taken to select an
option. It is a measure of the user performance and, if our hypothesis is correct,
will vary depending on the different types of menu.
When setting up a hypothesis to test the effect of the independent variable(s) on
the dependent variable it is usual to derive a null hypothesis and an alternative
one. The null hypothesis in our example would state that there is no difference in
the time it takes users to find items (i.e. selection time) between context and
cascading menus. The alternative hypothesis would state that there is a
difference between the two on selection time. When a difference is specified but
not what it will be, it is called a two-tailed hypothesis. This is because it can be
interpreted in two ways: either the context menu or the cascading menu is faster
to select options from. Alternatively, the hypothesis can be stated in terms of one
effect. This is called a one-tailed hypothesis and would state that context menus
are faster to select items from, or vice versa. A one-tailed hypothesis would be
made if there was a strong reason to believe it to be the case. A two-tailed
hypothesis would be chosen if there were no reason or theory that could be used
to support the case that the predicted effect would go one way or the other.

You might ask why you need a null hypothesis, since it seems the opposite of
what the experimenter wants to find out. It is put forward to allow the data to
contradict it. If the experimental data shows a big difference between selection
times for the two menu types, then the null hypothesis that menu type has no
effect on selection time can be rejected. Conversely, if there is no difference
between the two, then the null hypothesis cannot be rejected (i.e. the claim that
context menus are faster to select options from is not supported).

In order to test a hypothesis, the experimenter has to set up the conditions and
find ways to keep other variables constant, to prevent them from influencing the
findings. This is called the experimental design. Examples of other variables that
need to be kept constant for both types of menus include size and screen
resolution. For example, if the text is in 10 pt font size in one condition and 14 pt
font size in the other then it could be this difference that causes the effect (i.e.
differences in selection speed are due to font size). More than one condition can
also be compared with the control, for example:

Condition 1 = Context menu

Condition 2 = Cascading menu

Condition 3 = Scrolling
Sometimes an experimenter might want to investigate the relationship between
two independent variables: for example, age and educational background. A
hypothesis might be that young people are faster at searching on the web than
older people and that those with a scientific background are more effective at
searching on the web. An experiment would be set up to measure the time it
takes to complete the task and the number of searches carried out. The analysis
of the data would focus on both the effects of the main variables (age and
background) and also look for any interactions among them.

Hypothesis testing can also be extended to include even more variables, but it
makes the experimental design more complex. An example is testing the effects of
age and educational background on user performance for two methods of web
searching: one using a search engine and the other a browser. Again, the goal is
to test the effects of the main variables (age, educational background, web
searching method) and to look for any interactions among them. However, as the
number of variables increases in an experimental design, it makes it more
difficult to work out from the data what is causing the results.

Experimental Design

A concern in experimental design is to determine which participants to use for

which conditions in an experiment. The experience of participating in one
condition will affect the performance of those participants if asked to participate
in another condition. For example, having learned about the way the heart works
using a multimedia condition it would be unfair to expose the same participants
to the same learning material via another medium, e.g. virtual reality.
Furthermore, it would create bias if the participants in one condition within the
same experiment had seen the content and others had not. The reason for this is
that those who had seen the content would have had more time to learn about
the topic and this would increase their chances of answering more questions
correctly. In some experimental designs, however, it is possible to use the same
participants for all conditions without letting such training effects bias the
results.

The names given for the different designs are: different-participant design, same-
participant design, and matched-pairs design. In different-participant design, a
single group of participants is allocated randomly to each of the experimental
conditions, so that different participants perform in different conditions. Another
term used for this experimental design is between-subjects design. An advantage
is that there are no ordering or training effects caused by the influence of
participants' experience of one set of tasks on their performance in the next, as
each participant only ever performs in one condition. A disadvantage is that large
numbers of participants are needed so that the effect of any individual
differences among participants, such as differences in experience and expertise,
is minimized. Randomly allocating the participants and pre-testing to identify any
participants that differ strongly from the others can help.

In same-participant design (also called within subjects design), all participants

perform in all conditions so only half the number of participants is needed; the
main reason for this design is to lessen the impact of individual differences and to
see how performance varies across conditions for each participant. It is
important to ensure that the order in which participants perform tasks for this
set-up does not bias the results. For example, if there are two tasks, A and B, half
the participants should do task A followed by task B and the other half should do
task B followed by task A. This is known as counterbalancing. Counterbalancing
neutralizes possible unfair effects of learning from the first task, known as
the order effect.

In matched-participant design (also known as pair-wise design), participants are

matched in pairs based on certain user characteristics such as expertise and
gender. Each pair is then randomly allocated to each experimental condition. A
problem with this arrangement is that other important variables that have not
been taken into account may influence the results. For example, experience in
using the web could influence the results of tests to evaluate the navigability of a
website. So web expertise would be a good criterion for matching participants.

The advantages and disadvantages of using different experimental designs are

summarized in Table 14.2.

The data collected to measure user performance on the tasks set in an

experiment usually includes response times for subtasks, total times to complete
a task, and number of errors per task. Analyzing the data involves comparing the
performance data obtained across the different conditions. The response times,
errors, etc. are averaged across conditions to see if there are any marked
differences. Statistical tests are then used, such as t-tests that statistically compare
the differences between the conditions, to reveal if these are significant. For
example, a t-test will reveal whether context or cascading menus are faster to
select options from.
Statistics: t-tests

There are many types of statistics that can be used to test the probability of a
result occurring by chance but t-tests are the most widely used statistical test in
HCI and related fields, such as psychology. The scores, e.g. time taken for each
participant to select items for a menu in each condition (i.e. context and
cascading menus), are used to compute the means (x) and standard deviations
(SDs). The standard deviation is a statistical measure of the spread or variability
around the mean. The t-test uses a simple equation to test the significance of the
difference between the means for the two conditions. If they are significantly
different from each other we can reject the null hypothesis and in so doing infer
that the alternative hypothesis holds. A typical t-test result that compared menu
selection times for two groups with 9 and 12 participants each might be: t =
4.53, p < 0.05, df = 19. The t-value of 4.53 is the score derived from applying the t-
test; df stands for degrees of freedom, which represents the number of values in
the conditions that are free to vary. This is a complex concept that we will not
explain here other than to mention how it is derived and that it is always written
as part of the result of a t-test. The dfs are calculated by summing the number of
participants in one condition minus one and the number of participants in the
other condition minus one. It is calculated as df = (Na − 1) + (Nb − 1), where Na is
the number of participants in one condition and Nb is the number of participants
in the other condition. In our example, df = (9 − 1) + (12 − 1) = 19. p is the
probability that the effect found did not occur by chance. So, when p < 0.05, it
means that the effect found is probably not due to chance and that there is only a
5% chance that it could be by chance. In other words there most likely is a
difference between the two conditions. Typically, a value of p < 0.05 is considered
good enough to reject the null hypothesis, although lower levels of p are more
convincing, e.g. p < 0.01 where the effect found is even less likely to be due to
chance, there being only a 1% chance of that being the case.
Table 14.2 The advantages and disadvantages of different allocations of
participants to conditions

14.4 Field Studies

Increasingly, more evaluation studies are being done in natural settings with
either little or no control imposed on participants' activities. This change is
largely a response to technologies being developed for use outside of office
settings. For example, mobile, ambient, and other technologies are now available
for use in the home, outdoors, and in public places. Typically, field studies are
conducted to evaluate these user experiences.

As mentioned in Chapter 12, evaluations conducted in natural settings are very

different from those in controlled environments, where tasks are set and
completed in an orderly way. In contrast, studies in natural settings tend to be
messy in the sense that activities often overlap and are constantly interrupted.
This follows the way people interact with products in their everyday messy
worlds, which is generally different from how they perform on set tasks in a
laboratory setting. By evaluating how people think about, interact, and integrate
products within the settings they will ultimately be used in, we can get a better
sense of how successful the products will be in the real world. The trade-off is
that we cannot test specific hypotheses about an interface nor account, with the
same degree of certainty, for how people react to or use a product – as we can do
in controlled settings like laboratories. This makes it more difficult to determine
what causes a particular type of behavior or what is problematic about the
usability of a product. Instead, qualitative accounts and descriptions of people's
behavior and activities are obtained that reveal how they used the product and
reacted to its design.

Field studies can range in time from just a few minutes to a period of several
months or even years. Data is collected primarily by observing and interviewing
people; collecting video, audio, and field notes to record what occurs in the
chosen setting. In addition, participants may be asked to fill out paper-based or
electronic diaries, that run on cell phones or other handheld devices, at particular
points during the day, such as when they are interrupted during their ongoing
activity or when they encounter a problem when interacting with a product or
when they are in a particular location (Figure 14.9). This technique is based on
the experience sampling method (ESM) used in healthcare (Csikszentmihalyhi
and Larson, 1987). Data on the frequency and patterns of certain daily activities,
such as the monitoring of eating and drinking habits, or social interactions like
phone and face-to-face conversations, are recorded. Software running on the cell
phones triggers messages to study participants at certain intervals, requesting
them to answer questions or fill out dynamic forms and checklists. These might
include recording what they are doing, what they are feeling like at a particular
time, where they are, or how many conversations they have had in the last hour.

When conducting a field study, deciding whether to tell the people being
observed, or asked to record information, that they are being studied and how
long the study or session will take is more difficult than in a laboratory situation.
For example, when studying people's interactions with an ambient display such
as the Hello.Wall, telling them that they are part of a study will likely change the
way they behave. Similarly, if people are using a dynamic town map in the High
Street, their interactions may only take a few seconds and so informing them that
they are being studied would disrupt their behavior.
Figure 14.9 An example of a context-aware experience sampling tool running on
a mobile device

It is important to ensure the privacy of participants in field studies. For example,

photographs should not be included in reports without the participant's
permission. Participants in field studies that run over a period of weeks or
months, such as an investigation into how they use a type of remote control
device in their homes, should be informed about the study and asked to sign an
informed consent form in the usual way. In situations like this, the evaluators
will need to work out and agree with the participants what part of the activity is
to be recorded and how. For example, if the evaluators want to set up cameras
they need to be situated unobtrusively and participants need to be informed in
advance about where the cameras will be and when they will be recording their
activities. The evaluators will also need to work out in advance what to do if the
prototype or product breaks down. Can the participants be instructed to fix the
problem themselves or will the evaluators need to be called in? Security
arrangements will also need to be made if expensive or precious equipment is
being evaluated in a public place. Other practical issues may also need to be
taken into account depending on the location, product being evaluated, and the
participants in the study.

The system for helping skiers to improve their performance (discussed in Chapter
12) was evaluated with skiers on the mountains to see how they used it and
whether they thought it really did help them to improve their ski performance. A
wide range of other studies have explored how new technologies have been
appropriated by people in their own cultures and settings. By appropriated we
mean how the participants use, integrate, and adapt the technology to suit their
needs, desires, and ways of living. For example, the drift table, an innovative
interactive map table described in Chapter 6, was placed in a number of homes in
London for a period of weeks to see how the home owners used it. The study
showed how the different homeowners interacted with it in quite different ways,
providing a range of accounts of how they understood it and what they did with
it. Another study, mentioned in Chapter 4, was of the Opinionizer system that was
designed as part of a social space where people could share their opinions
visually and anonymously, via a public display. The Opinionizer was intended to
encourage and facilitate conversations with strangers at a party or other social
gatherings. Observations of it being interacted with at a number of parties
showed a honey-pot effect: as the number of people in the immediate vicinity of
the system increased, a sociable buzz was created, where a variety of
conversations were started between the strangers. The findings from these and
other studies in natural settings are typically reported in the form of vignettes,
excerpts, critical incidents, patterns, and narratives to show how the products are
being appropriated and integrated into their surroundings.

An In the Wild Study: The UbiFit Garden

It is becoming more popular in HCI to conduct in the wild studies to determine

how people use and persist in using a range of new technologies or prototypes in
situ. The term ‘in the wild’ reflects the context of the study – where the evaluation
takes place in a natural setting and where the researchers let free their prototype
technology to observe from afar how a new product is used by people in their
own homes, and elsewhere. An example is the UbiFit Garden project (Consolvo et
al, 2008), which evaluated activity sensing in the wild to address the growing
problem of people's sedentary lifestyles. A mobile device was designed to
encourage physical activity that uses on-body sensing and activity inferencing.
The personal mobile display has three components: a fitness device that monitors
different types of fitness activities, an interactive application that keeps detailed
information, including a journal of the individuals activities, and a glanceable
display that runs off mobile phones. The system works by inferring what the
wearer is doing, in terms of walking, running, cycling, and using gym equipment
based on data detected from accelerometer and barometer sensors. The sensor
data is processed and then communicated to the cell phone using Bluetooth. The
data is analyzed and used as input for the glanceable display that depicts the
UbiFit Garden (see Figure 14.10). The display depicts a garden that blooms
throughout the week as the user carries out the various physical activities. A
healthy regime of physical exercise is indicated by a healthy garden full of
flowers and butterflies. Conversely, an unhealthy garden with not much growth
or butterflies indicates an unhealthy lifestyle.

Figure 14.10 UbiFit Garden's glanceable display: (a) at the beginning of the week
(small butterflies indicate recent goal attainments; the absence of flowers means
no activity this week); (b) a garden with workout variety; (c) the display on a
mobile phone (the large butterfl y indicates this week's goal was met)

Data Collection and Participants

Two evaluation methods were used in this study: an in the wild field study over 3
weeks with 12 participants and a survey with 75 respondents from 13 states
across the USA that covered a range of attitudes and behaviors with mobile
devices and physical activity. The goals of the field study were to identify
usability problems and to see how this technology fitted into the everyday lives of
the six men and six women, aged 25–35, who volunteered to participate in the
study. Eleven of these people were recruited by a market research firm and the
twelfth was recruited by a member of the research team. All were regular phone
users who wanted to increase their physical activity. None of the participants
knew each other. They came from a wide range of occupations, including
marketing specialist, receptionist, elementary school employee, musician,
copywriter, and more. Eight were full-time employed, two were homemakers,
one was employed part time, and one was self-employed. Six participants were
classified as overweight, five as normal, and one as obese, based on body mass
calculations.

The study lasted for 21 to 25 days, during which the participants were
interviewed individually three times. The first interview session focused on their
attitudes to physical activity and included setting their own activity goals. In the
second interview sessions (day 7) participants were allowed to revise their
weekly activity schedule. The last interview session took place on day 21. These
interviews were recorded and transcribed. The participants were compensated
for their participation.

Data Analysis and Presentation

Figure 14.11 shows the data that the evaluators collected for each participant for
the various exercises. Some of the data was inferred by the system, some was
manually written up in a journal, and some was a combination of the two. The
way in which they were recorded over time and participant varied (the
participants are represented by numbers in the vertical axis and the day of the
study is represented by the horizontal axis). The reason for this is that sometimes
the system inferred activities incorrectly which the participants then changed. An
example was housework, which was inferred as bicycling. Manually written up
activities (described as ‘journaled’ in the figure) included such things as
swimming and weightlifting, which the system could not or was not trained to
record.

From the interviews, the researchers learned about the users' reactions to the
usability of UbiFit Garden, how they felt when it went wrong, and how they felt
about it in general as a support for helping them to be fitter. Seven types of errors
with the system were reported. These included: making an error with the start
time, making an error with the duration, confusing activities in various ways,
failing to detect an activity, and detecting an activity when none occurred. These
were coded into categories backed up by quotes taken from the participants'
discussions during the focus groups. Two examples are:
Figure 14.11 Frequency of performed activities and how they were recorded for
each participant

What was really funny was, um, I did, I did some, um a bunch of housework one night. And
boom, boom, boom, I'm getting all these little pink flowers. Like, ooh, that was very
satisfying to get those. (P9, Consolvo et al, 2008, p. 1803)

… it's not the end of the world, [but]it's a little disappointing when you do an activity and it
[the fitness device] doesn’t log it [the activity] … and then I think, ‘am I doing something
wrong?’ (P2, Consolvo et al, 2008, p. 1803)

An example of a general comment was:

The silly flowers work, you know? … It's right there on your wallpaper so every time you
pick up your phone you are seeing it and you're like, ‘Oh, look at this. I have all those
flowers. I want more flowers.’ It's remarkable, for me it was remarkably like, ‘Oh well, if I
walk there it's just 10 minutes. I might get another flower. So, sure, I'll just walk.’(P5,
Consolvo et al, 2008, p. 1804)

Overall the study showed that participants liked the system (i.e. the user
experience). Some participants even commented about how the system motivated
them to exercise. However, there were also technical and usability problems that
needed to be improved, especially concerning the accuracy of activity recording.

ACTIVITY 14.3
1. Why was UbiFit Garden evaluated in the wild rather than in a controlled
laboratory setting?
2. Two types of data are presented from the field study. What does each
contribute to our understanding of the study?
Comment

1. The researchers wanted to find out how UbiFit Garden would be used in
people's everyday lives, what they felt about it, and what problems they
experienced over a long period of use. A controlled setting, even a living lab,
would have imposed too many restrictions on the participants to achieve this.
2. Figure 14.11 provides a visualization of the activity data collected for each
participant, showing how it was collected and recorded. The anecdotal quotes
provide information about how the participants felt about their experiences.
Case Study 14.2

Developing cross-cultural children's book communities

Case study 14.2 (on the website) is about developing cross-cultural children's book
communities and is another example of a field study. It describes how a group of
researchers worked with teachers and school children to evaluate paper and technical
prototypes of tools for children to use in online communities.

The International Children's Digital Library (ICDL) (www.icdlbooks.org) is an online

library for the world's children developed by Ben Bederson, Allison Druin, and Ann
Weeks and colleagues from the University of Maryland and across the world. To support
this project, research is being conducted on how children access and use digital books to
explore diverse cultures. It has almost 5000 books, in 54 languages, making it the world's
largest international library online for children, ages 3–13. Figure 14.12 shows the
introductory screen for the ICDL. This interface is available in 17 different languages.

The ICDL Communities project explored the social context surrounding next generation
learners and how they share books. This research focused on how to support an online
global community of children who don't speak the same languages but want to share the
same digital resources and interact with each other socially, learn about each others'
cultures, and make friends even if they do not speak the same language. Using specially
developed tools, children communicated inter-culturally, created and shared stories, and
built cross-cultural understanding.

This case study reports the results of three field studies during the iterative development
of the ICDL Communities software with children in pairs of countries: Hungary/USA,
Argentina/USA, and Mexico/USA (Komlodi et al, 2007). In the early evaluations
the researchers investigated how the children liked to represent themselves and their
team using paper (Figure 14.13). In later prototypes the children worked online in pairs
using tablet PCs (Figure 14.14).

Figure 14.12 The homepage of the International Children's Digital Library

Figure 14.13 American children make drawings to represent themselves and their
community
Figure 14.14 Mexican children working with an early prototype using a tablet PC

The findings from each field study enabled the researchers to learn more about the
children's needs, which enabled them to extend the functionality of the prototype and
refine its usability and improve the children's social experiences. As the system was
developed, it became clear that it was essential to support the entire context of use,
including providing team-building activities for children and support for teachers before
using the online tools.

From these evaluations researchers learned that: children enjoy interacting with other
children from different countries and a remarkable amount of communication takes
place even when the children do not share a common language; identity and
representation are particularly important to children when communicating online;
drawing and sharing stories is fun; providing support for children and teachers offline as
well as online is as essential for the success of this kind of project as developing good
software.

Field studies may be conducted where a behavior the researchers are interested
in only reveals itself after a long time of using a tool, such as those designed for
knowledge discovery, developed in information visualization. Here, the expected
changes in user problem-solving strategies may only emerge after days or weeks
of active use (Shneiderman and Plaisant, 2006). To evaluate the efficacy of such
tools, users are best studied in realistic settings of their own workplaces, dealing
with their own data, and setting their own agenda for extracting insights relevant
to their professional goals. An initial interview is usually carried out to ensure the
participant has a problem to work on, available data, and a schedule for
completion. Then the participant will get an introductory training session,
followed by 2–4 weeks of novice usage, followed by 2–4 weeks of mature usage,
leading to a semi-structured exit interview. Additional assistance may be
provided as needed, thereby reducing the traditional separation between
researcher and participant, but this close connection enables the researcher to
develop a deeper understanding of the users' struggles and successes with the
tools. Additional data such as daily diaries, automated logs of usage, structured
questionnaires, and interviews can also be used to provide a multidimensional
understanding of weaknesses and strengths of the tool.

Sometimes, a particular conceptual or theoretical framework is adopted to guide

how the evaluation is performed or how the data collected from the evaluation is
analyzed (see Chapter 8). This enables the data to be explained at a more general
level in terms of specific cognitive processes, or social practices such as learning,
or conversational or linguistic interactions. For example, Activity Theory was
used as a framework to analyze how a family learned to use a new TV and video
system in their own home (Petersen et al, 2002). Another example of using a
theory – Semiotic Theory – to drive an evaluation method is described in Case
Study 14.3.

Case Study 14.3

Communicability evaluation

De Souza (2005) and her colleagues have developed a theory of HCI – semiotic
engineering – that provides tools for HCI design and evaluation. In semiotics the
fundamental entity is the sign, which can be a gesture, a symbol, or words, for example.
One way or another all of our communication is through signs; even when we don't
intend to produce signs, our mere existence conveys messages about us – how we dress,
how old we are, our gender, the way we speak, and so on are all signs that carry
information.

Semiotic engineering views human–computer interaction in terms of communication

between the designers of the artifact and the user. In linguistic terms the designer and
the user are thought of as interlocutors and the artifact is thought of as a message from
the designer to users. The book website (www.id-book.com) provides more information
about semiotic engineering. Of main interest here is how the theory is applied in
evaluation, which focuses on identifying breakdowns in communication between the
user and the designer. These breakdowns occur when the user fails to understand the
message (i.e. the design of the artifact) sent by the designer – i.e., the problems that the
user has interacting with the artifact – the communicability of the design. Evaluating
communicability is based on observing a user's experiences with an application either
directly or, more usually, recorded on video or audio. Using a predefined set of tags the
evaluator analyzes the user's behavior, focusing on breakdowns in which the user either
could not understand the designer's intentions (as encoded in the interface) or could not
make herself understood by the application – i.e. make the application do what she
wanted. The first step of tagging the user's interaction with communicability utterances
is like “putting words into the user's mouth” in a kind of reverse protocol analysis (de
Souza, 2005, p. 126; de Souza and Preece, 2004). The evaluator looks for patterns of
behavior that correspond to tags such as: ‘Oops!’, ‘Where is it?’, ‘I can do it this way,’ ‘I
can do otherwise,’ ‘Looks fine to me.’ Figure 14.15 presents a schematic image of
communicability utterances for a few frames of recorded video. Thirteen such tags have
been identified.

Figure 14.15 Schematic image of tagging communicative utterances

Recent work by these researchers uses the same theoretical basis – i.e. semiotics – for
developing an inspection method (de Souza et al, 2010). Chapter 15 describes inspection
methods in general, and details of this method are included in Case Study 14.3 on id-
book.com.

Typically, studies in the field, particularly those done in the wild are useful when
evaluators want to discover how new products and prototypes will be used
within their intended social and physical context of use. Routines and other types
of activities are analyzed as they unfold in their natural settings, describing and
conceptualizing the ways artifacts are used and appropriated. Interventions by
evaluators – other than the placement of the prototype or product in the setting,
and questions and/or probes to discover how the system is learned, used, and
adopted – are limited. In contrast evaluations in laboratories tend to focus on
usability and how users perform on predefined tasks.

With the development of a wide variety of mobile, ambient, wearable, and other
kinds of systems during the past few years, evaluators have to be creative in
adapting the methods that they use to meet the challenges of participants on the
move and in usual environments.

DILEMMA
How many users should I include in my evaluation study?

A question students always ask is how many users do I need to include in my study?
Deciding on how many to use for a usability study is partly a logistical issue that depends
on schedules, budgets, representative users, and facilities available. As already
mentioned, many professionals recommend that 5–12 testers is enough for many types of
studies such as those conducted in controlled or partially controlled settings (Dumas and
Redish, 1999), although a handful of users can provide useful feedback at early stages of
a design. Others say that as soon as the same kinds of problems start being revealed and
there is nothing new, it is time to stop. The more participants there are, the more
representative the findings will be across the user population but the study will also be
more expensive and time-consuming, so there is a trade-off to be made.

For field studies the number of people being studied will vary, depending on what is of
interest: it may be a family at home, a software team in an engineering firm, children in
a playground. The problem with field studies is that they may not be representative of
how other groups would act. However, the detailed findings gleaned from these studies
about how participants learn to use a technology and appropriate it over time can be
very revealing.

Assignment

This assignment continues work on the web-based ticket reservation system introduced at
the end of Chapter 10 and continued in Chapter 11. Using either the paper or software
prototype, or the HTML web pages developed to represent the basic structure of your
website, follow the instructions below to evaluate your prototype:

1. Based on your knowledge of the requirements for this system, develop a standard
task (e.g. booking two seats for a particular performance).
2. Consider the relationship between yourself and your participants. Do you need
to use an informed consent form? If so, prepare a suitable informed consent
form. Justify your decision.
3. Select three typical users, who can be friends or colleagues, and ask them to do
the task using your prototype.
4. Note the problems that each user encounters. If you can, time their
performance. (If you happen to have a camera, you could film each
participant.)
5. Since the system is not actually implemented, you cannot study it in typical
settings of use. However, imagine that you are planning such a study a field
study. How would you do it? What kinds of things would you need to take into
account? What sort of data would you collect and how would you analyze it?
6. What are the main benefits and problems with doing a controlled study verses
studying the product in a natural setting?
Summary

This chapter described evaluation studies in different settings. It focused on controlled

laboratory studies, experiments, and field studies in natural settings. A study of the iPad
when it first came out was presented as an example of usability testing. This testing was
done in an office environment; this was controlled to avoid outside influences affecting
the testing. The participants were also asked to conduct predefined tasks that the
evaluators were interested in investigating. Experimental design was then discussed that
involves testing a hypothesis in a controlled research lab. The chapter ended with a
discussion of field studies in which participants use new technologies in natural settings.
The UbiFit Garden example involved evaluating how participants used a mobile fitness
system designed to encourage people to do daily exercise. The goal of the evaluation was
to examine how participants used the system in their daily lives, what kinds of problems
they encountered, and whether they liked the system.

Key differences between usability testing, experiments, and field studies include the
location of the study – usability or makeshift usability lab, research lab, or natural
environment – and how much control is imposed. At one end of the spectrum is
laboratory testing and at the other are in the wild studies. Most studies use a
combination of different methods and evaluators often have to adapt their methods to
cope with unusual new circumstances created by the new systems being developed.

Key points

 Usability testing takes place in usability labs or temporary makeshift labs in a

room in a company that is dedicated for the purpose. These labs enable
evaluators to control the test setting.
 Usability testing focuses on performance measures such as how long and how
many errors are made, when completing a set of predefined tasks. Observation
(video and keystroke logging) is conducted and supplemented by user
satisfaction questionnaires and interviews.
 Usability-in-a-box and remote testing systems have been developed that are
more affordable than usability labs and also more portable. Many contain
mobile eye-tracking and other devices.
 Experiments aim to test a hypothesis by manipulating certain variables while
keeping others constant.
 The experimenter controls independent variable(s) in order to measure
dependent variable(s).
 Field studies are evaluation studies that are carried out in natural settings;
they aim to discover how people interact with technology in the real world.
 Field studies that involve the deployment of prototypes or technologies in
natural settings may also be referred to as ‘in the wild’.
 Sometimes the findings of a field study can be unexpected.

Further Reading
BUDIU, R. and NIElSAN, J. (2010) Usability of iPad Apps and Websites: First
research findings. Nielsen Norman Group, downloadable
from: www.nngroup.com/reports/mobile/ipad/ (accessed August, 2010). This
report discusses the usability testing of the iPad described in this chapter.

DUMAS, J. S. and REDISH, J. C. (1999) A Practical Guide to Usability Testing.

Intellect. Many books have been written about usability testing, but though now
quite old this one is useful because it describes the process in detail and provides
many examples.

LAZAR, J. (2006) Web Usability: A user-centered design approach. Addison-Wesley.

This book covers the entire user-centered design process for websites, from
determining the site mission and target user population, through requirements
gathering, conceptual and physical design, usability testing, implementation, and
evaluation. It contains useful case studies and pays special attention to universal
access and usability issues.

LAZAR, J., FENG, J. and HOCHHEISER, H. (2010a) Research Methods in Human–

Computer Interaction. John Wiley & Sons Ltd. This book discusses experimental
design in detail as well as other research methods such as surveys.

ROBSON, C. (1994) Experimental Design and Statistics in Psychology. Penguin

Psychology. This book provides an introduction to experimental design and basic
statistics. Another useful book by the same author is Real World Research (2nd
edn), published in 2002 by Blackwell Publishing.

SHNEIDERMAN, B. and PLAISANT. C. (2010) Designing the User Interface:

Strategies for effective human–computer interaction (5th edn). Addison-
Wesley. Chapter 4 provides a good overview of all the main evaluation
techniques used in HCI.
INTERVIEW with danah boyd

danah boyd is a senior researcher at Microsoft Research and a research associate at

Harvard University's Berkman Center for Internet and Society. In her research,
danah examines everyday practices involving social media, with specific attention
to youth engagement. Lately, she has focused on issues related to privacy, publicity,
and visibility. She recently co-authored Hanging Out, Messing Around, and Geeking
Out: Kids living and learning with new media, published by MIT Press. She is
currently co-directing the Youth and Media Policy Working Group, funded by the
MacArthur Foundation. She blogs at www.zephoria.org/thoughts/ and tweets at
@zephoria

Can you tell us about your research and what motivates you?

I am an ethnographer who examines the interplay between technology and sociality. For
the past 6 years, I've primarily focused on how American teens integrate social media
into their daily practices. Because of this, I've followed the rise of many popular social
media services – MySpace, Facebook, YouTube, Twitter, etc. I examine what teens do on
these services, but I also consider how these technologies fit into teens' lives more
generally. Thus, I spend a lot of time driving around the United States talking to teens
and their parents, educators and youth ministers, law enforcement and social workers,
trying to get a sense of what teens' lives look like and where technology fits in.

Most of what motivates me is understanding what has changed because of technology

and what is fundamentally the same. My work has focused on four core areas – self-
presentation and identity work, sociality and peer relationships, risky behaviors and
online safety, and privacy and publicity. These days, I'm most interested in how social
media has made social interactions more visible to more people. We don't have to go to a
mall in Nashville to see teens hanging out with their friends; we can see this on Facebook
or YouTube. Yet this visibility also complicates many issues, notably how individuals
manage self-presentation in light of broad audiences, how sociality takes on a different
form when it's hard to segment social contexts, how unhealthy interactions are made
visible but not necessarily addressed, and how people navigate privacy in light of
potentially large audiences.
Fundamentally, I'm a social scientist invested in understanding the social world.
Technology inflects social dynamics, providing a fascinating vantage point for
understanding how people interact.

How would you characterize good ethnography? And please include example(s)
from your own work

Ethnography is about mapping culture. To do this successfully, it's important to dive

deep into the cultural practices of a particular population or community and try to
understand them on their own terms. The next stage is to try to ground what one
observes in a broader discourse of theory and ideas, in order to provide a framework for
understanding cultural dynamics.

Many people ask me why I bother driving around the United States talking to teens when
I can see everything that they do online. Unfortunately, what's visible online is only a
small fraction of what they do and it's easy to misinterpret why teens do something
simply by looking at the traces of their actions. Getting into their lives, understanding
their logic, and seeing how technology connects with daily practice are critically
important, especially because teens don't have distinct online versus offl ine lives. It's all
intertwined so it's necessary to see what's going on from different angles. Of course, this
is just the data collection process. I tend to also confuse people because I document a lot
of my thinking and findings as I go, highlighting what I learned publicly for anyone to
disagree with me. I find that my blog provides a valuable feedback loop and I'm
especially fond of the teen commenters who challenge me on things. I've hired many of
them.

I know you have encountered some surprises – or maybe even a revelation – would
you tell us about it please?

From 2006 through 2007, I was talking with teens in different parts of the country and I
started noticing that some teens were talking about MySpace and some teens were
talking about Facebook. In Massachusetts, I met a young woman who uncomfortably told
me that the black kids in her school were on MySpace while the white kids were on
Facebook. She described MySpace as like ‘ghetto.’ I didn't enter into this project expecting
to analyze race and class dynamics in the United States but, after her comments, I
couldn't avoid them. I started diving into my data, realizing that race and class could
explain the difference between which teens preferred which sites. Uncomfortable with
this and totally afar from my intellectual strengths, I wrote a really awkward blog post
about what I was observing. For better or worse, the BBC picked this up as a ‘formal
report from UC Berkeley’ and I received over 10 000 messages over the next week. Some
were hugely critical, with some making assumptions about me, and my intentions. But
the teens who wrote consistently agreed. And then two teens starting pointing out to me
that it wasn't just an issue of choice, but an issue of movement, with some teens moving
from MySpace to Facebook because MySpace was less desirable and Facebook was safe.
Anyhow, recognizing the racist and classist roots of this, I spent a lot of time trying to
unpack the different language that teens used when talking about these sites in a paper
called ‘White Flight in Networked Publics? How Race and Class Shaped American Teen
Engagement with MySpace and Facebook.’

EN Programming ELCO Micro-ANTS LEB02 Basic Encoder V2.2 26-10-2020
100% (2)
EN Programming ELCO Micro-ANTS LEB02 Basic Encoder V2.2 26-10-2020
35 pages
Chapte 6 - Evaluation Studies From Controlled To Natural Settings
No ratings yet
Chapte 6 - Evaluation Studies From Controlled To Natural Settings
25 pages
Usability Testing of Virtual Reality Applications-The Pilot Study
No ratings yet
Usability Testing of Virtual Reality Applications-The Pilot Study
19 pages
Usability Analysis and Testing
No ratings yet
Usability Analysis and Testing
14 pages
Usability Testing
100% (2)
Usability Testing
27 pages
Posted With Permission by Student
No ratings yet
Posted With Permission by Student
17 pages
Usability Engineering of Games
No ratings yet
Usability Engineering of Games
17 pages
Chapter 7
No ratings yet
Chapter 7
59 pages
08 UI Testing
No ratings yet
08 UI Testing
48 pages
Evaluation and The User Experience: Designing The User Interface: Strategies For Effective Human-Computer Interaction
No ratings yet
Evaluation and The User Experience: Designing The User Interface: Strategies For Effective Human-Computer Interaction
26 pages
16.04.1355 DP
No ratings yet
16.04.1355 DP
2 pages
Usability Testing of World Wide Web Sites
No ratings yet
Usability Testing of World Wide Web Sites
10 pages
Usability Testing Workshop March2020
No ratings yet
Usability Testing Workshop March2020
89 pages
DTUI6 Chap05 accessiblePPT
No ratings yet
DTUI6 Chap05 accessiblePPT
28 pages
The Role of Colour Preattentive Processing in Human-Computer Interaction Task Efficiency: A Preliminary Study
No ratings yet
The Role of Colour Preattentive Processing in Human-Computer Interaction Task Efficiency: A Preliminary Study
19 pages
Field Versus Laboratory Usability Testing: A First Comparison
No ratings yet
Field Versus Laboratory Usability Testing: A First Comparison
10 pages
usability-testing_a-review-of-some-methodological-and-technical-aspects-of-the-method
No ratings yet
usability-testing_a-review-of-some-methodological-and-technical-aspects-of-the-method
6 pages
Usability Testing
No ratings yet
Usability Testing
29 pages
Usability Testing 101 Norman Nielsen Group
No ratings yet
Usability Testing 101 Norman Nielsen Group
9 pages
2012 Usability of interfaces
No ratings yet
2012 Usability of interfaces
17 pages
Experience Design Methodology The Four Q
No ratings yet
Experience Design Methodology The Four Q
6 pages
Hci - 16
No ratings yet
Hci - 16
36 pages
UNIT - 3 - Evaluation and User Experience
No ratings yet
UNIT - 3 - Evaluation and User Experience
25 pages
The Effects of Graphical Interface Design Characteristics On Human-Computer Interaction Task Efficiency
No ratings yet
The Effects of Graphical Interface Design Characteristics On Human-Computer Interaction Task Efficiency
31 pages
Usability Testing - Part 1
No ratings yet
Usability Testing - Part 1
43 pages
Professional Test Driven Development with C#: Developing Real World Applications with TDD
From Everand
Professional Test Driven Development with C#: Developing Real World Applications with TDD
James Bender
No ratings yet
ETRAPaper Fin
No ratings yet
ETRAPaper Fin
9 pages
Exploring Two Methods of Usability Testing: System Usability Scale and Retrospective Think-Aloud
No ratings yet
Exploring Two Methods of Usability Testing: System Usability Scale and Retrospective Think-Aloud
11 pages
Human-Computer Interaction Notes
No ratings yet
Human-Computer Interaction Notes
6 pages
Usefulness of eye tracking
No ratings yet
Usefulness of eye tracking
5 pages
CH 4 Evaluating Interface Designs (Reference Book 01)
No ratings yet
CH 4 Evaluating Interface Designs (Reference Book 01)
30 pages
1995 SUS - A quick and dirty usability scale
No ratings yet
1995 SUS - A quick and dirty usability scale
8 pages
LAUGWITZ - USAB - 2008 Construction and Evaluation of A User Experience Questionnaire
No ratings yet
LAUGWITZ - USAB - 2008 Construction and Evaluation of A User Experience Questionnaire
15 pages
PDF1
No ratings yet
PDF1
15 pages
Dynamics of User Experience UX
No ratings yet
Dynamics of User Experience UX
7 pages
1 s2.0 S1877042813037348 Main
No ratings yet
1 s2.0 S1877042813037348 Main
9 pages
Human Computer Interaction: Usability Inspection Methods
No ratings yet
Human Computer Interaction: Usability Inspection Methods
30 pages
Designing Evaluations To Evaluate
No ratings yet
Designing Evaluations To Evaluate
7 pages
The Usability Metrics For User Experience
No ratings yet
The Usability Metrics For User Experience
6 pages
#Chapter - 7 @HCI (Important Points)
No ratings yet
#Chapter - 7 @HCI (Important Points)
8 pages
Session 7 - 8 - IsYS6596 - Techniques For Designing UX Understanding
No ratings yet
Session 7 - 8 - IsYS6596 - Techniques For Designing UX Understanding
31 pages
E3 Chap 09 - 1
No ratings yet
E3 Chap 09 - 1
30 pages
The Effect of Icon Spacing and Size On The Speed of Icon Processing in The Human Visual System
No ratings yet
The Effect of Icon Spacing and Size On The Speed of Icon Processing in The Human Visual System
10 pages
Topic 11 Usability Theory Notes
100% (1)
Topic 11 Usability Theory Notes
5 pages
Usability testing documentation
No ratings yet
Usability testing documentation
11 pages
Week - 10 Evaluation
No ratings yet
Week - 10 Evaluation
30 pages
SUSCHAPT
No ratings yet
SUSCHAPT
8 pages
Com 415
No ratings yet
Com 415
3 pages
Model Based Eval
No ratings yet
Model Based Eval
29 pages
Usability Metric Framework For Mobile Phone Application: Azham Hussain Maria Kutar
No ratings yet
Usability Metric Framework For Mobile Phone Application: Azham Hussain Maria Kutar
5 pages
Session 13 - 14 - IsYS6596 - Techniques For Designing UX Evaluation
No ratings yet
Session 13 - 14 - IsYS6596 - Techniques For Designing UX Evaluation
47 pages
Factors Influencing Choosing An Evaluation Method
No ratings yet
Factors Influencing Choosing An Evaluation Method
4 pages
Chapter 7 - Formative Evaluation
No ratings yet
Chapter 7 - Formative Evaluation
9 pages
Critical Analysis On Usability Evaluation Techniques
No ratings yet
Critical Analysis On Usability Evaluation Techniques
9 pages
Evaluation Techniques and Universal Design
No ratings yet
Evaluation Techniques and Universal Design
64 pages
User Testing 210224
No ratings yet
User Testing 210224
15 pages
2024_1
No ratings yet
2024_1
5 pages
usability testing method
No ratings yet
usability testing method
5 pages
SUS: A Retrospective: John Brooke
No ratings yet
SUS: A Retrospective: John Brooke
12 pages
HCI Module 6 and 7
No ratings yet
HCI Module 6 and 7
44 pages
What Does The System Usability Scale (SUS) Measure?: Validation Using Think Aloud Verbalization and Behavioral Metrics
No ratings yet
What Does The System Usability Scale (SUS) Measure?: Validation Using Think Aloud Verbalization and Behavioral Metrics
11 pages
Teaching List 2425
No ratings yet
Teaching List 2425
48 pages
Rice Based Mushroom Production Manual PDF
100% (1)
Rice Based Mushroom Production Manual PDF
92 pages
Ordonez v. USA - Document No. 3
No ratings yet
Ordonez v. USA - Document No. 3
3 pages
Espelita - Final Term - Assignment #6
No ratings yet
Espelita - Final Term - Assignment #6
3 pages
1968 Ford Mustang
No ratings yet
1968 Ford Mustang
7 pages
Women Independence The Reason Behind Increasing Divorce Rate. Report
100% (3)
Women Independence The Reason Behind Increasing Divorce Rate. Report
46 pages
Talk - Three Wishes - Barney Wiki - Fandom
No ratings yet
Talk - Three Wishes - Barney Wiki - Fandom
6 pages
Vampire The Requiem - Bloodlines - Ancient Bloodlines PDF
100% (3)
Vampire The Requiem - Bloodlines - Ancient Bloodlines PDF
178 pages
OOT (RGPV) IV Sem CS
No ratings yet
OOT (RGPV) IV Sem CS
5 pages
1.11 Stress On An Oblique Plane Under Axial Loading: Force P, Causes Both Normal and Shear Stresses in The Inclined Plane
No ratings yet
1.11 Stress On An Oblique Plane Under Axial Loading: Force P, Causes Both Normal and Shear Stresses in The Inclined Plane
8 pages
Mail System-Literature Survey
No ratings yet
Mail System-Literature Survey
3 pages
التحقيق الجنائي الرقمي
No ratings yet
التحقيق الجنائي الرقمي
24 pages
PHD Thesis Final Version Ning Li
No ratings yet
PHD Thesis Final Version Ning Li
264 pages
Chapter 5 HRM
No ratings yet
Chapter 5 HRM
9 pages
Grade 9 - Unit 8 - Global Success - practice test 1
100% (1)
Grade 9 - Unit 8 - Global Success - practice test 1
10 pages
AUKEY_Catalogue
No ratings yet
AUKEY_Catalogue
46 pages
A Study of Cognitive Style of Junior College Students of Science Stream With Respect To Gender and Locality
No ratings yet
A Study of Cognitive Style of Junior College Students of Science Stream With Respect To Gender and Locality
8 pages
Thesis 15
100% (2)
Thesis 15
4 pages
Computer Fundamentals
No ratings yet
Computer Fundamentals
86 pages
P2 - 9H - Implicit Differentiation - QP1
No ratings yet
P2 - 9H - Implicit Differentiation - QP1
14 pages
Monthly Prayer Times 2023 October
No ratings yet
Monthly Prayer Times 2023 October
3 pages
File Tracking System-Documentation
55% (11)
File Tracking System-Documentation
116 pages
NY Archives Essay Competition Guidelines
No ratings yet
NY Archives Essay Competition Guidelines
2 pages
Halal Fiction - JCL Accepted
No ratings yet
Halal Fiction - JCL Accepted
25 pages
BUSINESS DRIVEN INFORMATION SYSTEMS 7th Edition Paige Baltzan 2024 Scribd Download
100% (1)
BUSINESS DRIVEN INFORMATION SYSTEMS 7th Edition Paige Baltzan 2024 Scribd Download
23 pages
Reference:: Property of STI
No ratings yet
Reference:: Property of STI
10 pages
Prompt Engineering
100% (1)
Prompt Engineering
13 pages
DIP-Assignment 03 (10 Points) : Fatima Jinnah Women University Computer Science
No ratings yet
DIP-Assignment 03 (10 Points) : Fatima Jinnah Women University Computer Science
3 pages
CITATION 1 /L 1033
No ratings yet
CITATION 1 /L 1033
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chapter 14

Uploaded by

Chapter 14

Uploaded by

Chapter 14

EVALUATION STUDIES: FROM

The main aims of this chapter are to:

 Explain how to do usability testing.

In this chapter we describe evaluation studies that take place in a spectrum of

14.2 Usability Testing

14.2.1 Methods, Tasks, and Users

Collecting data about users' performance on predefined tasks is a central

 Time to complete a task.

14.2.2 Labs and Equipment

Sometimes, modifications may have to be made to the room set-up to test

There is an increasing number of products that are specifically designed for

Figure 14.2 The testing arrangement used for NetMeeting, an early

14.2.3 An Example of Usability Testing: The iPad

Figure 14.6 UserZoom enables remote usability testing

 what the participant would be asked to do;

1. What was the main purpose of this study?

Getting lost in an application is an old but important problem for designers of

Interpreting and Presenting the Data

1. The evaluator tried to get a representative set of participants across an age

Case Study 14.1

14.3 Conducting Experiments

Typically, a hypothesis involves examining a relationship between two things,

Condition 1 = Context menu

Condition 2 = Cascading menu

A concern in experimental design is to determine which participants to use for

In same-participant design (also called within subjects design), all participants

In matched-participant design (also known as pair-wise design), participants are

The advantages and disadvantages of using different experimental designs are

The data collected to measure user performance on the tasks set in an

14.4 Field Studies

As mentioned in Chapter 12, evaluations conducted in natural settings are very

It is important to ensure the privacy of participants in field studies. For example,

An In the Wild Study: The UbiFit Garden

It is becoming more popular in HCI to conduct in the wild studies to determine

Data Collection and Participants

Data Analysis and Presentation

An example of a general comment was:

Developing cross-cultural children's book communities

The International Children's Digital Library (ICDL) (www.icdlbooks.org) is an online

Figure 14.12 The homepage of the International Children's Digital Library

Sometimes, a particular conceptual or theoretical framework is adopted to guide

Case Study 14.3

Semiotic engineering views human–computer interaction in terms of communication

Figure 14.15 Schematic image of tagging communicative utterances

This chapter described evaluation studies in different settings. It focused on controlled

 Usability testing takes place in usability labs or temporary makeshift labs in a

DUMAS, J. S. and REDISH, J. C. (1999) A Practical Guide to Usability Testing.

LAZAR, J. (2006) Web Usability: A user-centered design approach. Addison-Wesley.

LAZAR, J., FENG, J. and HOCHHEISER, H. (2010a) Research Methods in Human–

ROBSON, C. (1994) Experimental Design and Statistics in Psychology. Penguin

SHNEIDERMAN, B. and PLAISANT. C. (2010) Designing the User Interface:

danah boyd is a senior researcher at Microsoft Research and a research associate at

Most of what motivates me is understanding what has changed because of technology

Ethnography is about mapping culture. To do this successfully, it's important to dive

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.