105 A Survey On Conversational Recommender Systems
105 A Survey On Conversational Recommender Systems
105 A Survey On Conversational Recommender Systems
Recommender systems are software applications that help users to find items of interest in situations of
information overload. Current research often assumes a one-shot interaction paradigm, where the users’
preferences are estimated based on past observed behavior and where the presentation of a ranked list of
suggestions is the main, one-directional form of user interaction. Conversational recommender systems (CRS)
take a different approach and support a richer set of interactions. These interactions can, for example, help
to improve the preference elicitation process or allow the user to ask questions about the recommendations
and to give feedback. The interest in CRS has significantly increased in the past few years. This development
is mainly due to the significant progress in the area of natural language processing, the emergence of new
voice-controlled home assistants, and the increased use of chatbot technology. With this paper, we provide a
detailed survey of existing approaches to conversational recommendation. We categorize these approaches in
various dimensions, e.g., in terms of the supported user intents or the knowledge they use in the background.
Moreover, we discuss technological approaches, review how CRS are evaluated, and finally identify a number
of gaps that deserve more research in the future.
CCS Concepts: • Information systems → Recommender systems; • General and reference → Surveys
and overviews; • Human-centered computing → Interactive systems and tools;
1 INTRODUCTION
Recommender systems are among the most visible success stories of AI in practice. Typically, the
main task of such systems is to point users to potential items of interest, e.g., in the context of an
e-commerce site. Thereby, they not only help users in situations of information overload [126], but
they also can significantly contribute to the business success of the service providers [57].
In many of these practical applications, recommending is a one-shot interaction process. Typically,
the underlying system monitors the behavior of its users over time and then presents a tailored set
of recommendations in pre-defined navigational situations, e.g., when a user logs in to the service.
Although such an approach is common and useful in various domains, it can have a number of
potential limitations. There are, for example, a number of application scenarios, where the user
preferences cannot be reliably estimated from their past interactions. This is often the case with
high-involvement products (e.g., when recommending a smartphone), where we even might have
no past observations at all. Furthermore, what to include in the set of recommendations can be
highly context-dependent, and it might be difficult to automatically determine the user’s current
situation or needs. Finally, another assumption often is that users already know their preferences
when they arrive at the site. This might, however, not necessarily be true. Users might also construct
their preferences only during the decision process [152], when they become aware of the space of
the options. In some cases, they might also learn about the domain and the available options only
during the interaction with the recommender [154].
ACM Computing Surveys, Vol. 54, No. 4, Article 105. Publication date: May 2021.
105:2 Dietmar Jannach et al.
The promise of Conversational Recommender Systems (CRS) is that they can help to address
many of these challenges. The general idea of such systems, broadly speaking, is that they support
a task-oriented, multi-turn dialogue with their users. During such a dialogue, the system can elicit
the detailed and current preferences of the user, provide explanations for the item suggestions, or
process feedback by users on the made suggestions. Given the significant potential of such systems,
research on CRS already has some tradition. Already in the late 1970s, Rich [127] envisioned
a computerized librarian that makes reading suggestions to users by interactively asking them
questions, in natural language, about their personality and preferences. Besides interfaces based on
natural language processing (NLP), a variety of form-based user interfaces1 were proposed over the
years. One of the earlier interaction approaches in CRS based on such interfaces is called critiquing,
which was proposed as a means for query reformulation in the database field already in 1982 [144].
In critiquing approaches, users are presented with a recommendation soon in the dialogue and can
then apply pre-defined critiques on the recommendations, e.g., (“less $$”) [15, 49].
Form-based approaches can generally be attractive as the actions available to the users are
pre-defined and non-ambiguous. However, such dialogues may also appear non-natural, and users
might feel constrained in the ways they can express their preferences. NLP-based approaches, on
the other hand, for a long time suffered from existing limitations, e.g., in the context of processing
voice commands. In recent years, however, major advances were made in language technology. As
a result, we are nowadays used to issuing voice commands to our smartphones and digital home
assistants, and these devices have reached an impressive level of recognition accuracy. In parallel
to these developments in the area of voice assistants, we have observed a fast uptake of chatbot
technology in recent years. Chatbots, both rather simple and more sophisticated ones, are usually
able to process natural language and are nowadays widely used in various application domains,
e.g., to deal with customer service requests.
These technological advances led to an increased interest in CRS during the last years. In contrast
to many earlier approaches, we however observe that today’s technical proposals are more often
based on machine learning technology instead of following pre-defined dialogue paths. However,
often there still remains a gap between the capabilities of today’s voice assistants and chatbots
compared to what is desirable to support truly conversational recommendation scenarios [117], in
particular when the system is voice-controlled [161, 165].
In this paper, we review the literature on CRS in terms of common building blocks of a typ-
ical conceptual architecture of CRS. Specifically, after providing a definition and a conceptual
architecture of a CRS in Section 2, we discuss (i) interaction modalities of CRS (Section 3), (ii) the
knowledge and data they are based upon (Section 4), and (iii) the computational tasks that have to
be accomplished in a typical CRS (Section 5). Afterwards, we discuss evaluation approaches for
CRS (Section 6) and finally give an outlook on future directions.
One fundamental characteristic of CRS is their task-orientation, i.e., they support recommenda-
tion specific tasks and goals. The main task of the system is to provide recommendations to the
users, with the goal to support their users’ decision-making process or to help them find relevant
information. Additional tasks of CRS include the acquisition of user preferences or the provision of
explanations. This specific task orientation distinguishes CRS from other dialogue-based systems,
such as the early ELIZA system [158] or similar chat robot systems [151].
The other main feature of a CRS according to our definition is that there is a multi-turn conversa-
tional interaction. This stands in contrast to systems that merely support question answering (Q&A
tools). Providing one-shot Q&A-style recommendations is a common feature of personal digital
assistants like Apple’s Siri and similar products. While these systems already today can reliably
respond to recommendation requests, e.g., for a restaurant, they often face difficulties maintaining a
multi-turn conversation. A CRS therefore explicitly or implicitly implements some form of dialogue
state management to keep track of the conversation history and the current state.
Note that our definition does not make any assumptions regarding the modality of the inputs
and the outputs. CRS can be voice controlled, accept typed text, or obtain their inputs via form
fields, buttons, or even gestures. Likewise, the output is not constrained and can be voice, speech,
text, or multimedia content. No assumptions are also made regarding who drives the dialogue.
Generally, conversational recommendation shares a number of similarities with conversational
search [115]. In terms of the underlying tasks, search and recommendation have in common that
one main task is to rank the objects according to their assumed relevance, either for a given
query (search) or the preferences of the user (recommendation). Furthermore, in terms of the
conversational part, both types of systems have to interpret user utterances and disambiguate
user intents in case natural language interactions are supported. In conversational search systems,
however, the assumption often is that the interaction is based on “written or spoken form” [115],
whereas in our definition of CRS various types of input modalities are possible. Overall, the boundary
between (personalized) conversational search and recommendation systems often seems blurry, see
[86, 139, 172], in particular as often similar technological approaches are applied. In this survey, we
limit ourselves to works that explicitly mention recommendation as one of their target problems.
Computational Elements. One central part of such an architecture usually is a Dialogue Manage-
ment System (also called “state tracker” or similarly in some systems). This component drives the
process flow. It receives the processed inputs, e.g., the recognized intents, entities and preferences,
and correspondingly updates the dialogue state and user model. After that, using a recommendation
and reasoning engine and background knowledge, it determines the next action and returns appro-
priate content like a recommendation list, an explanation, or a question to the output generation
component.
The User Modeling System can be a component of its own, in particular when there are long-term
user preferences to be considered, or not. In some cases, the current preference profile is implicitly
part of the dialogue system. The Recommendation and Reasoning Engine is responsible for retrieving
a set of recommendations, given the current dialogue state and preference model. This component
might also implement other complex reasoning functionality, e.g., to generate explanations or to
compute a query relaxation (see later). Besides these central components, typical CRS architectures
105:4 Dietmar Jannach et al.
Dialogue
Management
System
Recommendation
User Modeling Item
User Models and Reasoning
System Database
Engine
comprise modules for input and output processing. These can, for example, include speech-to-text
conversion and speech generation. On the input side—in particular in the case of natural language
input—additional tasks are usually supported, including intent detection and named entity recognition
[66, 99], for identifying the users’ intentions and entities (e.g., attributes of item) in their utterances.
Knowledge Elements. Various types of knowledge are used in CRS. The Item Database is something
that is present in almost all solutions, representing the set of recommendable items, sometimes
including details about their attributes. In addition to that, different types of Domain and Background
Knowledge are often leveraged by CRS. Many approaches explicitly encode dialogue knowledge
in different ways, e.g., in the form of pre-defined dialogue states, supported user intents, and the
possible transitions between the states. This knowledge can be general or specific to a particular
domain. The knowledge can furthermore either be encoded by the system designers or automatically
learned from other sources or previous interactions. A typical example for learning approaches
are those that use machine learning to build statistical models from corpora of recorded dialogues.
Generally, domain and background knowledge can be used by all computational elements. Input
processing may need information about entities to be recognized or knowledge about the pre-
defined intents. The user modeling component may be built on estimated interest weights regarding
certain item features, and the reasoning engine may use explicit inference knowledge to derive the
set of suitable recommendations.
2 We looked at Springer Link, the ACM Digital Library, IEEE Xplore, ScienceDirect, arXiv.org, and ResearchGate
A Survey on Conversational Recommender Systems 105:5
that we considered in this work.3 Looking at the type of these papers, the majority of the works
described technical proposals for one of the computational components of a CRS architecture. A
smaller set of papers described demo systems. Another smaller set were analytical ones which, for
example, reviewed certain general characteristics of CRS.
Generally, we only included papers that are compliant with our definition of a CRS given above.
We therefore did not include papers that discussed one-shot or multi-step question-answering
systems [133, 166], even when the question or task was about a recommendation. We also did not
consider general dialogue systems like chatbot systems, which are not task-oriented, or systems that
only support a query-response interaction process like a search engine without further dialogue
steps, e.g., [31]. Furthermore, we did not include dialogue-based systems, which were task-oriented,
but not on a recommendation task, e.g., the end-to-end learning approaches presented in [159] and
[76], which focus on restaurant search and movie-ticket booking. Furthermore, we excluded a few
works like [50] or [174], which use the term “interactive recommendation”, which however rather
refers to a system that addresses observed user interest changes over time, but is not designed to
support a dialogue with the user. Other works like [138] or [174] mainly focus on finding good
strategies for acquiring an initial set of ratings for cold-start users. While these works can be seen
as supporting an interactive process, there is only one type of interaction, which is furthermore
mostly limited to a profile-building phase. Finally, there are a number of works where users of
a recommendation systems are provided with mechanisms to fine-tune their recommendations,
which is sometimes referred to as “user control” [61]. Such works, e.g., [163], in principle support
user actions that can be found in some CRS, for example to give feedback on a recommendation.
The interaction style of such approaches is however not a dialogue with the system.
Alexa or Google Home, e.g., [4, 36]. Compared to form-based approaches, these solutions usually
offer more flexibility in the dialogue and sometimes support chit-chat and mixed-initiative dia-
logues. Major challenges can, however, lie in the understanding of the users’ utterances and the
identification of their intents. But also the presentation of the recommendations can be difficult, in
particular when more than one option should be provided at once.
Hybrid approaches that combine natural language with other modalities are, therefore, not
uncommon. For example, systems that support written natural language dialogues often rely on
list-based or other visual approaches to present their results [73, 172]. The work presented in
[167], on the other hand, supports a hybrid visual/natural language interaction mechanism, where
recommendations are displayed visually, and users can provide feedback to certain features in a
critiquing-like form in natural language. Yet other systems support voice input, but present the
recommendations in textual form [47, 142], because it can be difficult to present more than one
recommendation at a time through spoken language without overwhelming the users. Chatbot
applications, finally, often combine natural language input and output with structured form elements
(e.g., buttons) and a visually-structured representation of the recommendations [53, 62, 100, 114].
Besides written or spoken language and fill-out forms, a few other alternative and application-
specific modalities for inputs and outputs can be found. The dialogue system presented in [150],
for example, supports multiple types of inputs, including visual inputs on a geographic map, pen
gestures like zooming, or handwritten input. The work proposed in [18] furthermore tries to process
non-verbal input, like body postures, gestures, facial expressions, as well as speech prosody to estimate
the user’s emotions and attitudes in order to acquire implicit feedback and preferences.
In terms of the outputs, several approaches use interactive geographic maps, often as part of a
multi-modal output strategy [5, 39, 73, 150]. The applicability of map-based approaches is limited to
certain application domains, e.g., travel and tourism, but can help to overcome various challenges
regarding the user experience with conversational systems [125]. The use of embodied conversational
agents (ECAs) [19] as an additional output mechanism is also not uncommon in the literature
[41, 52] because of the assumed general persuasive potential of human-like avatars [2, 38]. Various
factors can impact the effectiveness of such ECAs. In [43], for example, the authors analyze the
effects of non-verbal behavior (e.g., the facial expressions) on the effectiveness of an ECA in the
context of a dialogue-based recommender system. Research on the specific effects of using different
variants of an ECA in the context of recommender systems is, however, generally rare.
Finally, a few works exist where users interact with a recommendation system within a virtual,
three-dimensional space. In [33, 34], the authors describe a virtual shopping environment where
users interact with a critiquing-based recommender and can, in addition, collaborate with other
users. Supporting group decisions is also the goal of the work presented in [1]. In this work, however
no 3D visualization is supported, and the focus of the work is mostly to enable the conversation
between a group of users supported by a recommender system. Figure 2 provides an overview of
common input and output modalities found in the literature.
Input and Output
Modalities
Interactive
Embodied Agents … 3D Space
Maps (including non‐verbal acts)
(including gestures)
context is the use of a CRS on voice-based home assistants (smart speakers) [4, 36]. In such settings,
providing recommendations is only one of many functionalities the device is capable of. Users
might therefore not actually perceive the system as primarily being a recommender.
Supported Devices. An orthogonal aspect regarding the application environment of a CRS is
that of the supported devices. This is particularly important, because the specific capabilities and
features of the target device can have a significant impact on the design choices when building a
CRS. The mentioned smart speaker applications, for example, are specifically designed for hardware
devices that often only support voice-based interactions. This can lead to specific challenges, e.g.,
when it comes to determining the user’s intent or when a larger set of alternatives should be
presented to the users. The interaction with chatbot applications, on the other hand, is typically
not tied to specific hardware devices. Commonly, they are either designed as web applications or
as smartphone and tablet applications. However, the choice of the used communication modality
can still depend on the device characteristics. Typing on small smartphone screens may be tedious
and the limited screen space in general requires the development of tailored user interfaces.
The applicability of CRS is not limited to the mentioned devices. Alternative approaches were, for
example, investigated in [18, 37]. Here, the idea is that the CRS is implemented as an application on
an interactive wall that could be installed in a real store. A camera is furthermore used to monitor
and interpret the user’s non-verbal communication actions, in particular facial expressions and
gestures. An alternative on-site environment was envisioned in [170]. Here, the ultimate goal is
to build a CRS running on a service robot, in this case one that is able to elicit a customer’s food
preferences in a restaurant. Yet another application scenario, that of future in-car recommender
systems, is sketched in [83]. Given the specific situation in a driving scenario, the use of speech
technology often is advisable [22], which almost naturally leads to conversational recommendation
approaches, e.g., for driving-related aspects like navigation or entertainment [8, 9].
pre-defined or dynamically determined critiques to further refine their preferences. While the users
in such applications have some choices regarding the dialogue flow, e.g., they can decide to accept a
recommendation or further apply critiques, these choices are typically very limited and the available
critiques are determined by the system. Another class of mostly system-driven applications are
the form-based interactive advisory systems discussed in [41]. Here, the system guides the user
through a personalized preference elicitation dialogue until enough is known about the user. Only
after the initial recommendations are displayed, the user can influence the dialogue by selecting
from pre-defined options like asking for an explanation or by relaxing some constraints.
The other extreme would be a user-driven system, where the system takes no proactive role.
The resulting dialogue therefore consists of “user-asks, system-responds” pairs, and it stands to
question if we would call such an exchange a conversational recommendation. Such conversation
patterns are rather typical for one-shot query-answering, search and recommendation systems that
are not in the scope of our survey. As a result, in the papers considered relevant for this study, we
did not find any paper that aimed at building an entirely user-driven system in which the system
never actively engages in a dialogue, e.g., when it does not ask any questions ever. A special case in
that context is the recommender system proposed in [82], which monitors an ongoing group chat
and occasionally makes recommendations to the group based on the observed communication.
This observation is not surprising because every CRS is a task-oriented system aiming to achieve
goals like obtaining enough reliable information about the user’s preferences. As a result, almost
all approaches in the literature are mixed-initiative systems, although with different degrees of
system guidance. Typical chatbot applications, for example, often guide users through a series of
questions with pre-defined answer options (using forms and buttons), and at the same time allow
them to type in statements in natural language. In fully NLP-based interfaces, users typically have
even more freedom to influence how the dialogue continues. Still, also in these cases, the system
typically has some agenda to move the conversation forward.
Technically, even a fully NLP-based dialogue can almost entirely be system-driven and mostly
rely on a “system asks, user responds” [172] conversation pattern. Nonetheless, the provision of a
natural language user interface might leave the users disappointed when they find out that they
can never actively engage in the conversation, e.g., by asking a clarification question or explanation
regarding the system’s question.
3.4 Discussion
A variety of ways exist in which the user’s interaction with a CRS can be designed, e.g., in terms
of the input and output modalities, the supported devices, or the level of user control. In most
surveyed papers, these design choices are, however, rarely discussed. One reason is that in many
cases the proposed technical approach is mostly independent of the interaction modality, e.g., when
the work is on a new strategy to determine the next question to ask to the user. In other cases, the
modalities are pre-determined by the given research question, e.g., how to build a CRS on a mobile.
More research therefore seems required to understand how to make good design choices in
these respects and what the implications and limitations of each design choice are. Regarding the
chosen form of inputs and outputs, it is, for example, not always entirely clear if natural language
interaction makes the recommendation more efficient or effective compared to form-based inputs.
Pure natural language interfaces in principle provide the opportunity to elicit preferences in a
more natural way. However, these interfaces have their limitations as well. The accuracy of the
speech recognizer, for example, can have a major impact on the system’s usability. In addition, some
users might also be better acquainted and feel more comfortable with more traditional interaction
mechanisms (forms and buttons). According to the study in [54], a mix of a natural language
interface and buttons led to the best user experience. Moreover, in [102], it turned out that in
A Survey on Conversational Recommender Systems 105:9
situations of disambiguation, i.e., when a user has to choose among a set of multiple alternatives,
mixed-interaction mode (NLP interface with buttons) can make the task easier for users. Overall,
while in some cases the choice of the modalities is predetermined through the device, finding an
optimal combination of interaction modalities remains challenging, in particular as individual user
preferences might play a role here.
More studies are also needed to understand how much flexibility in the dialogue is required
by users or how much active guidance by the system is appreciated in a certain application.
Furthermore, even though language-based and in particular voice-based conversations have become
more popular in recent years, certain limitations remain. It is, for example, not always clear
how one would describe a set of recommendations when using voice output. Reading out more
than one recommendation seems impractical in most cases and something that we could call
“recommendation summarization” might be needed.
Despite these potential current limitations, we expect a number of new opportunities where CRS
can be applied in the future. With the ongoing technological developments, more and more devices
and machines are equipped with CPUs and are connected to the internet. In-store interactive
walls, service robots and in-car recommenders, as discussed above, are examples of visions that are
already pursued today. These new applications will, however, also come with their own general
challenges (e.g., privacy considerations, aspects of technology acceptance) and application-specific
ones (e.g., safety considerations in an in-car setting).
Research on what are relevant user intents is generally scarce, and we only found 11 papers that
explicitly discussed user intents. Among these 11, only few of them, e.g., see [16, 100, 164], considered
the majority of the domain-independent intents shown in Table 1. Others like [65, 105, 142] only
discuss certain subsets of them. Yet another set of papers focused on very application-specific
intents in the context of group recommendation [1, 103].
Table 1. High-level overview of selected domain-independent user intents found in the literature.
Starting, re-starting, and ending the dialogue. In NLP-based CRS, either the system or the user can
initiate the dialogue. In a user-driven conversation, the recommendation seeker might, for example,
explicitly ask for help [100] or make a recommendation request [162] to start the interaction. One
typical difficulty in this context is to recognize such requests when the dialogue starts with chit-chat.
Once the recommendation dialogue is moving on, it is not uncommon that users want to start
over, i.e., begin the session from scratch and “reset their profile” [100]. Previous studies found
that such an intent was found in 5.2% of the dialogues [16] or that 36.4% of the users had this
intent in a conversation [65]. Finally, at the end of the conversation, the user has either found a
recommendation useful and accepts it in some form (e.g., by purchasing or consuming an item)
or not. In either case, the CRS has to react to the intent in some form by redirecting the user
accordingly, e.g., to the shopping basket, or by saying goodbye.
Chit-chat. Many NLP-based systems support chit-chat in the conversation. In the study in [164],
nearly 80% of the recorded user utterances were considered chit-chat. This number indicates that
supporting chit-chat conversations can be a valuable means to create an engaging user experience.
Furthermore, the study in [164] showed that chit-chat can also help to reduce user dissatisfaction,
even though this part of the conversation is irrelevant to achieving the interaction goal.
Preference Elicitation. Understanding the user’s preferences is a key task for any CRS. Preference
information can be provided by the user in different ways. In an initial phase of the dialogue, the
user might specify some of the desired characteristics of the item that she or he is interested in or
even provide strict filtering constraints. In [105], this process is termed as “give criteria”. In later
phases, the user might however also want to revise the previously stated preferences. Note that
some authors also consider answering—to a system-provided question or proposal for a constraint
[25, 145]—as a dialogue intent during preference elicitation [156]. Since in NLP-based systems a
user may respond in an arbitrary way, it is clearly important for the system to disambiguate an
answer by the user from other utterances. Such an “Answer” intent nonetheless is different from
the other intents discussed here, as the intent is a response to the system’s initiative of asking.
A Survey on Conversational Recommender Systems 105:11
Also later in the process, preferences can be stated by the user in different ways after an initial
recommendation is made by the system. In critiquing-based approaches, the users can, for example,
add additional constraints in case the choice set is too large, relax some of the previously stated
ones, or state that they already know the item [56, 123, 142]. Generally, a system might also allow
the user to inspect, modify, and delete the current profile (supporting a “show profile” intent) [100].
By analyzing the interaction logs of a prototypical voice-controlled movie recommender, e.g., in
[65], the authors found that many users (41.1%) at some stage try to refine their initially stated
preferences. In particular in case of unsatisfactory system responses, some users might furthermore
also have the intent to “reject” [156] a recommendation or “restate” their preferences. In the study
presented in [16], this however happened only in 1.5% of the interactions.
Obtaining Recommendations and Explanations. There are various ways in which users might ask
for recommendations and additional information about the items. Asking for recommendations
often happens at the very beginning of a dialogue, but this event can also occur after the user has
revised the preferences. In case a currently displayed list of options is not satisfactory, users also
might ask the system to “show more” options [16] or ask for a similar item for comparison. For
each of the items, the user might want to learn more about its details or ask for an explanation, e.g.,
why it was recommended [100]. Finally, an alternative form of requesting a recommendation is to
ask the system about its opinion (“how about”) regarding a certain item, see e.g., [164].
Besides such ephemeral user models that are constructed during the ongoing session, some
approaches in the literature also maintain long-term preference profiles [4, 123, 142, 154]. In the
critiquing approach in [123], for example, the system tries to derive long-term and supposedly
more stable preferences (e.g., for non-smoking rooms in restaurants) from multiple sessions. In
the content-based recommendation approach adopted in [142], a probabilistic model is maintained
based on past user preferences for items. In general, a key problem when recommending based
on two types of models (long-term and short-term) is to determine the relative importance of the
individual models. One so far unexplored option could lie in the consideration of contextual factors
such as seasonal aspects, the user’s location, or the time of the day.
Finally, there are also approaches that try to leverage information about the collective preferences
of a user community, in particular for cold-start situations [101]. If nothing or little is known yet
about the user’s preferences, a common strategy is to recommend popular items, where item
popularity can be determined based on user ratings, reviews, or past sales numbers as in [47]. The
feedback obtained for these popular items can then be used to further refine the user model.
A Conditional
transition
Phase 1 Dialog pages /
Personalized Interaction
B C dialog page steps
Transition
D F Hint
Phase n
E G
Presentation Explanation
of proposal
Comparison
Fig. 3. Pre-defined dialogue states in the Advisor Suite system (adapted from [58]).
Figure 3 shows a schematic overview of such a dialogue model. It consists of (i) a number of
dialogue steps that to acquire the user’s preferences through questions, and (ii) special dialogue
states in which the system presents the results, provides explanations, or shows a comparison
between different alternatives. The possible transitions are defined at design time, but which path is
taken during the dialogue is determined dynamically based on decision rules. Another example for
a work that is based on a pre-defined set of states and possible transitions is the interactive tourism
recommender system proposed in [86]. In their case, the transitions at run-time are not determined
A Survey on Conversational Recommender Systems 105:13
based on manually engineered decision rules, but learned from the data using reinforcement
learning techniques, where one goal is to minimize the number of required interaction steps.
Technically, there are different ways of explicitly representing such state machines. Some tools,
as the one mentioned above, use visual representations, others rely on textual and declarative
representations like “dialogue grammars” [13] and case-frames [12]. Google’s DialogFlow4 , as an
example of a commercial service, uses a visual tool to model linear and non-linear conversation
flows, where non-linear means that there are different execution paths, depending on the user’s
responses or contextual factors. Finally, in some cases, the possible states are simply hard-coded as
part of the general program logic of the application.
In some works, and in particular in early critiquing-based ones which are based on forms and
buttons [122, 123, 134], only a few generic dialogue states exist, which means that no complex flow
has to be designed. After an initial preference elicitation stage, recommendations are presented,
and the system offers a number of critiques that the user can apply until a recommendation is
accepted or rejected. Dialogue state management is therefore in some ways relatively light-weight.
The main task of the system in terms of dialogue management is to keep track of the user responses
and, in case of dynamic critiquing, make inferences regarding the next critiques to offer.
Similarly, in some NLP-based conversational preference elicitation systems such as [29, 172],
there are mainly two phases: asking questions, in this case in an adaptive way, and presenting a
recommendation list. In other NLP-based systems, the possible dialogue states are not modeled
explicitly as such, but implicitly result from the implemented intents. For example, whether or not
there is a dialogue state “provide explanation” depends on the question whether a corresponding
intent was considered in the design phase of the system.
Finally, in the NLP-based end-to-end learning CRS proposed in [75], the dialogue states are in
some ways also modeled implicitly, but in a different way. This system is based on a corpus of
recorded human conversations (between crowdworkers) centered around movie recommendations.
This corpus is used to train a complex neural model, which is then used to react to utterances
by users. Looking at the conversation examples, these conversations, besides some chit-chat,
mainly consist of interactions where one communication partner asks the other if she or he likes a
certain movie. The sentiment of the answer of the movie seeker is then analyzed to make another
recommendation, again mostly in the form of a question. The dialogue model is therefore relatively
simple and encoded in the neural model. It seemingly does not support many other types of intents
or information requests that do not contain movie names (e.g., “I would like to see a sci-fi movie”).
created their own datasets or relied on preexisting datasets from different domains. In Table 2, we
provide examples of datasets containing item-related information. It can be observed that it is not
uncommon, e.g., in critiquing-based applications, that researchers solely rely on datasets which
they created or collected for the purpose of their studies, i.e., there is limited reuse of datasets by
other researchers. One main underlying reason is that in most papers we analyzed, researchers did
not publicly share their datasets.
Domain Description
Movies Traditional movie rating databases from MovieLens, EachMovie, Netflix, used
for example in [75, 174, 174].
Electronics A product database with more than 600 distinct products was collected from
various retailers [47].
A smartphone database consisting of 1721 products with multiple features [34].
An Amazon electronics review datasetcontaining millions of products, user
reviews and product meta-data [172].
A dataset consisting of 120 personal computers, each with 8 features [134].
Travel More than 100 sightseeing spots in Japan with 25 different features [53].
A database of restaurants in the San Francisco area covering 1,900 items with
multiple features like cuisine, ratings, price, location, or parking [142].
Search logs and reviews of 3,549 users of a restaurant review provider, focusing
on locations in Cambridge [29].
A travel destinations dataset, crawled from online platforms containing
5,723,169 venues in 180 cities around the globe [39].
A restaurants dataset crawled for Dublin city, which consists of 632 restaurants
with 28 different features [92].
Food Recipes A food recipe dataset containing dishes and their ingredients [170].
E-commerce A product database of 11M products and logged data from the search engine
of an e-commerce website was collected. The logged data consists of 3,146,063
unique questions [164].
Music A music dataset crawled from multiple online sources, containing 2,778 songs
with 206k explanatory statements and 22 user tags [173].
4.4.2 Dialogue Corpora Created to Build CRS. NLP-based dialogue systems are usually based
on training data that consist of recorded and often annotated conversations between humans
(interaction histories). A number of initiatives were therefore devoted to create such datasets
that can be used to build CRS. Other researchers, in contrast, rely on dialogue datasets that were
created or collected for other purposes. Generally, these corpora can be obtained with the help of
crowdworkers [64, 75, 139], by annotating interviews [16, 18, 109], or by logging interactions with
a chatbot like in [63]. Table 3 shows examples of such datasets used in recent research.
Note that in some cases when building a CRS, these dialogue corpora are combined with other
knowledge bases [75, 170]. In [75], for example, both a dialogue corpus and MovieLens data are
used for the purposes of sentiment analysis and rating prediction. Such a combination of datasets
can be necessary when there is not enough relevant information in the dialogues.
4.4.3 Logged Interaction Histories. Building an effective CRS requires to understand the conver-
sational needs of the users, e.g., how they prefer to provide their preferences, which intents they
might have, and so on. One way to better understand these needs is to log and analyze interactions
between users and a prototypical system. These logs then serve as a basis for further research.
A Survey on Conversational Recommender Systems 105:15
Differently from the dialogue corpora discussed above, these datasets were often not primarily
created to build a CRS, but to better understand the interaction behavior of users. In [154, 155],
for example, the interactions of the user with a specific NLP-based CRS were analyzed regarding
dialogue quality and dialogue strategies. In [16, 18], user studies were conducted prior to developing
the recommender system to understand and classify possible feedback types by users. In some
approaches like [18, 114] researchers annotated and labeled such datasets for the purpose of model
training and system initialization. However, such logged histories are—except for [114]—typically
much smaller in size than the dialogue corpora discussed above, mostly because they were collected
during studies with a limited number of participants. Examples of datasets obtained by logging
system interactions and user studies are shown in Table 4.
4.4.4 Lexicons and World Knowledge. Researchers often use additional knowledge bases to
support the entity recognition process in NLP-based systems. In [73, 80], for instance, information
was harvested from online sources such as Wikipedia or Wikitravel to develop dictionaries for the
purpose of entity-keyword mapping. Similarly, the WordNet corpus was used in [73] to determine
the semantic distance of an identified keyword in a conversation with predefined entities. More
examples for the use of lexicons and world knowledge are shown in Table 5.
105:16 Dietmar Jannach et al.
Table 4. Examples of datasets obtained from logged system interactions and user studies.
Domain Description
Movies A dialogue dataset involving 347 users was collected in [65] during the experi-
mental evaluation of a recommender system.
A subset of the ReDial dataset was analyzed and annotated in [16] to classify
the user feedback types in 200 dialogues at the utterance level.
A dialogue corpus was collected in [154] for the purpose of dialogue quality
analysis consisting of 226 complete dialogue turns with 20 users.
A user study was conducted in [155], where a movie seeker and a human
recommender converse with each other. The dialogue corpus consists of 2,684
utterances and 24 complete dialogues.
Travel A dataset containing preferences for hotel, flight, car rental searches was col-
lected in [4] involving 200 users of a content-based recommender system that
supports multiple tasks (i.e., hotel, car, flight booking) in the same dialogue.
Fashion A user study was conducted using a virtual shopping system. A non-verbal
feedback (e.g., gestures, facial expressions, voices) dataset involving 345 subjects
was collected and then annotated for model training [18].
E-commerce A dataset containing conversation logs of users with a chatbot of an online
customer service center (Alibaba.com) was collected in [114]. It consists of over
91,000 Q&A pairs as a knowledge base used for the information retrieval task.
4.5 Discussion
Our discussions show that CRS can be knowledge-intensive or data-intensive systems. Differently
from the traditional recommendation problem formulation, where the goal is to make relevance
predictions for unseen items, CRS often require much more background information than just a
user-item rating matrix, in particular in the context of dialogue management.
Pre-defined Knowledge vs. Learning Approaches. In CRS approaches that use forms and buttons as
the only interaction mechanism, the interaction flow is typically pre-defined in the form of the
possible dialogue states, the set of supported user intents, and the user profile attributes to acquire.
NLP-based systems, in contrast, are usually more dynamic in terms of the dialogue flow, and they
rely on additional knowledge sources like dialogue corpora and answer templates as well as lexicon
and word knowledge bases. Nonetheless, these systems typically require the manual definition of
additional background knowledge, e.g., with respect to the supported user intents.
Pure “end-to-end” learning only from recorded dialogues seems challenging. In most existing
approaches the set of supported interaction patterns is implicitly or explicitly predefined, e.g., in the
A Survey on Conversational Recommender Systems 105:17
form of “user provides preferences, systems recommends”. To a certain extent, also the collection
of human-to-human dialogues can be designed to support possible system responses like in [75],
where the crowdworkers were given specific instructions regarding the expected dialogues. As a
result, the range of supported dialogue utterances can be relatively narrow. The system presented
in [75], for example, cannot handle a query like “good sci-fi movie please”.
Intent Engineering and Dialogue States. In case a richer dialogue and additional functionalities
are desirable, the definition of the supported user intents usually is a central and often manual task
during CRS development. Compared to general-purpose dialogue systems and home assistants,
however, the set of user intents that will be supported is often relatively small. We have identified
some common intent categories in Section 4.1. Depending on the domain, also very specific intents
can be supported, e.g., asking for a style tip in a fashion recommender system [105]. Furthermore,
yet another set of possible user intents has to be supported in CRS that are designed for group
decision scenarios. Typical user intents can, for example, relate to the invitation of collaborators
[103] or to a request for a group recommendation. Furthermore, there might be user utterances
that relate to the resolution of preference conflicts and voting among group members [1, 91, 103].
Generally, the set of user intents that the system supports determines how rich and varied the
resulting conversations can be. Not being able to appropriately react to user utterances can be
highly detrimental to the quality perception of the system. For example, being able to explain the
recommendations that the system makes is often considered as a key feature to make decision-
making easier or to increase user trust in a recommender system. A user of an NLP-based system
might therefore be easily disappointed by the conversation if the system fails to recognize and
respond to a request for an explanation.
A key challenge therefore is to anticipate or learn over time which intents the users might
have. Depending on the application and used technology, the design and implementation of an
intent database (e.g., using Google’s DialogFlow engine) can lead to substantial manual efforts
and require the involvements of professional writers to achieve a certain naturalness and richness
of the conversation. At the same time, the rule-based modeling approach (“if-this-then-that”) as
implemented by major solution providers can easily lead to large knowledge bases that are difficult
to maintain, leading to a need for alternative modeling approaches [140].
5 COMPUTATIONAL TASKS
Having discussed possible user intents in recommendation dialogues, we will now review common
computational tasks and technical approaches for CRS. We distinguish between (i) main tasks, i.e.,
those related more directly to the recommendation process, e.g., compute recommendations or
determine the next question to ask, and (ii) additional, supporting tasks.
System-driven CRS
User-driven CRS
Recommend
Request Respond
Explain
Mixed-initiative CRS
goal to increase dialogue efficiency, i.e., to minimize the number of required interactions (see also
Section 6). Various methods to determine the order of the facets were proposed in the literature
[20, 132, 142, 170]. In an early system [142], specific weights were used to rank the item attributes
for which the user has not expressed preferences yet. Entropy-based methods also consider the
potential effects on the remaining item space of each attribute. They aim to identify the next question
(attribute) that helps mostly to narrow down the candidate (item) space [20, 96, 104, 132, 166],
sometimes including feature popularity information [96]. Considerations like this are typically also
the foundation of typical dynamic and compound critiquing systems [25, 90, 111, 121, 134, 148, 171].
In compound critiquing systems, in particular, the user is not asked about feedback for one single
attribute, but for more than one within one interaction, e.g., “Different Manufacturer, Lower Processor
Speed and Cheaper”. Finally, in some systems, possible sequences of questions asked to the users are
pre-defined in the form of state machines [55, 58]. At run-time, the dialogue path is then chosen
based on the users’ inputs in the ongoing session.
Instead of using heuristics for attribute selection and static dialogue state transition rules, a
number of more recent systems rely on learning-based approaches, e.g., using reinforcement
learning [86, 139, 146]. In [139], for example, the authors use a deep policy network to decide on
the system action. Based on the current dialogue state, as modeled by a belief tracker, the system
either makes a request for a pre-defined facet or generates a recommendation to be shown to the
user. An alternative learning-based way to determine the question order was proposed in [30]. In
their work, the authors design a recommender for YouTube that leverages past watching histories
of the user community and a Recurrent Neural Network architecture to rank the questions (topics)
that are shown to the user in a conversational step.
An alternative to asking users about attribute-based preferences is to ask them to give feedback on
selected items. This can be done either by asking them to rate individual items (e.g., by like/dislike
statements) or by asking them to express their preference for item pairs or entire sets of items
[81]. The computational task in this context is to determine the most informative item(s) to present
to the user. Possible strategies include the selection of popular or diverse items in the cold-start
phase, items that are different in terms of their past ratings or attributes, or itemsets that represent
a balance of popularity and diversity [17, 93, 101, 120]. However, not only item features might be
relevant for the selection of the items. In [17], the authors found that a user’s willingness to give
feedback on an item can depend on additional factors. Specifically, they identified several situations
in which the feedback probability may be higher, e.g., when the system’s predicted rating deviates
from the user’s past experience of the item. In more recent works, again learning-based approaches
are more common. The authors of [29, 174], for example, employed bandit-based approaches to
either (i) determine the next item to be shown for eliciting the user’s absolute feedback (i.e., like or
A Survey on Conversational Recommender Systems 105:19
dislike), or (ii) to select a pair of items for obtaining the user’s relative preference regarding these
two items.
5.1.2 Recommend. The recommendation of items is the core task of any CRS. From a technical
perspective, we can find collaborative, content-based, knowledge-based, and hybrid approaches
in the literature. Differently from non-conversational systems, the majority of the analyzed CRS
approaches mainly relies solely on short-term preference information. However, there are also
approaches that additionally consider long-term preferences of a user, e.g., to speed up the elicitation
process [82, 103, 125, 130, 139, 142, 154].
In the context of critiquing-based and knowledge-based systems, different strategies are applied
to filter and rank the items. For the filtering task, often constraint-based techniques [42] are applied
that remove items from the candidate set which do not (exactly) match the current user’s preferences.
The items that remain can then be sorted in different ways [169]. In the system proposed in [171],
for example, the user preference model is updated after a user critique by adjusting the weights of
the attributes that are involved in the critique. Then, Multi-Attribute Utility Theory (MAUT) [67]
was used to calculate the utility of each candidate item for generating top-K recommendations for
the user. An alternative ranking approach was applied in [130], where a history-guided critiquing
system was proposed that aims to retrieve recommendation candidates from other users’ critiquing
sessions that are similar to the one of the current user. In [39], a critiquing-based travel recommender
system was implemented that computes recommendations based on the relevance of item attributes
to user preferences based on the Euclidean Distance.
Some works consider both long-term and short-term preferences of users when making recom-
mendations [4, 82, 123, 130]. The Adaptive Place Advisor system [142] represents an early example
of combining short-term and long-term preferences. Here, the user’s current query is expanded
by considering the probability distribution of the user’s past preference for item attributes, based
on her/his short-term constraints (within a conversation) and long-term constraints (over many
conversations). This expanded query was then used to retrieve and rank the items for recommen-
dation. In [130], the authors proposed to leverage the successful recommendation sessions in the
previous conversations to improve the efficiency of the current session (i.e., to shorten its length).
More recent works rely on machine learning models and background datasets for the recom-
mendation task. One common approach is to train a model on the traditional user-item interaction
matrix, e.g., based on probabilistic matrix factorization [29], and to then combine the user’s current
interactions with the trained user and item embeddings. In another approach [4], the authors rely
on a content-based method based on item features and the user profile in the cold-start stage, and
then switch to a Restricted Boltzmann Machine collaborative filtering method once a sufficient
number of preference signals is available. In [172], a hybrid multi-memory network with attention
mechanism was trained to find suitable recommendations based on item embeddings and the user’s
query embedding. Here, the item embedding was based on the item’s textual description, and
the user’s query embedding encoded the user’s initial request and the follow-up conversations
during the interaction. A hybrid model was also proposed in [139], which used Factorization
Machines to combine the dialogue state—represented with an LSTM-based belief tracker for each
item facet—user information, and item information to train the recommendation model. In the
video recommender system presented in [30], finally, an RNN-based model was built for making
recommendations, based on the topics selected by the users and their watching history.
In some cases, application-specific techniques were applied for the recommendation task. In
[167, 168], for example, the CRS features a visual dialogue component, where users can give
feedback based on the images, e.g., “I prefer blue color”. To implement this functionality, the system
proposed in [167] implemented a component that encoded item images and user feedback using a
105:20 Dietmar Jannach et al.
convolutional neural network, and then combined these encodings as an input to both a response
encoder and a state tracker. Furthermore, various types of user behaviors (i.e., viewing, commenting,
clicking) on the visually represented recommendation were considered in a bandit approach to
balance exploration and exploitation.
5.1.3 Explain. The value of explanations in general recommender systems is widely recognized
[51, 106, 143]. Explanations can increase the system’s perceived transparency, user trust and
satisfaction, and they can help users make faster and better decisions [45]. However, according to
our survey, few papers so far have studied the explanation issue specific to CRS.
In the context of critiquing-based systems, [110] examined the trust-building nature of explana-
tions. In this work, an “organization-based” explanation approach was evaluated, where the system
showed multiple recommendation lists to the user, each of them labelled based on the critiquing-
based selection criteria, e.g., “cheaper but heavier”. A more recent interactive explanation approach
for a mobile critiquing-based recommender was proposed in [69], where the textual explanations
to be shown to the user were determined based on the user’s preferences and constructed from
pre-defined templates.
Providing more information about a recommended item, e.g., in the form of pros and cons, is a
typical approach when providing explanations. Generating such item descriptions in a user-tailored
way in the context of CRS was proposed in [43] and [150]. In such approaches, the users’ feedback
during the conversation can influence which attributes are mentioned in the item descriptions shown
to the user in the recommendation phase. Furthermore, the user preferences can be considered to
order the arguments and to help determine which adjectives and adverbs to use in the explanation
[43].
In [101], two kinds of explanations were implemented in a CRS for movies. One was simply
based on the details of a given movie, whereas the other connects the given user preferences
with item features through a graph-based approach to create a personalized explanation. Another
graph-based approach following similar ideas was proposed in [97], where a knowledge-augmented
dialogue system for open-ended conversations was discussed. In this approach, relevant entities
and attributes in a dialogue context were retrieved by walking over a common fact knowledge
graph, and the walk path was used to create explanations for a given recommendation. In [109],
finally, a human-centered approach was employed. By analyzing a human-human dialogue dataset,
the authors identified different social strategies for explaining movie recommendations. They then
accommodated the social explanation in a conversational recommender system to improve the
users’ perception of the quality of the system.
However, for the main task of explaining, we found that little CRS-specific research exists so far,
and only a smaller set of the proposed CRS in the literature support such a functionality.
Natural Language Understanding. In NLP-based CRS, it is essential that the system understands
the users’ intents behind their utterances, as this is the basis for the selection of an appropriate
system action [118]. Two main tasks in this context are intent detection and named entity recognition,
and typical CRS architectures have corresponding components for this task. In principle, intent
detection can be seen as a classification task (dialogue act classification), where user utterances
are assigned to one or multiple intent categories [137]. Named entity recognition aims at the
identification of entities in a given utterance into pre-defined categories such as product names,
product attributes, and attribute values [164].
Although intent detection and named entity recognition have been extensively studied in general
dialogue systems [137], there are few studies specific to CRS according to our survey, possibly due
to the lack of a well-established taxonomy and large-scale annotated recommendation dialogue
data. In an early approach [142], manually-defined recognition grammars were used to map user
utterances to pre-defined dialogue situations, which is comparable to using pre-defined intents as
described above in the context of the Respond task. An example for a more recent approach can be
found in [164]. Here, a natural language understanding component for intent detection, product
category detection, and product attribute extraction was implemented in a dialogue system for
online shopping. For instance, from the utterance “recommend me a Huawei phone with 5.2 inch
screen” the system should derive the intent recommendation, the product category cellphone, as well
as the brand and the display size. To solve these tasks, the authors first collected product-related
questions from queries posted on a community site, and then extracted intent phrases (e.g., “want
to by” and “how about”) by using two phrase-based algorithms. A multi-class classifier was trained
for intent detection of new user questions. As for product category detection, the authors employed
105:22 Dietmar Jannach et al.
a CNN-based approach that took the detected intent into account to identify the category of a
mentioned product in a given utterance.
Neural networks were used also in other recent intent and entity recognition approaches [105,
146]. For example, a Multilayer Perceptron (MLP) was used to predict the probability distribution
on a set of pre-defined intent categories in [105]. A sequence-to-sequence model was used in [166]
to reframe the user’s query (e.g., “How to protect my iphone screen”) into keywords (e.g., “iphone
screen protector”) that are then used in the recommendation process to identify candidate items.
Another supporting task in some applications is sentiment analysis, see, e.g., [54, 75, 100, 105, 173].
One typical goal in the context of CRS is to understand a user’s opinion about a certain item. For
example, whenever an item—e.g., a movie—is mentioned in an utterance, the sentiment of the
sentence can be used to approximate the user’s feelings about the item. This sentiment can then be
considered as an item rating, which can subsequently be used for recommending other items using
established recommendation techniques.
5.3 Discussion
Our analysis shows that a wide range of technical approaches are used in the literature to support
the main computational tasks of a CRS. For the problem of computing recommendations, for
example, all sorts of approaches—collaborative, content-based, hybrid—can be used within CRS.
However, for the main task of explaining, we found that little CRS-specific research exists so far,
and only a smaller set of the proposed CRS in the literature support such a functionality.
A Survey on Conversational Recommender Systems 105:23
(1) Effectiveness of Task Support: This category refers to the ability of the CRS to support its main
task, e.g., to help the users make a decision or find an item of interest.
(2) Efficiency of Task Support: In many cases, researchers are also interested to understand how
quickly a user finds an item of interest or makes a decision.
(3) Quality of the Conversation and Usability Aspects: Analyses in this category focus on the quality
of the conversation itself and on the usability (ease-of-use) of the CRS as a whole.
(4) Effectiveness of Subtask: A number of studies investigated in our survey focus on certain subtasks
like intent or entity recognition.
In each of these dimensions, a number of different measurements are considered in the literature.
Task effectiveness, for example, can be both measured objectively (through accuracy measures,
acceptance or rejection rates) or subjectively (through surveys related to choice satisfaction or
perceived recommendation quality). Task efficiency is very often measured objectively through
the number of required interaction steps and shorter dialogues are usually considered favorable.
The quality of the conversation is most often analyzed in terms of subjective assessments, e.g., with
respect to fluency, understandability, or the quality of the responses. Finally, specific measurements
for subtasks include intent recognition rates or the accuracy of the state recognition process.
From a methodological perspective, we found works that entirely relied on offline experiments,
works that relied exclusively on user studies, and studies that combined both offline experiments
with user studies. Reports on fielded systems and A/B tests are rare. Examples of such works that
discuss deployed systems include [20, 30, 32, 55, 60, 104, 114, 164]. However, the level of detail
provided for these tests is often limited, partially informal, or only considers certain aspects like
processing times. Finally, we also found works without any evaluation or where the evaluation
was mostly qualitative or anecdotal [4, 73, 160].
In the experimental evaluations, all sorts of materials—in particular prototype applications—and
datasets were used. As discussed in Section 4, at least an item database is needed. Depending
on the technical approach, also additional types of knowledge and data are used, such as logged
conversations between humans, explicit dialogue-related knowledge such as supported intents etc.
In Figure 5, we provide an overview of the most common evaluation dimensions and evaluation
approaches, and give examples for typical measurements and datasets. In the following sections,
we will discuss some of the more typical evaluation approaches in more detail.
CRS Evaluation
Measures, e.g., task completion / acceptance / Simulation /
Effectiveness of rejection rate, number of add‐to‐carts, hit rate, Item and rating
decision accuracy, reward, engagement, BLEU
Offline databases
Task Support
score, choice satisfaction, intention to reuse Experiment
Measures, e.g., number of interactions in
Efficiency of simulated conversation or user study, perceived
efficiency, interaction time, number of utterances,
User Study Logged dialogs
Task Support
distraction, perceived effort
Quality of the Measures, e.g., fluency, understandability,
response time, user‐friendliness, misrecognition Background
Conversation Mixed Methods
rate, usability score (SUS), perceived quality of knowledge
and Usability responses, user control, intimacy, adaptation
Measures, e.g., intent or entity recognition No or only
Effectiveness of performance, acceptance of conversation repair,
accuracy of proposed critiques, state recognition
anecdotal …
Subtask
accuracy, misrecognition rates evaluation
profiles and report the average precision of the recommendations after each question-answering
round. Similarly, a user simulator was used for the evaluation of a dialogue-based facet-filling
recommender system based on deep reinforcement learning and end-to-end memory networks in
[146] and [139]. The simulator in [146] was based on real user utterances extracted from a dataset
about restaurant reservations [63]. The objective measures included the recommendation accuracy
(median of ranking and success rate), as well as the proportion of the simulated users who accepted
the recommendations. In [139], the “online” experiments were based on a dataset collected through
crowdworkers and the objective measures included Average Reward of the reinforcement learning
strategy and the Success Rate (conversion rate), i.e., the fraction of successful dialogues.
The authors of [101] present a domain-independent CRS framework, and they use the Hit Rate to
assess the effectiveness of different system components such as the recommendation algorithm or
the intent recognizer. To make the measurements, they use the above-mentioned bAbI dataset as a
ground truth, where each example contains the user preferences, the recommendation request and
the recommended item. A similar evaluation approach based on ground truth information derived
from different real-world dialogues and accuracy measures (RMSE, Recall, Hit Rate) was adopted in
[27, 64, 75]. In such approaches, the system typically analyzes (positive) mentions of items (movies)
in the ongoing natural language dialogue and use these preferences for the prediction task.
The focus of [18] was on implicit feedback in CRS, where this feedback was obtained from
non-verbal communication acts. To assess the effectiveness of using such signals, the accuracy of
rating predictions by a content-based recommender was evaluated using MAE and RMSE. In their
approach, the ground truth for the evaluation was previously collected in a user study. In some
ways, this approach is similar to [101] in that the effects of the performance of a side task—here,
the interpretation of non-verbal communication acts—on the system’s overall recommendation
quality are investigated.
105:26 Dietmar Jannach et al.
Given the possible limitations of pure offline experiments in the context of CRS, user studies
are also frequently applied to gauge the effectiveness of a system. In the context of a critiquing-
based system [23, 24], for example, decision accuracy was objectively measured by the fraction of
users who changed their mind when they were presented with all available options after they had
previously made their selection with the help of the CRS. In [86], in contrast, the authors used task
completion rates and add-to-cart actions as proxies, which measure how often users had at least one
item in their cart and how many items they added on average, respectively.
Subjective Measures. Differently from objective measures that, e.g, record the user’s decision
behavior when interacting with the system or determine prediction accuracy using simulations,
subjective measures assess the user’s quality perception of the system. Such measurements can be
important because even common accuracy measures are often not predictive of the quality of the
recommendations as perceived by the users.6 In the reviewed literature on CRS, various quality
factors were examined that are also commonly used for non-conversational recommenders, e.g.,
those discussed in the evaluation frameworks in [68] and [113].
For the critiquing-based systems discussed in [23, 24], the authors therefore not only used decision
accuracy (as an objective measure) but also assessed the different factors like decision confidence,
and purchase and return intentions. User satisfaction, either with the system’s recommendations
or the system as a whole, was additionally investigated in earlier critiquing approaches such as
[112, 125] and in other comparative evaluations [101, 165]. The perceived recommendation quality
was assessed in the speech-controlled critiquing system in [47], and in [152] the authors looked at
user acceptance rates. In [62, 81] and [109], finally, the authors considered several dimensions in
their questionnaire like the match of the recommendations with the preferences (interest fit), the
confidence that the recommendations will be liked, and trust.
6.2.2 Efficiency of Task Support. Traditionally, in particular critiquing-based CRS approaches
are often evaluated in terms of the efficiency of the recommendation process. Specifically, one goal
of generating dynamic critiques is to minimize the number of required interactions until the user
finds the needed item or accepts the recommendation. Such evaluations are often done offline with
simulated user profiles. One assumption, also in approaches that are not based on critiquing, is
that the simulated users act rationally and consistently, i.e., they will not revise their preferences
during the process.
Examples of works that measure interaction cycles in critiquing approaches include [47, 86, 89,
91, 122, 125, 147, 171]. The number of required interaction stages was also one of usually multiple
evaluation criteria for chatbot-like applications, e.g., [53, 62, 101, 154], and a shopping decision-aid
in [152]. In the context of learning-based systems, the number of dialogue turns in a two-stage
interaction model was measured in [139]. The usage of such measures is however rather uncommon
for natural language, learning-based dialogue systems.
Besides the number of interaction stages, task completion time is sometimes used as an alternative
or complementary way of objectively measuring efficiency, e.g., in [62, 86]. In [54], the authors,
among other aspects, compared the efficiency of different interaction modes with a chatbot: NLP-
based, button-based, and mixed. They measured the number of questions, the interaction time
and the time per question in the dialogue. A main outcome of their work was that pure natural
language interfaces were leading to less efficient recommendation sessions, in part due to problems
of correctly interpreting the natural language utterances.
In the mentioned papers, shorter interaction or task completion times are generally considered
favorable. Note however, that in some cases longer sessions are desirable. In particular, longer
6 See, e.g., [10, 35, 44, 87, 128].
A Survey on Conversational Recommender Systems 105:27
interaction times might reflect higher user engagement and, as in [62], correspond to a larger
number of listened songs in a music application. In [28], the authors compared a voice-based and
visual output system and measured the number of options that were explored by the users. In
this context, note that the exploration of more items can, depending on the application, both be a
sign that the user found more interesting options to inspect and a sign that the user did not find
something immediately and had to explore more options. In [165], the effects of using a voice
interface for a podcast recommender were analyzed. Their results showed that users were slower,
explored fewer options, and chose fewer long-tail items, which can be detrimental for discovery.
In some works, finally, subjective measures regarding the efficiency of the process are used,
typically as a part of usability assessments. In [23, 81, 109, 152] and [86] the authors ask the study
participants about their perceived cognitive effort.
6.2.3 Quality of the Conversation and Usability Aspects. In a number of works, the focus of the
evaluation is put on certain aspects of the dialogue quality and on usability aspects regarding the
system as a whole. The general ease-of-use of the system was, for example examined in [47, 62, 112,
122]; the more specific concept task ease was part of the user questionnaire in [154].
Regarding quality aspects of the conversation itself, various aspects are investigated in the
literature. From the perspective of the conversation initiative, the authors of [81] and [109] measured
the perceived level of user control. Whether or not the desire for control is dependent on personal
characteristics was investigated in [62]. In addition to user control, perceived transparency was
considered as a quality factor in [109]. A common way to establish transparency is through the
use of explanations. Questions of how to design explanations for a recommender chatbot were
investigated in [108]. The quality factors used in [154] were based on an early framework for
evaluating spoken dialogue systems in [78]. They, for example, include adaptation (i.e., how fast
the system adapts to the user’s preferences), expected behavior (i.e., how intuitive and natural the
dialogue interaction is), or the entertainment value. Furthermore, in [109] coordination, mutual
attentiveness, positivity, and rapport were considered as additional desired factors of a conversation.
Looking closer at the content and linguistic level of the dialogues, many recent proposals based
on natural language rely on the BLEU [107] score to assess the system’s responses, e.g., [64, 75–
77, 105]. With the help of this score, which was developed in the context of machine translation, one
can compare the responses generated by the system with ground-truth responses from real human
conversations in an automated way. As an alternative, the NIST score can be used, e.g., in [105].
Additional objective linguistic aspects that are measured in the literate include the lexical diversity
[46], perplexity (corresponding to fluency), and distinct n-gram (to assess diversity) [27]. In addition
to these objective linguistic measures, researchers sometimes consider subjective assessments of
the quality of the system responses in their evaluations, e.g., with respect to fluency, appropriateness,
consistency, engagingness, relevance, informativeness, and the overall dialogue quality and generation
performance [27, 46, 64, 76, 77, 105, 154].
6.2.4 Effectiveness of Sub Task. In some works, finally, researchers focus on the evaluation of the
performance of certain subtasks. Again, such measurements can both be objective or subjective ones.
As objective measurements, the reward is often computed in approaches that rely on reinforcement
learning [86]. In a critiquing system, the number of times a proposed critique was applied was
investigated in [122]. In NLP-based systems, in contrast, researchers often evaluate the performance
of the entity and intent recognition modules [77, 101]. In the particular multi-modal CRS in [105],
Recall was used for assessing the image selection performance. In terms of subjective measures,
the interpretation performance, i.e., how good the system is in understanding the input, was, for
example, considered in [154].
105:28 Dietmar Jannach et al.
6.3 Discussion
Our review shows that a wide range of different evaluation methodologies and metrics are used to
evaluate CRS. In principle, general user-centric evaluation frameworks for recommender systems
as proposed in [68] and [113] can be applied for CRS as well. So far, however, while user-centric
evaluation is common, these frameworks are not widely used and no standards or extensions
to them were proposed in the literature. In terms of objective measurements, typical accuracy
measures are used by several researchers. Still, the individual CRS proposals in the literature
are quite diverse, e.g., in terms of the application domain, interaction strategy, and background
knowledge, and a comparison between existing systems remains challenging.
In NLP-based systems, the BLEU score is widely used for automatic evaluation. However, accord-
ing to [79], the BLEU score, at least at the sentence level, can correlate poorly with user perceptions,
see also [46]. In general, the evaluation of language models is often considered difficult [88] and
task-oriented systems like CRS might be even more challenging to assess. These observations
therefore suggest that BLEU scores alone cannot inform us well about the quality of the generated
system utterances and that in addition subjective evaluations should be applied.
Researchers therefore often resort to offline experiments with simulated users or user stud-
ies, where study participants have to accomplish a certain task. In offline studies, often a target
(preferred) item is randomly selected, and then a rationally-behaving user is simulated, which
interacts with the CRS by answering questions about preferences or by providing feedback on
explanations. Such a design however assumes that users a priori have fixed preferences towards
items or item features. However, in reality, users may also construct or change their preference
during the conversation when they learn about the space of options. Therefore, it is not always fully
clear to what extent such simulations reflect real-world situations. In user studies, in contrast, often
realistic decision situations are explored and participants have to accomplish tasks like selecting a
product in a shop or finding musical tracks for a birthday party. While such studies to some extent
remain artificial as usually no real purchase is made, such evaluations seem more realistic than the
offline experiments described above. In general, relying solely on offline experimentation seems
too limited, except for certain subtasks, given that any CRS is a system that has to support complex
user interactions.
Finally, more research seems needed with respect to understand (i) how humans make recom-
mendations to each other in a conversation, and (ii) how users interact with intelligent assistants,
e.g., what kind of intelligence they attribute to them and what their expectations are. Some aspects
related to these questions are discussed, e.g., in [29, 65, 108, 165]. With respect to how humans
talk with each other, some analyses were done in [13] and [29]. In [13], the authors based their
research on insights from the field of Conversational Analysis, and correspondingly implement
typical real-world conversation patterns, albeit in a somewhat restricted form, in their technical
proposal. In general, more work also needs to be done to understand the effects of the quality
perception of a system when certain communication patterns like the explanation for a system
recommendation are not supported, as it is the case for many investigated systems.
7 OUTLOOK
Our study reveals an increased rise in the area of CRS in the past few years, where the most recent
approaches rely on machine learning techniques, in particular deep learning, and natural language
based interactions. Despite these advances, a number of research questions remain open, as outlined
in the discussion sections throughout the paper. In this final section, we briefly discuss four more
general research directions.
A Survey on Conversational Recommender Systems 105:29
One first question is “Which interaction modality supports the user best in a given task?”.
While voice and written natural language have become more popular recently, more research
is required to understand which modality is suited for a given task and situation at hand or if
alternative modalities should be offered to the user. An interesting direction of research also lies in
the interpretation of non-verbal communication acts by users. Furthermore, entirely voice-based
CRS have limitations when it comes to present an entire set of recommendations in one interaction
cycle. In such a setting, a summarization of a set of recommendations might be needed, as it might
in most cases not be meaningful when the CRS reads out several options to the user.
Second, we ask: “What are challenges and requirements in non-standard application environ-
ments?” Today, most existing research focuses on interactive web or mobile applications, either with
forms and buttons or with natural language input in chatbot applications. Some of the discussed
works go beyond such scenarios and consider alternative environments where CRS can be used, e.g.,
within physical stores, in cars, on kiosk solutions, or as a feature of (humanoid) robots. However,
little is known so far about the specific requirements, challenges, and opportunities that come
with such application scenarios and regarding the critical factors that determine the adoption and
value of such systems. Regarding the usage scenarios, most research works discussed in our survey
focus on one-to-one communications. However, there are additional scenarios which are not much
explored yet, for example, where the CRS supports group decision processes [1, 103].
A third question is “What can we learn from theories of conversation?”, see also [141]. Regarding
the underpinnings and adoption factors of CRS, only very few works are based on concepts and
insights from Conversation Analysis, Communication Theory or related fields. In some works,
at least certain communication patterns in real-world recommendation dialogues were discussed
at a qualitative or anecdotal level. What seems to be mostly missing so far, however, is a clearer
understanding of what makes a CRS truly helpful, what users expect from such a system, what
makes them fail [95], and which intents we should or must support in a system. Explanations are
often considered as a main feature for a convincing dialogue, but these aspects are not explored
a lot. In addition, more research is required to understand the mechanisms that increase the
adoption of CRS, e.g., by increasing the user’s trust and developing intimacy [72], or by adapting
the communication style (e.g., with respect to the initiative and language) to the individual user.
Finally, from a technical and methodological perspective, we ask: “How far do we get with pure
end-to-end learning approaches, i.e., by creating systems where, besides the item database, only a
corpus of past conversations serves as input. Tremendous advances were made in NLP technology
in recent years, but it stands to question if today’s learning-based CRS are actually useful, see [59].
In part, the problem of assessing this aspect is tied to how we evaluate such systems. Computational
metrics like BLEU can only answer certain aspects of the question. But also the human evaluations
in the reviewed papers are sometimes not too insightful, in particular when a newly proposed
system is evaluated relative to a previous system by a few human judges. We therefore should revisit
our evaluation practice and also investigate what users actually expect from a CRS, how tolerant
they are with respect to misunderstandings or poor recommendations, how we can influence
these expectations, and how useful the systems are considered on an absolute scale. Technically,
combining learning techniques with other sorts of structured knowledge seems to be key to more
usable, reliable and also predictable conversational recommender systems in the future.
REFERENCES
[1] J. O. Álvarez Márquez and J. Ziegler. Hootle+: A group recommender system supporting preference negotiation. In
Collaboration and Technology, pages 151–166, 2016.
[2] E. André and C. Pelachaud. Interacting with embodied conversational agents. In Speech Technology: Theory and
Applications, pages 123–149. Springer US, 2010.
105:30 Dietmar Jannach et al.
[3] P. Angara, M. Jiménez, K. Agarwal, H. Jain, R. Jain, U. Stege, S. Ganti, H. A. Müller, and J. W. Ng. Foodie Fooderson: A
Conversational Agent for the Smart Kitchen. In CASCON ’17, page 247–253, 2017.
[4] A. Argal, S. Gupta, A. Modi, P. Pandey, S. Shim, and C. Choo. Intelligent travel chatbot for predictive recommendation
in Echo platform. In CCWC’18, pages 176–183, 2018.
[5] D. Arteaga, J. Arenas, F. Paz, M. Tupia, and M. Bruzza. Design of information system architecture for the recommen-
dation of tourist sites in the city of Manta, Ecuador through a chatbot. In CISTI ’19, pages 1–6, 2019.
[6] Z. Ashktorab, M. Jain, Q. V. Liao, and J. D. Weisz. Resilient chatbots: Repair strategy preferences for conversational
breakdowns. In CHI’19, page 254, 2019.
[7] O. Averjanova, F. Ricci, and Q. N. Nguyen. Map-based interaction with a conversational mobile recommender system.
In UBICOMM ’08, pages 212–218, 2008.
[8] R. Bader, O. Siegmund, and W. Woerndl. A study on user acceptance of proactive In-Vehicle recommender systems.
In AutomotiveUI ’11, page 47–54, 2011.
[9] T. Becker, N. Blaylock, C. Gerstenberger, I. Kruijff-Korbayová, A. Korthauer, M. Pinkal, M. Pitz, P. Poller, and J. Schehl.
Natural and intuitive multimodal dialogue for In-Car applications: The SAMMIE System. In ECAI ’06, page 612–616,
2006.
[10] J. Beel and S. Langer. A comparison of offline evaluations, online evaluations, and user studies in the context of
research-paper recommender systems. In TPDL ’15, pages 153–168, 2015.
[11] H. Blanco and F. Ricci. Acquiring user profiles from implicit feedback in a conversational recommender system. In
RecSys ’13, page 307–310, 2013.
[12] D. G. Bobrow, R. M. Kaplan, M. Kay, D. A. Norman, H. S. Thompson, and T. Winograd. GUS, A Frame-Driven Dialog
System. Artificial Intelligence, 8:155–173, 1977.
[13] D. G. Bridge. Towards conversational recommender systems: A dialogue grammar approach. In ECCBR ’02, pages
9–22, September 2002.
[14] R. Burke. The Wasabi personal shopper: A case-based recommender system. In AAAI ’99, pages 844–849, 1999.
[15] R. D. Burke, K. J. Hammond, and B. C. Young. The FindMe approach to assisted browsing. IEEE Expert, 12(4):32–40,
1997.
[16] W. Cai and L. Chen. Towards a taxonomy of user feedback intents for conversational recommendations. In RecSys’ 19
Late-Breaking Results, pages 572–573, 2019.
[17] G. Carenini, J. Smith, and D. Poole. Towards more conversational and collaborative recommender systems. In IUI ’03,
pages 12–18, 2003.
[18] B. D. Carolis, M. de Gemmis, P. Lops, and G. Palestra. Recognizing users feedback from non-verbal communicative
acts in conversational recommender systems. Pattern Recognition Letters, 99:87–95, 2017.
[19] J. Cassell. Embodied conversational agents: Representation and intelligence in user interfaces. AI Magazine, 22(4):
67–83, 2001.
[20] S. R. Chakraborty, M. Anagha, K. Vats, K. Baradia, T. Khan, S. Sarkar, and S. Roychowdhury. Recommendence and
fashionsence: Online fashion advisor for offline experience. In CoDS-COMAD ’19, 2019.
[21] A. A. Chandrashekara, R. K. M. Talluri, S. S. Sivarathri, R. Mitra, P. Calyam, K. Kee, and S. Nair. Fuzzy-based
conversational recommender for data-intensive science gateway applications. In BigData ’18, pages 4870–4875, 2018.
[22] F. Chen, I.-M. Jonsson, J. Villing, and S. Larsson. Application of speech technology in vehicles. In Speech Technology:
Theory and Applications, pages 195–219. Springer, 2010.
[23] L. Chen and P. Pu. Evaluating critiquing-based recommender agents. In AAAI ’06, pages 157–162, July 2006.
[24] L. Chen and P. Pu. Hybrid critiquing-based recommender systems. In IUI ’07, page 22–31, 2007.
[25] L. Chen and P. Pu. Preference-based organization interfaces: Aiding user critiques in recommender systems. In
International Conference on User Modeling, pages 77–86, 2007.
[26] L. Chen and P. Pu. Critiquing-based recommenders: survey and emerging trends. User Modeling and User-Adapted
Interaction, 22(1-2):125–150, 2012.
[27] Q. Chen, J. Lin, Y. Zhang, M. Ding, Y. Cen, H. Yang, and J. Tang. Towards knowledge-based recommender dialog
system. In EMNLP-IJCNLP ’19, pages 1803–1813, 2019.
[28] W. Chen, H. Huang, and S. T. Chou. Understanding consumer recommendation behavior in a mobile phone service
context. In ECIS ’08, pages 1022–1033, 2008.
[29] K. Christakopoulou, F. Radlinski, and K. Hofmann. Towards conversational recommender systems. In KDD ’16, pages
815–824, 2016.
[30] K. Christakopoulou, A. Beutel, R. Li, S. Jain, and E. H. Chi. Q&R: A two-stage approach toward interactive recommen-
dation. In KDD ’18, pages 139–148, 2018.
[31] F. Clarizia, F. Colace, M. Lombardi, and F. Pascale. A context aware recommender system for digital storytelling. In
AINA ’18, pages 542–549, 2018.
A Survey on Conversational Recommender Systems 105:31
[32] F. Colace, M. De Santo, F. Pascale, S. Lemma, and M. Lombardi. BotWheels: A petri net based chatbot for recommending
tires. In DATA ’17, pages 350–358, 2017.
[33] D. Contreras, M. Salamó, I. Rodríguez, and A. Puig. A 3D visual interface for critiquing-based recommenders:
Architecture and interaction. IJIMAI, 3:7–15, 2015.
[34] D. Contreras, M. Salamo, I. Rodriguez, and A. Puig. Shopping decisions made in a virtual world: Defining a state-based
model of collaborative and conversational user-recommender interactions. IEEE Consumer Electronics Magazine, 7(4):
260–35, 2018.
[35] P. Cremonesi, F. Garzotto, and R. Turrin. Investigating the persuasion potential of recommender systems from a
quality perspective: An empirical study. Transactions on Interactive Intelligent Systems, 2(2):1–41, 2012.
[36] J. Dalton, V. Ajayi, and R. Main. Vote Goat: Conversational movie recommendation. In SIGIR ’18, pages 1285–1288,
2018.
[37] B. De Carolis, M. de Gemmis, and P. Lops. A multimodal framework for recognizing emotional feedback in conversa-
tional recommender systems. In EMPIRE Workshop at ACM RecSys, page 11–18, 2015.
[38] D. M. Dehn and S. van Mulken. The impact of animated interface agents: A review of empirical research. International
Journal of Human-Computer Studies, 52(1):1 – 22, 2000.
[39] L. W. Dietz, S. Myftija, and W. Wörndl. Designing a conversational travel recommender system based on data-driven
destination characterization. In ACM RecSys Workshop on Recommenders in Tourism, pages 17–21, 2019.
[40] J. Dodge, A. Gane, X. Zhang, A. Bordes, S. Chopra, A. H. Miller, A. Szlam, and J. Weston. Evaluating prerequisite
qualities for learning end-to-end dialog systems. In ICLR’16, 2016.
[41] A. Felfernig, G. Friedrich, D. Jannach, and M. Zanker. An integrated environment for the development of knowledge-
based recommender applications. International Journal of Electronic Commerce, 11(2):11–34, 2006.
[42] A. Felfernig, G. Friedrich, D. Jannach, and M. Zanker. Constraint-based recommender systems. In Recommender
Systems Handbook, volume 1, pages 161–190. Springer, 2015.
[43] M. E. Foster and J. Oberlander. User preferences can drive facial expressions: Evaluating an embodied conversational
agent in a recommender dialogue system. User Modeling and User-Adapted Interaction, 20(4):341–381, 2010.
[44] F. Garcin, B. Faltings, O. Donatsch, A. Alazzawi, C. Bruttin, and A. Huber. Offline and online evaluation of news
recommender systems at swissinfo.ch. In RecSys ’14, 2014.
[45] F. Gedikli, D. Jannach, and M. Ge. How should I explain? A comparison of different explanation types for recommender
systems. International Journal of Human-Computer Studies, 72(4):367 – 382, 2014.
[46] M. Ghazvininejad, C. Brockett, M. Chang, B. Dolan, J. Gao, W. Yih, and M. Galley. A Knowledge-Grounded neural
conversation model. In AAAI’18, pages 5110–5117, 2018.
[47] P. Grasch, A. Felfernig, and F. Reinfrank. ReComment: Towards critiquing-based recommendation with speech
interaction. In RecSys ’13, pages 157–164, 2013.
[48] C. Greco, A. Suglia, P. Basile, and G. Semeraro. Converse-Et-Impera: Exploiting deep learning and hierarchical
reinforcement learning for conversational recommender systems. In AI*IA 2017, pages 372–386, 2017.
[49] K. J. Hammond, R. Burke, and K. Schmitt. Case-based approach to knowledge navigation. In AAAI ’94, 1994.
[50] N. Hariri, B. Mobasher, and R. Burke. Context adaptation in interactive recommender systems. In RecSys ’14, pages
41–48, 2014.
[51] J. L. Herlocker, J. A. Konstan, and J. Riedl. Explaining collaborative filtering recommendations. In CSCW ’00, pages
241–250, 2000.
[52] Z.-W. Hong, R.-T. Huang, K.-Y. Chin, C.-C. Yen, and J.-M. Lin. An interactive agent system for supporting knowledge-
based recommendation: A case study on an e-Novel recommender system. In ICUIMC’10, pages 53:1–53:8, 2010.
[53] Y. Ikemoto, V. Asawavetvutt, K. Kuwabara, and H.-H. Huang. Tuning a conversation strategy for interactive
recommendations in a chatbot setting. Journal of Information and Telecommunication, 3(2):180–195, 2019.
[54] A. Iovine, F. Narducci, and G. Semeraro. Conversational recommender systems and natural language: A study through
the ConveRSE framework. Decision Support Systems, 131:113250–113260, 2020.
[55] D. Jannach. ADVISOR SUITE – A knowledge-based sales advisory system. In ECAI ’04, pages 720–724, 2004.
[56] D. Jannach. Finding preferred query relaxations in content-based recommenders. In IS ’06, pages 355–360, 2006.
[57] D. Jannach and M. Jugovac. Measuring the business value of recommender systems. ACM TMIS, 10(4):1–23, 2019.
[58] D. Jannach and G. Kreutler. Rapid development of knowledge-based conversational recommender applications with
Advisor Suite. Journal of Web Engineering, 6(2):165–192, June 2007.
[59] D. Jannach and A. Manzoor. End-to-end learning for conversational recommendation: A long way to go? In IntRS
Workshop at ACM RecSys 2020, Online, 2020.
[60] D. Jannach, M. Zanker, M. Jessenitschnig, and O. Seidler. Developing a conversational travel advisor with ADVISOR
SUITE. In ENTER ’07, pages 43–52, 2007.
[61] D. Jannach, S. Naveed, and M. Jugovac. User control in recommender systems: Overview and interaction challenges.
In EC-Web ’16, 2016.
105:32 Dietmar Jannach et al.
[62] Y. Jin, W. Cai, L. Chen, N. N. Htun, and K. Verbert. MusicBot: Evaluating critiquing-based music recommenders with
conversational interaction. In CIKM ’19, pages 951–960, 2019.
[63] C. K. Joshi, F. Mi, and B. Faltings. Personalization in goal-oriented dialog. In NeurIPS ’17 Workshop on Conversational
AI, 2017.
[64] D. Kang, A. Balakrishnan, P. Shah, P. Crook, Y.-L. Boureau, and J. Weston. Recommendation as a communication
game: Self-supervised bot-play for goal-oriented dialogue. In EMNLP-IJCNLP ’19, pages 1951–1961, Nov. 2019.
[65] J. Kang, K. Condiff, S. Chang, J. A. Konstan, L. Terveen, and F. M. Harper. Understanding how people use natural
language to ask for recommendations. In RecSys ’17, pages 229–237, 2017.
[66] E. Kapetanios, D. Tatar, and C. Sacarea. Natural Language Processing: Semantic Aspects. CRC Press, 2013.
[67] R. L. Keeney and H. Raiffa. Decisions with Multiple Objectives: Preferences and Value Trade-Offs. Cambridge UP, 1993.
[68] B. Knijnenburg, M. Willemsen, Z. Gantner, H. Soncu, and C. Newell. Explaining the user experience of recommender
systems. User Modeling and User-Adapted Interaction, 22(4):441–504, 2012.
[69] B. Lamche, U. Adigüzel, and W. Wörndl. Interactive explanations in mobile shopping recommender systems. In
RecSys ’19 IntRS workshop, pages 14–21, 2014.
[70] H. Lee, Y. Ahn, H. Lee, S. Ha, and S.-g. Lee. Quote recommendation in dialogue using deep neural network. In SIGIR
’16, pages 957–960, 2016.
[71] M. K. Lee, S. Kielser, J. Forlizzi, S. Srinivasa, and P. Rybski. Gracefully mitigating breakdowns in robotic services. In
HRI ’10, page 203–210, 2010.
[72] S. Lee and J. Choi. Enhancing user experience with conversational agent for movie recommendation: Effects of
self-disclosure and reciprocity. International Journal of Human-Computer Studies, 103:95 – 105, 2017.
[73] S. Lee, R. J. Moore, G. Ren, R. Arar, and S. Jiang. Making personalized recommendation through conversation:
Architecture design and recommendation methods. In AAAI ’18, pages 727–730, 2018.
[74] Y. Lee, J.-e. Bae, S. Kwak, and M. Kim. The effect of politeness strategy on human-robot collaborative interaction on
malfunction of robot vacuum cleaner. In RSS Workshop on HRI, 2011.
[75] R. Li, S. E. Kahou, H. Schulz, V. Michalski, L. Charlin, and C. Pal. Towards deep conversational recommendations. In
NIPS ’18, pages 9725–9735, 2018.
[76] X. Li, Y.-N. Chen, L. Li, J. Gao, and A. Celikyilmaz. End-to-end task-completion neural dialogue systems. In IJCNLP
’17, 2017.
[77] L. Liao, R. Takanobu, Y. Ma, X. Yang, M. Huang, and T.-S. Chua. Deep conversational recommender in travel. ArXiv,
abs/1907.00710, 2019.
[78] D. J. Litman and S. Pan. Empirically evaluating an adaptable spoken dialogue system. In UM ’99, pages 55–64, 1999.
[79] C.-W. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau. How NOT to evaluate your dialogue system:
An empirical study of unsupervised evaluation metrics for dialogue response generation. In EMNLP ’16, pages
2122–2132, 2016.
[80] J. Liu, S. Seneff, and V. Zue. Dialogue-oriented review summary generation for spoken dialogue recommendation
systems. In ACL ’10, pages 64–72, 2010.
[81] B. Loepp, T. Hussein, and J. Ziegler. Choice-based preference elicitation for collaborative filtering recommender
systems. In CHI ’14, pages 3085–3094, 2014.
[82] S. Loh, D. Lichtnow, A. J. C. Kampff, and J. P. M. de Oliveira. Recommendation of complementary material during
chat discussions. Knowledge Management & E-Learning, 2(4), 2010.
[83] J. Luettin, S. Rothermel, and M. Andrew. Future of In-Vehicle recommendation systems @ Bosch. In RecSys ’19, page
524, 2019.
[84] K. Luo, S. Sanner, G. Wu, H. Li, and H. Yang. Latent linear critiquing for conversational recommender systems. In
WWW ’20, page 2535–2541, 2020.
[85] T. Mahmood and F. Ricci. Learning and adaptivity in interactive recommender systems. In ICEC ’07, pages 75–84,
2007.
[86] T. Mahmood and F. Ricci. Improving recommender systems with adaptive conversational strategies. In HT ’09, pages
73–82, 2009.
[87] A. Maksai, F. Garcin, and B. Faltings. Predicting online performance of news recommender systems through richer
evaluation metrics. In RecSys ’15, pages 179–186, 2015.
[88] G. Marcus. GPT-2 and the Nature of Intelligence. https://thegradient.pub/gpt2-and-the-nature-of-intelligence/, Jan.
2020.
[89] K. McCarthy, J. Reilly, L. McGinty, and B. Smyth. On the dynamic generation of compound critiques in conversational
recommender systems. In AH ’04, pages 176–184, 2004.
[90] K. McCarthy, J. Reilly, L. McGinty, and B. Smyth. Thinking positively-explanatory feedback for conversational
recommender systems. In ECCBR ’04, pages 115–124, 2004.
A Survey on Conversational Recommender Systems 105:33
[91] K. McCarthy, M. Salamó, L. Coyle, L. McGinty, B. Smyth, and P. Nixon. Group recommender systems: A critiquing
based approach. In IUI ’06, pages 267–269, 2006.
[92] K. McCarthy, Y. Salem, and B. Smyth. Experience-based critiquing: Reusing critiquing experiences to improve
conversational recommendation. In ICCBR ’10, pages 480–494, 2010.
[93] L. McGinty and B. Smyth. On the role of diversity in conversational recommender systems. In ICCBR ’03, pages
276–290, 2003. doi: 10.1007/3-540-45006-8_23.
[94] D. McSherry. Incremental relaxation of unsuccessful queries. In ECCBR ’04, pages 331–345, 2004.
[95] M. S. B. Mimoun, I. Poncin, and M. Garnier. Case study–embodied virtual agents: An analysis on reasons for failure.
Journal of Retailing and Consumer Services, 19(6):605 – 612, 2012.
[96] N. Mirzadeh, F. Ricci, and M. Bansal. Feature selection methods for conversational recommender systems. In EEE ’05,
pages 772–777, April 2005.
[97] S. Moon, P. Shah, A. Kumar, and R. Subba. OpenDialKG: Explainable conversational reasoning with attention-based
walks over knowledge graphs. In ACL ’19, pages 845–854, 2019.
[98] C. Myers, A. Furqan, J. Nebolsky, K. Caro, and J. Zhu. Patterns for how users overcome obstacles in voice user
interfaces. In CHI ’18, 2018.
[99] D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):
3–26, 2007.
[100] F. Narducci, P. Basile, A. Iovine, M. de Gemmis, P. Lops, and G. Semeraro. A domain-independent framework for
building conversational recommender systems. In RecSys ’18 KaRS Workshop, pages 29–34, 2018.
[101] F. Narducci, M. de Gemmis, P. Lops, and G. Semeraro. Improving the user experience with a conversational recom-
mender system. In AI*IA ’18, pages 528–538, 2018.
[102] F. Narducci, P. Basile, M. de Gemmis, P. Lops, and G. Semeraro. An investigation on the user interaction modes of
conversational recommender systems for the music domain. UMUAI ’19, pages 1–34, 2019.
[103] T. N. Nguyen and F. Ricci. A chat-based group recommender system for tourism. In ENTER ’17, pages 17–30, 2017.
[104] I. Nica, O. A. Tazl, and F. Wotawa. Chatbot-based tourist recommendations using model-based reasoning. In
Configuration Workshop ’18, pages 25–30, 2018.
[105] L. Nie, W. Wang, R. Hong, M. Wang, and Q. Tian. Multimodal dialog system: Generating responses via adaptive
decoders. In MM ’19, pages 1098–1106, 2019.
[106] I. Nunes and D. Jannach. A systematic review and taxonomy of explanations in decision support and recommender
systems. User Modeling and User-Adapted Interaction, 27(3–5):393–444, Dec. 2017.
[107] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: A method for automatic evaluation of machine translation. In
ACL ’02, page 311–318, 2002.
[108] S. Park and L. Youn-kyung. Design considerations for explanations made by a recommender chatbot. In IASDR ’19,
2019.
[109] F. Pecune, S. Murali, V. Tsai, Y. Matsuyama, and J. Cassell. A model of social explanations for a conversational movie
recommendation system. In HAI ’19, page 135–143, 2019.
[110] P. Pu and L. Chen. Trust building with explanation interfaces. In IUI ’06, pages 93–100, 2006.
[111] P. Pu, P. Viappiani, and B. Faltings. Increasing user decision accuracy using suggestions. In CHI ’06, page 121–130,
2006.
[112] P. Pu, M. Zhou, and S. Castagnos. Critiquing recommenders for public taste products. In RecSys ’09, pages 249–252,
2009.
[113] P. Pu, L. Chen, and R. Hu. A user-centric evaluation framework for recommender systems. In RecSys ’11, pages
157–164, 2011.
[114] M. Qiu, F.-L. Li, S. Wang, X. Gao, Y. Chen, W. Zhao, H. Chen, J. Huang, and W. Chu. Alime chat: A sequence to
sequence and rerank based chatbot engine. In ACL’17, pages 498–503, 2017.
[115] F. Radlinski and N. Craswell. A theoretical framework for conversational search. In CHIIR ’17, pages 117–126, 2017.
[116] F. Radlinski, K. Balog, B. Byrne, and K. Krishnamoorthi. Coached conversational preference elicitation: A case study
in understanding movie preferences. SIGDIAL ’19, 2019.
[117] D. Rafailidis and Y. Manolopoulos. Can virtual assistants produce recommendations? In WIMS ’19, 2019.
[118] D. Rafailidis and Y. Manolopoulos. The technological gap between virtual assistants and recommendation systems.
ArXiv, abs/1901.00431, 2019.
[119] A. Rana and D. Bridge. Navigation-by-preference: A new conversational recommender with preference-based
feedback. In IUI ’20, page 155–165, 2020.
[120] A. M. Rashid, I. Albert, D. Cosley, S. K. Lam, S. M. McNee, J. A. Konstan, and J. Riedl. Getting to know you: Learning
new user preferences in recommender systems. In IUI ’02, page 127–134, 2002.
[121] J. Reilly, K. McCarthy, L. McGinty, and B. Smyth. Dynamic critiquing. In ECCBR 04, pages 763–777, 2004.
105:34 Dietmar Jannach et al.
[122] J. Reilly, J. Zhang, L. McGinty, P. Pu, and B. Smyth. A comparison of two compound critiquing systems. In IUI ’07,
pages 317–320, 2007.
[123] F. Ricci and Q. N. Nguyen. Acquiring and revising preferences in a critique-based mobile recommender system.
Intelligent Systems, 22(3):22–29, 2007.
[124] F. Ricci, A. Venturini, D. Cavada, N. Mirzadeh, D. Blaas, and M. Nones. Product recommendation with interactive
query management and twofold similarity. In ICCBR ’03, pages 479–493, 2003.
[125] F. Ricci, Q. N. Nguyen, and O. Averjanova. Exploiting a map-based interface in conversational recommender systems
for mobile travelers. In Tourism Informatics, pages 73–79. IGI, 2010.
[126] F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor. Recommender Systems Handbook. Springer-Verlag, 2nd edition, 2015.
[127] E. Rich. User modeling via stereotypes. Cognitive Science, 3(4), 1979.
[128] M. Rossetti, F. Stella, and M. Zanker. Contrasting offline and online results when evaluating recommendation
algorithms. In RecSys ’16, pages 31–34, 2016.
[129] A. Saha, M. M. Khapra, and K. Sankaranarayanan. Towards building large scale multimodal domain-aware conversation
systems. In AAAI ’18, 2018.
[130] Y. Salem, J. Hong, and W. Liu. History-Guided conversational recommendation. In WWW ’14, page 999–1004, 2014.
[131] G. Shani and A. Gunawardana. Evaluating recommendation systems. In Recommender Systems Handbook, pages
257–297. Springer US, 2011.
[132] H. Shimazu. ExpertClerk: A conversational case-based reasoning tool for developing salesclerk agents in E-Commerce
webshops. Artificial Intelligence Review, 18(3-4):223–244, 2002.
[133] N. Siangchin and T. Samanchuen. Chatbot implementation for ICD-10 recommendation system. In ICESI ’19, pages
1–6, 2019.
[134] B. Smyth, L. McGinty, J. Reilly, and K. McCarthy. Compound critiques for conversational recommender systems. In
WI ’04, pages 145–151, 2004.
[135] A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, J. Grue Simonsen, and J.-Y. Nie. A hierarchical recurrent encoder-decoder
for generative context-aware query suggestion. In CIKM ’15, page 553–562, 2015.
[136] V. Srinivasan and L. Takayama. Help me please: Robot politeness strategies for soliciting help from humans. In CHI
’16, page 4945–4955, 2016.
[137] A. Stolcke, N. Coccaro, R. Bates, P. Taylor, C. Van Ess-Dykema, K. Ries, E. Shriberg, D. Jurafsky, R. Martin, and
M. Meteer. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational
Linguistics, 26(3):339–373, 2000.
[138] M. Sun, F. Li, J. Lee, K. Zhou, G. Lebanon, and H. Zha. Learning multiple-question decision trees for cold-start
recommendation. In WSDM ’13, pages 445–454, 2013.
[139] Y. Sun and Y. Zhang. Conversational recommender system. In SIGIR ’18, pages 235–244, 2018.
[140] P. R. Telang, A. K. Kalia, M. Vukovic, R. Pandita, and M. P. Singh. A conceptual framework for engineering chatbots.
IEEE Internet Computing, 22(06):54–59, 2018.
[141] P. Thomas, M. Czerwinski, D. McDuff, and N. Craswell. Theories of conversation for conversational ir. In International
Workshop on Conversational Approaches to Information Retrieval, 2020.
[142] C. A. Thompson, M. H. Göker, and P. Langley. A personalized system for conversational recommendations. Journal
of Artificial Intelligence Research, 21(1):393–428, 2004.
[143] N. Tintarev and J. Masthoff. Designing and evaluating explanations for recommender systems. In Recommender
Systems Handbook, volume 1, pages 479–510. Springer, 2011.
[144] F. N. Tou, M. D. Williams, R. Fikes, D. A. H. Jr., and T. W. Malone. RABBIT: An intelligent database assistant. In AAAI
’82, pages 314–318, 1982.
[145] W. Trabelsi, N. Wilson, D. G. Bridge, and F. Ricci. Comparing approaches to preference dominance for conversational
recommenders. In ICTAI ’10, pages 113–120, October 2010.
[146] D. Tsumita and T. Takagi. Dialogue based recommender system that flexibly mixes utterances and recommendations.
In WI ’19, pages 51–58, 2019.
[147] P. Viappiani and C. Boutilier. Regret-based optimal recommendation sets in conversational recommender systems. In
RecSys ’11, pages 101–108, 2009.
[148] P. Viappiani, P. Pu, and B. Faltings. Conversational recommenders with adaptive suggestions. In RecSys ’07, pages
89–96, 2007.
[149] J. Vig, S. Sen, and J. Riedl. Navigating the tag genome. In IUI ’11, page 93–102, 2011.
[150] M. Walker, S. Whittaker, A. Stent, P. Maloor, J. Moore, M. Johnston, and G. Vasireddy. Generation and evaluation of
user tailored responses in multimodal dialogue. Cognitive Science, 28(5):811 – 840, 2004.
[151] R. S. Wallace. The Anatomy of A.L.I.C.E. In Parsing the Turing Test, pages 181–210. Springer, 2009.
[152] W. Wang and I. Benbasat. Research Note—A contingency approach to investigating the effects of user-system
interaction modes of online decision aids. Information Systems Research, 24(3):861–876, 2013.
A Survey on Conversational Recommender Systems 105:35
[153] P. Wärnestål. Modeling a dialogue strategy for personalized movie recommendations. In IUI ’05 Beyond Personalization
Workshop, pages 77–82, 2005.
[154] P. Wärnestål. User evaluation of a conversational recommender system. In IJCAI ’05 Workshop on Knowledge and
Reasoning in Practical Dialogue Systems, 2005.
[155] P. Wärnestål, L. Degerstedt, and A. Jönsson. Interview and delivery: Dialogue strategies for conversational recom-
mender systems. In NODALIDA ’07, pages 199–205, 2007.
[156] P. Wärnestål, L. Degerstedt, and A. Jönsson. PCQL: A formalism for human-like preference dialogues*. In IJCAI ’07
Workshop on Knowledge and Reasoning in Practical Dialogue Systems, 2007.
[157] B. Wei, J. Liu, Q. Zheng, W. Zhang, C. Wang, and B. Wu. DF-Miner: Domain-specific facet mining by leveraging the
hyperlink structure of Wikipedia. Knowledge-Based Systems, 77:80 – 91, 2015.
[158] J. Weizenbaum. ELIZA – Computer program for the study of natural language communication between man and
machine. Communications. ACM, 9(1):36–45, Jan. 1966.
[159] T.-H. Wen, D. Vandyke, N. Mrkšić, M. Gašić, L. M. Rojas-Barahona, P.-H. Su, S. Ultes, and S. Young. A network-based
end-to-end trainable task-oriented dialogue system. In ACL ’17, pages 438–449, 2017.
[160] D. H. Widyantoro and Z. Baizal. A framework of conversational recommender system based on user functional
requirements. In ICoICT ’14, pages 160–165, 2014.
[161] J. Wissbroecker and F. M. Harper. Early lessons from a voice-only interface for finding movies. In RecSys ’19
Late-Breaking Results, 2018.
[162] C.-S. Wu, A. Madotto, E. Hosseini-Asl, C. Xiong, R. Socher, and P. Fung. Transferable multi-domain state generator
for task-oriented dialogue systems. In ACL, 2019.
[163] D. J. Xu, I. Benbasat, and R. T. Cenfetelli. A Two-Stage model of generating product advice: Proposing and testing the
complementarity principle. Journal of Management Information Systems, 34(3):826–862, 2017.
[164] Z. Yan, N. Duan, P. Chen, M. Zhou, J. Zhou, and Z. Li. Building task-oriented dialogue systems for online shopping.
In AAAI ’17, pages 4618–4626, 2017.
[165] L. Yang, M. Sobolev, C. Tsangouri, and D. Estrin. Understanding user interactions with podcast recommendations
delivered via voice. In RecSys ’18, pages 190–194, 2018.
[166] Z. Yin, K.-h. Chang, and R. Zhang. DeepProbe: Information directed sequence understanding and chatbot design via
recurrent neural networks. In KDD ’17, pages 2131–2139, 2017.
[167] T. Yu, Y. Shen, and H. Jin. A visual dialog augmented interactive recommender system. In KDD ’19, pages 157–165,
2019.
[168] T. Yu, Y. Shen, R. Zhang, X. Zeng, and H. Jin. Vision-language recommendation via attribute augmented multimodal
reinforcement learning. In MM ’19, page 39–47, 2019.
[169] M. Zanker and M. Jessenitschnig. Case-studies on exploiting explicit customer requirements in recommender systems.
User Modeling and User-Adapted Interaction, 19(1-2):133–166, 2009.
[170] J. Zeng, Y. I. Nakano, T. Morita, I. Kobayashi, and T. Yamaguchi. Eliciting user food preferences in terms of taste and
texture in spoken dialogue systems. In MHFI ’18, page 1–5, 2018.
[171] J. Zhang and P. Pu. A comparative study of compound critique generation in conversational recommender systems.
In AH ’02, pages 234–243, June 2006.
[172] Y. Zhang, X. Chen, Q. Ai, L. Yang, and W. B. Croft. Towards conversational search and recommendation: System ask,
user respond. In CIKM ’18, pages 177–186, 2018.
[173] G. Zhao, H. Fu, R. Song, T. Sakai, Z. Chen, X. Xie, and X. Qian. Personalized reason generation for explainable song
recommendation. ACM Transactions on Intelligent Systems and Technology, 10(4):1–21, 2019.
[174] X. Zhao, W. Zhang, and J. Wang. Interactive collaborative filtering. In CIKM ’13, pages 1411–1420, 2013.