1 Introduction
Seeking, accessing, and using relevant information is a fundamental human activity and arguably crucial to the workings of societies around the globe. While the process and tools for
Information Access (IA) have changed considerably across human history, what we have witnessed in the past few decades has been astounding to say the least. With the explosion of digitized and online information, tools and processes for accessing them have had to evolve rapidly, resulting in many advancements in a short period of time when put in the context of the history of human information production, storage, and dissemination. While the goals of these advancements for IA tools and technologies have been around retrieval, filtering, and accessibility of information, the recent focus has shifted more toward what might be considered the generation of information.
1 As we see the proliferation of putative information generation systems such as Google’s LaMDA/Bard or Gemini [
85], OpenAI’s ChatGPT [
66], Microsoft’s New Bing or Bing Copilot [
59], and Baidu’s Wenxin Yiyan (ERNIE Bot) [
64], it is important for
Information Retrieval (IR)/IA scholars and developers to ask how such systems address human needs for IA, where they are falling short, and what should we want from them going forward.
This conceptual article is an attempt to draw attention to these issues and questions. We will do this with a three-pronged approach: (1) taking a step back and reviewing what we already know about what users and society want/need from an IA system, (2) asking what we do not know about building user-focused systems, and (3) thinking through ways to study and evaluate such systems. Our methodologies are a systematic analysis of past and existing scholarship, a critical examination of gaps in our knowledge, and a careful presentation of ideas for new research. We will go broad by considering IA as a part of a larger context of Information Seeking (IS) and Information Behavior (IB), and go narrow by looking at how recent advancements in Large Language Model (LLM)-based IA systems help with or hinder the progress we want. A primary purpose of these delineations is to provide ideas, opinions, and help to students, scholars, and developers interested in this area to learn about challenges and opportunities that may shape their work.
This is not a literature review nor it is an opinion article. Our contribution, instead, is an envisioning process for the future of IA systems in the era of ‘generative AI’. Specifically, we reassert a broad view of the study of IA and invite the scholarly community working in the area of IA to look at the problem holistically. This broader view allows us to consider generative AI systems as one candidate approach rather than a ‘solution’ which narrows and trivializes the problem. In doing so, we provide a framework for thinking about IA systems in our current landscape, suggest a broad range of Research Questions (RQs) to pursue, and provide guidance as to how they might be engaged.
The rest of the article is organized as follows. We begin in Section
2 with an overview of fundamental concepts to contextualize the discussion, with special attention to algorithmically mediated IA. Our purpose with this overview is to create a frame within which to situate and contrast generative AI systems (and LLMs) with other types of IA systems. In Section
3, we briefly review what we know about what users want from an IA system. This sets the stage for how LLM-based IA systems are helping or falling short in addressing those user needs as detailed in Section
4. In this analysis, we compare generative models to traditional discriminative models, enumerate risks associated with generative systems, and explore necessary conditions for beneficial use cases. Given the shortcomings of LLMs for addressing various types of information needs identified in Section
4, we turn our attention to the broader question of what we should and could study in this area and how. In Section
5, we present that as a proposal and a call for action, along with RQs and methods. Building on this analysis of LLMs and the future of IA systems generally, in Section
6, we turn our attention to what it means for IA on the Web. We consider the Web as an ecosystem of information and, through examples of how that ecosystem is being harmed, reflect on how acts of IA impact the broader society. We conclude in Section
7 with suggestions to researchers for how to navigate our current environment, where corporate incentives are funneling resources toward LLMs as the next incarnation of Web-based IA.
3 What Do Users Want from IA Systems?
Croft [
27] asked back in 1995, “What do people want from information retrieval?” Some of the attributes he identified have continued to be relevant, but new generations of IA systems have been shaping the user’s behaviors and expectations, while being influenced by the same behaviors and expectations. We expect that people generally want IA to be
easy, but what counts as easy and just how much willingness or ability the user has to persist with less than easy interfaces depends on the situation. An example of people we might expect not to persist very long in the face of difficult or non-intuitive interfaces, despite a strong information need, are elderly users looking for medical information online [
5].
Relevance is a persistent goal throughout IA system development, but the meaning of ‘relevant’ and how to measure ‘relevance’ are often debated. Similarly, while people usually want information that is of
high quality,
authoritative, and
trustworthy [
45], how exactly one thinks about and operationalizes these attributes varies. In the following, we briefly review some of the findings about which features that users desire from IA systems.
Relevance. This is one of the most important attributes that users desire from an IA/IR/IF system [
42]. Many scholars have taken a closer and deeper look at what ‘relevance’ means for users and how to operationalize it in different kinds of IA systems (e.g., [
19,
63]). Often evaluation frameworks characterize relevance as binary, but scholars such as Saracevic [
74,
75] have argued that relevance is subjective and should be considered a multi-faceted quantity to measure for evauation, which includes facets such as situational, affective, and cognitive relevance.
Novelty and Diversity. While users clearly want relevance from accessed information, they also do not want to see the same kind of relevant information multiple times. In other words, users desire novelty and diversity [
86,
92]. For example, Chavula et al. [
26] studied a task about creating new ideas through IA. While it was not surprising that the users wanted more novelty and diversity in the information they encountered in this task, other works (e.g., [
32]) have found that even in cases where relevance is clearly the most important aspect of an IA task, user satisfaction was strongly correlated with novelty.
Speed. Studies have shown that speed is an important factor for user satisfaction with search results, regardless of their satisfaction with the quality of results themselves. Zhang et al. [
95] tested user satisfaction with the academic search engine Baidu Scholar, finding that respondents ranked “responsiveness” as the most important system attribute. Slow system response times had the most significant impact on user satisfaction regardless of the search results. Teevan et al. [
84] advocated for a shift toward
slow search, pointing out potential tradeoffs in search result quality compromised for speed. However, their survey of more than 1,300 crowd workers about their perception of search times and satisfaction with results using Bing queries found that the majority (61%) were unable to envision a search engine that sacrificed speed for quality, with almost a third stating that they would like to “see fast results always.” Only the minority whose information needs were not time sensitive and who sought a “perfect answer” were willing to wait longer, but many users doubted that search engines would be able to provide significantly improved results even if given extra time.
Personalization/Contextualization. Users appreciate IA that is personalized to their search because it provides data that is most relevant to their preferences and inquiry needs. This is amply demonstrated in studies of populations which experience oppression, such as LGBTQ+ people [
47]. In a study by Kitzie [
47], participants were pleased with the search engine’s capabilities to appropriate technological features toward their desired information outcomes but faced significant built-in sociocultural hetero/cisnormative barriers. Personalization is also an important factor in a study by Pretorius et al. [
69] on user preferences for searches related to mental health information. Once again, it was made clear that users who experience marginalization—whether due to their demographics, socio-economic situation, or disability—need highly personalized information. But even those who do not experience marginalization still find personalization a very important characteristic of an IA system [
51].
Interactivity. Interactivity in information searching can take many forms, with various approaches resulting in improved user engagement and satisfaction with results. Allen et al. [
4] conducted a participatory design session with six adults and seven children (ages 6–11) to create a new search engine results page interface, finding that interactive elements such as larger icons and navigation buttons, as well as the option to like, dislike, or bookmark results, improved children’s ability to navigate results and retrieve the desired information successfully. Another study of graduate students and postdocs by Liu et al. [
52] found that providing interactive keyword facets to allow users to refine their search queries resulted in greater recall with more diverse results, with equal precision to the regular text-search interface they tested. A study of 89 adults with mixed levels of health information literacy found that using a conversation agent to help guide health information search queries led users to report overall greater satisfaction with the results [
15]. When users were tasked with finding a clinical trial fitting certain criteria, 33% of users with low health information literacy were able to do so with the assistance of the conversation agent, whereas none were able to while using only the regular search engine. Interactivity can be a powerful method of improving user’s satisfaction with the quality of results, although the effectiveness of iterated and additional methods of refining queries, interactive design components, and conversational agents varies based on the user’s specific information needs.
Transparency. Transparency is another key aspect of user satisfaction, particularly with regard to users’ ability to trust the information provided by chatbot and search algorithm technologies. Shin and Park [
79] conducted a survey of 100 adults to determine how users’ perceptions of fairness, transparency, and accountability with respect to informational systems impacted the extent to which they trusted algorithms. They found that perceptions of fairness, accountability, and transparency supported trust in systems (and also that people more likely to trust systems were more likely to perceive them as fair, accountable,
2 and transparent), and that trust in turn was important for user satisfaction. Another example of the importance of transparency within these systems is the recent increased focus on so-called explainable artificial intelligence, or XAI. However, Diefenbach et al. [
31] echo Shin and Park in arguing that the value of transparency lies in responsiveness to the individual user’s needs. To balance users’ desires to delegate tasks to ‘invisible’ technology while also having technology that is transparent and therefore trustworthy, these programs must be able to respond to each user’s desired degree of transparency.
Other. In addition, scholars (e.g., [
3]) have discovered or proposed several other characteristics that the users expect or desire from a good IA system including fairness, lack of bias and misinformation in content, as well as recency of information (especially in case of news and social media). It is expected that one may not get all of these desired characteristics in every situation and that some factors may be more important than others, given a context or an application.
In summary, we already know a lot about what users want in an IA system, but there remain many open questions about how to achieve these design goals and how to measure how well they have been met. A prominent approach that is taking shape currently involves using generative models, specifically LLMs, to address various information needs. While this approach has shown some impressive results, we have not examined its appropriateness, applicability, and suitability for IA. In the next section, we compare discriminative and generative IA systems in the context of the user-focused dimensions described previously.
4 Framing and Positioning of LLM-based IA Systems
In this section we will review how and where new advancements in IA systems, stemming from ‘generative AI,’ have fulfilled or come up short for users and their information needs, as they are currently understood. Specifically, we will review LLM-based generative IA in comparison to classical discriminative systems with an eye toward the dangers of the generative systems as well as future directions for system development.
4.1 Assessment of IA Systems along the User-Focused Dimensions
In Section
3, we identified several characteristics that users want from an IA system, with six of them elaborated: relevance, novelty, speed, personalization/contextualization, interactivity, and transparency. Let us now examine how discriminative and generative IA systems do on these six dimensions. Note that both these categories fall under algorithmically mediated IA. Discriminative systems classify or rank existing content, whereas generative systems create new content based on the underlying LLM trained on large corpora. Both of these categories of IA systems cover IR (search) and IF (recommender systems) as shown in Figure
1. A quick comparison of discriminative and generative systems is presented in Table
1.
As far as relevance goes, both types of systems are able to produce impressive results, but the path to achieving that relevance varies. Search engines and other discriminative models achieve relevance through matching candidate retrieval results to the input queries and then ranking (although topical relevance is not the only factor in either of these processes). Generative models, however, achieve relevance either by running a discriminative process first and then using the generative model to synthesize a summary or just through their text synthesis process: with sufficiently large models, the training objective of plausibility will keep the text largely on topic and thus relevant. Similarly, novelty is produced in very different ways: for discriminative models, it is a question of what is available and how it is ranked. For generative models, it is dependent on pre-training and fine-tuning, and risks producing something that is both novel and false, through stochastic generation [
13]. One of the interesting ways the notion of novelty is manifested with generative models is through the ‘temperature’ parameter. This parameter can allow the user to control how
creative the model should get during content generation. For example, Bing provides three levels of conversation styles: Precise, Balanced, and Creative, going from more factual and retrieval-based output to more creative or novel content. Even though the user gets to control such temperature or style parameters, it is presented without thorough documentation. For example, a user may think that using the setting
precise (temperature=0) would guarantee accurate answers, but of course this is not how LLMs work.
Both discriminative and generative systems provide opportunities for users to interact with the information surfaced, but in different ways: classical discriminative systems facilitate user interaction with the source documents. These documents themselves are typically static, but the user can examine them directly and also explore how they are situated (where they are hosted, what other sources of information they point to in turn, etc.). Generative systems provide a new kind of possibility for interaction, namely conversational chat. However, this comes at the cost of direct access to sources. In the scenario where answers are provided directly from an LLM trained on a large dataset, there either is no source for the information (it is a recombination of words or word parts from the training data that does not exist in its entirety or match the information in any source document) or it is not traceable.
3 In the scenario where the LLM is used to summarize information from a duly linked set of source documents, the chat interface still seems likely to discourage exploration of those source documents, by foregrounding an appealing even if possibly incorrect summary.
Comparing discriminative and generative systems, we also see a tradeoff between personalization/contextualization and transparency. LLMs are masters of mimicry and can be prompted to use many different styles of language, for example. An LLM chatbot-based IA interface could have preset (‘hidden’) addenda to prompts
4 along the lines of ‘Provide the answer using simple language’ that might be effective at producing output perceived as easier to understand for specific audiences (e.g., children, second language learners). However, this comes at the cost of the interface consisting solely of synthetic text. Furthermore, the more need there is for something like text simplification, the less well positioned the user would be to check the accuracy of the output they see. Similarly, such prompt addenda could possibly be used to help populations experiencing oppression (like LGBTQ+ users mentioned in Section
3) access output that contains less of the discriminatory framing that results from general searches might contain. In this case, a good point of comparison in studies of user satisfaction would be search engine techniques that help users find community forums and similar spaces where other people with similar experiences share knowledge—but note that shaping query responses to avoid discriminatory language, while potentially quite desirable, is a far cry from providing connections to community.
4.2 Potential and Projected Harms and Costs for LLM-Based IA Systems
In the current narrative around ‘generative AI’ and specifically LLM-based models, several issues have been raised [
13,
77,
88] about their potential costs and harms. In the following, we examine a selection of these issues, starting with the most immediate and common issues and going toward longer-term harms and less talked about problems.
Ungrounded Answers. Despite all of the hype, from industry labs and elsewhere, that LLMs are not only a step on the path toward ‘artificial general intelligence’ [
2], but in fact showing the first ‘sparks’ of it (phrasing from the title of the unscientific report released from Microsoft Research [
22]), there is in fact no evidence or reason to believe that ‘intelligence’ or ‘reasoning’ will emerge from systems designed to extrude plausible sequences of word forms. With a large enough corpus for training, such a system can find sufficiently rich associations among words to generate appealing passages of text. But the only information an LLM can be said to have is information about the distribution of linguistic tokens [
13,
14]. Just because a system is able to store and connect some representation of online texts does not make it knowledgeable or capable producing reliable answers. Information is not knowledge, and that is even more true when the information is only information about the distribution of word forms.
Bias in Answers. Friedman and Nissenbaum [
38] define bias as systematic and unfair discrimination. We know that any IA system can represent and reproduce biases resulting from data, algorithms, and content presentation [
8,
65]. And LLM-based generative systems could make things even worse: it is well established that LLMs absorb and amplify biases around gender, race, and other sensitive attributes [
1,
16,
17,
18,
53,
78]. What is worse is that when these biases are reproduced in the synthetic text output by the machine, they are stripped of their original context, where they are more clearly situated as the ideas of people, and reified into something that seems to have come from an omniscient and impartial machine. This effect is not new—Noble [
65] showed how Google’s presentation of search results had the same effect—but we believe there is reason to fear it will be even more pernicious when machines seem to be speaking our language.
Lack of Transparency. While most algorithmically mediated IA systems lack transparency, the issue becomes amplified when the system is LLM based and generating responses without a clear indication of how it sourced or created such responses. What data or information was used to train it? What was not used and why? Why was a given response generated? What signals and sources did it use? What confidence or guarantee can it provide? We should also consider the connection between transparency and accountability: if the response is incorrect, what recourse would be available to the user? What recourse would be available to non-users about whom incorrect information was displayed?
Lack of Agency for Users. As described earlier, one of the issues with algorithmically mediated IA is that the user often does not have enough agency. This is amplified when the system eliminates the interface with multiple options (e.g., set of search results on a page, or a set of recommendations on a widget) and provides a single response. It becomes difficult to impossible for the users to confirm or control these responses. They are not able to question or change the process through which their responses are generated, other than specifying, via prompt engineering, the type or length of the response they want. They are further not able to locate the sources in a broader information ecosystem and integrate that into their sense-making process [
77].
Lack of Appropriate ‘Friction.’. Most systems are designed with the idea that they are supporting users who prefer to put in as little effort as possible when accessing information. This is certainly a desired characteristic of an IA system. However, we argue that it is not always advisable to minimize this effort, especially if it takes away the ability for a user to learn, for instance due to lack of transparency. We believe, and as others have supported [
68], that certain amount of ‘friction’ in an IA process is a good thing to have. It allows the user to understand and question the process, as well as providing an ability to refine and even retract their original question or information need. LLM-based generative IA systems often go too far in reducing the active participation of the user as well as potentially beneficial friction as they aim to cut the process of discovering information to simply getting ‘the’ answer.
Labor Exploitation and Environmental Costs. Apart from the issues discussed previously arising from the
use of LLM-driven chatbots, there are also serious issues with the way that they are presently produced. The performance of LLMs relies on extremely large datasets which are collected without consent, credit, or compensation [
57].
5 Current practice in improving the ‘safety’ of models like ChatGPT (read: reducing the chances that they output harmful content) involves a technique called
reinforcement learning from human feedback [
67], a data-intensive task which requires a human workforce to label frequently extremely troubling data. Investigative reporting by Billy Perrigo, Karen Hao, Josh Dzieza, and others has found that this work is largely outsourced to poorly paid workers in locales such as Kenya
6—a finding in keeping with what is known about the human labor behind so-called ‘AI’ more generally [
89]. Finally, there is the environmental impact of these compute-intensive systems (at both train and test time), which includes not only energy usage (and the associated carbon impact) [
20,
76,
81] but also intensive water usage.
7 4.3 Potential Beneficial Use Cases
There have been plenty of applications of generative IA systems in the recent months, including in education [
10], healthcare [
50], and commerce [
94]. However, insufficient attention is given to making such applications both beneficial and safe. We argue that to be beneficial and safe, such a use case would have to be one where:
(A)
what matters is language form (content is unimportant),
(B)
the ersatz fluency and coherence of LLM output would not be misleading,
(C)
problematic biases and hateful content could be identified and filtered, and
(D)
the LLM was created without data theft, exploitative labor practices, or profligate energy and water use.
Even setting aside conditions (B) through (D), what kind of IA use case could satisfy (A)?
The most compelling use cases of ChatGPT that we have seen reported are for accessing information about how to express certain programs in different programming languages [
80]. We note that this use case is actually best understood as a kind of machine translation, which is different from open-ended conversational replies. The answers (suggested computer code) are grounded in the questions (descriptions of what the computer code should do). Furthermore, the accuracy of the output can be relatively thoroughly checked, by compiling and running the code. We note, however, that issues remain, such as the possibility of security vulnerabilities in the generated code and lack of clarity around licensing/copyright of the training data [
33]. Furthermore, as discussed further in Section
6.1, StackOverflow found that users with easy access to a system that provided answers that look good did not reliably test them before passing them along, causing headaches for the site.
Another possible use case
8 involves tip-of-the-tongue phenomena: cases where the user is looking for the name of something like a movie or an obscure term, and can describe it. In this case, the results can be immediately verified with an ordinary search on the retrieved term. Condition (B) is certainly met as well. Condition (C) remains an issue: the chatbot might well output either something directly toxic or an answer that, when paired with the query, reflects damaging biases (akin to the examples discussed in Section
6.1). And condition (D) remains unresolved. Does a handy machine for turning to with tip-of-the-tongue queries merit the environmental and social impact of its construction?
4.4 Do LLMs Belong in IA Systems at All?
A critical perspective on LLMs in IA systems is open to the possibility that the tech simply does not fit the task. Accordingly, in this section we ask whether LLMs belong in IA systems at all.
In answering this question, the first step is to distinguish among ways that the LLMs might be used. Previous generation LLMs, such as BERT [
30], were largely used as a source of ‘word embeddings’ or vector-space representations of words in terms of what other words they co-occur with. These representations are more informative than the word spellings themselves and thus proved extremely beneficial in many natural language processing classification tasks [
61]. Classification tasks that serve as components of IA systems include named entity recognition, topic clustering, and disambiguation, among others. This use of LLMs does not seem to reduce user agency or transparency.
LLMs might also be used in query expansion, rephrasing the user’s query into multiple alternatives that can then in turn match more documents [
9,
96]. This level of indirection comes at some cost to transparency (although systems could presumably make the expanded queries explorable). However, it still leaves the user with the ability to directly interact with the returned documents in their source context.
Finally, LLMs could be used to synthesize answers to user queries, either based on a specific set of documents returned (as done, e.g., in Google snippets [
82]) or off of their entire training data (as with ChatGPT [
66]). While the latter case can be called ungrounded generation, the former is aimed at being grounded on relevant content. Some of the recent efforts related to retrieval-augmented generation provide potential solutions for provenance and precision or accuracy but are still vulnerable to the shortcomings of text generation [
24,
49]. In short, for either case, there is a high risk that the synthetic text, once interpreted, is false or misleading. This risk is somewhat mitigated if the source documents are surfaced along with the summary, although user studies would be needed to verify in what contexts and to what extent users click through to check the answers.
Many detrimental use cases of ungrounded generation have been proposed, including LLMs as robo-lawyers (to be used in court [
71]), LLMs as psychotherapists [
21], LLMs as medical diagnosis machines [
40], and LLMs as stand-ins for human subjects in surveys [
6,
39]. In all of these cases, a user has a genuine and often life- or livelihood-critical information need. Turning to a system that presents information about word co-occurrence as if it were information about the world could be harmful or even disastrous.
4.5 Summary
In summary, LLMs have several important and often alarming shortcomings as they have been applied as technologies for IA systems. To understand whether and how LLMs could be effectively used in IA, a lot more work needs to be done to understand user needs, IB, and how the affordances both LLMs and other tools respond to these needs. As a step toward beginning that work, we turn to an exploration of the kinds of questions that can be asked in research programs centered not on promoting specific technology but rather on supporting IB.
5 What Should We Study and How?
In light of what we know about what the users of IA systems want as well as the recent advancements and foci in this space, we see a broad range of fruitful RQs, itemized in the following. These RQs are organized in four categories based on what aspect of IA is most in focus: user needs, societal needs, technical solutions, or transparency in user interfaces. These questions vary in their specificity and are not meant to be an exhaustive listing of possible research directions, but rather to suggest questions that are still very much open and urgent given the advent of LLMs.
Design Questions Based on Supporting User Learning.
The RQs in this category center around envisioning and designing of new systems, services, and modalities, with a particular focus on the human activity the systems are meant to support:
RQ1:
What would users need to be able to do effective sense making when working with a generative model?
RQ2:
What design choices support continual development of information literacy, and how might these differ for different types of users?
RQ3:
How can we support users in learning to identify and contextualize synthetic text?
RQ4:
How do we provide proper agency and ‘friction’ to the user? ‘Friction’ here refers to a small amount of effort needed by the user to accomplish their IA task.
While systems are often designed to be as frictionless as possible, we argue, with Pickens [
68], that ‘friction’ is often useful. A small amount of effort can be what allows the users to have more control and a possibility to learn, while also being able to do better assessment of information being accessed and used.
These questions could be investigated using various ethnographic studies (RQ1), design sessions (RQ2), and user studies with cognitive methods (RQ3, RQ4).
Design Questions Based on Societal Needs.
As previously, these questions center around envisioning new systems, but this time with a focus on the societal impact of IA infrastructure:
RQ5:
How do we fashion IA systems that are understood as public goods rather than profit engines? Are there distributed peer-to-peer conceptualizations that would support this, even without massive public investment?
RQ6:
How might we structure IA systems such that there is shared governance structures that could slow or resist the injection of hateful content or other misinformation?
We believe that these questions are very broad and complex and will require multiple methods to address them. A good place to start may be the framework and methodologies of value sensitive design [
34,
36].
Questions Centered on Technical Innovations.
The questions in this category focus on technical innovations that might be evaluated quantitatively and are responsive to the broader range of desiderata laid out in Section
3:
RQ7:
How can we detect potential biases in responses generated by an IA system? How can we mitigate them or position system users to mitigate them?
RQ8:
In working to mitigate bias in IA systems, how do we navigate tensions with measures of relevance and other desirable characteristics?
RQ9:
What methods can be devised to combine discriminative and generative models of IA systems to improve performance along desired characteristics?
The RQs in this set could be effectively addressed using appropriate methodologies from value sensitive design, as well as empirical user studies or randomized trials.
Questions Centered on Interface Design and Transparency.
Finally, we turn to questions which center on user interface design and especially questions about how UI design choices can support transparency:
RQ10:
How do we balance summarization, which provides access to information across very large data collections, with transparency into the sources of information and its original context?
RQ11:
What interface design options better encourage users to connect information in summarized output to its antecedent in the source documents?
RQ12:
How do we provide transparency in personalized and application-specific ways?
RQ13:
How can we effectively integrate temporal information (when documents were written/updated) into the presentation of search results?
These RQs are suited to methods that involve designing and evaluating interactive interfaces, primarily using lab and field studies.
Even though these questions vary in their foci—ranging from society, to the user, to system internals, to UI—all are framed within a viewpoint that considers IA systems in their societal context. Starting from this framing, we consider first how human IA activities proceed and how they can be supported. Work on developing and evaluating algorithms would then largely be in service to these user- and society-centered goals.
If LLMs might serve as components in such systems, they would be evaluated against the goals and in the context of these use cases. In other words, it is not in the interest of information science to be a proving ground for so-called ‘AI.’ IR, and information science in general, should be positioned to ask, of any system, called ‘AI’ or not, if it is an effective match for IA needs.
7 Conclusion
IA systems, and the way we access information on the Web, have come a long way from their initial modern incarnations in the late 20th century.
For a long time, IR researchers have argued (e.g., [
56]) for a better modality than a small search box that encourages short queries with a few keywords rather than longer queries, questions, or natural language interactions. With the development of ChatGPT, Microsoft’s integration of similar technology into Bing, and Google’s Bard/Gemini, we have what look like steps toward that vision. Not only do they encourage longer queries, they also provide responses in the form of paragraphs and elicit multi-turn search activities. However, this is also the time when we must ask again—what is it that IA users really want, do the new modalities and interactions address all their needs, and how do they impact society as a whole? If we optimize toward some imagined ideal, perhaps one inspired by science fiction representations of ship-board computers, we fail to design for actual users in the actual world. In other words, the goal of IR should not be to get users the ‘right’ answer as quickly and easily as possible, but rather to support users’ IA, sense making, and information literacy. We also need to ensure that these systems provide exposure to such diverse and more comprehensive information as is available, while being mindful of societal context of fairness, equity, and accessibility.
Thus, regarding the turn toward LLM-based IA interfaces, we ask: What are we sacrificing when we shape systems in this way? We argue that these systems take away transparency and user agency, further amplify the problems associated with bias in IA systems, and often provide ungrounded and/or toxic answers that may go unchecked by a typical user. However, given the current corporate incentives in this space, we expect to continue to see resources poured into developing and promoting LLM and chatbot-based IA systems. Given that, we urge the researchers and developers to exercise some responsibility and caution. We list a few suggestions here to wrap up our discussion:
•
Focus on user processes. For example, previous work [
77] demonstrated how we could use IS strategies by Belkin et al. [
11] for designing more comprehensive and capable IA systems.
•
Involve not only potential users and but also other stakeholders throughout the process, starting with design. Other stakeholders here are people who are potentially impacted when a third party accesses inaccurate, bigoted, or otherwise harmful information. Methodologies from value sensitive design can help in the identification of stakeholders (e.g., [
93]) and structured consultation with them (e.g., [
54]).
•
Given that these are complex systems with multiple stakeholders, aim for a multi-pronged approach to evaluation. Simply focusing on relevance could bias the search results. Simply focusing on diversity could alienate users. Simply focusing on efficiency could take away user agency and friction.
Finally, we urge IR as a field to strengthen and maintain its focus on the study of how to support people when they engage in IB. IR is not a subfield of AI, nor a set of tasks to be solved by AI. It is an interdisciplinary space that seeks to understand how technology can be designed to serve ultimately human needs relating to information. In this article, we have engaged in an envisioning process, to lay out the kinds of RQs that we believe will strengthen the field of IR in this way. We invite the reader to take up these questions and work on them directly or to take up their spirit and propose more.