Article 5

Applied Corpus Linguistics 3 (2023) 100066
Contents lists available at ScienceDirect
Applied Corpus Linguistics

journal homepage: www.elsevier.com/locate/acorp
Generative AI and the end of corpus-assisted data-driven learning? Not

so fast!
Peter Crosthwaite a, *, Vit Baisa b
a
School of Languages and Cultures, University of Queensland, Australia
b
Masaryk University, Faculty of Informatics, Natural Language Processing Centre, Czech Republic
A R T I C L E I N F O A B S T R A C T
Keywords: This article explores the potential advantages of corpora over generative artificial intelligence (GenAI) in un
Data-driven learning derstanding language patterns and usage, while also acknowledging the potential of GenAI to address some of the
generative AI main shortcomings of corpus-based data-driven learning (DDL). One of the main advantages of corpora is that we
ChatGPT
know exactly the domain of texts from which the corpus data is derived, something that we cannot track from
DDL
Corpora
current large language models underlying applications like ChatGPT. We know the texts that make up large
general corpora such as BNC2014 and BAWE, and can even extract full texts from these corpora if needed.
Corpora also allow for more nuanced analysis of language patterns, including the statistics behind multi-word
units and collocations, which can be difficult for GenAI to handle. However, it is important to note that
GenAI has its own strengths in advancing our understanding of language-in-use that corpora, to date, have
struggled with. We therefore argue that by combining corpus and GenAI approaches, language learners can gain
a more comprehensive understanding of how language works in different contexts than is currently possible
using only a single approach.
Introduction we are all out of a job soon”, “my friend uses it to write all her work e-
mails”, etc. More astute observers by now even understand the under
The late 2022 release of OpenAI’s ChatGPT and the subsequent ex lying processes behind how such models and their user interfaces work,
plosion in generative artificial intelligence (GenAI) applications have essentially by calculating – very quickly and at previously unprece
already fundamentally changed the perception of the general public dented scale – those underlying patterns of language-in-use that corpus
towards the possibilities of human interaction with large language data linguists are so familiar with. Suddenly, language data – corpora – are
– something corpus linguists have been attempting to do for decades. back in vogue. Yet, the field of corpus linguistics is at a crossroads.
Certainly, as corpus linguists dedicated to popularising what we do Despite our best efforts, our field risks being overshadowed by GenAI
outside of academia, we’ve often had to describe what corpus linguists researchers who are essentially just doing what we as corpus linguists
do with corpora and what corpus consultation can bring, for people who already do, but in a way that has finally captured the imagination of the
might not consider themselves corpus linguists. It generally goes public.
something along the lines of the following: This is certainly the case for those work on corpus-based data-driven
“We use corpus tools to query large authentic, principled collections learning (DDL). Corpora have now been used for over two decades to
of electronically searchable text to discover the patterns of language-in- enhance the teaching and learning of languages, whether indirectly
use across a wide variety of general and specific registers, so that we can through the creation of dictionaries, wordlists or integration of corpus
better understand (and possibly teach) those patterns” data into teaching materials, or directly through learners’ hands-on
After the usual cursory nods of the head or the occasional “how corpus consultation (more commonly termed DDL). DDL is purported
interesting?”, the conversation usually ends there, and the subject is to promote language acquisition through exposure to the frequency and
changed. salience of patterns of language in use, whether individually through
But not so much these days. Recently, any talk of large language data constructivist learning, or socioculturally when used in classroom con
is met with fervent interest – “oh, are you using that ChatGPT?”, “I guess texts as learners, peers and teachers complete scaffolded activities
* Corresponding author.
E-mail address: p.cros@uq.edu.au (P. Crosthwaite).
https://doi.org/10.1016/j.acorp.2023.100066
Received 18 April 2023; Received in revised form 25 June 2023; Accepted 12 July 2023
Available online 13 July 2023
2666-7991/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
P. Crosthwaite and V. Baisa Applied Corpus Linguistics 3 (2023) 100066
involving corpus data (O’Keeffe, 2021). A wealth of empirical studies 2) Authenticity – together with the first point above, there is a sig
has found advantages in using DDL across a variety of linguistic targets nificant difference between corpus data as produced by humans, and
in experimental research, with generally positive accompanying per GenAI output as generated through a statistical procedure. GenAI
ceptions (Boulton & Vyatkina, 2021), albeit tempered with criticisms may generate sentences which are grammatically correct, but which
about corpus tools and corpus data that we outline in more detail below. might rarely be used in actual writing or conversation, or which may
Therefore, having just released our own DDL corpus tool early 2023 not be contextually or register-appropriate. Subsequent prompt
(https://corpusmate.com), we were concerned that our work was all in tweaking may be required to ensure issues of this nature are mini
vain – who would want to use corpus tools to consult corpora for DDL, mised when using GenAI, but the authenticity of corpus data – that is,
when the world’s largest corpus (in a sense) was now publicly available language data actually produced by humans – should be seen as a more
for query, with the latest interactive chatbot available to quickly and reliable indicator of real language-in-use, particularly for second
recursively query that corpus, and all (currently) for free? Does this language learners who do not have the benefit of being able to easily
mean the end of DDL as we know it? Upon reflection, it certainly does authenticate whether a given output would match what a native
mean the end of DDL as we know it if corpus linguists and DDL practi speaker of the target language might produce in the same context.
tioners do not take steps to rectify ongoing issues that have continually 3) Replicability – While GenAI applications ‘generate’ text through
plagued the field and begin to consider how GenAI may help us over complex statistical procedures, an end-user currently cannot see, nor
come them. Crosthwaite and Boulton (2023) described these issues for be able to replicate, the statistical procedures that lead to that
DDL at some length, including (amongst other factors) expanding the generated text. Even if you could, the answers are randomly sampled
boundaries of what constitutes DDL outside of concordancing, the leading to a unique answer for each subsequent identical query. The
complexity of most publicly available corpus datasets used for DDL, and benefit of corpora in this regard is that one can easily replicate a
the high level of Technological, Pedagogical and Content Knowledge given finding with the same query on the same data, in this respect
(TPACK – Koehler & Mishra, 2009; see Meunier, 2019 for a description producing ‘hard evidence’ that word X belongs with word Y, for
of its potential for DDL) required for practicing teachers to make example. Such evidence is incredibly powerful for language learners
corpus-based DDL work for them. Now the genie is out of the proverbial (and their teachers) – even better if you can replicate a finding in
bottle when it comes to ChatGPT/GenAI, the field of DDL needs to Corpus X in Corpus Y. You can also do this time and time again,
address these issues if more ‘traditional’ corpus consultation is to remain without limitation.
viable (funnily enough, in most DDL studies, corpora are seen as the 4) Multimodality – Several recent corpus tools present multiple path
trendy alternative to ‘traditional’ teaching). Alternatively, can DDL re ways into corpus data, be this in the form of (coloured) concor
searchers now harness the power of GenAI to bring the kind of learning dances, statistical tables (e.g., collocation scores), and, increasingly,
espoused under DDL to a wider audience? In what follows, we briefly visual charts and maps of relationships between words and lexico-
make the case for both possibilities. grammatical units (e.g., Voyant Tools, Sinclair and Rockwell,
2016). These improvements to corpus tool functionality work to
The continued case for corpora in DDL specifically target and highlight patterns in corpus data for improved
visibility and learnability for groups of users for whom more tradi
The potential affordances of GenAI for language learning include tional concordancing may be difficult. Additionally, some pedagog
real-time conversation, immediate formative and corrective feedback, ical corpora e.g., SACODEYL (Pérez-Paredes & Alcaraz-Calero, 2009)
natural language explanations of vocabulary in contexts, instant gen utilise video and audio files in tandem with concordancers. At the
eration of texts of specific registers and genres, dictionary definitions time of writing, most GenAI tools do have the ability to generate
and examples, and machine translation (Kohnke et al., 2023). However, tables or detailed images from text prompts, but it can be difficult to
given the delay between the release of GenAI applications e.g., ChatGPT, operationalise this within a chatbot context and usually requires
such affordances remain as yet largely empirically untested at the actual integration of one tool (e.g., ChatGPT) with another (e.g., Mid
classroom level. We do, however, have a lengthy body of research journey), making this difficult for non-technical users.
pointing out the learning gains for genre/register awareness, vocabu 5) Safety – In working closely with educators at the primary and sec
lary, grammar and error correction that corpus consultation can bring to ondary levels, an oft-reported concern is the lack of clarity about how
end-users, as suggested in numerous meta-analyses and bibliometric GenAI tools and companies use user data. Any data provided by pre-
reviews (e.g., Boulton and Cobb, 2017; Lee et al., 2019; Dong et al., tertiary age users is commonly subject to numerous ethical and legal
2022). At a fundamental level, we also know that principled, curated safeguards, as is sensitive personal data, data on curricula, assess
consultation of corpus data can lead to meaningful learning at both in ments, and many more. Educational bodies are as yet hesitant to
dividual and sociocultural levels of engagement (O’Keeffe, 2021). allow even staff access to ChatGPT for these reasons, never mind
There is absolutely no reason why this should not continue in the younger learners. As most corpus tools only require very little in
GenAI age. Let us consider here some of the advantages that corpora and terms of user data (perhaps only initial registration details), corpus
corpus tools still hold over chatbot-like GenAI applications: linguists looking to use this opportunity to push corpus use in schools
can stress that corpus consultation is currently a ‘safe’ option.
1) Knowing the data – One of the main ‘selling points’ of corpora as a 6) Hallucinations – Due to the way GenAI works as a predictive lan
research and later teaching tool was that we know, exactly, the guage model, the accuracy of its output can often leave a lot to be
domain of texts from which the corpus data is derived, something desired. For example, recent tests on ChatGPT’s ability to generate
that we cannot track from current large language models underlying non-latin script languages is poorer than its ability to understand
applications e.g., ChatGPT. We know the texts that make up large them (Bang et al., 2023), and has also been shown to invent terms
general corpora e.g., the BNC2014, the BAWE, etc., and can even go that lie outside of its training data (Shen et al., 2023). While corpus
to the trouble to extract the full texts from these corpora should we data can also be rife with typos, punctuation and spacing issues, and
need to (see Crosthwaite et al., 2021 for a pedagogical example). lexico-grammatical errors and innovations (albeit this is what you
CorpusMate, for example, has a ‘citation’ function whereby one can want in learner corpora, for example), many of these issues can be
click any concordance result to see which corpus the text was derived overcome through appropriate data cleaning and preparation, unlike
from, the title of that text, and a link to said corpus. The ability to use GenAI where no such access to the data is available and where users
DIY corpora (e.g., Charles, 2012) is a considerable advantage in this are a slave to the algorithm.
regard, with learners able to claim complete ownership over the 7) Active vs. passive learning – The literature on the ‘L’ in DDL sug
corpus used for subsequent queries. gests that the type of constructivist, usage-based learning that takes
2
place during corpus consultation requires much intention on the part advanced queries can unlock even more patterns (e.g., “list 10 sen
of the learner to succeed, with concordancing a cognitively tences containing the past participle of the word ‘do’).
demanding task that requires significant inductive learning processes 2) User experience – The single biggest reported complaint we hear
(Sun and Wang, 2003). So far, ChatGPT claims its use promotes following DDL training sessions is related to the complexity of the
inductive learning - “When a user interacts with GPT, they are essen corpus tools, with most comments reserved for the user interface
tially making use of the patterns and associations learned by the model to (UI). So much DDL training has been done with highly complex UIs
generate new text or provide relevant information. The process is induc such as BYU/English-corpora.org, which accounted for 31% of all
tive because it involves drawing conclusions based on the patterns and DDL studies up to 2019 (Boulton & Vytakina, 2021). Admittedly,
associations observed in the input data, rather than applying pre-existing such corpus tools commonly used for DDL are research rather than
knowledge or rules (as in deductive reasoning)(OpenAI, 2023). How teaching tools, but even attempts to streamline complex corpus tool
ever, to what extent is ChatGPT or the learner doing the induction? UIs have done little to help (e.g., SketchEngine’s recent ‘overhaul’ of
Without the learners’ knowledge of these ‘patterns’ or ‘associations’ its UI). Interacting with GenAI applications such as ChatGPT could
in the data, it appears that GPT is the one doing the induction, and not be easier in form or function, hence its current popularity
there is a significant risk that users simply copy-paste ChatGPT amongst the general population.
output into their own work with little actual ‘learning’ taking place. 3) Differentiation – while we already discussed the rigidity of corpus
data as an advantage earlier in this paper, such rigidity is also a major
To summarise, corpora and DDL still offer advantages over large problem for users without the required proficiency to understand the
language models such as ChatGPT when it comes to language learning concordance results they are getting. What GenAI applications do
and teaching, which should come as a relief for the field. That said, there very well at present is the ability to ‘differentiate’ language intended
are still many reasons why the field needs to embrace this technology if for, say, advanced users of the target language and reproduce that
we are to stay relevant, which we outline in the following section. using simpler structures and vocabulary (e.g., “re-write this text for a
9th grader”). English as an additional language/dialect teachers we
The case for bringing GenAI into DDL work with in Australia see this function as game-changing for their
everyday work, particularly for mixed-proficiency classes. The po
DDL researchers are acutely aware of the shortcomings of the current tential impact of GenAI’s ability to generate results from almost any
state-of-the-art related to available corpora and corpus tools. Following register, domain or even language cannot be overstated, and can
DDL training sessions, we hear the same qualitative comments from greatly widen the scope of DDL from its current almost invariable
teacher trainees and language learners again and again, summarised in focus on tertiary academic English language.
the form of “Corpora and DDL sound really cool, but…”. Crosthwaite and 4) Data size - Another factor to consider is simply the size of current
Boulton (2023) note some of these “buts” include: large language models such as ChatGPT (using GPT-4), with training
data token counts in the billions or trillions. Being able to quickly
• The level of technical knowledge needed to use the tool (for DDL). query models of this size and at speed – online - is previously un
• Complicated or unintuitive user interfaces at odds with how modern precedented with even the best available corpus tools.
learners typically access digital information through resources such 5) Prior user input as input – What separates OpenAI’s success from
as Google (and now GenAI). rivals with older large language models is the ability to take previous
• Unsuitable corpus data for the target learners (e.g., COCA for inputs to the model and use these to refine future outputs, with these
younger or less proficient L2 learners) ‘chats’ saved for future, currently unlimited use. While numerous
• A general inability to track users’ corpus use (and any associated DDL studies have sought to track how learners engage with corpora
learning gains) over time. in the form of screen recordings (e.g., Kotamjani et al., 2017) and
query logs (e.g., Pérez-Paredes et al., 2011), easily capturing this
GenAI stands as a potential solution to almost all these concerns if we longitudinal data using existing corpus tools is currently not possible.
can take the kind of inductive, discovery learning approach that we Revisiting user interactions with these language models and using
currently employ for corpus based DDL and apply that to our use of chatlogs as evidence of ‘learning’ will likely be a major research
GenAI. In other words, if we work on “bringing a DDL mindset to our methodology in the coming years as researchers investigate to what
learners, rather than expecting them to come to corpus linguistics” extent GenAI use promotes language acquisition and even ‘better
(Crosthwaite & Boulton, 2023), we can open the DDL field to new learning’ in general, something DDL research has largely yet to
possibilities rather than gatekeeping DDL behind concordancing and achieve.
concordancers. While this was necessary before the advent of GenAI, it is 6) Translation to teaching materials – It generally takes a teacher
now crucial for the DDL field to take this step if we are to remain rele with a high degree of corpus literacy, content and pedagogical
vant going forward. knowledge to convert corpus findings into actual teaching materials,
Let us discuss the possibilities in turn. as seen in a number of recent studies exploring DDL lesson planning
(e.g., Ma et al., 2022). However, GenAI/ChatGPT’s ability to take a
1) Required technical knowledge – Already GenAI chatbots have previous language-focused query (e.g., “what are some nouns that fill
significantly reduced the levels of technical knowledge required to slot X in this sentence”) and seamlessly convert that finding into a
successfully consult large language data. The simple ability to use teaching task or assessment item (e.g., “take the list of nouns you
natural language inputs to receive (generated) natural language generated and create a sequence of multiple-choice questions based
output is really the main gamechanger, even before the release of on them”) or even a full lesson plan (e.g., “build a lesson plan for 2nd
OpenAI’s ChatGPT chatbot. For example, we no longer need to use graders based on acquiring these nouns”) is simply amazing, and a
complex corpus query language syntax to isolate parts of speech complete game-changer for mainstream pre-tertiary teachers with
within our corpus retrievals, as we can now simply ‘ask’ for what we little time or resources.
want, e.g., “can you list a few example sentences including ‘phrase 7) Funding – typically, most corpus builders and corpus tools creators
X’”?, “what is a more formal equivalent of word Y”, or “what are are one-person shows or small university-based teams, working on
some words similar to the word “Z”. This significantly reduces the limited grants, donations, or in many cases for free as a hobby. This
degree of metalinguistic knowledge required to successfully query a can leave tools unfinished or with no foreseeable upgrades in data or
corpus platform for DDL, and for those with such knowledge, more functionality, leading to a limited shelf-life beyond with end-users
move on. Current major GenAI companies such as OpenAI are
3
already multi-billion-dollar enterprises, with new updates and up Boulton, A., Cobb, T., 2017. Corpus use in language learning: a meta-analysis. Lang.
Learn. 67 (2), 348–393. https://doi.org/10.1111/lang.12224.
grades coming out on a weekly or even daily basis. We, as a field,
Boulton, A., Vyatkina, N., 2021. Thirty years of data-driven learning: taking stock and
cannot hope to compete with the speed and scale of GenAI, but if you charting new directions over time. Lang. Learn. Technol. 25 (3), 66–89. http://hdl.
can’t beat them, join them. handle.net/10125/73450.
Charles, M., 2012. Proper vocabulary and juicy collocations’: EAP students evaluate do-
it-yourself corpus-building. English Specific Purposes 31 (2), 93–102. https://doi.
Conclusion org/10.1016/j.esp.2011.12.003.
Crosthwaite, P., Boulton, A., 2023. DDL is dead? Long live DDL! Expanding the
boundaries of data-driven learning. In: Pérez-Paredes, P., Tyne, H. (Eds.),
To conclude, it may be time to consider whether we continue to
Discovering Language: Learning and Affordance in press.
gatekeep DDL behind concordancers, or open up the ‘D’ in DDL to new, Crosthwaite, P., Sanhueza, A.G., Schweinberger, M., 2021. Training disciplinary genre
GenAI-assisted possibilities. We are keen to stress that this does not have awareness through blended learning: An exploration into EAP students’ perceptions
of online annotation of genres across disciplines. J. Engl. Acad. Purp. 53, 101021.
to be a zero-sum game – there are situations where corpora will continue
https://doi.org/10.1016/j.jeap.2021.101021.
to be advantageous over GenAI for some time to come, while GenAI can Dong, J., Zhao, Y., Buckingham, L., 2022. Charting the landscape of data-driven learning
already solve certain issues that have continued to plague corpus-based using a bibliometric analysis. ReCALL 1–17. https://doi.org/10.1017/
DDL for years. Combining corpus-based DDL with GenAI appears to be ‘a S0958344022000222.
Koehler, M., Mishra, P., 2009. What is technological pedagogical content knowledge?
useful methodological synergy’, similarly to when corpora began to be Contemp. Issues Technol. Teacher Educ. 9 (1), 60–70.
used for critical discourse analysis, for example (Baker et al., 2008). We Kohnke, L., Moorhouse, B.L., Zou, D., 2023. ChatGPT for language teaching and learning.
are also keen to stress, however, that not leveraging GenAI risks leaving RELC J. https://doi.org/10.1177/00336882231162868, 00336882231162868.
Kotamjani, S.S., Razavi, O.F., Hussin, H., 2017. Online Corpus Tools in Scholarly Writing:
DDL practitioners and DDL as an enterprise behind, given the current a Case of EFL Postgraduate Student. English Lang. Teach. 10 (9), 61–68. https://doi.
depth of interest in GenAI for education. DDL researchers are org/10.5539/elt.v10n9p61.
well-placed to take advantage of this renewed mainstream interest in Lee, H., Warschauer, M., Lee, J.H., 2019. The effects of corpus use on second language
vocabulary learning: a multilevel meta-analysis. Appl. Linguist. 40 (5), 721–753.
language data as we understand the power of such data for language https://doi.org/10.1093/applin/amy012.
teaching as well as the conditions required for meaningful learning using Ma, Q., Yuan, R., Cheung, L.M.E., Yang, J., 2022. Teacher paths for developing corpus-
such data to proceed. Let’s therefore put our money where our mouth is based language pedagogy: a case study. Comput. Assist. Lang. Learn. 1–32. https://
doi.org/10.1080/09588221.2022.2040537.
and get started.
Meunier, F., 2019. A case for constructive alignment in DDL: rethinking outcomes,
practices and assessment in (data-driven) language learning. In: Crosthwaite, P.
(Ed.), Data-driven Learning For the Next generation: Corpora and DDL For Pre-
Declaration of Competing Interest Tertiary Learners. Taylor & Francis, Routledge, pp. 13–31. https://doi.org/10.4324/
9780429425899-2.
OpenAI. (2023). ChatGPT (19/4/23) [Large language model].
The authors declare that they have no known competing financial
O’Keeffe, A., 2021. Data-driven learning: a call for a broader research gaze. Lang. Teach.
interests or personal relationships that could have appeared to influence 54 (2), 259–272. https://doi.org/10.1017/S0261444820000245.
the work reported in this paper. Pérez-Paredes, P., Alcaraz-Calero, J.M., 2009. Developing annotation solutions for online
data driven learning. ReCALL 21 (1), 55–75. https://doi.org/10.1017/
S0958344009000093.
References Pérez-Paredes, P., Sánchez-Tornel, M., Alcaraz Calero, J.M., Jiménez, P.A., 2011.
Tracking learners’ actual uses of corpora: guided vs non-guided corpus consultation.
Baker, P., Gabrielatos, C., Khosravinik, M., Krzyżanowski, M., McEnery, T., Wodak, R., Comput. Assist. Lang. Learn. 24 (3), 233–253. https://doi.org/10.1080/
2008. A useful methodological synergy? Combining critical discourse analysis and 09588221.2010.539978.
corpus linguistics to examine discourses of refugees and asylum seekers in the UK Shen, Y., Heacock, L., Elias, J., Hentel, K.D., Reig, B., Shih, G., Moy, L., 2023. ChatGPT
press. Discourse Society 19 (3), 273–306. https://doi.org/10.1177/ and other large language models are double-edged swords. Radiology 307 (2),
0957926508088962. e230163.
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Sinclair, S. & Rockwell, G. (2016). Voyant Tools. http://voyant-tools.org/.
Chung, W., Do, Q., Xu, Y. & Fung, P. (2023). A multitask, multilingual, multimodal Sun, Y.C., Wang, L.Y., 2003. Concordancers in the EFL classroom: cognitive approaches
evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint and collocation difficulty. Comput. Assist. Lang. Learn. 16 (1), 83–94. https://doi.
arXiv:2302.04023. org/10.1076/call.16.1.83.15528.

Article 5

Uploaded by

Copyright:

Available Formats

Article 5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Article 5

Uploaded by

Copyright:

Available Formats

Applied Corpus Linguistics 3 (2023) 100066

Contents lists available at ScienceDirect

Applied Corpus Linguistics

Generative AI and the end of corpus-assisted data-driven learning? Not

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.