Corpus Typology
Corpus Typology
3.1 Introduction
For the past fifty years or so, corpus linguistics has been attested as one of
the mainstays of linguistics for various reasons. At various points of time,
scholars have discussed the methods of generating corpora and techniques
of processing them and using information from them in linguistic works—
starting from mainstream linguistics to applied linguistics and language
technology. However, in general, these discussions have often ignored an
important aspect related to the classification of corpora, although they
sporadically attempted to discuss form, formation and function of corpora of
various types. People have avoided this aspect because it is a difficult scheme
to classify corpora by way of a single frame of type. Any scheme that attempts
to put various corpora within a single frame will turn out to be unscientific
and non-reliable.
Electronic corpora are designed to be used in various linguistic works.
Sometimes, these are used for general linguistics research and application;
at other times these are utilized for works of language technology and
computational linguistics. The general assumption is that a corpus developed
for specific types of work is not so useful for works of other types. Such an
assumption is fallacious in the sense that a corpus developed for a specific
kind of work can be fruitfully used for similar works. Therefore, it is sensible
to assume that the function and utilization of a corpus is multidimensional
and multidirectional. For instance, a corpus developed for compiling a
dictionary may be equally used for writing grammar books, developing
language teaching materials and writing reference books. Due to such
reasons, people are often hesitant to classify corpora into any scheme.
from others in form, content, feature, and function. Taking these factors into
to the factors related with their form, content and function. The reasons for
corpora they think are useful for their works. For this they do not need
• If corpora are not classified, users will have to refer to all corpus types
before selecting the required one. This consumes much time, energy
types of corpus.
• If corpora are intermixed, any comparative study becomes highly
be considered a corpus
• Separation of corpora of ordinary language use from the corpora
Both these factors maintain a perfect balance. If the criteria proposed below are
not meet these conditions. Also there are some corpora that record special
and artificial language samples. Besides, the branch of corpus linguistics
Electronic corpora are of various types with regard to texts, modes of data
sampling, methods of generation, manners of processing, nature of utilization,
• One corpus may contain samples of written text, while another may
contain samples of spoken texts.
This implies that there are numerous needs and factors that control the
content, type and use of a corpus. It also signifies that the kind of texts
included as well as the combination of various text types may vary among
• Genre of text
• Nature of data
• Type of text
• Purpose of design
• Nature of application
In the following sections, the first two types of corpus are discussed briefly
Under the 'genre of text', corpora are classified broadly into written corpus,
speech corpus, and spoken corpus. Each one is discussed in separate
subsections below.
A written corpus, by virtue of its genre, contains only language data collected
from various written, printed, published and electronic sources. In case of
printed materials, it collects texts from published books, papers, journals,
magazines, periodicals, notices, circulars, documents, reports, manifestos,
advertisements, bulletins, placards, festoons, etc. In case of non-published
materials, it collects texts from personal letters, personal diaries, written family
records, old manuscripts, ancient legal deeds and wills, etc.
Thus, samples of various text types obtained from both published and
non-published sectors constitute the central body of a written corpus. Some
very well known, examples of written corpus are Bank of English, American
National Corpus, Brown Corpus, LOB Corpus, Australian Corpus of English,
Wellington Corpus of Written New Zealand English, Kolhapur Corpus of Indian
English, FEOB Corpus, British National Corpus, Bank of Swedish, MIT Corpus of
Indian Languages and others. These are made with text samples derived only
from written sources.
In the early years of corpus generation, there was virtually little scope
for including text samples from electronic sources in a corpus because such
text samples were not easily available. However, the situation has greatly
altered within the past three to four decades. Now, we can find a huge
amount of written texts from various electronic sources to be included in a
written corpus. There are many Web sites from where we can collect data for
generating a corpus of written texts. Moreover, there are electronic journals
and newsletters of various types from where texts samples are collected for
generating a written corpus.111 The following figure (Figure 3.1) presents a
sample of a written text from the KCIE:
**[txt. aOl**]
0010A01 **«*3Politics of Job Reservations*'!)*'^ S^lbegin leader comment, begin
0020A01 underscoring^*"] *"3AThe Bihar Government did not foresee or forestall
0030A01 the complications that_ followed its decision to_ reserve jobs for
0031A01 backward
A
0040A01 classes. The present violence in the State has raised the controversy
0050A01 over the criterion for backwardness— whether it should be caste or
0060A01 economic conditions.^O^fend underscoring, end leader comment*'*!
0070AOl $A WHY has the Bihar Government's decision to_ reserve jobs for backward
A
0080A01 classes led to a violent outburst? It is not such an original idea
0090AOl that it should have triggered demonstrations and riots or attracted all-India
the status of texts used in scripts and plays with relation to their inclusion
in a written corpus. Should we include samples from these sources into a
written corpus? It is really a difficult question because it is almost impossible
notes dictated in class or office, etc., although meant for listening, are
actually composed following the general norms of writing. Moreover, such
texts, although delivered in spoken form, do not have the features of normal
dialogue or conversation. Public speech such as, 'Dear ladies and gentlemen! It
is a great delight to inform you that the government has decided to implement a mass
literacy programme for the benefit of the nation', does not contain the features
typical to impromptu speech. It is quite rational to include it in a written
corpus because it is generated first in written form. Written text may be read
out, but its expression changes due to a change in medium. Therefore, it is
we may argue that these texts should belong to a speech corpus only. The
general argument is that these texts are composed not for reading but for
speaking. The scripts composed for films and plays are made in such a way
that they are suitable for the characters to communicate verbally. Similarly,
lectures composed for public oration are made in such a way that they are
suitable for open verbal deliberation before an audience. These texts should
not be put within written texts. A similar argument also stands for the
notes dictated and delivered in class. However, before we take any decision
regarding the actual status of such texts, we need to analyse the materials
from various angles, with serious consideration of the linguistic and non-
texts written in Indian languages are rarely found. In fact, due to specific
orthographic problems related with Indian scripts, written texts in Indian
languages are very difficult to procure from websites. Because of this
speech corpus, it is always kept in mind that samples are natural, informal,
conversational, and impromptu in nature. By its default value, a speech
corpus is entitled to contain samples of private and personal talks, formal
and informal discussions, debates, instant talks, impromptu analysis,
casual speech, face-to-face as well as telephonic conversations, dialogues,
monologues, online dictations, instant addressing, etc. There is no scope for
any external involvement because the aim of a speech corpus is to display the
basic characteristics of a speech act in the most faithful manner (Chafe 1982).
A speech corpus, for example, may contain text samples from various
types of speech events occurring in regular normal life and living, such as
common talks; telephonic exchanges; casual speeches; proceedings of courts;
this, we will not only fail to account for the peculiarities observed in different
speech patterns of people but also deprive a large lumber of common speakers
from representing their speech data in a corpus.
Therefore, spoken text samples should be taken from all possible domains
of spoken interactions to represent people coming from all walks of life,
irrespective of profession, education, class, ethnicity, age and sex. It should
contain equal amount of data spoken by children as well as by students of
schools and colleges, workers of offices and courts, and people of various
other professions. Similarly text samples should come from interrogations
conducted at police stations, debates held in parliaments, quarrels taking
place in markets and on roads, etc. In sum, a language spoken by varieties of
people should have proportional representation in it. Only then will it reflect
on the internal form and nature of speech by maximum representation.
According to us, both formal and informal speech data should be included
in a speech corpus to make it maximally representative. While formal speech
will include texts from radio and television newscasts, public announcements,
audio advertisements, dialogic interactions, interviews, verbal surveys, pre-
recorded dialogues and scripts of films and plays, informal data will include
samples of texts obtained from various verbal interactions casually enacted
in regular courses of life. Thus equal representation of spoken texts will
make a speech corpus balanced, non-skewed, and properly representative.
A speech corpus is made in such a way that it is able to balance between
demographic and contextual varieties. While demographic variety accounts
for age, gender, profession, birthplace, education, economic condition,
ethnicity etc., of speakers, contextual variety accounts for all types of variations
observed in speech events taking place at different times, spaces, agents and
events. A speech corpus made in this process faithfully represents the actual
nature and form of a speech event. Thus, it builds bases for repetitiveness
and diversion—two important features of speech for providing reliable
information and clues for proper analysis and interpretation of discourse.
In the following figure (Figure 3.2), a small sample of the Corpus of London
Teenagers (COLT) corpus is reproduced to understand how a speech corpus
is designed and developed.
54 Corpus Linguistics
Sharon: Oh, don't start on me, you know, saying I can't be there on Tuesday! (...)
Susie: <laughing> I said nothing. [I'm talking about me</>!]
Sharon: [<nv> laugh </nv>] Don't start because I'll... I'll smash your face in! (...)
Sharon: I say, I've got friends.
Susie: <nv> laugh </nv> (...)
Sharon: And I'm gonna make them come over, and I'm gonna make them beat
the shit out of you! (...)
Susie: Oh, shut up!
a speech corpus for Standard Colloquial Bengali (SCB). Let us also hope that
it will preserve all the salient features required for a speech corpus to be
In this case, the method employed for other speeches may be useful (Hary
2003), if necessary. Normally, the methods that are used for collecting spoken
(IPA)
Stage 5: Annotating texts with phonetic, orthographic, grammatical,
important variety of all because it has the closest representation of the core
of speech in a reliable and lively way that no other variety can probably do.
The controversies related to the selection of spoken text samples need
texts for either oral deliberation or silent reading or for both. The truth is,
informal and impromptu speeches are the most difficult and expensive
things to acquire and are highly complicated to classify and manage. Also,
speech texts are preserved in written form without changing the texts at the
time of transcription.
Spoken corpora are annotated with phonetic transcriptions. If spoken
then a single text exists in two versions to generate a special kind of parallel
corpora exist, they are an useful addition to the class of annotated corpora
for linguists who lack technological expertise for analysing recorded speech
(McEnery and Wilson 1996: 26). In the figure below (Figure 3.3), a sample of
A
10 1 1 B 11 ((of Spanish)). graph \ology# /
A
20 1 1 A 11 w=ell#. /
A
30 1 1 A 11 ((if)) did y/ou _set _that# - /
A
40 1 1 B 11 well !I\oe and _I# /
A
50 1 1 B 11 set it betw \ een _us# /
A
60 1 1 B 11 actually! Joe 'set the :p\aper# /
80 1 1 A 11 *Aw=eIl# /
(ICAME: http://www.hit.uib.no/icame/lolu-eks.html)
corpora, the works related to spoken corpora generation and annotation have
not become simplified. Spoken texts involve many aspects that need to be
taken care of at the time of text collection and annotation. The transient nature
corpora have brought speech technologists and linguists under one platform.
Ideally, a spoken corpus addresses the needs of these people, although there
are conflicts of interests. For example, the quality of recording of spontaneous
week by...
reasons that children acquire speech first and illiterate people use language
without having the skill to write and read. Thus, primacy of speech
over writing clearly shows that speech is the basic medium of linguistic
expression, without regard to how language evolved and how children
acquire language.
However, in some recent works, it is argued that both writing and speech
are different but equal manifestations of language (Crystal 1997). Writing is
neither an 'other state' of speech, nor a degraded manifestation of speech. The
strategies and processes of production and comprehension of written texts
are autonomous from those used for speech. In other words, writing has an
independent entity and functional role in language as speech does (Sasaki
2003: 91). Yet, we must abide by the fact that speech is the most natural and
spontaneous state of language, which is used when two persons are within
hearing distance of each other's voice and when language is mutually
understandable to both the participants.
The differences between written and spoken texts are observed in the fact
that while written texts are processed as graphically codified, completed and
monologue-like products, spoken texts are normally transmitted as sound
waves. Spoken texts are developed through time and usually in the form
of a series of utterances that make a dialogic interaction characterized by
formal and organizational mechanisms of communication with one or more
participants. In other words, spoken texts, because these are developed as
parts of interactions, cannot be analysed adequately without proper reference
to the situations. These differences between spoken and written texts lead us
to develop special systems for transcription of spoken texts.
The unique features that constitute a spoken text are diversion and
interaction, which can only be made visible and accessible if an instance
of verbal interaction is not removed from its actual setting or contextual
background during the task of representing the spoken text in written form.
That means, to devise an appropriate written form for spoken text, it is
necessary to develop a whole range of theoretical categories to adequately
describe the interactional nature of spoken texts.151
The fundamental differences between spoken and written texts have
naturally and inevitably led to methodological and theoretical differences
between conversation analysis and grammatical description of language in
the following ways (Uhmann 2001: 377):
• Spoken language in verbal interaction follows its own rules, and thus
there is a linguistic system outside sentential grammar.
• The analysis inventory needed for theoretical description of this
system is not separate from the inventory of sentential grammar.
58 Corpus Linguistics
• For the speaker, this means that applying the rules referred to in (b)
While we describe the texture of a spoken text, we need to describe the patterns
of lexical relations, conjunction and reference because all these patterns are
moves.
We can characterize the basic contrasts between spoken and written texts
by following register variables, which have linguistic consequences both
in speech and writing.[6] The situations in which we use spoken texts are
but no visual contact between the speakers involved) with interactants. Here
we use language in a typical way to achieve some ongoing social actions (for
example, get work done, get consent over a point, etc.). In such situations
Corpus Typology: Part One 59
Because spoken situations are often 'informal and everyday', we are normally
any kind of aural or visual contact with our intended audiences. The language
we use deals with some issues related to our research. Here we never try to
write a commentary on our actions, feelings and thoughts we experienced
This implies that language in written form calls for rehearsal at various
stages of its final culmination: we make drafts, edit them, rewrite then,
correct them and finally re-copy our texts. The truth underlying it is that for
most of us, writing is not an easy and casual activity. We need peaceful and
+ interactive - interactive
(two or more participants) (1 participant)
+ face-to-face - face-to-face
(in the same place and time) (on your own)
+ spontaneous - spontaneous
(without rehearsing what is going to be (done with planning, drafting and
said) rewriting)
+ casual - casual
(informal and everyday) (formal and special occasions)
Table 3.1: Spoken Text vs. Written Text (Eggins 1994: 55)
There are some obvious implications of the contrast between spoken and
written texts. The texts used in spoken situations are typically organized
tends to accompany action, the structure of talk is dynamic, with one sentence
leading to another. Written text, on the other hand, is produced as a monologic
block (Eggins 1994: 57). It needs to stand more or less by itself. It needs to
and an end, with a generic structure determined before the text is complete.
Table 3.2 summarizes the differences that correspond to the two polar ends of
critically compare a piece of spoken text with a piece of written text. We find
everyday words, including slang, provincialism and dialects, which are rarely
used in written text. Spoken text often includes unique structures of sentences
between the spoken and the written texts are not new to us. It is, however,
important to appreciate that the differences are not accidental. They are
Table 3.2: Spoken Text vs. Written Text (Eggins 1994: 56)
There are also features that are highly sensitive to each type of text. They
include factors such as degree of grammatical complexity, patterns of lexical
since spoken and written texts are characteristically different from each other
with regard to their form, function and composition, corpora developed from
these two different types of text should not be merged together to produce a
general corpus. Rather, each type of text should be kept in a separate corpus
Corpus Typology: Part One 61
so that its future use in linguistic studies and application is more useful and
trouble free.
From the perspective of the nature of data, corpora are classified into several
broad types such as, general corpus, special corpus, controlled language
corpus, etc. Each type of corpus is discussed below with reference to text
samples used for building it.
disciplines, genres, subject fields and registers. With regard to form and
utility, a general corpus is infinite in number of text samples. That means,
the number of text types and words included in this corpus is really vast and
open. Ffowever, it has little scope to grow with time because appending a
general corpus with new text samples is hardly permitted. A general corpus
is large in size, rich in variety, wide in text representation and reliable with
include all kinds of linguistic data and information in it. Therefore, whenever
general corpus.
The minimal criteria for selecting texts for a general corpus include the
texts (Sinclair 1991: 20). Also, there should be markers for identification of
marks to distinguish between formal and informal texts and the factors that
control the use of texts based on age, gender, education, profession, origin
National Corpus, Swedish National Corpus, etc., are considered faithful examples
of general corpus.
A special corpus is designed from the texts already stored in a general corpus.
of language, dialect and subject, with emphasis on special aspects and properties
of the language that investigators want to explore. That means a special corpus
its functional relevance and referential purpose. Because of its unique nature
of composition, it usually fails to contribute towards the description of the
one or the other variety of a normal and authentic language. Corpora made
from the language used by children, non-native speakers, dialect groups and
people belonging to specialized areas of profession and works (for example,
auction, medicine, music, play, cooking, law, the underworld, gambling, etc.)
advantage of this is that text samples are selected in such a way that particular
phenomena we are looking for in the language variety occur more frequently
in it than in a general corpus. A special corpus, made and enriched in this
manner, is smaller in size than a general corpus that contains small samples
numbers of native speakers and the varieties that, for one reason or another,
deviate from the central core of a language. Therefore, a special corpus may fail
language variety is retained at the helm of the corpus with a label without
transferring the data into a general category. CHILDES Database is a unique
example of a special corpus as is Corpus of the Times of India, 2000.[8i
In essence, a special corpus is made with texts that do not overlap much
with the central pool of a language. However, to be clearly 'within the frame
vocabulary of a total of 850 words, from back in the early 1970s, as a way
lexical items that may be used throughout the documents. Although specific
is used that checks for adherence to vocabulary items rather than the overall
grammatical structure of the language.
terms selected from a total of approximately one million terms). The database
English that may be mapped into about ten other languages. As indicated in
a recent article on the subject (Kamprath et al. 1998), new technical terms are
constantly added to the database for approval and then submitted to human
goes through each sentence one at a time and notifies the author of potential
spelling mistakes, ambiguity pitfalls for translation, etc. Researches are
assist authors who are writing technical texts. This works in a similar fashion
decades and has taken form in different applications for different purposes.
Scholars have taken the general concept to customize it within their own
controlled language and training principles that will allow for cross-language
the linguistic spectrum, while a reference corpus lies at the other extreme end.
are. Thus, a sublanguage corpus is defined with the help of its internal and
internal criteria actually match in practice. The study of language used for
Under this scheme, corpora consisting of sublanguage materials will fall under
English) is a broad set that may contain all conceivable utterances, including
slang, poetry and what we call 'standard' language. On the other hand, a
this definition, then weather reports, stock market reports, computer manuals
People working in the area of language processing realize that they need access
of addition or change in any way because that may disturb the balance of
its composition and distort the actual image of the data required for special
research purposes. Because the number of samples is small and the size is
constant, they usually do not qualify as general texts. Zurich Corpus of English
Nezvspapers is a fine example of a sample corpus.
A special variety of a sample corpus is the literary corpus, which may
Ulysses, etc.)
and varieties of text samples from all possible sources of both written and
spoken language. It has an exclusive criterion of constant augmentation
to trace novel changes creeping into the language. Thus, it enables rare tokens
to become large in number and allows old and common tokens to be stored in
an archive. Gradually, over time, the volume of a monitor corpus is enlarged
corpus may change. In that case, the actual rate of flow of data may need to
be readjusted to address the trade of the time. Bank of English, Bank of Swedish,
multimodal corpora
multimodal corpora
corpora. However, since this is a new area of corpus research, we are yet to get
of corpora for the past fifty years helps us classify corpora in a tentative
way with a chance for future modification. It is noted that a corpus may be
what should belong to a particular corpus and how the selection criteria
and investigation.
68 Corpus Linguistics
Endnotes
[1] Some people are interested in including text samples from personal e-mail
messages in a written corpus. However, people are highly sensitive in this regard.
According to their arguments, texts composed in personal mails should not be
included in a general written corpus because samples derived from these sources
possess specific criteria that are hardly observed in texts composed in imaginative
and informative writings. E-mail message texts are originally skewed and greatly
distracted from actual form and texture of general written texts. Therefore, we
identify a special category, namely 'E-mail Corpus' in which such texts should be
preserved for special type of investigation and analysis.
[2] Technically, however, a 'speech corpus' refers to texts that are available in oral
form. That is, speakers involved in a speech corpus behave in oral mode. An
important type of a speech corpus is an 'experimental corpus', which is assembled
for studying fine details of spoken language. Such a corpus is small in size and is
produced by asking informants to read out passages in an anechoic chamber.
[3] A speech corpus is a collection of spoken data typically recorded in a specific
setting for specific purpose by specific users (for example, the speech corpus
Speech DatCar is designed for developing an interactive system for direct
consumer application). Usually such a corpus lacks the richness of linguistic
features normally found in spoken language.
[4] The method and the standard proposed by Greenbaum and Quirk (1990) while
developing the London-Lund Speech Corpus of English is greatly revised and
modified at the time of developing Swedish Speech Corpus, Chinese Speech Corpus,
Speech Corpus of American English, and Hebrew Speech Corpus. As a result, we have
no definite guideline to follow for collecting speech data. However, the present
trend of corpus research implies that linguists have the liberty to select the type
and amount of speech data independently taking into consideration the need of
specific research and application potential of the work.
[5] It is here that the discipline 'Conversation Analysis' establishes itself as an
independent area of linguistic research. By using central concepts such as 'turn',
'adjacency pair', 'recipient design', 'reflexivity', etc., the field has developed a
necessary theoretical basis and methodological procedures for conversation
analysis and interpretation.
[6] According to the established norms of speech analysis, there are three register
variables: 'field' (what the language is being used to talk about), 'mode' (the
role language is playing in an interaction) and 'tenor' (the role of relationships
between interactants) (Halliday 1987).
[7] For instance, a corpus that contains text samples from normal dialogues and
conversations is not a special corpus because it partially reflects on the regular
spoken form of a language. Similarly, a corpus that represents texts from
newspapers (even of a single one) is not a special corpus because its content is
actually a small representation of the normal variety of a language.
[8j A special category of a special corpus is a 'literary corpus' of which there are many
kinds. Biblical and literary scholarships began the discipline of corpus linguistics
long ago, and there is a lot of expertise available in literary circles on such things
as establishing a canon of author's works. Classification criteria include 'author'
(for example, Shakespeare, James Joyce, Rabindranath Tagore, etc.), 'genre' (for
example, odes, short stories, limericks, etc.), 'period' (for example, fourteenth
Corpus Typology: Part One 69
century, twentieth century, etc.), 'group' (for example, Augustan prose, Victorian
novels, etc.) and 'theme' (for example, revolutionary writings, renaissance prose,
post-modern novels, etc.).
[9] This is similar to the work of Odgen's Basic English in the 1930s. More information
is available at marshallnet.com/~manor/basiceng/ramble.html.
[10] A distinction is made between the texts 'destined for translation' (i.e. decided
before writing starts that the original text will be translated) and texts that are
'chosen for translation' after writing the source text. When an organization
decides that all manuals that are produced are destined to be translated from
their very inception, it becomes easier to persuade management and technical
authoring staff to implement writing principles that will improve translatability
of the texts. If a text is meant to be produced and read only in the source language
(for example. Times Report, etc.), and someone decides to take such a document
and feed it through a machine translation system, the resulting text will most
likely be quite unsatisfactory because the text was not written with the intent that
it would be translated, especially not by a machine.
[11] Probably Zellig Harris first used the term 'sublanguage' to the natural language
by way of using algebra as an underlying formalism. He has defined that a set of
all sentences of a language is the closure of a given set of linguistic operations,
identified as a sublanguage. For example, the conjunction of two sentences yields
another sentence.
[12] A 'subcorpus' compiles text samples selected and ordered according to a set of
linguistic criteria defined beforehand to serve as characteristics to a particular
linguistic variety. Components of a subcorpus, to an extent, illustrate a particular
type of a language.
[13] Also, the concept of 'sublanguage' needs to be distinguished from that of 'artificial
language' or 'reduced language'. The latter two terms are designed intentionally,
whereas a sublanguage evolves naturally (although at the terminology level,
there may be some deliberate acts of creation). It is argued that sublanguages and
controlled languages are not mutually exclusive (Kittredge 2003).
[14] Recently, European Netzuorks of Excellence has launched integrated projects (for
example, HUMA1NE, SIMILAR, CHIE, AMI, etc.), which are solely dedicated to
multimodal human communication to testify the growing interest in this area as
well as to address the general needs for data on multimodal behaviours.