0% found this document useful (0 votes)

61 views

Corpus Typology

Uploaded by

HUN Teng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

Corpus Typology

Uploaded by

HUN Teng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

3

Corpus Typology: Part One

3.1 Introduction

For the past fifty years or so, corpus linguistics has been attested as one of
the mainstays of linguistics for various reasons. At various points of time,
scholars have discussed the methods of generating corpora and techniques
of processing them and using information from them in linguistic works—
starting from mainstream linguistics to applied linguistics and language
technology. However, in general, these discussions have often ignored an
important aspect related to the classification of corpora, although they
sporadically attempted to discuss form, formation and function of corpora of
various types. People have avoided this aspect because it is a difficult scheme
to classify corpora by way of a single frame of type. Any scheme that attempts
to put various corpora within a single frame will turn out to be unscientific
and non-reliable.
Electronic corpora are designed to be used in various linguistic works.
Sometimes, these are used for general linguistics research and application;
at other times these are utilized for works of language technology and
computational linguistics. The general assumption is that a corpus developed
for specific types of work is not so useful for works of other types. Such an
assumption is fallacious in the sense that a corpus developed for a specific
kind of work can be fruitfully used for similar works. Therefore, it is sensible
to assume that the function and utilization of a corpus is multidimensional
and multidirectional. For instance, a corpus developed for compiling a
dictionary may be equally used for writing grammar books, developing
language teaching materials and writing reference books. Due to such
reasons, people are often hesitant to classify corpora into any scheme.

3.2 Why Classify Corpora?

Each language corpus is developed following some principles of text

collection, text representation and application. These make a corpus distinct
48 Corpus Linguistics

from others in form, content, feature, and function. Taking these factors into

consideration, we propose to classify corpora into various types according

to the factors related with their form, content and function. The reasons for

corpus classification provide some advantages not possible to achieve in any

other way:

• Systematic classification of corpora helps us to identify the fields where

they are suitable for use.

• Prior classification of corpora guides linguists to select particular

corpora they think are useful for their works. For this they do not need

to grope in the dark.

• Dictionary makers wanting to compile a general dictionary need not

be in a dilemma with regard to the selection of corpora. They can select

both general as well as special corpora if they find prior information

about the type of corpus they need for their works.

• If general corpora can satisfy application-specific requirements, people

will try to procure only those corpora.

• Special corpora may satisfy the need of extracting relevant lexical

information necessary for jargon and technical terms.

• Classification of corpora helps investigators to retrieve necessary

linguistic information from specific corpora. Investigators wanting to

study normal speech patterns of native people will access a general

speech corpus rather than other corpora.

• If corpora are not classified, users will have to refer to all corpus types
before selecting the required one. This consumes much time, energy

and labour due to the internal complexities involved.

• Systematic classification of corpora enhances speed and accuracy of

comparative studies across corpus types. For instance, if speech and

text corpora are kept separate, any kind of comparative study between

the two becomes more robust and effective.

• Systematic classification of corpora makes us comfortable in

comparing data stored in each corpus type. We can systematically

observe the traits of similarities and differences between the two

types of corpus.
• If corpora are intermixed, any comparative study becomes highly

complicated, and observations become defective.

Taking into consideration the advantages of various types, we present here

a tentative scheme of classification of corpora. In this context, the most

important factors are:

• Minimum conditions required for any collection of language data to

be considered a corpus
• Separation of corpora of ordinary language use from the corpora

recording specialized language use

Corpus Typology: Part One 49

Both these factors maintain a perfect balance. If the criteria proposed below are

considered adequate, we assume that considerable progress is made because

there are large collections of language databases called corpora that do

not meet these conditions. Also there are some corpora that record special
and artificial language samples. Besides, the branch of corpus linguistics

is developing rapidly. As a result of this, regular norms and assumptions

are revised at quick successions. Therefore, classification of corpora is made
maximally flexible to meet such unstable conditions.

Electronic corpora are of various types with regard to texts, modes of data
sampling, methods of generation, manners of processing, nature of utilization,

etc. For instance:

• One corpus may contain samples of written text, while another may
contain samples of spoken texts.

• One corpus may preserve text samples from present-day (language

use), while another may preserve samples complied from age-old

texts and ancient documents.

• One corpus may be monolingual by way of collecting data from a

single language, another may be bilingual by way of including texts

from two languages and a third corpus may be multilingual by way of
including samples from more than two languages.

• Texts included in a corpus may be collected from a particular source,

from a whole range of sources belonging to a particular field or across

the fields and subjects of a language.

• Text samples may be obtained from newspapers, magazines, journals,

periodicals and similar other forms.

• Texts may be compiled from extracts of impromptu conversations,

spontaneous dialogues, made-up monologues, interactive discourses

of varying lengths, etc.

This implies that there are numerous needs and factors that control the
content, type and use of a corpus. It also signifies that the kind of texts
included as well as the combination of various text types may vary among

corpora. Taking all these issues under consideration, we broadly classify

corpora in the following criteria:

• Genre of text

• Nature of data
• Type of text

• Purpose of design
• Nature of application

In the following sections, the first two types of corpus are discussed briefly

with reference to the corpora developed so far in various languages of the

world. In the next chapter (Chapter 4), the remaining three types are discussed

with adequate examples and explanations.

50 Corpus Linguistics

3.3 Genre of Text

Under the 'genre of text', corpora are classified broadly into written corpus,
speech corpus, and spoken corpus. Each one is discussed in separate

subsections below.

3.3.1 Written Corpus

A written corpus, by virtue of its genre, contains only language data collected
from various written, printed, published and electronic sources. In case of
printed materials, it collects texts from published books, papers, journals,
magazines, periodicals, notices, circulars, documents, reports, manifestos,
advertisements, bulletins, placards, festoons, etc. In case of non-published
materials, it collects texts from personal letters, personal diaries, written family
records, old manuscripts, ancient legal deeds and wills, etc.
Thus, samples of various text types obtained from both published and
non-published sectors constitute the central body of a written corpus. Some
very well known, examples of written corpus are Bank of English, American
National Corpus, Brown Corpus, LOB Corpus, Australian Corpus of English,
Wellington Corpus of Written New Zealand English, Kolhapur Corpus of Indian
English, FEOB Corpus, British National Corpus, Bank of Swedish, MIT Corpus of
Indian Languages and others. These are made with text samples derived only
from written sources.
In the early years of corpus generation, there was virtually little scope
for including text samples from electronic sources in a corpus because such
text samples were not easily available. However, the situation has greatly
altered within the past three to four decades. Now, we can find a huge
amount of written texts from various electronic sources to be included in a
written corpus. There are many Web sites from where we can collect data for
generating a corpus of written texts. Moreover, there are electronic journals
and newsletters of various types from where texts samples are collected for
generating a written corpus.111 The following figure (Figure 3.1) presents a
sample of a written text from the KCIE:

**[txt. aOl**]
0010A01 **«*3Politics of Job Reservations*'!)*'^ S^lbegin leader comment, begin
0020A01 underscoring^*"] *"3AThe Bihar Government did not foresee or forestall
0030A01 the complications that_ followed its decision to_ reserve jobs for
0031A01 backward
A
0040A01 classes. The present violence in the State has raised the controversy
0050A01 over the criterion for backwardness— whether it should be caste or
0060A01 economic conditions.^O^fend underscoring, end leader comment*'*!
0070AOl $A WHY has the Bihar Government's decision to_ reserve jobs for backward
A
0080A01 classes led to a violent outburst? It is not such an original idea
0090AOl that it should have triggered demonstrations and riots or attracted all-India

Figure 3.1: Example of a Written Text Corpus {KCIE)

(ICAME: http://www.hit.uib.no/icame/kol-eks.html)
Corpus Typology: Part One 51

There is a debate regarding inclusion of texts written to be delivered in

speech (i.e. oration) in a written corpus. Also, a debate arises with regard to

the status of texts used in scripts and plays with relation to their inclusion
in a written corpus. Should we include samples from these sources into a
written corpus? It is really a difficult question because it is almost impossible

to decide in a definite way in which group these texts actually belong.

If we take into consideration the basic linguistic modality used in the

generation of these texts, we find that these texts have a right to be included
in a written corpus. Also, read-out writings, lectures delivered in seminars,

notes dictated in class or office, etc., although meant for listening, are
actually composed following the general norms of writing. Moreover, such

texts, although delivered in spoken form, do not have the features of normal
dialogue or conversation. Public speech such as, 'Dear ladies and gentlemen! It

is a great delight to inform you that the government has decided to implement a mass
literacy programme for the benefit of the nation', does not contain the features
typical to impromptu speech. It is quite rational to include it in a written

corpus because it is generated first in written form. Written text may be read
out, but its expression changes due to a change in medium. Therefore, it is

primarily a written text.

On the contrary, if we take the purpose of composition into consideration,

we may argue that these texts should belong to a speech corpus only. The
general argument is that these texts are composed not for reading but for
speaking. The scripts composed for films and plays are made in such a way

that they are suitable for the characters to communicate verbally. Similarly,

lectures composed for public oration are made in such a way that they are
suitable for open verbal deliberation before an audience. These texts should
not be put within written texts. A similar argument also stands for the

notes dictated and delivered in class. However, before we take any decision
regarding the actual status of such texts, we need to analyse the materials
from various angles, with serious consideration of the linguistic and non-

linguistic factors interlinked with these events.

Although texts written in English, German, Spanish, French and some

other Western languages are easily available in huge amount on Internet,

texts written in Indian languages are rarely found. In fact, due to specific
orthographic problems related with Indian scripts, written texts in Indian
languages are very difficult to procure from websites. Because of this

technological snag, Indian corpus linguists are not in a position to generate a

written corpus by way of quick collection of data from Web sources. However,
the situation is rapidly changing. At present, some resources are indeed

available on the Internet for Indian languages, thanks to the development

done in the area of putting Indian texts in 'Cyberia'.

3.3.2 Speech Corpus

A speech corpus/21 in contrast to a written corpus, usually contains text

samples obtained from verbal interactions. At the time of developing a

52 Corpus Linguistics

speech corpus, it is always kept in mind that samples are natural, informal,
conversational, and impromptu in nature. By its default value, a speech
corpus is entitled to contain samples of private and personal talks, formal
and informal discussions, debates, instant talks, impromptu analysis,
casual speech, face-to-face as well as telephonic conversations, dialogues,
monologues, online dictations, instant addressing, etc. There is no scope for
any external involvement because the aim of a speech corpus is to display the
basic characteristics of a speech act in the most faithful manner (Chafe 1982).
A speech corpus, for example, may contain text samples from various
types of speech events occurring in regular normal life and living, such as
common talks; telephonic exchanges; casual speeches; proceedings of courts;

interrogations at the police station; quarrels on roads; bargaining at markets;

talks in social functions, festivals and celebrations; exchange of talks in
classrooms; gossip among friends at malls; love-talks between lovers; curtain
lectures of couples; etc. Texts collected from such sources will properly attest
the actual form and nature of normal speeches. The London-Lund Corpus of
Spoken English, Korean Speech Corpus, Cantonese Speech Database, Dutch and
Flemish Speech Database, American Speech Corpus, Machine-Readable Corpus of
Spoken English, Edinburgh University Speech Corpus of English, Dialogue Diversity
Corpus, West Point Arabic Speech Corpus, Smart-Kom Multimodal Corpus, Speech
Corpus of London Teenagers, etc., are authentic examples of this type.131
The most important questions related to speech corpus generation are
how should they be designed and developed and the language of which
community they will project. These are tricky questions, which have no
straightforward answers. To solve problems of representing the speech
of a community, we argue to lay emphasis on the generation of a speech
corpus for each language variety, including standard and regional ones.
Practical constraints such as lack of financial support, technical know-how,
trained manpower, linguistic motivation, social inspiration and political
encouragement may act as barriers on the path of such projects in the Indian
context. Therefore, considering the facilities and conditions available, we
argue for developing speech corpora for the standard variety of each national
language included in the Eighth Schedule of the Indian Constitution. Priority
may be diverted towards other language varieties after the generation of
corpora in each Indian language.
The next question related to this is, from which sections, sectors and
domains are spoken data to be collected? Experts have furnished various
arguments for this particular issue.141 According to some scholars (Sinclair
1991: 132), speech samples should be taken from those sources and domains
that are considered standard and universally accepted by most of the
people of the speech community. For instance, texts from news broadcasting
and telecasting and language used in official and formal situations, court
proceedings, college and university lectures and classroom teachings may
be included in a speech corpus.
The reasons behind the selection of texts from these sources are
that these samples are suitable to reveal the actual standard form of the
Corpus Typology: Part One 53

spoken version of a language used by people. Moreover, analysis of these

standard speech databases will produce almost all the salient features
of the spoken form. In addition, if required, these databases may be
used in classroom teaching for teaching discourse patterns in spoken
interactions, pronunciation of sounds, words and sentences to language
learners. Also, these corpora may equally be used for teaching language to
foreign learners.
However, these arguments are strongly contradicted by others (Leech
1993). According to them, if a speech corpus is designed with data of standard
form only, there will be no scope for variety in the corpus. Moreover it will
fail to represent numerous varieties normally found in regular speech events.
It is, therefore, not logical to generalize the speech habit of an entire speech
community with a small set of text samples of standard spoken form. If we do

this, we will not only fail to account for the peculiarities observed in different
speech patterns of people but also deprive a large lumber of common speakers
from representing their speech data in a corpus.
Therefore, spoken text samples should be taken from all possible domains
of spoken interactions to represent people coming from all walks of life,
irrespective of profession, education, class, ethnicity, age and sex. It should
contain equal amount of data spoken by children as well as by students of
schools and colleges, workers of offices and courts, and people of various
other professions. Similarly text samples should come from interrogations
conducted at police stations, debates held in parliaments, quarrels taking
place in markets and on roads, etc. In sum, a language spoken by varieties of
people should have proportional representation in it. Only then will it reflect
on the internal form and nature of speech by maximum representation.
According to us, both formal and informal speech data should be included
in a speech corpus to make it maximally representative. While formal speech
will include texts from radio and television newscasts, public announcements,
audio advertisements, dialogic interactions, interviews, verbal surveys, pre-
recorded dialogues and scripts of films and plays, informal data will include
samples of texts obtained from various verbal interactions casually enacted
in regular courses of life. Thus equal representation of spoken texts will
make a speech corpus balanced, non-skewed, and properly representative.
A speech corpus is made in such a way that it is able to balance between
demographic and contextual varieties. While demographic variety accounts
for age, gender, profession, birthplace, education, economic condition,
ethnicity etc., of speakers, contextual variety accounts for all types of variations
observed in speech events taking place at different times, spaces, agents and
events. A speech corpus made in this process faithfully represents the actual
nature and form of a speech event. Thus, it builds bases for repetitiveness
and diversion—two important features of speech for providing reliable
information and clues for proper analysis and interpretation of discourse.
In the following figure (Figure 3.2), a small sample of the Corpus of London
Teenagers (COLT) corpus is reproduced to understand how a speech corpus
is designed and developed.
54 Corpus Linguistics

Sharon: Oh, don't start on me, you know, saying I can't be there on Tuesday! (...)
Susie: <laughing> I said nothing. [I'm talking about me</>!]
Sharon: [<nv> laugh </nv>] Don't start because I'll... I'll smash your face in! (...)
Sharon: I say, I've got friends.
Susie: <nv> laugh </nv> (...)
Sharon: And I'm gonna make them come over, and I'm gonna make them beat
the shit out of you! (...)
Susie: Oh, shut up!

Figure 3.2: Speech Corpus (Stenstrom and Andersen 2002: 203)

For the convenience of understanding, let us assume that we want to develop

a speech corpus for Standard Colloquial Bengali (SCB). Let us also hope that

it will preserve all the salient features required for a speech corpus to be

maximally balanced, representative and useful for studying Bengali speech.

Now the question is how we are going to develop such a speech corpus.

In this case, the method employed for other speeches may be useful (Hary

2003), if necessary. Normally, the methods that are used for collecting spoken

texts in electronic form include the following:

Stage 1 : Recording spoken texts in digital tape recorders

Stage 2 : Recording spoken interactions in videotapes

Stage 3 : Transcribing spoken texts into written form

Stage 4 : Transcribing spoken texts into International Phonetic Alphabets

(IPA)
Stage 5: Annotating texts with phonetic, orthographic, grammatical,

demographic and contextual information

Stage 6 : Preparing a detailed database about extralinguistic information

related to spoken texts and interactants

Stage 7 : Preparing a detailed glossary of spoken texts

Stage 8 : Translating texts into another widely known language

It is normally argued that informal and impromptu speech is the most

important variety of all because it has the closest representation of the core

of a language. An informal speech corpus, in principle, contains texts from

informal and impromptu conversations to reveal all the characteristic features

of speech in a reliable and lively way that no other variety can probably do.
The controversies related to the selection of spoken text samples need

urgent clarification. We are not sure how a speech is considered impromptu or

identified as informal. In fact, these questions need to be addressed first before

we actually tag spoken samples. We are also not sure whether one composes

texts for either oral deliberation or silent reading or for both. The truth is,
informal and impromptu speeches are the most difficult and expensive

things to acquire and are highly complicated to classify and manage. Also,

complexities are involved in transcription of spoken texts because there is no

agreed consensus about the conventions of transcription.
Corpus Typology: Part One 55

3.3.3 Spoken Corpus

The term 'spoken corpus' is deliberately used to distinguish it from a speech

corpus. A spoken corpus, in principle, is a technical extension of a speech

corpus. Definitely it contains texts of a spoken language but in a different mode

and formation. Text samples in a spoken corpus are stored in written form,

transcripted directly from spoken texts. Also, sometimes, it is tagged with

various annotations related to normal utterance of speech. Some examples of

a spoken corpus are Lancaster/IBM Spoken English Corpus, Emotional Prosody

Speech and Transcripts Corpus, Eondon-Eund Corpus, Wellington Corpus of Spoken

New Zealand English, International Corpus of English, etc. In these corpora,

speech texts are preserved in written form without changing the texts at the

time of transcription.
Spoken corpora are annotated with phonetic transcriptions. If spoken

corpora are preserved as sound waves as well as transcripted versions,

then a single text exists in two versions to generate a special kind of parallel

corpora. Although not many examples of phonetically transcripted spoken

corpora exist, they are an useful addition to the class of annotated corpora
for linguists who lack technological expertise for analysing recorded speech

(McEnery and Wilson 1996: 26). In the figure below (Figure 3.3), a sample of

a spoken trascripted corpus from Eancaster-Eund Corpus (LLC) is given.

A
10 1 1 B 11 ((of Spanish)). graph \ology# /
A
20 1 1 A 11 w=ell#. /
A
30 1 1 A 11 ((if)) did y/ou _set _that# - /
A
40 1 1 B 11 well !I\oe and _I# /
A
50 1 1 B 11 set it betw \ een _us# /
A
60 1 1 B 11 actually! Joe 'set the :p\aper# /

70 1 1 B 20 and ^((3 to 4 sylls))* /

80 1 1 A 11 *Aw=eIl# /

Figure 3.3: Example of a Spoken Corpus (LLC)

(ICAME: http://www.hit.uib.no/icame/lolu-eks.html)

Despite the wide experience gained in compilation and annotation of written

corpora, the works related to spoken corpora generation and annotation have

not become simplified. Spoken texts involve many aspects that need to be
taken care of at the time of text collection and annotation. The transient nature

of spoken texts is offered as an explanation for justifying the complexities

involved with collection of spoken texts. Therefore, capturing spoken texts is

not a trivial task.

After the audio data is collected and stored in electronic form, it

involves the production of transcription of texts in both orthographic and

phonetic forms for their utilization. That means processing of spoken texts
56 Corpus Linguistics

involves text segmentation, orthographic annotation, prosodic annotation,

part-of-speech tagging, lemmatization, parsing, etc., which are built upon

transcription. The problems that are often encountered in processing spoken

texts are the following:

• As there is little experience available for transcription of spoken texts,

procedures and guidelines need to be developed for it.

• Tools for automatic or manual annotation of spoken texts need to be

designed and implemented.

• The experiences gathered from working with written corpora have

marginal value to deal with the idiosyncrasies of spoken texts.

• The schemes for spoken text transcription and annotation have to be

designed separately to be functionally useful.

• The standards of annotation developed for spoken texts of some

languages may be used to cater to the needs of spoken corpora of

other languages.

Due to the complexities involved in compilation and annotation, spoken

corpora have brought speech technologists and linguists under one platform.

Ideally, a spoken corpus addresses the needs of these people, although there
are conflicts of interests. For example, the quality of recording of spontaneous

conversation in a noisy environment is highly interesting and useful for

linguists, but it appears to be useless to researchers of speech recognition and

speaker identification. Given below (See Figure 3.4) is an annotated spoken

corpus, tagged with features of spontaneous speech and syntax.

Orthographic version of a spoken text:

Good morning. More news about the Reverend Sun Myung

Moon, founder of the Unification church, who's currently in

jail for tax evasion; he was awarded an honorary degree last

week by...

Annotated version of the spoken text:

A01 2 (_(In_IN Perspective_NP)_)

A01 3 (_( Rosemary_NP Fiartill_NP)_)

A A
A01 5 good_JJ morning_NN ._. more_AP news_NN about_IN the_ATl

A01 5 Reverend_NPT Sun_NP Myung_NP Moon_NP,_, founder_NN

A01 6 of_IN the_ATI Unification_NNP church_NN ,_, who_WP /s_BEZ

A01 6 currently_RB in_IN jail_NN for_IN tax_NN evasion_NN :_:

Figure 3.4: Lancaster/IBM Spoken Tagged English Corpus

(ICAME: http://www.hit.uib.no/icame/lanspeks.html)
Corpus Typology: Part One 57

3.3.4 Text Corpus vs. Speech Corpus

It is well known that speech came historically prior to writing (Halliday

1987). We know that speech is primary, while writing is secondary, for the

reasons that children acquire speech first and illiterate people use language
without having the skill to write and read. Thus, primacy of speech
over writing clearly shows that speech is the basic medium of linguistic
expression, without regard to how language evolved and how children
acquire language.
However, in some recent works, it is argued that both writing and speech
are different but equal manifestations of language (Crystal 1997). Writing is
neither an 'other state' of speech, nor a degraded manifestation of speech. The
strategies and processes of production and comprehension of written texts
are autonomous from those used for speech. In other words, writing has an
independent entity and functional role in language as speech does (Sasaki
2003: 91). Yet, we must abide by the fact that speech is the most natural and
spontaneous state of language, which is used when two persons are within
hearing distance of each other's voice and when language is mutually
understandable to both the participants.
The differences between written and spoken texts are observed in the fact
that while written texts are processed as graphically codified, completed and
monologue-like products, spoken texts are normally transmitted as sound
waves. Spoken texts are developed through time and usually in the form
of a series of utterances that make a dialogic interaction characterized by
formal and organizational mechanisms of communication with one or more
participants. In other words, spoken texts, because these are developed as
parts of interactions, cannot be analysed adequately without proper reference
to the situations. These differences between spoken and written texts lead us
to develop special systems for transcription of spoken texts.
The unique features that constitute a spoken text are diversion and
interaction, which can only be made visible and accessible if an instance
of verbal interaction is not removed from its actual setting or contextual
background during the task of representing the spoken text in written form.
That means, to devise an appropriate written form for spoken text, it is
necessary to develop a whole range of theoretical categories to adequately
describe the interactional nature of spoken texts.151
The fundamental differences between spoken and written texts have
naturally and inevitably led to methodological and theoretical differences
between conversation analysis and grammatical description of language in
the following ways (Uhmann 2001: 377):

• Spoken language in verbal interaction follows its own rules, and thus
there is a linguistic system outside sentential grammar.
• The analysis inventory needed for theoretical description of this
system is not separate from the inventory of sentential grammar.
58 Corpus Linguistics

A theory of spoken language must therefore refer to concepts that

genuinely belong to sentential grammar.

• For the speaker, this means that applying the rules referred to in (b)

can necessitate the simultaneous activation of knowledge that properly

belongs to sentential grammar.

While we describe the texture of a spoken text, we need to describe the patterns

of lexical relations, conjunction and reference because all these patterns are

drawn on in speech as in writing. However, texture in spoken interaction also

involves additional cohesive patterns of the conversational structures. The

conversational structure describes how interactions actually negotiate the

exchange of meanings in case of dialogues. The conversational structure has

two components (Eggins 1994:109):

• Speech function: Negotiation, which characterizes spoken texts, is

achieved through the sequencing of moves each of which performs a

speech function or a speech act. The basic initiating speech functions

include an 'offer' (for example. Would you like another chocolate?), a

'command' (for example. Pass the chocolates.), a 'statement' (for example,

I love chocolates.) and a 'question' (for example. Winch chocolate do you

like best?). Responding speech functions either support or confront the

initiating speech function. Thus, we have accepting an offer versus

declining it, complying with a command versus refusing to comply,

acknowledging or agreeing with a statement versus disagreeing and

answering a question versus disavowing.

• Exchange structure: Sequences of these speech functions also

constitute jointly negotiated exchanges. The minimal exchange is two

speech functions (for example, offer + accept or question + answer).

But exchanges can be of many moves: an exchange may include both

preparatory moves and the following exchange—the core sequence of

an offer and its acceptances are surrounded by those initial sounding-

out moves and their extensive politeness-motivated following-up

moves.

We can characterize the basic contrasts between spoken and written texts
by following register variables, which have linguistic consequences both

in speech and writing.[6] The situations in which we use spoken texts are

typically interactive, while in case of writing, the situation is more or less

static without any direct visible interaction among the participants. We do

not usually deliver monologues to ourselves, although we do often interact

with ourselves by imagining a respondent to our remarks (Eggins 1994: 57).

In spoken situations we are usually in immediate face-to-face contacts

(exceptions are noted in telephonic conversations where there is only aural

but no visual contact between the speakers involved) with interactants. Here
we use language in a typical way to achieve some ongoing social actions (for

example, get work done, get consent over a point, etc.). In such situations
Corpus Typology: Part One 59

we usually act spontaneously so that our linguistic output is unrehearsed.

Because spoken situations are often 'informal and everyday', we are normally

relaxed and casual in mood during the course of spoken interaction.

We can clearly observe a contrast of this situation with a typical situation

in which we use written language. For instance, we can think of a situation in

which we write an article on the results obtained from a research experiment.

Here we find ourselves alone, not in a face-to-face situation, without having

any kind of aural or visual contact with our intended audiences. The language

we use deals with some issues related to our research. Here we never try to
write a commentary on our actions, feelings and thoughts we experienced

while we conducted the experiment.

This implies that language in written form calls for rehearsal at various

stages of its final culmination: we make drafts, edit them, rewrite then,

correct them and finally re-copy our texts. The truth underlying it is that for
most of us, writing is not an easy and casual activity. We need peaceful and

quiet situations as well as we need to concentrate to gather our thoughts in a

systematic manner before we put them in the formal process of composition.

Thus, spoken and written texts—two basic language types—reveal several

different dimensions as presented below (See Table 3.1):

Spoken Language Written Language

+ interactive - interactive
(two or more participants) (1 participant)

+ face-to-face - face-to-face
(in the same place and time) (on your own)

+ language as action - language as action

(language used to accomplish tasks) (using language to reflect)

+ spontaneous - spontaneous
(without rehearsing what is going to be (done with planning, drafting and
said) rewriting)

+ casual - casual
(informal and everyday) (formal and special occasions)

Table 3.1: Spoken Text vs. Written Text (Eggins 1994: 55)

There are some obvious implications of the contrast between spoken and

written texts. The texts used in spoken situations are typically organized

according to turn-by-turn sequences of talk. Because a spoken interaction

tends to accompany action, the structure of talk is dynamic, with one sentence
leading to another. Written text, on the other hand, is produced as a monologic

block (Eggins 1994: 57). It needs to stand more or less by itself. It needs to

be context independent. Because it is intended to encode our considered

reflections on a topic, it is organized synoptically. It has a beginning, a middle

and an end, with a generic structure determined before the text is complete.

Table 3.2 summarizes the differences that correspond to the two polar ends of

spoken and written texts.

60 Corpus Linguistics

Further differences between spoken and written texts are observed if we

critically compare a piece of spoken text with a piece of written text. We find

that spoken text contains spontaneity phenomena, such as hesitations, false

starts, repetitions, non-beginnings, non-ends, interruptions, etc., whereas

written text has all such traces removed. Spoken text contains a large list of

everyday words, including slang, provincialism and dialects, which are rarely

used in written text. Spoken text often includes unique structures of sentences

never used in standard grammatical conventions.

In written text we find highly prestigious vocabulary, selected dictions

and standard grammatical constructions. The differences thus noted

between the spoken and the written texts are not new to us. It is, however,

important to appreciate that the differences are not accidental. They are

functional consequences of situational differences in the mode of linguistic

communication (Eggings 1994: 58).

Spoken Language Written Language

• Turn-taking organization • Monologic organization

• Context dependent • Context independent
• Dynamic structure (interactive staging • Synoptic structure (rhetorical staging,
and open ended) closed ended and finite)
• Spontaneity phenomena (false • Polished final draft (indications of
starts, hesitations, non-beginnings, earlier drafts removed)
interruptions, overlaps, incomplete
clauses, non-ends, etc.)
• Everyday lexis • 'Prestige' lexis
• Non-standard grammar • Standard grammar
• Grammatical complexity • Grammatical simplicity
• Lexically sparse • Lexically dense

Table 3.2: Spoken Text vs. Written Text (Eggins 1994: 56)

There are also features that are highly sensitive to each type of text. They
include factors such as degree of grammatical complexity, patterns of lexical

density, process of text composition, process of nominalization, process

of the use of function and content words, process of addressing in texts,

etc. Analysis of these factors will illustrate the major differences between
spoken and written texts in various ways. In sum, while spoken text is

concerned with human actors to carry out action processes in dynamically

linked sequences of clauses, written text is virtually concerned with abstract

ideas and reasons and, therefore, is functionally linked by several relational

processes in condensed sentences (Eggins 1994: 53-68).
All these arguments are furnished here to substantiate our claim that

since spoken and written texts are characteristically different from each other
with regard to their form, function and composition, corpora developed from

these two different types of text should not be merged together to produce a
general corpus. Rather, each type of text should be kept in a separate corpus
Corpus Typology: Part One 61

so that its future use in linguistic studies and application is more useful and
trouble free.

3.4 Nature of Data

From the perspective of the nature of data, corpora are classified into several

broad types such as, general corpus, special corpus, controlled language

corpus, sublanguage corpus, sample corpus, monitor corpus, multimodal

corpus, etc. Each type of corpus is discussed below with reference to text
samples used for building it.

3.4.1 General Corpus

A general corpus contains texts of general type belonging to various

disciplines, genres, subject fields and registers. With regard to form and
utility, a general corpus is infinite in number of text samples. That means,

the number of text types and words included in this corpus is really vast and

open. Ffowever, it has little scope to grow with time because appending a

general corpus with new text samples is hardly permitted. A general corpus

is large in size, rich in variety, wide in text representation and reliable with

regards to information. Any information that is not available in a special

sample corpus is available in a general corpus because it has the authority to

include all kinds of linguistic data and information in it. Therefore, whenever

we require, we can easily retrieve necessary data and information from a

general corpus.

The minimal criteria for selecting texts for a general corpus include the

markers for drawing lines of distinction between fictional and non-fictional

texts (Sinclair 1991: 20). Also, there should be markers for identification of

texts obtained from books, journals, periodicals, newspapers, etc. In case of

both spoken and written text samples, we should use special identification

marks to distinguish between formal and informal texts and the factors that

control the use of texts based on age, gender, education, profession, origin

and other similar demographic variables. British National Corpus, American

National Corpus, Swedish National Corpus, etc., are considered faithful examples

of general corpus.

3.4.2 Special Corpus

A special corpus is designed from the texts already stored in a general corpus.

Obviously, text samples included in a special corpus belong to specific varieties

of language, dialect and subject, with emphasis on special aspects and properties

of the language that investigators want to explore. That means a special corpus

is usually assembled for a special purpose in a specific manner with specific

goals. In fact, the very nature of specificity of a special corpus makes it highly

flexible to vary in size, content, composition and representation, depending on

62 Corpus Linguistics

its functional relevance and referential purpose. Because of its unique nature
of composition, it usually fails to contribute towards the description of the

general features of a language, although its content reflects on the presence

of a high proportion of unusual features of a language and projects on a few

peripheral properties of a language.

In general, a special corpus is not reliable for general linguistic description
because it records text samples from people not behaving in a normal manner

or situation. Moreover, it is not balanced in its composition except within the

scope of its own purpose. Therefore, if used for other purposes, it will present

a distorted and skewed image of a language or its segments. It is different

from a general corpus in the sense that it aims at reflecting on features of

one or the other variety of a normal and authentic language. Corpora made
from the language used by children, non-native speakers, dialect groups and
people belonging to specialized areas of profession and works (for example,

auction, medicine, music, play, cooking, law, the underworld, gambling, etc.)

are designated as a special corpus due to their high representative function to

the language they include.
Special corpora contain texts sampled from particular varieties of a

language. For example, a dialect corpus is identified as a special corpus if

samples are obtained from a particular dialect or speech variety. The main

advantage of this is that text samples are selected in such a way that particular
phenomena we are looking for in the language variety occur more frequently
in it than in a general corpus. A special corpus, made and enriched in this

manner, is smaller in size than a general corpus that contains small samples

of the same data.[7]

Distinctions may be made between the varieties within the limits of
reasonable expectation of the kind of language in daily use by substantial

numbers of native speakers and the varieties that, for one reason or another,
deviate from the central core of a language. Therefore, a special corpus may fail

to contribute towards the general description of an ordinary language, either

because it contains a high proportion of unusual features or its origin is not
reliable as records of people behaving normally. Each component illustrates

a particular kind of language, and for each component, there is a descriptive

label that identifies homogeneity of the materials stored. The particularity of

language variety is retained at the helm of the corpus with a label without
transferring the data into a general category. CHILDES Database is a unique
example of a special corpus as is Corpus of the Times of India, 2000.[8i

In essence, a special corpus is made with texts that do not overlap much
with the central pool of a language. However, to be clearly 'within the frame

of a language', it shows a number of grammatical and lexical features of that

language. Even then, the 'markedness' of patterns unique to it will serve to

differentiate it clearly from general varieties of a language.

Corpus Typology: Part One 63

3.4.3 Controlled Languoge Corpus

A 'controlled language' is an exclusive concept because it puts special

restriction on the grammar, style and vocabulary for the writers of documents

in special domains. Typically, a controlled language is formally defined so

that conformity to a controlled language standard is verified. There is much

discussion on a controlled language among scholars working in the area of

language teaching, text editing and translation. Caterpillar Fundamental English

(CFE) is a unique example of a controlled language. It exercises a restricted

vocabulary of a total of 850 words, from back in the early 1970s, as a way

of simplifying the version of technical English so that non-native English-

speaking clients can read the documents easily.[9]

Various industries and research organizations have now started to build

upon the original work of the CFE. Emphasis is laid on creating a core of

lexical items that may be used throughout the documents. Although specific

numbers of general technical writing rules (for example, short sentences,

single-sense terms, etc.) are promoted, strict enforcement of grammatical

rules is not a usual phenomenon here. Sometimes, a 'conformance checker'

is used that checks for adherence to vocabulary items rather than the overall
grammatical structure of the language.

Conformance checkers are new measures of controlled language writing.

For instance, the Simplified English Checker/Corrector (SECC) project,

completed in 1994, resulted in the creation of a basic conformance checker,

which checks grammatical structures that did not conform to standard English

examples. It is mostly interactive in the sense that it indicates where deviance

occurs in controlled language writing samples.

The process of controlled technical English writing starts with a reduced

vocabulary (for example, 8000 general terms and nearly 50,000 technical

terms selected from a total of approximately one million terms). The database

is supported with a set number of constrained syntactic constructions in

English that may be mapped into about ten other languages. As indicated in

a recent article on the subject (Kamprath et al. 1998), new technical terms are
constantly added to the database for approval and then submitted to human

translators who can provide translations in their respective languages

and add them to multilingual databases. The number of English technical

terms, at present, is approximately 70,000. The objective is to set a better

standardization of English terminology, prepare better comprehension

techniques of English documents by both native and non-native English

readers and facilitate translations into target languages.

Most of the Controlled Eanguage Authoring Systems (CLAS) are called

'Stop-and-Go' or 'Red light/Green light' systems. An author works on an

entire piece of text and then submits it to a conformance checker. The checker
64 Corpus Linguistics

goes through each sentence one at a time and notifies the author of potential
spelling mistakes, ambiguity pitfalls for translation, etc. Researches are

underway on the development of interactive authoring systems that may

assist authors who are writing technical texts. This works in a similar fashion

to how computer-aided translation tools assist a human translator to produce

the target translation of a source text.1101

The objective of controlled language applications for technical writing is

to foresee the need of document translation and to create structural paradigms

that allow a computational system to optimally retrieve equivalents in the

target language for texts written in a controlled source language. In essence,

a controlled language is not a single, immutable entity. It has evolved over

decades and has taken form in different applications for different purposes.

Scholars have taken the general concept to customize it within their own

environments to make it profitable for their specific needs. It is only now in

the late 1990s that different controlled language systems are starting to work

together by forming the National Consortium to Advance Controlled Language and

Computer-Aided Translation Tools (Fields 1998). The focus is to create general

controlled language and training principles that will allow for cross-language

standards in the emerging field.

3.4.4 Sublanguage Corpus

A sublanguage1111 corpus contains only one text type of a particular variety

of a language. A sublanguage is a subset of a language that is closed under

some or all operations of a language. In essence, it lies at one extreme end of

the linguistic spectrum, while a reference corpus lies at the other extreme end.

The homogeneity of structure and highly specialized lexicon restricts it to be

quantitatively small in the amount of data so that it is able to demonstrate

properly what is typically good or what the closure properties of a database

are. Thus, a sublanguage corpus is defined with the help of its internal and

external criteria. However, it remains to be seen whether the external and

internal criteria actually match in practice. The study of language used for

special purposes shows that writers often conform to quite an elaborate

prescription when composing in a technical or professional context. Therefore,

it is not surprising if we find many similarities among sublanguage corpora.

Under this scheme, corpora consisting of sublanguage materials will fall under

the head of a subcorpus.[12]

It is, however, necessary to keep in mind that 'language' (for example,

English) is a broad set that may contain all conceivable utterances, including

slang, poetry and what we call 'standard' language. On the other hand, a

sublanguage is not merely an arbitrary subset of sentences because it may differ

in structure as well as in vocabulary. For example, in medicine, a telegraphic

sentence such as 'patient improved' is considered grammatically correct due

to an operation that permits dropping of articles and auxiliaries. If we follow

Corpus Typology: Part One 65

this definition, then weather reports, stock market reports, computer manuals

and controlled languages, all stand as examples of a sublanguage.[13]

A sublanguage corpus is an important resource in language technology.

People working in the area of language processing realize that they need access

to corpora containing sublanguage materials in order to develop systems

capable of handling specialized texts. It is also assumed that by narrowing the

subvariety, in a highly specialized communicative context, the actual structure

of language will be simplified and, thus, become more amenable to automatic

processing. The vocabulary, too, is restricted and specialized to correspond

with the constraints at semantic, conceptual, and cognitive levels.

3.4.5 Sample Corpus

A sample corpus is one of the major offshoots of a special corpus. It contains

a small collection of text samples chosen with great care and attention to be
studied in minute detail. According to some scholars (Sinclair 1991: 24), after
a sample corpus is designed and developed, it should not be open to any kind

of addition or change in any way because that may disturb the balance of
its composition and distort the actual image of the data required for special
research purposes. Because the number of samples is small and the size is

constant, they usually do not qualify as general texts. Zurich Corpus of English
Nezvspapers is a fine example of a sample corpus.
A special variety of a sample corpus is the literary corpus, which may

be further subcategorized based on the type of text included in it. This

incidentally draws attention to Biblical and literary scholarship that began
the work of corpus generation centuries ago (for details see Dash 2006). In
fact, there is lot of expertise available in literary circles on such things as
establishing the canon of an author's works. In case of a literary corpus, the
criteria considered for classifying a corpus may include various parameters
such as:

• A particular author (for example, a corpus of literary works of

Shakespeare, Milton, Elliot, Hemingway, Tagore and others)

• Text type of a single author (for example, a corpus of Shakespearean

plays, Keats's Odes, Tagore's short stories, etc.)

• Particular text of a single author (for example, a corpus of Paradise Lost,

Ulysses, etc.)

• A particular genre of text (for example, a corpus of odes, fiction,

dramas, short stories, poetry, etc.)

• A particular period (for example, a corpus of fifteenth-century prose

texts, eighteenth-century novels, etc.)

• A particular group (for example, a corpus of Romantic poets, Augustan

prose writers, Victorian novelists, etc.)

• A particular theme (for example, a corpus of revolutionary writings,

family narration, industrialization, etc.)

66 Corpus Linguistics

3.4.6 Monitor Corpus

A monitor corpus grows continuously with time to include an infinite number

and varieties of text samples from all possible sources of both written and
spoken language. It has an exclusive criterion of constant augmentation

of language databases to reflect on changes occurring within a language

that throbs with life. Because the scope of constant growth gives a monitor

corpus scope to reflect on the passage of language change, it marks the

meanders of growth and modification through the lens of a diachronic view.
However, scholars argue that this particular aspect of a monitor corpus keeps

untouched the relative balance of its components defined previously by

specific parameters (Sinclair 1991: 21). This implies that the same principles

of composition should be followed strictly year after year.

The basic purpose of a monitor corpus is to refer to texts, spoken or written,
in a language within a particular time span. Such a corpus has high functional

relevance in lexicography because it is constantly refurnished with new data

to trace novel changes creeping into the language. Thus, it enables rare tokens
to become large in number and allows old and common tokens to be stored in
an archive. Gradually, over time, the volume of a monitor corpus is enlarged

with the coverage of data spreading across decades or centuries.

A monitor corpus allows us to identify new words, track variation in

usage, observe change in meanings, establish long-term norms on frequency

of distribution, and derive wide-range lexical and syntactic information.
However, with the introduction of new texts, the overall balance of a monitor

corpus may change. In that case, the actual rate of flow of data may need to
be readjusted to address the trade of the time. Bank of English, Bank of Swedish,

etc., are examples of a monitor corpus.

3.4.7 Multimodal Corpus

A recent offshoot in the field of corpus typology is the multimodal

corpus, which targets to record and annotate several modalities of human

communication that aims at including spoken texts, written texts, gestures,
hand movements, facial expressions, body postures, etc. Obviously the
scheme of work involves several theoretical and practical issues to make this

corpus maximally useful in composition and representation. Also involved

are several physical aspects, which are directly related to designing a corpus
of this type.

Due to some limitations (See Chapter 7) of language corpora made

so far, there is an increasing interest among scientists for the formation of

a multimodal corpus, which, according to scientists of information and

communication technology, can contribute in a robust manner towards the

exploration of the techniques normally used in multimodal communication

that involves various modes of human communication and cognition.[14] The

basic focus of the enterprise for multimodal corpus generation is directed

Corpus Typology: Part One 67

towards the techniques of non-verbal communication studies and their

contribution to a definition of collection protocols, coding schemes, inter-

coder agreement measures, and reliable models of exploring multimodal

behaviours. These techniques are possible to build up from multimodal

corpora to be compared with results found in literature.

There are questions about how such corpora should be built up in

order to provide useful and usable answers to research questions related

to linguistics and information technology. This implies that the question

of generation, processing and utilization of multimodal corpora is actually

related to several issues stated below:

• Building up of models of behaviour from multiple sources of human

knowledge (for example, manual annotation, image processing,

motion capture, literature studies, etc.)

• Coding of schemes for annotation of multimodal video corpora

• Validation of multimodal annotations and metadata descriptions of

multimodal corpora

• Exploitation of multimodal corpora for various application (for

example, information extraction and retrieval, meeting transcription,

multimodal interfaces, summarization, translation, Internet services,

communication, clinical studies, etc.)

• Benchmarking of systems and products generated from the use of

multimodal corpora

• Use of multimodal corpora for evaluation of systems developed in

computational linguistics and language technology

• Automated fusion of resources (for example, coordinated speeches,

gazes, gestures, facial expressions, movements, etc.) retrieved from

one or more multimodal corpora

These issues require deeper understanding of theoretical issues and research

questions related to verbal and non-verbal communication of multimodal

corpora. However, since this is a new area of corpus research, we are yet to get

an updated view of the state-of-the-art of research on multimodal corpora.

Because electronic corpora are new things, we are yet to reach to a

common consensus on what counts as a corpus and how corpora should

be classified. However, our experience in dealing with different types

of corpora for the past fifty years helps us classify corpora in a tentative

way with a chance for future modification. It is noted that a corpus may be

designed and developed based on a number of parameters according to the

requirement of investigators. That means a scheme of corpus classification

is bound to evoke controversies and counterarguments. The decisions about

what should belong to a particular corpus and how the selection criteria

should be made virtually control every aspect of subsequent analysis

and investigation.
68 Corpus Linguistics

Endnotes

[1] Some people are interested in including text samples from personal e-mail
messages in a written corpus. However, people are highly sensitive in this regard.
According to their arguments, texts composed in personal mails should not be
included in a general written corpus because samples derived from these sources
possess specific criteria that are hardly observed in texts composed in imaginative
and informative writings. E-mail message texts are originally skewed and greatly
distracted from actual form and texture of general written texts. Therefore, we
identify a special category, namely 'E-mail Corpus' in which such texts should be
preserved for special type of investigation and analysis.
[2] Technically, however, a 'speech corpus' refers to texts that are available in oral
form. That is, speakers involved in a speech corpus behave in oral mode. An
important type of a speech corpus is an 'experimental corpus', which is assembled
for studying fine details of spoken language. Such a corpus is small in size and is
produced by asking informants to read out passages in an anechoic chamber.
[3] A speech corpus is a collection of spoken data typically recorded in a specific
setting for specific purpose by specific users (for example, the speech corpus
Speech DatCar is designed for developing an interactive system for direct
consumer application). Usually such a corpus lacks the richness of linguistic
features normally found in spoken language.
[4] The method and the standard proposed by Greenbaum and Quirk (1990) while
developing the London-Lund Speech Corpus of English is greatly revised and
modified at the time of developing Swedish Speech Corpus, Chinese Speech Corpus,
Speech Corpus of American English, and Hebrew Speech Corpus. As a result, we have
no definite guideline to follow for collecting speech data. However, the present
trend of corpus research implies that linguists have the liberty to select the type
and amount of speech data independently taking into consideration the need of
specific research and application potential of the work.
[5] It is here that the discipline 'Conversation Analysis' establishes itself as an
independent area of linguistic research. By using central concepts such as 'turn',
'adjacency pair', 'recipient design', 'reflexivity', etc., the field has developed a
necessary theoretical basis and methodological procedures for conversation
analysis and interpretation.
[6] According to the established norms of speech analysis, there are three register
variables: 'field' (what the language is being used to talk about), 'mode' (the
role language is playing in an interaction) and 'tenor' (the role of relationships
between interactants) (Halliday 1987).
[7] For instance, a corpus that contains text samples from normal dialogues and
conversations is not a special corpus because it partially reflects on the regular
spoken form of a language. Similarly, a corpus that represents texts from
newspapers (even of a single one) is not a special corpus because its content is
actually a small representation of the normal variety of a language.
[8j A special category of a special corpus is a 'literary corpus' of which there are many
kinds. Biblical and literary scholarships began the discipline of corpus linguistics
long ago, and there is a lot of expertise available in literary circles on such things
as establishing a canon of author's works. Classification criteria include 'author'
(for example, Shakespeare, James Joyce, Rabindranath Tagore, etc.), 'genre' (for
example, odes, short stories, limericks, etc.), 'period' (for example, fourteenth
Corpus Typology: Part One 69

century, twentieth century, etc.), 'group' (for example, Augustan prose, Victorian
novels, etc.) and 'theme' (for example, revolutionary writings, renaissance prose,
post-modern novels, etc.).
[9] This is similar to the work of Odgen's Basic English in the 1930s. More information
is available at marshallnet.com/~manor/basiceng/ramble.html.
[10] A distinction is made between the texts 'destined for translation' (i.e. decided
before writing starts that the original text will be translated) and texts that are
'chosen for translation' after writing the source text. When an organization
decides that all manuals that are produced are destined to be translated from
their very inception, it becomes easier to persuade management and technical
authoring staff to implement writing principles that will improve translatability
of the texts. If a text is meant to be produced and read only in the source language
(for example. Times Report, etc.), and someone decides to take such a document
and feed it through a machine translation system, the resulting text will most
likely be quite unsatisfactory because the text was not written with the intent that
it would be translated, especially not by a machine.
[11] Probably Zellig Harris first used the term 'sublanguage' to the natural language
by way of using algebra as an underlying formalism. He has defined that a set of
all sentences of a language is the closure of a given set of linguistic operations,
identified as a sublanguage. For example, the conjunction of two sentences yields
another sentence.
[12] A 'subcorpus' compiles text samples selected and ordered according to a set of
linguistic criteria defined beforehand to serve as characteristics to a particular
linguistic variety. Components of a subcorpus, to an extent, illustrate a particular
type of a language.
[13] Also, the concept of 'sublanguage' needs to be distinguished from that of 'artificial
language' or 'reduced language'. The latter two terms are designed intentionally,
whereas a sublanguage evolves naturally (although at the terminology level,
there may be some deliberate acts of creation). It is argued that sublanguages and
controlled languages are not mutually exclusive (Kittredge 2003).

[14] Recently, European Netzuorks of Excellence has launched integrated projects (for
example, HUMA1NE, SIMILAR, CHIE, AMI, etc.), which are solely dedicated to
multimodal human communication to testify the growing interest in this area as
well as to address the general needs for data on multimodal behaviours.

Collins Cobuild English Grammar
From Everand
Collins Cobuild English Grammar
HarperCollins UK
4/5 (13)
Teaching Languages To Young Learners
100% (4)
Teaching Languages To Young Learners
275 pages
Follower - Seamus Heaney
No ratings yet
Follower - Seamus Heaney
11 pages
Corpus Linguistics Practical Introduction PDF
No ratings yet
Corpus Linguistics Practical Introduction PDF
32 pages
Collaborative Writing in L2 Classrooms
From Everand
Collaborative Writing in L2 Classrooms
Neomy Storch
No ratings yet
Zack Files Book 1 Module
100% (2)
Zack Files Book 1 Module
14 pages
Differences Alif Hamza Wasly Qatai
100% (1)
Differences Alif Hamza Wasly Qatai
2 pages
2.3 Introduction To Corpora and Corpora Analysis
No ratings yet
2.3 Introduction To Corpora and Corpora Analysis
42 pages
8-CORPUS Analysis - Module 2-12-01-2024
No ratings yet
8-CORPUS Analysis - Module 2-12-01-2024
41 pages
Text Corpus: Meaning, Features, Classification
No ratings yet
Text Corpus: Meaning, Features, Classification
14 pages
CORPUS TYPES and CRITERIA
100% (1)
CORPUS TYPES and CRITERIA
14 pages
Corpus Linguistics: An Introduction
No ratings yet
Corpus Linguistics: An Introduction
43 pages
Corpus Lingustics
No ratings yet
Corpus Lingustics
24 pages
Introduction To Corpus Linguistics PDF
No ratings yet
Introduction To Corpus Linguistics PDF
12 pages
Cheng 2012 PP 3-8 Intro
No ratings yet
Cheng 2012 PP 3-8 Intro
6 pages
Corpora in Human Language Technologies
No ratings yet
Corpora in Human Language Technologies
42 pages
Corpus Linguistics and Corpus Analysis
No ratings yet
Corpus Linguistics and Corpus Analysis
7 pages
Corpus Design and Types of Corpora
No ratings yet
Corpus Design and Types of Corpora
68 pages
Developing Linguistic Corpora A Guide To Good Practice
No ratings yet
Developing Linguistic Corpora A Guide To Good Practice
21 pages
Seminar 1
No ratings yet
Seminar 1
7 pages
Terminology - Lecture 1 Corpora, Corpus Design and Corpus Selection
No ratings yet
Terminology - Lecture 1 Corpora, Corpus Design and Corpus Selection
28 pages
RoutledgeHandbooks 9780367076399 Chapter4
No ratings yet
RoutledgeHandbooks 9780367076399 Chapter4
14 pages
McEnery-corpusit-2001
No ratings yet
McEnery-corpusit-2001
47 pages
The International Encyclopedia of Language and Social Interaction - 2015 - Vaughan
No ratings yet
The International Encyclopedia of Language and Social Interaction - 2015 - Vaughan
17 pages
Corpus Design and Types of Corpora
No ratings yet
Corpus Design and Types of Corpora
68 pages
Corpus Linguistics Lect 1
No ratings yet
Corpus Linguistics Lect 1
5 pages
Types
No ratings yet
Types
41 pages
E-Content Submission To INFLIBNET
No ratings yet
E-Content Submission To INFLIBNET
14 pages
topics
No ratings yet
topics
85 pages
Types of Corpora
100% (6)
Types of Corpora
2 pages
Cospus Approaches in Discourse Analysis
No ratings yet
Cospus Approaches in Discourse Analysis
14 pages
Corpus Linguistics
100% (1)
Corpus Linguistics
13 pages
Corpus Linguistics
No ratings yet
Corpus Linguistics
25 pages
Corpus
No ratings yet
Corpus
123 pages
Corpus Bases Language Studies
No ratings yet
Corpus Bases Language Studies
312 pages
Los Corpus Del Español: Javier Rodríguez Molina
No ratings yet
Los Corpus Del Español: Javier Rodríguez Molina
56 pages
Reading 11b_Barth Schnell 2022 34-40
No ratings yet
Reading 11b_Barth Schnell 2022 34-40
7 pages
Pages From English Corpus Linguistics, An Introduction, 2 Ed., Charles Meyers, CUP 2023
No ratings yet
Pages From English Corpus Linguistics, An Introduction, 2 Ed., Charles Meyers, CUP 2023
41 pages
Introduction
No ratings yet
Introduction
8 pages
Lan & Meng 2023
No ratings yet
Lan & Meng 2023
23 pages
Séquence 4 NEW PPDDFF
No ratings yet
Séquence 4 NEW PPDDFF
6 pages
An Introduction To Corpus Linguistics
No ratings yet
An Introduction To Corpus Linguistics
328 pages
9. Corpus linguistics
No ratings yet
9. Corpus linguistics
31 pages
Corpus 2
No ratings yet
Corpus 2
49 pages
summary-lc (2)
No ratings yet
summary-lc (2)
9 pages
Corpora in Indian Languages
No ratings yet
Corpora in Indian Languages
18 pages
Corpus Usage: Be Ata B. Megyesi
No ratings yet
Corpus Usage: Be Ata B. Megyesi
40 pages
Features and Differences of The Parallel Corpus of English and Uzbek Languages. Jamshid Norov
No ratings yet
Features and Differences of The Parallel Corpus of English and Uzbek Languages. Jamshid Norov
5 pages
Digital Corpora as a Source of Authentic Materials in Teaching Grammar
No ratings yet
Digital Corpora as a Source of Authentic Materials in Teaching Grammar
12 pages
Dicción 1
No ratings yet
Dicción 1
52 pages
Corpus Linguistics For ENG 411
No ratings yet
Corpus Linguistics For ENG 411
66 pages
Seminar 3
No ratings yet
Seminar 3
10 pages
Corpora
No ratings yet
Corpora
12 pages
WK 3 Key Issues For Corpora Selection
No ratings yet
WK 3 Key Issues For Corpora Selection
37 pages
Corpus Linguistics
No ratings yet
Corpus Linguistics
23 pages
Session 14 - Computaional Linguistics
No ratings yet
Session 14 - Computaional Linguistics
23 pages
Charles Meyer - English Corpus Linguistics - An Introduction
93% (15)
Charles Meyer - English Corpus Linguistics - An Introduction
185 pages
Language, Linguistics, and Development Simplified
From Everand
Language, Linguistics, and Development Simplified
Narinder Mehra
No ratings yet
Applied Linguistics: A Genre Analysis Of: Research Articles Results and Discussion Sections in Journals Published in Applied Linguistics
From Everand
Applied Linguistics: A Genre Analysis Of: Research Articles Results and Discussion Sections in Journals Published in Applied Linguistics
Veronica M. Mutinda
No ratings yet
Analysis of a Medical Research Corpus: A Prelude for Learners, Teachers, Readers and Beyond
From Everand
Analysis of a Medical Research Corpus: A Prelude for Learners, Teachers, Readers and Beyond
Georgette Nicolas Jabbour
No ratings yet
Selected Readings on Transformational Theory
From Everand
Selected Readings on Transformational Theory
Noam Chomsky
5/5 (1)
Non-Transformational Syntax: Formal and Explicit Models of Grammar
From Everand
Non-Transformational Syntax: Formal and Explicit Models of Grammar
Robert Borsley
4/5 (1)
Chinese Rhetoric and Writing: An Introduction for Language Teachers
From Everand
Chinese Rhetoric and Writing: An Introduction for Language Teachers
Andy Kirkpatrick
No ratings yet
英汉小说语篇中话语标记功能的对比研究：英文
From Everand
英汉小说语篇中话语标记功能的对比研究：英文
徐欣，张媛
No ratings yet
X 5 Vkzgu 7 Zs QR SUCb 60 X LQor RTVF NQAtyl Sox AUJ5
No ratings yet
X 5 Vkzgu 7 Zs QR SUCb 60 X LQor RTVF NQAtyl Sox AUJ5
28 pages
On Brahmi
0% (1)
On Brahmi
62 pages
Issue 131
No ratings yet
Issue 131
64 pages
ចម្ការជីតាខ្ញុំ
No ratings yet
ចម្ការជីតាខ្ញុំ
11 pages
រឿងការភ្ញាក់រឮករបស់អ៊ំសែម
No ratings yet
រឿងការភ្ញាក់រឮករបស់អ៊ំសែម
3 pages
រឿងបងប្អូន២នាក់
No ratings yet
រឿងបងប្អូន២នាក់
3 pages
De Casparis (1975) Indonesian Palaeography
No ratings yet
De Casparis (1975) Indonesian Palaeography
1 page
Adobe Photoshop CS5 Complete
No ratings yet
Adobe Photoshop CS5 Complete
92 pages
SSC CHSL 2023 Tier 1 Question Papers English @exam - Stocks
No ratings yet
SSC CHSL 2023 Tier 1 Question Papers English @exam - Stocks
1,227 pages
candy 4
No ratings yet
candy 4
3 pages
Jorge El Curioso Va A La Escuela: Characteristics of The Text
No ratings yet
Jorge El Curioso Va A La Escuela: Characteristics of The Text
8 pages
English 9th 3+4
No ratings yet
English 9th 3+4
3 pages
Scheme For Analyzing Texts
No ratings yet
Scheme For Analyzing Texts
6 pages
DM Direct Method
No ratings yet
DM Direct Method
4 pages
Grammar N. Mkryan
No ratings yet
Grammar N. Mkryan
101 pages
CBSE Class 5 English Revision Worksheet (16) - Adjectives
100% (1)
CBSE Class 5 English Revision Worksheet (16) - Adjectives
4 pages
Colloquial Adalah Gaya Bahasa Yang Digunakan Dalam Kehidupan Sehari
No ratings yet
Colloquial Adalah Gaya Bahasa Yang Digunakan Dalam Kehidupan Sehari
5 pages
Lesson Iii: Numbers in Korea
No ratings yet
Lesson Iii: Numbers in Korea
15 pages
Comparative Superlative and Modal Verbs (28-10-24)
No ratings yet
Comparative Superlative and Modal Verbs (28-10-24)
20 pages
BBC - Learning English - Adjectives and Adverbs
100% (2)
BBC - Learning English - Adjectives and Adverbs
240 pages
Morphologie Historique Latin PDF
No ratings yet
Morphologie Historique Latin PDF
205 pages
IELTS Linking Words
100% (2)
IELTS Linking Words
20 pages
IC5 Intro WQ U15to16
100% (1)
IC5 Intro WQ U15to16
2 pages
Past Modals
No ratings yet
Past Modals
8 pages
Past Perfect
No ratings yet
Past Perfect
2 pages
Olimpiada Cls 5 Intensiv
No ratings yet
Olimpiada Cls 5 Intensiv
4 pages
From Discourse in The Novel - Mikhail Bakhtin
No ratings yet
From Discourse in The Novel - Mikhail Bakhtin
16 pages
Past Simple Past Continuous Used To
No ratings yet
Past Simple Past Continuous Used To
2 pages
Simple Past Tense
No ratings yet
Simple Past Tense
3 pages
The Language of Report Writing
100% (1)
The Language of Report Writing
22 pages
Remedial Instruction in English
No ratings yet
Remedial Instruction in English
9 pages
Quiz 2 Inglés UNAD
100% (1)
Quiz 2 Inglés UNAD
12 pages
Rush Hour Movie Clip:: Physical Communication Barriers Such As Social Distancing, Remote Work
No ratings yet
Rush Hour Movie Clip:: Physical Communication Barriers Such As Social Distancing, Remote Work
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.