Corpus Linguistics 1

Corpus Linguistics
Linguistics being the scientific study of language

and its structure, ‘corpus linguistics’ is the study
of language “on the basis of text corpora.”
The analysis does not stop at the description of

those texts; rather the contexts are also focused
upon.
Place for Corpus Linguistics in Applied Linguistics
 A means to explore actual patterns of language use.

 A tool for developing materials for classroom language
instruction.
 To explore different questions about language use.
 To provide powerful tools for analysis of
natural languages.
 To give an insight about how language use varies
in different situations.
Corpora
 ‘Corpora’ are a large and structured set of texts
(nowadays usually electronically stored and
processed).
 In corpus linguistics, they are used to do
statistical analysis and hypothesis testing,
checking occurrences or validating linguistic
rules within a specific language territory.
General Corpora
The texts that do not belong to a single text type,
subject field, or register.
May include written or spoken language, or both.
May include texts produced in one country or
many.
They aim to represent language in its broadest
sense and to serve as a widely available resource
for baseline or comparative studies of general
linguistic features.
May be used to produce reference materials
for language learning or translation.
Often used as a baseline in comparison with more

specialized corpora.
Also sometimes known as ‘reference corpora’.

Examples
Brown Corpus – 1 million words.
LOB Corpus – 1 million words.
BNC (British National Corpus) – 100 million

words.
Specialized Corpora
 Texts that are designed with more specific research goals
in mind – register-specific descriptions and
investigations of language.
 It aims to be representative of a given type of text.
 Used to investigate a particular type of language.
 The kind of texts included are limited:
 A time frame – such as a particular century.
 A social setting – such as conversations taking place in
a bookshop.
 A given topic – such as newspaper articles dealing
with a particular thing.
Examples
 Cambridge and Nottingham Corpus
of Discourse in English
(CANCODE) (informal registers of
British English) – 5 million words.
 Michigan Corpus of Academic

Spoken English (MICASE) (spoken
registers in a US academic setting) –
5 million words.
Historical or Diachronic Corpora
Texts fromdifferent
periods of time.
Aim at representing an
earlier stage(s) of a
language.
They help to trace the
development of a language
over time.
Example
Helsinki Corpus - 700 to 1700 texts

1.5 million words
Regional Corpora
Aim at representinga regional variety of a
language, such as dialects.
Learner’s Corpora
Aim at representing the language as produced by the
learners of a language, and they include spoken or
written language samples produced by non-native
speakers.
They are used to identify differences among learners’

frequency of words and types of mistakes.
 In what respects learners differ from each other
and from the language of native speakers
Example
Louvain Corpus of
Native English Essays
(LOCNEE)
International Corpus of Learner

English (ICLE)
20,000 words
Multilingual Corpora
 Any systematic collection of empirical language data
enabling linguists to carry out analyses of multilingual
individuals, multilingual societies or multilingual
communication.
Comparable Corpora
 Two (or more) corpora in different languages (e.g. English
and Spanish) or in different varieties of a language (e.g.
Indian English and Canadian English).
 They are designed along the same lines – will contain the
same proportions of newspaper texts, novels, casual
conversation, etc.
 Comparable corpora of varieties of the same language can be
used to compare those varieties.
 Comparable corpora of different languages can be used by
translators to identify differences and equivalences in
each language.
Example
 International Corpus of English (ICE) are
comparable corpora of 1 million words each of
different varieties of English.
Parallel Corpora
Two (or more) corpora in different languages, each
containing texts that have been translated from one
language into the other, or texts that have been produced
simultaneously in two or more languages.
 Can be used by translators and by learners to find

potential equivalent expressions in each language and to
investigate differences between languages.
Size
Representativeness
Registers / modes / topics
Demographics
Production / reception
Research goals
Funding
Time
Staff/students
Written Corpora
 Obtaining/creating, Storing, Organizing
Materials Required:
-scanner, OCR software
Process:
-paper document into electronic text file
Types:
-newspapers, periodicals
-small specialized corpora
-informal writings (travel diaries, e-mail,
discussion, blogs, news groups)
Spoken Corpora
deciding on a transcription system
I. prosodic/non prosodic
II.representing interactional characteristics of speech
(over lapping speech, back channels, pauses, non-
verbal contextual events)
III. permission to use data
IV. ensuring anonymity
V. avoiding impracticality of data
Markup
1. Structural markups:
-written corpus: Titles, authors, paragraphs, subheadings,
chapters etc.
-spoken corpus: Contextual events, paralinguistic features
2: Header:
-written corpus:
Classification into categories(register, genre, topic domain, discourse
mode, formality)
-spoken corpus:
Demographic infirmation about speaker(gender,social
class,occupation,age,native language/dialect)
Relationship among the participants
Linguistic Annotation
Parts of Speech Tagging:

Grammatical category, case assigning
Prosodic Annotation
Phonetic Annotation
Syntactic Parsing
Advantages of Tagging
Vast exploration
Frequency
Co-occurance
Multiple meaning studies
Automatically retrievable
Concordance Lines
Concordance lines are a useful tool for
investigating corpora, but their use is limited by
the ability of the human observer to process
information.
There are some statistical calculations of

collocation and corpus annotation.
Frequency and Key-word Lists
 A frequency list is a list of all the types in a corpus
together with the number of occurrences of each type.
Comparing the frequency lists for two corpora can give
interesting information
 About the differences between the two texts.
e.g.) Kennedy (1998)
 a comparison between a corpus of Economics texts
and one of general academic English→ the words price,
cost, demand, curve, firm… are frequently found in the
Economics corpus.
Keywords
 A useful starting point in investigating a
specialized corpus.
They can be lexical items which reflect the topic

of a particular text but also grammatical words
which convey more subtle information.
Collocation
 The tendency of words to be biased in the
way they co-occur.
 Statistical measurements of collocation are more

reliable, and for this reason a corpus is essential.
Measurements of Collocation
Computer programs, which calculate collocation,
take a node word and count the instances of all words
occurring within a particular span.
(note) the count ignores punctuation marks.
Counts ‘s’ as a separate word.
 Ignores sentence boundaries.

Tagging and Parsing
Tagging is allocating a part of speech (POS) label
to each word in a corpus.
 e.g.) the word light ・・・ tagged as verb, a

noun or an adjective each time it occurs in the
corpus.
 Parsing is analyzing the sentences in a

corpus into their constituent parts, that is, doing.
Annotation
General term for tagging and parsing, and also
used to describe other kinds of categorisation that
may be performed on a corpus.
(e.g.) The annotation of a spoken corpus for
prosodic features.
The annotation of a corpus of learner English for
types of error.
 Annotation of anaphora and semantic annotation.
Softwares
Special software is used in order to analyze a corpus and
certain words or phrases.
For example
• Sara for the BNC
• ICECUP for the ICE Great Britain.
• Concordancers can be used for the analysis of almost
any corpus.
Concordancer
 One of themost frequently used concordancers is
‘Wordsmith Tools’.
 Its two most important tools are:

 Concord and WordList
 As an alternative to Wordsmith, you can also use a

concordancer called ‘AntConc’ which can be
downloaded for free.
WordSmith Concord
 Click on the Wordsmith icon on the desktop to

open the program. Select concord in order to
search a corpus for a certain word or phrase. You
can now choose a corpus and select those files of
the corpus you want to analyse.
Some further options for entering a search word or
phrase:
 By using the asterisk *, you can widen the scope of your

search. For example, entering going as a search word will
provide you only with all instances of going; entering going
to with all instances of going to. If you type in go*, on the
other hand, you will get all words beginning with go-, e.g.
going, goes, gold. Searching for *ing, you will get all words
ending in –ing, e.g. swimming, dancing, sing.
WordSmith WordList
The tool WordList generates word lists of the

selected text files and enables you to compare the
length of text files or corpora.
Moreover, you can use WordList to compare the

frequency of a word in different text files or
across genres and to identify common clusters.
AntConc Concordance tool
 This tool shows the words or word strings you want

to analyse in their textual context.
 Select the files you want to analyse: File > Open file(s)
 Choose the tab "Concordance"
 Type in a search word (“Search Term”, bottom left-

hand corner)
More reliable than intuition.
Language patterns are easily identified.
Deconstruct texts to discover patterns.
Track the development of specific features in the
history of English.
Test hypothesis on specific language features
empirically.
Follow language acquisition properly.
Draw conclusions on large amounts of linguistics
data.
Frequency rather than the possibility.
Not always a complete picture.
 More communicative modes:
 spoken corpora, interactional corpora (classroom
interactions, authentic interactions, etc) multimodal
corpora, corpora of textbook materials, etc.
 More text types and genres, to cover text types which are
less represented in corpora (letters, emails, leaflets, TV
programs, book synopses, recipes, short notes, chat room
logs, etc.),
 More longitudinal language data:
 from beginners to advanced levels, from children to adults, from L1
to L2.
 More variables:
 more language learning variables should be collected and encoded
at the time of corpus collection (proficiency, language aptitude,
motivation, more precise description of the task, of temporal, social
or situational settings, etc).
 More languages:
 to counterbalance the predominance of Anglo-Saxon native and
learner corpora and to foster the computer-aided analysis of
different languages and language families.
 Prior to Corpus Linguistics it was difficult to note patterns of
use in language, since observing and tracking usage patterns
was a monumental task.
 Scholars have used various types of corpora to gain insights
into changes related to language development, both in first
and second language situations.
 Corpus Linguistics can help in telling about language use and
how it varies in different situations.

Corpus Linguistics 1

Uploaded by

Copyright:

Available Formats

Corpus Linguistics 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Corpus Linguistics 1

Uploaded by

Copyright:

Available Formats

What are the main types of corpora discussed in the text?

What are the main types of corpora discussed in the text?

What are some examples of specialized corpora mentioned?

What are some examples of specialized corpora mentioned?

Corpus Linguistics

Linguistics being the scientific study of language

The analysis does not stop at the description of

 A means to explore actual patterns of language use.

Often used as a baseline in comparison with more

Also sometimes known as ‘reference corpora’.

LOB Corpus – 1 million words.

BNC (British National Corpus) – 100 million

 Michigan Corpus of Academic

Helsinki Corpus - 700 to 1700 texts

They are used to identify differences among learners’

International Corpus of Learner

 Can be used by translators and by learners to find

Parts of Speech Tagging:

There are some statistical calculations of

They can be lexical items which reflect the topic

 Statistical measurements of collocation are more

(note) the count ignores punctuation marks.

Counts ‘s’ as a separate word.

 Ignores sentence boundaries.

 e.g.) the word light ・・・ tagged as verb, a

 Parsing is analyzing the sentences in a

 Its two most important tools are:

 As an alternative to Wordsmith, you can also use a

 Click on the Wordsmith icon on the desktop to

 By using the asterisk *, you can widen the scope of your

The tool WordList generates word lists of the

Moreover, you can use WordList to compare the

 This tool shows the words or word strings you want

 Choose the tab "Concordance"

 Type in a search word (“Search Term”, bottom left-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.