Corpus Linguistics 1
Corpus Linguistics 1
Corpus Linguistics 1
Texts fromdifferent
periods of time.
Aim at representing an
earlier stage(s) of a
language.
They help to trace the
development of a language
over time.
Example
Research goals
Funding
Time
Staff/students
Written Corpora
Obtaining/creating, Storing, Organizing
Materials Required:
-scanner, OCR software
Process:
-paper document into electronic text file
Types:
-newspapers, periodicals
-small specialized corpora
-informal writings (travel diaries, e-mail,
discussion, blogs, news groups)
Spoken Corpora
deciding on a transcription system
I. prosodic/non prosodic
II.representing interactional characteristics of speech
(over lapping speech, back channels, pauses, non-
verbal contextual events)
III. permission to use data
IV. ensuring anonymity
V. avoiding impracticality of data
Markup
1. Structural markups:
-written corpus: Titles, authors, paragraphs, subheadings,
chapters etc.
-spoken corpus: Contextual events, paralinguistic features
2: Header:
-written corpus:
Classification into categories(register, genre, topic domain, discourse
mode, formality)
-spoken corpus:
Demographic infirmation about speaker(gender,social
class,occupation,age,native language/dialect)
Relationship among the participants
Linguistic Annotation
Prosodic Annotation
Phonetic Annotation
Syntactic Parsing
Advantages of Tagging
Vast exploration
Frequency
Co-occurance
Multiple meaning studies
Automatically retrievable
Concordance Lines
Concordance lines are a useful tool for
investigating corpora, but their use is limited by
the ability of the human observer to process
information.
For example
• Sara for the BNC
• ICECUP for the ICE Great Britain.
• Concordancers can be used for the analysis of almost
any corpus.
Concordancer
One of themost frequently used concordancers is
‘Wordsmith Tools’.
More text types and genres, to cover text types which are
less represented in corpora (letters, emails, leaflets, TV
programs, book synopses, recipes, short notes, chat room
logs, etc.),
More longitudinal language data:
from beginners to advanced levels, from children to adults, from L1
to L2.
More variables:
more language learning variables should be collected and encoded
at the time of corpus collection (proficiency, language aptitude,
motivation, more precise description of the task, of temporal, social
or situational settings, etc).
More languages:
to counterbalance the predominance of Anglo-Saxon native and
learner corpora and to foster the computer-aided analysis of
different languages and language families.
Prior to Corpus Linguistics it was difficult to note patterns of
use in language, since observing and tracking usage patterns
was a monumental task.
Scholars have used various types of corpora to gain insights
into changes related to language development, both in first
and second language situations.
Corpus Linguistics can help in telling about language use and
how it varies in different situations.