0% found this document useful (0 votes)
9 views

Corpora

The document compares and contrasts specialized and generalized corpora as well as written and spoken corpora. It outlines six common features that both specialized and generalized corpora share, as well as four distinctive features that differentiate them. Examples of each type of corpus are also provided.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Corpora

The document compares and contrasts specialized and generalized corpora as well as written and spoken corpora. It outlines six common features that both specialized and generalized corpora share, as well as four distinctive features that differentiate them. Examples of each type of corpus are also provided.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Types of corpora Common features Distinctive features Examples

specialised vs 1. Textual Content: 1. Specialization: Examples of specialized corpora


generalised Both specialized and - Specialized Corpus: include the Michigan Corpus of
generalized corpora Focused on a specific Academie Spoken English
consist of written or domain, industry, or (MICASE), which contains only
spoken language data. subject area. Examples spoken language from a university
include medical texts, setting; the CHILDES Corpus
2. Diversity: Both types legal documents, or (MacWhinney, 1992), which
can include a diverse scientific publications. contains language used by children;
range of topics, genres, - Generalized Corpus: the MICUSP, Michigan Corpus of
and styles of language Encompasses a broad Upper-level Student Papers, a
use. range of topics and is not collection of papers from a range of
restricted to any university disciplines; and a medical
3. Linguistic Elements: particular domain. corpus containing language used by
Both corpora contain nurses and hospital staff.
linguistic elements such 2. Vocabulary: Specialized corpora are often used
as words, phrases, - Specialized Corpus: in ESP settings. The AWL, for
sentences, and discourse Contains domain- example, was generated from a
structures. specific vocabulary and specialized corpora of academic
terminology relevant to texts.
4. Size: They can vary its focused subject area.
in size, ranging from - Generalized Corpus: The British National Corpus (BNC)
small-scale datasets to Encompasses a more and the American National Corpus
large-scale corpora. general vocabulary that (ANC) are examples of large,
covers a wide spectrum generalized corpora. The COCA is
of topics. also an example of a generalized
corpus. These large, generalized
3. Purpose: corpora contain written texts such as
- Specialized Corpus: newspaper and magazine articles,
Often created for works of fiction and nonfic-tion, as
research, analysis, or well as writing from scholarly
applications within a journals; these corpora also contain
specific field or industry. spoken transcripts such as informal
- Generalized Corpus: conversations, government
Used for general proceedings, and business meetings.
language understanding, If generalizations about language as
machine learning a whole are to be drawn, a large,
training, and a wide general corpus should be consulted.
range of applications.

4. Domain-Specific
Knowledge:
- Specialized Corpus:
Requires domain-
specific knowledge to
fully understand and
interpret the content.
- Generalized Corpus:
Content is designed to be
accessible to a broader
audience without the
need for specialized
knowledge.

5. Annotation and
Metadata:
- Specialized Corpus:
May have specific
annotations or metadata
tailored to the needs of
the domain, such as
annotations for medical
entities in a biomedical
corpus.
- Generalized Corpus:
Annotations, if present,
are more likely to be
general linguistic
features rather than
domain-specific ones.

6. Usage:
- Specialized Corpus:
Often used for training
models in specific
domains, developing
domain-specific
applications, or
conducting research
within a particular field.
- Generalized Corpus:
Widely used for training
general-purpose
language models,
understanding language
patterns, and building
applications that require
a broad understanding of
language.

written vs spoken American National Corpus


1. Linguistic Elements: 1. Medium of
- Both types of corpora Communication: Bank of English
consist of linguistic - Written Corpus:
elements such as words, Comprises text that is BookCorpus
phrases, sentences, and typically produced for
discourse structures. reading, such as books,
articles, essays, or online British National Corpus
2. Contextual Variation: content.
- Both written and - Spoken Corpus: Bergen Corpus of London Teenage
spoken corpora can Consists of transcriptions Language (COLT)
exhibit variations in of spoken language,
formality, register, and including conversations, Brown Corpus, forming part of the
style based on the interviews, speeches, and "Brown Family" of corpora,
context in which the other oral interactions. together with LOB, Frown and F-
language is used. LOB
2. Orthographic
3. Grammar and Syntax: Representation: Corpus of Contemporary American
- Both types adhere to - Written Corpus: English (COCA) 425 million
grammatical and Represents language in a words, 1990–2011. Freely
syntactic rules, although standardized, searchable online
spoken language may be orthographic form with
more prone to punctuation,
Corpus Resource Database (CoRD),
colloquialisms, ellipses, capitalization, and
more than 80 English language
and informal structures. proper spelling.
corpora.
- Spoken Corpus:
4. Diversity: Captures the
- Both written and spontaneous nature of Examples include the British
spoken corpora can spoken language, National Corpus, the Corpus of
encompass a wide range including features like Contemporary American English
of topics, genres, and hesitations, repetitions, (COCA), and the Leipzig Corpora
communication styles. and colloquial Collection.
expressions.

3. Prosody and
Paralinguistic Features:
- Spoken Corpus:
Includes prosodic
features like intonation,
rhythm, pitch, and
paralinguistic elements
such as tone of voice,
laughter, or pauses—
crucial for conveying
meaning in spoken
language.

4. Interaction Dynamics:
- Written Corpus: Lacks
the interactive dynamics
inherent in spoken
language, where turn-
taking, interruptions, and
back-and-forth
exchanges are common.
- Spoken Corpus:
Reflects the dynamic
nature of conversations
and interactions,
including the influence
of context, speaker
intention, and listener
feedback.

5. Formality and
Register:
- Written Corpus: Tends
to exhibit more formal
language, adhering to
established conventions
and standards.
- Spoken Corpus: Can
range from formal to
informal, with features
like conversational tone,
slang, and casual
language use.

6. Production and
Reception:
- Written Corpus:
Produced with the
expectation of being read
and processed visually.
- Spoken Corpus:
Produced for oral
communication and is
often accompanied by
non-verbal cues, making
it a more interactive
form of language.

static vs dynamic 1. Textual Content: - Dynamic: The Bank of English


corpora Both static and dynamic 1. Nature of Data: (BoE) (see discussion at
corpora consist of - Static Corpus: http://corpus.byu.edu/coca/compare-
written or spoken Represents a fixed boe.asp) News on the Web (NOW)
language data. snapshot of language corpus dates from 2010 and grows
data at a specific point in by about 10,000 web articles each
2. Linguistic Elements: time. The content does day
- Both types include not change, and it
linguistic elements such remains constant for An example of a static corpus is the
as words, phrases, analysis. British National Corpus.
sentences, and discourse - Dynamic Corpus:
structures. Evolves over time,
reflecting changes in
3. Variety of Topics: language use, emerging
- Both static and trends, and
dynamic corpora can developments. New data
cover a diverse range of is added, and older data
topics, genres, and may be updated or
styles of language use. replaced.

2. Temporal Aspect:
- Static Corpus: Lacks a
temporal dimension,
making it suitable for
studying language at a
specific moment in
history or a particular
context.
- Dynamic Corpus:
Captures the temporal
aspect of language,
allowing researchers to
analyze language
changes and trends over
time.

3. Applications:
- Static Corpus: Often
used for historical
linguistics, linguistic
research that does not
require real-time data, or
applications where a
stable dataset is
sufficient.
- Dynamic Corpus:
Valuable for applications
that require up-to-date
information, such as
sentiment analysis on
social media, tracking
language trends, or
studying language
evolution in
contemporary contexts.

4. Maintenance and
Updating:
- Static Corpus: Does not
require frequent updates,
as the content remains
unchanged after the
initial compilation.
- Dynamic Corpus:
Requires regular updates
to stay relevant and
reflective of current
language use.
Continuous monitoring
and addition of new data
are necessary.

5. Availability of
Metadata:
- Static Corpus:
Metadata, if available,
may be fixed and not
subject to frequent
changes.
- Dynamic Corpus:
Metadata may evolve
along with the corpus,
providing insights into
contextual changes over
time.

6. Research Focus:
- Static Corpus: Suitable
for studies that focus on
a specific period or
context, where a stable
dataset is sufficient for
analysis.
- Dynamic Corpus: Ideal
for research exploring
language dynamics,
trends, and changes in
response to evolving
societal, cultural, or
technological factors.

monolingual vs 1. Textual Content: Monolingual Corpora:** Multilingual : European Corpus


multilingual corpora - Both monolingual and 1. Single Language: Initiative (ECI) Multilingual
multilingual corpora - Monolingual corpora Corpus, containing texts (mainly
consist of written or focus exclusively on one newspapers and fiction) in over 20
spoken language data. language, providing in- languages
depth insights into the
2. Diversity: linguistic features, One of the best known large-scale
- Both types can include structures, and patterns monolingual corpora is the British
a diverse range of of that specific language. National Corpus (BNC), a 100
topics, genres, and million-word collection of samples
styles of language use. 2. Language-Specific of written and spoken language
Analysis: from wide range of sources.
3. Linguistic Elements: - Monolingual corpora
- Both corpora contain are used for tasks such as Brown Corpus:
linguistic elements such studying syntax, Language: English
as words, phrases, semantics, and other Description: A well-known corpus
sentences, and discourse linguistic properties of American English, originally
structures. within the confines of a created in the 1960s, widely used
particular language. for linguistic research.
4. Size: Linguistic Data Consortium
- Both monolingual and 3. Vocabulary Depth: (LDC) Corpora:
multilingual corpora can - Monolingual corpora Language: Various
vary in size, ranging allow for a deeper Description: The LDC provides a
from small-scale exploration of the variety of monolingual corpora in
datasets to large-scale vocabulary and idiomatic different languages for research
collections. expressions specific to a purposes. It includes corpora in
particular language. languages such as Chinese, Arabic,
Spanish, and many others.
4. Native Speaker
Considerations:
- Monolingual corpora
are often created with a
focus on content
produced by native
speakers of the language,
ensuring linguistic
authenticity.

Multilingual Corpora:

1. Multiple Languages:
- Multilingual corpora
include texts or speech
samples from more than
one language, enabling
researchers to study
language interactions,
translation, and cross-
linguistic phenomena.

2. Cross-Linguistic
Studies:
- Multilingual corpora
facilitate comparative
analysis and research on
language universals and
differences across
multiple languages.

3. Code-Switching and
Language Mixing:
- Multilingual corpora
may include instances of
code-switching or
language mixing,
reflecting the dynamic
use of multiple
languages in
communication.

4. Translation
Resources:
- Multilingual corpora
are valuable for
developing and
improving translation
models, as they provide
aligned texts in different
languages.

5. Global Applications:
- Multilingual corpora
are particularly useful in
applications where a
global perspective is
needed, such as
developing models for
machine translation,
cross-lingual information
retrieval, or multilingual
natural language
processing tasks.
synchronic vs 1. Textual Content: Synchronic Corpora: diachronic : Hansard corpus (1803–
diachronic corpora - Both synchronic and 2005) Time Magazine corpus
diachronic corpora 1. Snapshot in Time: (1923–2006) The Siena
consist of written or - Synchronic corpora Bologna/Portsmouth Modern
spoken language data. represent a snapshot of Diachronic Corpus (SiBol/Port)
language use at a contains 787,000 articles (385
2. Diversity: specific point in time. million tokens) from three UK
- Both types can include The data is collected or broadsheet newspapers in 1993,
a diverse range of compiled to provide 2005 and 2010
topics, genres, and insights into the
styles of language use. language as it exists Synchronic corpora :
within a particular
3. Linguistic Elements: period. 1. British National Corpus (BNC):
- Both corpora contain - Language: English (British)
linguistic elements such 2. Contemporary - Description: A large synchronic
Analysis:
- Synchronic corpora are
used for contemporary
linguistic analysis,
studying language
structures, meanings, and
usage patterns within a
specific time frame.

3. Single Time Frame:


- Synchronic corpora corpus representing written and
focus on language data spoken British English from the late
from a single time frame, 20th century, covering a wide range
allowing researchers to of genres.
analyze the language as
it is used synchronously. 2. Corpus of Contemporary
American English (COCA):
4. Example: - Language: English (American)
- A corpus of English - Description: A substantial
news articles from the synchronic corpus of American
year 2020 is an example English, covering a diverse set of
of a synchronic corpus. written and spoken texts from the
1990s to the present.
Diachronic Corpora:
3. Corpus de Referencia del Español
as words, phrases, 1. Temporal Evolution: Actual (CREA):
sentences, and discourse - Diachronic corpora - Language: Spanish
structures. span multiple time - Description: A synchronic corpus
periods, allowing of contemporary Spanish,
4. Size: researchers to track containing a broad spectrum of
- Both synchronic and changes and evolution in written and spoken texts from the
diachronic corpora can language use over time. late 20th century to the present.
vary in size, ranging
from small-scale 2. Historical Analysis: 7. Lancaster-Oslo/Bergen Corpus
datasets to large-scale - Diachronic corpora are (LOB):
collections. valuable for historical - Language: English (British)
linguistic analysis, - Description: A synchronic corpus
studying how language of written British English texts from
structures, meanings, and the early 1960s, offering a historical
usage patterns evolve perspective on language use.
and change over
different epochs. 8. Toronto Corpus of Old Japanese
(TCOJ):
3. Comparison Across - Language: Japanese (Old
Time: Japanese)
- Diachronic corpora - Description: A synchronic corpus
enable researchers to focusing on Old Japanese texts,
compare linguistic providing insights into the language
features across different as it existed during specific
time periods, identifying historical periods.
trends, linguistic shifts,
and historical
developments.

4. Example:
- A corpus containing
English literature
samples from the 17th,
18th, and 19th centuries
is an example of a
diachronic corpus.
parallel corpora vs 1. Textual Content: Parallel Corpora: EUR-LEX is a multilingual parallel
comparable corpora - Both parallel and corpus of European Union
comparable corpora 1. Bilingual or documents translated into the
consist of written or Multilingual: official European languages
spoken language data. - Parallel corpora contain
texts in two or more Very large corpora like the
2. Diversity: languages that are Canadian Hansard and the EC
- Both types can include translations of each database of texts are parallel
a diverse range of other. The texts are corpora. Smaller general corpora,
topics, genres, and aligned at the sentence or complete with tagging, like the
styles of language use. segment level. ICAME corpora (Brown, LOB etc)
and similar corpora in other
3. Linguistic Elements: 2. Translation languages, offer the possibility of
- Both corpora contain Resources: more controlled comparative and
linguistic elements such - Parallel corpora are contrastive research into general
as words, phrases, often used as valuable language at all levels. Newspaper
sentences, and discourse resources for training corpora are a popular solution in the
structures. and evaluating machine quest for ‘concurrent corpora’.
translation systems, as Examples we have been involved
4. Size: they provide with at FLUP are corpora of war
- Both parallel and corresponding texts in reports, and football during the
comparable corpora can multiple languages. World Cup, and another possibility
vary in size, ranging would be political texts during
from small-scale 3. Example: election campaigns. One can also
datasets to large-scale - A parallel corpus might compare styles of journalism by
collections. include English and comparing individual journalists.
French versions of the
same documents, such as
European Union
documents that are
translated into multiple
languages.

Comparable Corpora:

1. No Direct Translation:
- Comparable corpora
consist of texts in
different languages or
dialects that do not have
direct translations. The
texts are related in some
way, such as addressing
similar topics or genres.

2. Cross-Linguistic
Studies:
- Comparable corpora
are used for cross-
linguistic studies,
contrastive analysis, and
exploring linguistic
variation across
languages.

3. Example:
- A comparable corpus
might include news
articles on similar topics
in English and Spanish,
where the articles are not
translations of each other
but share a thematic
similarity.

4. Genre or Topic
Similarity:
- The similarity in
comparable corpora is
often based on thematic
content, such as
documents discussing
the same subject matter
or belonging to the same
genre.

Purpose and Use:

- Parallel Corpora:
- Primarily used for
machine translation
research, bilingual
lexicography, and cross-
language studies where
translations are crucial
for the analysis.

- Comparable Corpora:
- Useful for contrastive
analysis, studying
language variation, and
exploring differences
and similarities in usage
across languages or
dialects.
native vs. (non- 1. Textual Content: Native Corpora: A well-known learner corpus is the
native) learner - Both native and International Corpus of Learner
corpora learner corpora consist 1. Produced by Native English (ICLE) (Granger, 2003),
of written or spoken Speakers: which contains essays written by
language data. - Native corpora are English language learners with 14
composed of texts or different native languages. While
2. Diversity: speech produced by the ICLE is more generalized,
- Both types can include individuals who are containing writings from learners
a diverse range of native speakers of the with 14 different native languages,
topics, genres, and language. The language other learner corpora are more
styles of language use. use is considered specialized; for example, the
authentic and Standard Speaking Test Corpus
3. Linguistic Elements: representative of natural (SST), comprised of oral interview
- Both corpora contain language patterns. tests of Japanese learners. Targeted
linguistic elements such instruction can be developed for
as words, phrases, 2. Dialectal Variations: general language teaching or for
sentences, and discourse - Native corpora may specific language groups depending
structures. reflect dialectal on the type of learner corpus.
variations within the Chapter 7 will look at corpus-
4. Size: native language, designed activities created from a
- Both native and capturing regional or learner corpus.
learner corpora can vary sociolinguistic nuances.
in size, ranging from Native corpora:
small-scale datasets to 3. Vocabulary Depth:
large-scale collections. - Native corpora provide 1. British National Corpus (BNC):
in-depth exploration of - Language: English (British)
vocabulary, idiomatic - Description: A large corpus of
expressions, and cultural written and spoken British English,
references inherent to the representing the language as used
native language. by native speakers across various
genres and contexts.
4. Example:
- A native English 2. Corpus del Español (CORDE):
corpus might include - Language: Spanish
texts written or spoken - Description: A corpus of written
by native English Spanish texts, including literature,
speakers across different journalism, and other genres,
regions and contexts. produced by native speakers of
Spanish.
Learner Corpora:

1. Produced by
Language Learners:
- Learner corpora consist
of texts or speech
produced by individuals
learning the language.
These texts often reflect
the linguistic challenges
and developmental
stages of language
acquisition.

2. Error Analysis:
- Learner corpora are
valuable for error
analysis, as they can
highlight common
mistakes, syntactic
structures, and lexical
choices made by
language learners.
3. Deutsches Textarchiv (DTA):
3. Developmental
- Language: German
Stages:
- Description: A collection of
- Learner corpora may
written German texts from the early
capture language use at
modern period to the early 20th
different proficiency
century, reflecting the language as
levels, providing insights
used by native speakers over time.
into how language skills
evolve over time.
4. Corpus of Contemporary
American English (COCA):
4. Example:
- Language: English (American)
- A learner English
- Description: A large corpus of
corpus might include
written and spoken American
writings or speech
English, comprising texts produced
samples from individuals
by native speakers across different
at various proficiency
genres and time periods.
levels, from beginners to
advanced learners.

Purpose and Use:

- Native Corpora:
- Used for linguistic
analysis, studying
natural language
patterns, and training
models for applications
requiring an
understanding of
authentic language use.

- Learner Corpora:
- Used for second
language acquisition
research, developing
language learning
materials, and assessing
the linguistic challenges
faced by learners.
raw vs. annotated 1. Textual Content: Raw Corpora: Raw :
corpora - Both raw and
annotated corpora 1. Unprocessed Text: 1. OpenSubtitles:
consist of written or - Raw corpora consist of - Language: Various
spoken language data. unprocessed, - Description: A large collection of
unannotated text. The movie and television show subtitles
2. Diversity: content is presented in its in multiple languages, providing
- Both types can include natural state without raw text data extracted from video
a diverse range of additional linguistic content.
topics, genres, and information.
styles of language use. 2. Wikimedia Dumps:
2. Lack of Markup: - Language: Various
3. Linguistic Elements: - Raw corpora lack - Description: Periodic dumps of
- Both corpora contain linguistic annotations, content from Wikipedia and other
linguistic elements such tags, or metadata that Wikimedia projects, providing raw
as words, phrases, provide information textual data from articles,
sentences, and discourse about specific linguistic discussions, and other contributions.
structures. features, such as part-of-
speech tags, named 3. Brown Corpus:
4. Size: entities, or syntactic - Language: English
- Both raw and structures. - Description: A classic corpus of
annotated corpora can American English that includes raw
vary in size, ranging 3. Flexibility: text data from a variety of sources,
from small-scale - Raw corpora offer such as fiction, news, and academic
datasets to large-scale flexibility for various writing.
collections. linguistic analyses and
applications, but the 4. European Parliament Proceedings
absence of annotations Parallel Corpus 1996-2011:
may require additional - Language: Multiple European
processing for specific languages
tasks. - Description: A collection of raw
texts from the proceedings of the
4. Example: European Parliament, available in
- A collection of news multiple languages, reflecting
articles in their original unprocessed official records.
form without any added
linguistic annotations is 5. Open American National Corpus
an example of a raw (OANC):
corpus. - Language: English
- Description: A collection of raw
Annotated Corpora: text data in American English,
covering various genres such as
1. Linguistic Markup: fiction, newspaper articles, and
- Annotated corpora conversation transcripts.
have additional linguistic
annotations or markup, Annotated :
providing information
about specific linguistic 1. CONLL 2003 NER Corpus:
features. Annotations - Language: English
can include part-of- - Annotation: Named entity
speech tags, named recognition
entity tags, syntactic - Description: A dataset from the
structures, sentiment CoNLL 2003 shared task, annotated
labels, etc. for named entity recognition, widely
used for training and evaluating
2. Enhanced models in this domain.
Information:
- Annotations enhance 2. Universal Dependencies (UD):
the corpus by providing - Language: Multiple languages
detailed information that - Annotation: Dependency parsing
facilitates more specific - Description: A project aiming to
linguistic analyses, such create cross-linguistically consistent
as training machine treebanks with annotations for
learning models or dependency parsing in multiple
conducting targeted languages.
linguistic research.
3. GUM (Glasgow University
3. Task-Specific Media Group):
Annotations:
- Annotations can be
tailored to specific tasks,
such as sentiment
analysis, named entity
recognition, or machine
translation, making
annotated corpora
valuable for training and
evaluating models in
these areas.

4. Example:
- A corpus of movie
- Language: English
reviews with sentiment
- Annotation: Morphosyntactic,
annotations indicating
semantic role labeling
positive or negative
- Description: A corpus of diverse
sentiments associated
English texts with annotations for
with each review is an
morphosyntactic features and
example of an annotated
semantic role labeling.
corpus.
4. Chinese Treebank (CTB):
Purpose and Use:
- Language: Chinese
- Annotation: Part-of-speech
- Raw Corpora:
tagging, syntactic tree structures
- Used for general
- Description: Annotated corpus for
linguistic analysis,
Chinese, including annotations for
exploring language
part-of-speech tags and syntactic
patterns, and providing
tree structures.
flexibility for various
research applications.

- Annotated Corpora:
- Used for training and
evaluating models in
natural language
processing tasks,
developing linguistic
resources, and
conducting targeted
analyses requiring
specific linguistic
information.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy