Corpora
Corpora
4. Domain-Specific
Knowledge:
- Specialized Corpus:
Requires domain-
specific knowledge to
fully understand and
interpret the content.
- Generalized Corpus:
Content is designed to be
accessible to a broader
audience without the
need for specialized
knowledge.
5. Annotation and
Metadata:
- Specialized Corpus:
May have specific
annotations or metadata
tailored to the needs of
the domain, such as
annotations for medical
entities in a biomedical
corpus.
- Generalized Corpus:
Annotations, if present,
are more likely to be
general linguistic
features rather than
domain-specific ones.
6. Usage:
- Specialized Corpus:
Often used for training
models in specific
domains, developing
domain-specific
applications, or
conducting research
within a particular field.
- Generalized Corpus:
Widely used for training
general-purpose
language models,
understanding language
patterns, and building
applications that require
a broad understanding of
language.
3. Prosody and
Paralinguistic Features:
- Spoken Corpus:
Includes prosodic
features like intonation,
rhythm, pitch, and
paralinguistic elements
such as tone of voice,
laughter, or pauses—
crucial for conveying
meaning in spoken
language.
4. Interaction Dynamics:
- Written Corpus: Lacks
the interactive dynamics
inherent in spoken
language, where turn-
taking, interruptions, and
back-and-forth
exchanges are common.
- Spoken Corpus:
Reflects the dynamic
nature of conversations
and interactions,
including the influence
of context, speaker
intention, and listener
feedback.
5. Formality and
Register:
- Written Corpus: Tends
to exhibit more formal
language, adhering to
established conventions
and standards.
- Spoken Corpus: Can
range from formal to
informal, with features
like conversational tone,
slang, and casual
language use.
6. Production and
Reception:
- Written Corpus:
Produced with the
expectation of being read
and processed visually.
- Spoken Corpus:
Produced for oral
communication and is
often accompanied by
non-verbal cues, making
it a more interactive
form of language.
2. Temporal Aspect:
- Static Corpus: Lacks a
temporal dimension,
making it suitable for
studying language at a
specific moment in
history or a particular
context.
- Dynamic Corpus:
Captures the temporal
aspect of language,
allowing researchers to
analyze language
changes and trends over
time.
3. Applications:
- Static Corpus: Often
used for historical
linguistics, linguistic
research that does not
require real-time data, or
applications where a
stable dataset is
sufficient.
- Dynamic Corpus:
Valuable for applications
that require up-to-date
information, such as
sentiment analysis on
social media, tracking
language trends, or
studying language
evolution in
contemporary contexts.
4. Maintenance and
Updating:
- Static Corpus: Does not
require frequent updates,
as the content remains
unchanged after the
initial compilation.
- Dynamic Corpus:
Requires regular updates
to stay relevant and
reflective of current
language use.
Continuous monitoring
and addition of new data
are necessary.
5. Availability of
Metadata:
- Static Corpus:
Metadata, if available,
may be fixed and not
subject to frequent
changes.
- Dynamic Corpus:
Metadata may evolve
along with the corpus,
providing insights into
contextual changes over
time.
6. Research Focus:
- Static Corpus: Suitable
for studies that focus on
a specific period or
context, where a stable
dataset is sufficient for
analysis.
- Dynamic Corpus: Ideal
for research exploring
language dynamics,
trends, and changes in
response to evolving
societal, cultural, or
technological factors.
Multilingual Corpora:
1. Multiple Languages:
- Multilingual corpora
include texts or speech
samples from more than
one language, enabling
researchers to study
language interactions,
translation, and cross-
linguistic phenomena.
2. Cross-Linguistic
Studies:
- Multilingual corpora
facilitate comparative
analysis and research on
language universals and
differences across
multiple languages.
3. Code-Switching and
Language Mixing:
- Multilingual corpora
may include instances of
code-switching or
language mixing,
reflecting the dynamic
use of multiple
languages in
communication.
4. Translation
Resources:
- Multilingual corpora
are valuable for
developing and
improving translation
models, as they provide
aligned texts in different
languages.
5. Global Applications:
- Multilingual corpora
are particularly useful in
applications where a
global perspective is
needed, such as
developing models for
machine translation,
cross-lingual information
retrieval, or multilingual
natural language
processing tasks.
synchronic vs 1. Textual Content: Synchronic Corpora: diachronic : Hansard corpus (1803–
diachronic corpora - Both synchronic and 2005) Time Magazine corpus
diachronic corpora 1. Snapshot in Time: (1923–2006) The Siena
consist of written or - Synchronic corpora Bologna/Portsmouth Modern
spoken language data. represent a snapshot of Diachronic Corpus (SiBol/Port)
language use at a contains 787,000 articles (385
2. Diversity: specific point in time. million tokens) from three UK
- Both types can include The data is collected or broadsheet newspapers in 1993,
a diverse range of compiled to provide 2005 and 2010
topics, genres, and insights into the
styles of language use. language as it exists Synchronic corpora :
within a particular
3. Linguistic Elements: period. 1. British National Corpus (BNC):
- Both corpora contain - Language: English (British)
linguistic elements such 2. Contemporary - Description: A large synchronic
Analysis:
- Synchronic corpora are
used for contemporary
linguistic analysis,
studying language
structures, meanings, and
usage patterns within a
specific time frame.
4. Example:
- A corpus containing
English literature
samples from the 17th,
18th, and 19th centuries
is an example of a
diachronic corpus.
parallel corpora vs 1. Textual Content: Parallel Corpora: EUR-LEX is a multilingual parallel
comparable corpora - Both parallel and corpus of European Union
comparable corpora 1. Bilingual or documents translated into the
consist of written or Multilingual: official European languages
spoken language data. - Parallel corpora contain
texts in two or more Very large corpora like the
2. Diversity: languages that are Canadian Hansard and the EC
- Both types can include translations of each database of texts are parallel
a diverse range of other. The texts are corpora. Smaller general corpora,
topics, genres, and aligned at the sentence or complete with tagging, like the
styles of language use. segment level. ICAME corpora (Brown, LOB etc)
and similar corpora in other
3. Linguistic Elements: 2. Translation languages, offer the possibility of
- Both corpora contain Resources: more controlled comparative and
linguistic elements such - Parallel corpora are contrastive research into general
as words, phrases, often used as valuable language at all levels. Newspaper
sentences, and discourse resources for training corpora are a popular solution in the
structures. and evaluating machine quest for ‘concurrent corpora’.
translation systems, as Examples we have been involved
4. Size: they provide with at FLUP are corpora of war
- Both parallel and corresponding texts in reports, and football during the
comparable corpora can multiple languages. World Cup, and another possibility
vary in size, ranging would be political texts during
from small-scale 3. Example: election campaigns. One can also
datasets to large-scale - A parallel corpus might compare styles of journalism by
collections. include English and comparing individual journalists.
French versions of the
same documents, such as
European Union
documents that are
translated into multiple
languages.
Comparable Corpora:
1. No Direct Translation:
- Comparable corpora
consist of texts in
different languages or
dialects that do not have
direct translations. The
texts are related in some
way, such as addressing
similar topics or genres.
2. Cross-Linguistic
Studies:
- Comparable corpora
are used for cross-
linguistic studies,
contrastive analysis, and
exploring linguistic
variation across
languages.
3. Example:
- A comparable corpus
might include news
articles on similar topics
in English and Spanish,
where the articles are not
translations of each other
but share a thematic
similarity.
4. Genre or Topic
Similarity:
- The similarity in
comparable corpora is
often based on thematic
content, such as
documents discussing
the same subject matter
or belonging to the same
genre.
- Parallel Corpora:
- Primarily used for
machine translation
research, bilingual
lexicography, and cross-
language studies where
translations are crucial
for the analysis.
- Comparable Corpora:
- Useful for contrastive
analysis, studying
language variation, and
exploring differences
and similarities in usage
across languages or
dialects.
native vs. (non- 1. Textual Content: Native Corpora: A well-known learner corpus is the
native) learner - Both native and International Corpus of Learner
corpora learner corpora consist 1. Produced by Native English (ICLE) (Granger, 2003),
of written or spoken Speakers: which contains essays written by
language data. - Native corpora are English language learners with 14
composed of texts or different native languages. While
2. Diversity: speech produced by the ICLE is more generalized,
- Both types can include individuals who are containing writings from learners
a diverse range of native speakers of the with 14 different native languages,
topics, genres, and language. The language other learner corpora are more
styles of language use. use is considered specialized; for example, the
authentic and Standard Speaking Test Corpus
3. Linguistic Elements: representative of natural (SST), comprised of oral interview
- Both corpora contain language patterns. tests of Japanese learners. Targeted
linguistic elements such instruction can be developed for
as words, phrases, 2. Dialectal Variations: general language teaching or for
sentences, and discourse - Native corpora may specific language groups depending
structures. reflect dialectal on the type of learner corpus.
variations within the Chapter 7 will look at corpus-
4. Size: native language, designed activities created from a
- Both native and capturing regional or learner corpus.
learner corpora can vary sociolinguistic nuances.
in size, ranging from Native corpora:
small-scale datasets to 3. Vocabulary Depth:
large-scale collections. - Native corpora provide 1. British National Corpus (BNC):
in-depth exploration of - Language: English (British)
vocabulary, idiomatic - Description: A large corpus of
expressions, and cultural written and spoken British English,
references inherent to the representing the language as used
native language. by native speakers across various
genres and contexts.
4. Example:
- A native English 2. Corpus del Español (CORDE):
corpus might include - Language: Spanish
texts written or spoken - Description: A corpus of written
by native English Spanish texts, including literature,
speakers across different journalism, and other genres,
regions and contexts. produced by native speakers of
Spanish.
Learner Corpora:
1. Produced by
Language Learners:
- Learner corpora consist
of texts or speech
produced by individuals
learning the language.
These texts often reflect
the linguistic challenges
and developmental
stages of language
acquisition.
2. Error Analysis:
- Learner corpora are
valuable for error
analysis, as they can
highlight common
mistakes, syntactic
structures, and lexical
choices made by
language learners.
3. Deutsches Textarchiv (DTA):
3. Developmental
- Language: German
Stages:
- Description: A collection of
- Learner corpora may
written German texts from the early
capture language use at
modern period to the early 20th
different proficiency
century, reflecting the language as
levels, providing insights
used by native speakers over time.
into how language skills
evolve over time.
4. Corpus of Contemporary
American English (COCA):
4. Example:
- Language: English (American)
- A learner English
- Description: A large corpus of
corpus might include
written and spoken American
writings or speech
English, comprising texts produced
samples from individuals
by native speakers across different
at various proficiency
genres and time periods.
levels, from beginners to
advanced learners.
- Native Corpora:
- Used for linguistic
analysis, studying
natural language
patterns, and training
models for applications
requiring an
understanding of
authentic language use.
- Learner Corpora:
- Used for second
language acquisition
research, developing
language learning
materials, and assessing
the linguistic challenges
faced by learners.
raw vs. annotated 1. Textual Content: Raw Corpora: Raw :
corpora - Both raw and
annotated corpora 1. Unprocessed Text: 1. OpenSubtitles:
consist of written or - Raw corpora consist of - Language: Various
spoken language data. unprocessed, - Description: A large collection of
unannotated text. The movie and television show subtitles
2. Diversity: content is presented in its in multiple languages, providing
- Both types can include natural state without raw text data extracted from video
a diverse range of additional linguistic content.
topics, genres, and information.
styles of language use. 2. Wikimedia Dumps:
2. Lack of Markup: - Language: Various
3. Linguistic Elements: - Raw corpora lack - Description: Periodic dumps of
- Both corpora contain linguistic annotations, content from Wikipedia and other
linguistic elements such tags, or metadata that Wikimedia projects, providing raw
as words, phrases, provide information textual data from articles,
sentences, and discourse about specific linguistic discussions, and other contributions.
structures. features, such as part-of-
speech tags, named 3. Brown Corpus:
4. Size: entities, or syntactic - Language: English
- Both raw and structures. - Description: A classic corpus of
annotated corpora can American English that includes raw
vary in size, ranging 3. Flexibility: text data from a variety of sources,
from small-scale - Raw corpora offer such as fiction, news, and academic
datasets to large-scale flexibility for various writing.
collections. linguistic analyses and
applications, but the 4. European Parliament Proceedings
absence of annotations Parallel Corpus 1996-2011:
may require additional - Language: Multiple European
processing for specific languages
tasks. - Description: A collection of raw
texts from the proceedings of the
4. Example: European Parliament, available in
- A collection of news multiple languages, reflecting
articles in their original unprocessed official records.
form without any added
linguistic annotations is 5. Open American National Corpus
an example of a raw (OANC):
corpus. - Language: English
- Description: A collection of raw
Annotated Corpora: text data in American English,
covering various genres such as
1. Linguistic Markup: fiction, newspaper articles, and
- Annotated corpora conversation transcripts.
have additional linguistic
annotations or markup, Annotated :
providing information
about specific linguistic 1. CONLL 2003 NER Corpus:
features. Annotations - Language: English
can include part-of- - Annotation: Named entity
speech tags, named recognition
entity tags, syntactic - Description: A dataset from the
structures, sentiment CoNLL 2003 shared task, annotated
labels, etc. for named entity recognition, widely
used for training and evaluating
2. Enhanced models in this domain.
Information:
- Annotations enhance 2. Universal Dependencies (UD):
the corpus by providing - Language: Multiple languages
detailed information that - Annotation: Dependency parsing
facilitates more specific - Description: A project aiming to
linguistic analyses, such create cross-linguistically consistent
as training machine treebanks with annotations for
learning models or dependency parsing in multiple
conducting targeted languages.
linguistic research.
3. GUM (Glasgow University
3. Task-Specific Media Group):
Annotations:
- Annotations can be
tailored to specific tasks,
such as sentiment
analysis, named entity
recognition, or machine
translation, making
annotated corpora
valuable for training and
evaluating models in
these areas.
4. Example:
- A corpus of movie
- Language: English
reviews with sentiment
- Annotation: Morphosyntactic,
annotations indicating
semantic role labeling
positive or negative
- Description: A corpus of diverse
sentiments associated
English texts with annotations for
with each review is an
morphosyntactic features and
example of an annotated
semantic role labeling.
corpus.
4. Chinese Treebank (CTB):
Purpose and Use:
- Language: Chinese
- Annotation: Part-of-speech
- Raw Corpora:
tagging, syntactic tree structures
- Used for general
- Description: Annotated corpus for
linguistic analysis,
Chinese, including annotations for
exploring language
part-of-speech tags and syntactic
patterns, and providing
tree structures.
flexibility for various
research applications.
- Annotated Corpora:
- Used for training and
evaluating models in
natural language
processing tasks,
developing linguistic
resources, and
conducting targeted
analyses requiring
specific linguistic
information.