Domandeingpiunno
Domandeingpiunno
Domandeingpiunno
During the last century, many definitions of corpus have been given. All the definitions underline that a
corpus is a collection of texts, or of parts of texts, which are representative of the status of a language or a
variety of it with the purpose of obtaining an overall description of it. One of the most complete definitions
was given by Stefan Gries, who claims that a corpus is a text or a collection of texts (written or spoken)
produced in a natural communicative setting with the intention to be representative, as we said before,
balanced and analysed linguistically. The important is that they’re in digital form, in particular in TXT format,
a format which has not formatting. Therefore, both a poem by Shakespeare and all his works could be two
examples of corpora. A corpus can be written, like a journalistic article, or spoken, like an interview, a
recorded conversation, or even interlocutions between people. What is required in both cases is that there
must be a natural communicative setting: this means that a corpus can’t be created voluntarily, neither
written nor spoken. It has to be representative of a language, or a variety of it, and of a certain time period.
For example, the journalistic articles of a newspaper represent a corpus which represents that type of
editorial, of that language and of that period. Corpora must be balanced in their analysis in terms of their
nature: if they’re both written and spoken, they must be analysed in each of two aspects. Finally, it must be
analysed linguistically, and this means that if we’re analysing a journalistic article, we can’t only consider the
editorial English, but also the other varieties of language. In fact, even in linguistics, analysing for example
morphology, doesn’t mean ignoring the analysis of syntax, or phonetics. Furthermore, corpus linguistics is
strictly connected to computational linguistics: the one asks the other technological and mathematical
instruments to analyse language. Recollecting different texts requires certain criteria, for example the
chronological order.
Corpus linguistics is frequently associated with a certain outlook on language and this outlook is that the
rules of language are considered usage-based, but the situation changes when speakers use language to
communicate with each other. The idea of corpus linguistics is that, when studying Italian for example, it’s a
good idea to study it in use.
Corpus linguistics is strictly connected to computational linguistics because the one asks the other
technological and mathematical instruments to analyse linguistics. The method of corpus linguistics is
related to the birth of computers: before 1960 there were only paper based corpora. For example, Otto
Jespersen talks about shoeboxes in which there were thousand of paper slips on which he noted examples of
English sentences from his readings of English literature. He uses these slips as a corpus in the moment he
writes his grammar, and at the end of every section he gives authentic examples picked out of these boxes.
After 1960, there were the first small (in terms of amount of data) electronic corpora. Over the last sixty
years, corpora grew fast, and we now have large-size corpora. This was possible for the technology
evolution, in particular that of computers. The reason why computers are so important in corpus linguistics is
first of all the fact that they facilitate the collection and the storage of data, even large amounts; then they
enable the development of the software that is used to analyse the corpus data; data analysis is fast and
automated, because it’s by computer itself; it is possible to repeat the study to analyse better the reliability
of the results. In 1950 the Survey of English Usage by Randolf Quirk showed some cons of the computer
usage to analyse a corpus: computers were unreliable and expensive, and paper slips had detailed grammar
annotations. The first example of digital corpus is that of Brown Corpu: it’s the first digital representation of
data for the linguistic analysis. During that years to do a research for a large corpus meant to spend so much
time on it, to the extent that the whole department in which scholars looked for a certain corpus stopped
working. Now with the use of modern and technological computers it’s easier and faster.
5. What are the main advantages and disadvantages of using corpora for linguistic research?
The big success of generative linguistics made corpus linguists believe that they had to work in secret places,
and this was the case of the ICAME organisation. One of the major figure of this organisation was Jan
Svartvik, who stressed the pros and the cons of corpus linguistics. He underlined the fact that, first of all,
corpus data are more objective than data based on introspection; they can be easily verified by other
researchers so that they can share the analysed data; they can be used to study variations between
registers, styles and dialects; they provide the frequency of linguistic items; they’re both illustrative in their
examples and theoretical; they give essential information for a number of applied areas; finally, they’re ideal
for non-native speakers of the language analysed. But there are also some cons: corpora will never be big
enough to show all the infinite sentence possibilities in a language; some of the results may be insignificant
and useless; corpora also show errors in a language, which should be ignored; finally you always need a
language theory to explain what you find in that corpus.
Corpora are also used by many linguists who are not corpus linguist. There’s a strong interdisciplinary
character for the fact that corpus methodology is also applicable to non-linguistic disciplines: the
relationships between corpus and other linguistics have improved. We see this with cognitive linguistics, in
which there’s a usage-based grammar which results compatible with corpus studies. Moreover corpora can
be used for functional linguistics, historical linguistics, semantics, grammaticalization studies and second
language acquisition (translation studies).
We can analyse different language manifestations as language date that we want to investigate. As we said,
there are two ways of doing linguistics according to Fillmore: armchair linguistics and corpus linguistics. In
these two situations data is seen differently. Armchair linguistic is the approach according to which language
can be analysed through introspection, consequently armchair linguistics data is available through intuition.
But in the most cases we either have to collect our materials ourselves ore use data that someone has
previously collected. This last approach is that of corpus linguistics, in which data are texts contained in
corpora. It’s not as easy as it seems to extract all the available material for the analysis, and data can be
distinguished according to truthfulness, data source and the type of language they represent. In terms of
truthfulness we distinguish natural language data, that is the case of corpora produced in the real world
naturally, and artificial data, that is any kind of language which is not natural (programming languages and
mark-up languages). Moreover, data can change according to the source: we can have attested (or
authentic) data in which there aren’t the interventions of the researchers during the transcriptions; when
one modifies this attested data, this becomes modified, in order to exclude what’s extraneous; finally there’s
intuitive data, that is invented to illustrate a particular linguistic point. Finally, data can be studied according
to the type of language they represent: elements such as medium, genre, register are here important. In
terms of medium, for example, linguistic data can come in two forms: written or spoken. But there also
intermediate categories, such as texts that are written to be spoken.
The difference between corpora proper and text database is that these last ones (often called text archives)
are a collection of texts not meant to be balanced and put together for their own sake. They’re used for
linguistic investigation and they also can be used by linguistic researchers for their own compilation of the
corpora.
Synchronic corpora are that in which the texts collected are approximatively collected to show a certain time
period, while diachronic corpora are the ones in which texts are carefully collected in order to be
representative of a language or a language variety of a certain period of time. An example of synchronic
corpus is the BNC.
11. Can corpora be both synchronic and diachronic at the same time?
General corpora are used by linguists to find out something general about the language. Let’s think of the
COLT, the BNC, the LLC, the Bank of English (BoE), the Corpus of Contemporary American English (COCA). But
we can also find subcorpora, subgenres in the bigger corpora, and it’s in this case that we speak about
specialised corpora: the analysis is more detailed and more specialised in a certain direction. Examples of
specialised corpora are for example the MICASE (Michigan Corpus of Academic Spoken English) in which
data comes from American academic language spoken at the University of Michigan. Another example is The
International Corpus of Learner English (ICLE), in which there are further subcorpora made of words of
written English produced by learners from different countries.
Both specialised and general corpora mentioned before are considered monolingual corpora for the fact that
they’re all written in English. Then we have also parallel corpora in which there are source and target texts
(texts in the original language and their translation). Finally, multilingual corpora are the ones that contain
similar text types in different languages.
During the years two views of corpus linguistics research arose. The first one is the corpus-driven view
according to which the researcher should work from scratch, or with as few preconceived ideas as possible,
for the analysis. On the other hand, in corpus-based view, corpora are used to test hypothesis based on
already existing theories.
According to Singleton, the term word for corpus linguistics and for a computer program is a “sequence of
letters bounded on either side by a blank space”. We also talk about orthographic words, which are the
words that have blank spaces at each end but no space in the middle. E.g. armchair is a word, while “rail-
road” two, and “point of view” three. But even if “point of view” has 3 orthographic words, intuitively it is
one. The same thing happens with “rail-road”, but also with “John’s”, “don’t”, “mother-in-law”. In the
sentence “My dog loves cats and cats love my dog” there are nine occurrences of word-forms (tokens), 6
word-form types (types) and 5 base-forms (lemmas)
A token is an individual occurrence of linguistic units, any single linguistic unit, most often a word, words
between white spaces or symbols. A single word can be split into more tokens: “she’s” consists of two
orthographic word, but it’s one word, and in terms of tokens it’s two tokens and we can analyse it as she +
‘s. Repeated words can be counted more than once. In the sentence “My mother loves being a mother for
me”, the word “mother” is repeated twice and in counting the tokens of the sentence we count it twice. The
sentence has eight tokens. The process through which words are separated and considered as, e.g., two or
more tokens is called tokenization. It’s also the case of verb abbreviation (she’s she + ‘s) and the Anglo-
Saxon genitive of nouns (John’s John + ‘s).
In a sentence like “My mother loves my cousin” we have 5 tokens (5 occurrences of word-forms) but 4 types
(word-form types). This happens because a type is the abstract class of which tokens are members. It’s the
number of distinct words, not counting repetitions, and grouping occurrences of a word together as
representatives of a single type. It’s the total number of unique words: in the sentence above the two “my”
are considered two tokens, but at the same time one type. Even if the word “my” occurs eighty times in a
text, there will be eighty tokens but one type.
In a corpus we have, as we said, types and tokens. The number of tokens is an estimate of overall the corpus
size and dimension, while the number of types is an estimate of vocabulary size, which gives us an idea of
the lexical richness. To have the number of words in a corpus we have to divide the number of types for the
number of tokens; the result has to be expressed as a percentage. If we’ll have as a result a high type/token
ratio, this means we have a lexically rich and diverse text; if we have a low percentage it means that there’s a
lot of repetition of lexical items.
In most corpora, the ideal size is that of 100 million words. An example is the British National Corpus. But
this doesn’t mean that all corpora must fit this size: more and more corpora are getting bigger, but there are
also small corpora of special areas. The size of the corpora depends on the purpose of the research. The
question of the ideal size of the corpora also depends on representativeness: the researcher wonders if the
texts in which he’s investigating are enough for the research and if they represent in a good way the object
of the research. Analysing the dimension of a corpus doesn’t mean focusing on the length of the corpus but
on the internal structure of it, in order to represent, for example, all the works of an author, texts from a
certain period etc… But it’s not always possible to represent some features in a complete way, and in this
case, it’s about capturing enough of the language for an accurate representation. When we choose a
specialised corpus, we’re in front of a small one and this can be useful for exploring common grammatical
features, while the bigger ones, like the general corpora, can be used to study rarer aspects of the language
and uncommon features.
20. By using a corpus, the linguist can investigate a large amount of linguistic data. How are the results
represented? Explain your answer.
Compiling corpora is very expensive and it requires a lot of time, so the results must be considerable gains
for the researchers. Corpora investigation has to do with speed and reliability, and this means that by using a
corpus the linguist can investigate more material and get more reliable calculations of frequencies. The way
in which the results from this fast and reliable investigation are presented is through frequency figures or
concordances.
21. What is a concordance and what is a concordance line?
A concordance is a way through which the corpus result can pe presented. We refer to concordances of
tokens (the individual occurrence of linguistic unit), and it is a list of all the contexts in which a word occurs
in a particular text. Concordance lines are the way in which words in context are presented, specifically we
say presented in KWIC (keyword-in-context). In terms of concordances of tokens, it’s important to analyse
three aspects: the syntagmatic pattern (the context in which words are used), syntagmatic word clusters (the
expressions that are common) and collocations (the words that go together). In the case of the syntagmatic
pattern we analyse the context in which a word is used by seeing if it’s followed by a preposition, a noun, if
it’s transitive, intransitive or ditransitive. Let’s look at the situation of the verb “to send”: it could be
transitive (send an email), intransitive (they send what they want) or ditransitive (send me a text). We
cannot know in which situation the verb is if not in context. Then, we move to syntagmatic word clusters. In
this case, a common case is when nouns are preceded by classifiers, as in “a piece of cake”, “a bunch of
bananas” etc… With collocations we wonder which words go together the most: verbs + nouns, verbs +
prepositions etc…
As concordances, frequency is another way to present corpora results, and it is a major factor in language
change. Thanks to computers we can get frequency data very easily, due to the reliability and speed of the
instrument. Frequency is used to check how many times a word occurs in a corpus, the most frequent forms
of this word in the corpus and the co-occurrence of forms. Frequencies can be compared in terms of
differences in type of language used: the medium, the genre, the register, the geographical variety etc…
In the British National Corpus the most frequent words are “the” (at the first place in general), “a”, “that” as
determiners, “of”, “to”, “in” as prepositions, “and” as a conjunction, “it”, “I”, “you” as pronouns, “is” and
“was” as auxiliary verbs.
Linguistics is compared to hard sciences such as chemistry, biology or geology for the fact that, even not
being one, it uses a scientific method. Moreover, like hard sciences, we can distinguish in linguistics two
methods: qualitative and quantitative. The first one is a close analysis of the text in terms of grammatical
constructions known from introspection. It’s a method based on intuition: through these analyses you arrive
at theories about language by induction. This is the case of armchair linguists, and of the scholars of
Chomsky’s generative linguistics. On the other hand, the quantitative method is an analysis based on
counting things and using frequencies and percentages to describe language and to formulate hypothesis.
Contrarily from the qualitative method, this one is based on deduction.
25. How can quantitative methods can be combined with qualitative analysis?
Even if the two methods seem to be far from each other, there are many cases in which they combine for
the linguistic analysis because the use of one implies the use of the other. When we want to analyse
something in a deductive way, counting and using percentages (quantitative method), we also need to imply
an intuitive method to categorise what we’re about to study (qualitative method). If we want to count the
words per sentence in a text, we need to know what qualifies a word and what qualifies a sentence. It’s in
this moment that quantitative and qualitative methods meet and work together for the analysis.
We can see the importance of frequency with the many applications in language technology making use of
frequency information. Let’s think about the spell checker of the mobile phone that, while you’re typing
something, it guesses the word you intend to type. We can also refer to the order through which teachers
and textbooks decide to present new words to students. This is a matter of frequency.
28. The most frequent body-part nouns in the BNC are hand, head, and eye. What do they have in common?
The words “hand”, “head” and “eye” are the most frequent body-part noun lemmas in the British National
Corpus. They’re all content words and they depend more on the type of context they’re in than the
frequency of function words. Then, they’re not always used to express body parts in a literal way, but also in
a metaphorical way: “the head of an organisation”, “the hand of clock”.
29. What are the most frequent verbs and adjectives in the BNC? Why?
In the top fifty of the most frequent verbs in the BNC, six out of the top seven are auxiliaries. Then, between
the most frequent lexical verbs we find verbs such as “say”, “get”, “make”, “go”. Most of these verbs are
defined light verbs for the fact that they can occur in more contexts. Surprisingly, the most used adjective is
“other” in the BNC, and not quality adjectives like “bad”, “good”, “new”, “old” etc…
30. What arguments relevant when we discuss whether a corpus is representative or not?
According to the definition of Fillmore, a corpus, whether written or spoken, must be analysed linguistically,
balanced, and above all representative. Representativeness is an important concept referred to corpora. The
BNC is meant to show a fair picture of the English language in Britain around 1990. But there are many
problems: first, as we already said, spoken corpora are lagged behind written corpora, so there could be
genres or types left out. Then, another problem is that the impact of a recorded utterance of a radio speaker
with 100.000 listeners has the same impact in a corpus of a sentence said by someone in a daily
conversation, or a sentence in a best-seller has the same weight of sentences from flyers. The conclusion is
that representativity varies between corpora: it’s not absolute, it’s relative.
31. Why is it important to normalise frequencies when you compare results from different corpora?
The process of normalisation is required in order to compare data from different corpora of different sizes.
This process should always be used, and thanks to this it is possible to compare two corpora or subcorpora,
for example British and American English, written and spoken, 300 million words corpus with 100 million
words corpus.
According to the amount of information that a corpus can give, corpora are divided into raw corpora and
annotated corpora. Raw corpora are unannotated corpora, there’s no code, no tagging, no information
about the text; corpora are in their raw state of plain text. Annotated corpora provide additional
information, that we call annotations or tagging; these tags are of different nature, depending on the part of
speech that must be annotated, and constitute an important tool for research. Between the annotated
corpora we find textual markup and metadata, which constitute additional information about the text itself,
the writer, the age or the sex of the speakers etc… Besides metadata another type of tagging is linguistic
annotation: the extra information is a linguistic one. Between the linguistic annotations we find POS tagging,
lemmatisation and parsing.
Let’s suppose that we want to check how many times in a certain corpus the noun “kick” occurs. Once we
have a certain result we wonder if it’s representative of our research or not, because we may have found
also tokens of the verb “to kick”. At this point it is important to introduce the part of speech tagging (POS
tagging) because through this method words have annotations and it’s easier for a researcher to search for
all the nouns instead of the verbs. In fact, POS tagging is considered such an important technique for
linguists because it accelerates their work enormously. The same thing could happen with “can” that could
be a can of Coke or the verb “can”. If the object of the research is the verb, with POS tagging annotations it
will be easier for a researcher to look for verbs and to exclude all the nouns he’s not interested in for his
analysis. POS tagging helps to have precision and recall, both expressed in percentage. Their ideal
percentage should be 100% for both. Precision measures if the largeness of a proportion of hits you’ve found
actually consists of what you were looking for. Recall is the measure of how many of the relevant tokens in
the corpus you can actually retrieve in your search. There are, moreover, some problems with POS tagging:
first of all it’s not always obvious what the tag should be, like in the case of the verb “to put off”, in which we
don’t know if it’s better to tag it as verb + prep. or as a whole particle verb; then, not every researcher tags in
the same way. There are tagging teams and every member of it should tag in the same way, trying not to
categorise words according to different grammatical theories. The best thing to do is to assign the tagging to
automatic tagger, like computer programs.
Between the most used adjectives in the BNC there’s “big”. It’s possible that, with a certain frequency we
can find in the list also the superlative “biggest” or the comparative “bigger”. It’s obvious that we’re talking
about the same adjective but in different situations, in different uses. The word from which bigger and
biggest come from is big, they’re just inflicted forms, and that’s what corpus linguistics calls lemma, which
has always to be written in small caps (BIG). The process of lemmatisation groups words of the same word-
class in one lemma. Lemmatisation is considered another type of linguistic annotation because for example,
if we have in a corpus both the verb and the noun “can”, they can’t be grouped in the same lemma because
they belong to two different word-classes. We will have to write CANv (verb) and CANn (noun). It’s a normal
consequence that lemmatising brings changing in frequency counts. If we have a table in which there’s a list
of the most used verbs, we see that for example the position of the verb “to be” has a different rank from
that of the verb “to be” in the verb lemmas table. This happens because in the lemmas table the verb “be”
groups all the inflicted forms: “was”, “are”, “am” etc…
The last type of linguistic annotation is that of parsing. Parsing is more difficult as method, but also more
connected to language theories that the researchers need to for the analysis. It consists on labelling
functionally the sentence: adverbial clauses, adjective phrases, temporal noun phrases etc… To do this, this
technique of parsing uses the structure of treebanks, a tree diagrams that shows the functional labels.
Automatic parsing is important for several technical applications in the field of natural language processing.
36. What are the main advantages of corpus-based lexicography compared with the old way of collecting
examples by hand?
There’s a big difference between lexicon and lexicography, between grammar and vocabulary in a language.
This explains why we’re used to learning a language both through grammar books and vocabularies as
separated texts. According to scholars of generative linguistics, words were inserted in preconceived
grammatical structures, which existed way before words. New studies affirm on the contrary that words not
only have their meaning, but also their own “local” grammar, assuming a more central role and not a
peripheral one. Before the invention of technologies like computers, as to lexicography, lexicographers
collected slips of paper with annotations and text excerpts for all the words they wanted to include in the
dictionaries. This is the same thing Otto Jespersen did with slips of paper in shoeboxes to write his grammar,
and what the editors of the OED (Oxford English Dictionary) did in order to write the vocabulary. But this
work was cumbersome, and not fast at all, so the invention of computers was a big step forward for
lexicographers and dictionary-makers. The editors use now concordances to find out the constructions in
which each word is used and evaluate which of them are worth mentioning in the dictionary. Dictionaries
quote authentic examples from corpora, and specialised/subcorpora give important information about the
differences of register or genre. Finally, corpora give also rich information about verbs that appear common,
such as “take”, “bring”, “put”, assigning more meanings and examples excluded by introspection.
37. Identify the different parts of Sinclair’s system of lexical relations and describe them.
Sinclair suggested a system of lexical relations in which a word doesn’t convey meaning alone, but when it is
in context and co-occurs with other words. Sinclair’s system is made of four parameters: collocation,
colligation, semantic preference and semantic prosody. Collocation is the relation between a node and a
collocate, so two words that frequently co-occur. Let’s see for example the words OPEN and MIND. They co-
occur together many times in different examples, “he has an open mind” “he always opens his mind” “open-
minded” etc… Colligation is the relation between a node and a grammatical category. For example, the co-
occurrence of “announce + to + infinitive” is a colligation (She announced to be pregnant). Semantic
preference is the relation between a node and semantically related words. For instance, the adjective “large”
often precedes words about measures, quantities: “a large space”, “a large amount” etc… Finally, there’s
semantic prosody which has a communicative purpose described the discourse function of a word.
Semantic prosody has a communicative purpose which is described by the discourse function of a certain
word. With collocation, colligation and semantic preference it’s one parameter of the lexical system
proposed by Sinclair. The linguist makes the example of the verb “to set in”, by claiming that it frequently
has a negative meaning, even if the general meaning is “to begin”. Speakers and writers can use semantic
prosody to convey positive or negative evaluation without stating their view in an explicit way. It’s called
semantic prosody because meaning spread over words in a way similar to intonation in prosody.
39. What are the pros and cons of using newspapers for corpus investigations?
Using newspaper for considerations such as collocations, colligations or semantic preferences is a rich way to
investigate language and comparing for example AmE and BrE as two different varieties of English. This
richness is given by the fact that editors are always following new trends, new words, new concepts, so that
journalistic prose becomes an important source for corpus linguists. A drawback emerged when it was
impossible to reach some journals on the web, and newspaper CDs were created. The problem with these
newspapers was that publishers didn’t provide figures for all the words, and statistics had to be counted
with caution.
40. The text shows the frequency of use of greenhouse effect and global warming in Time corpus (1950s-
2000s). Describe the results of the analysis.
It’s normal that through time, the lexicon of a language changes more and more. New concepts, new words,
new way of conceptualizing and borrowings from other languages are required to talk about new ways of
talking about the society, new inventions etc… Words like “greenhouse effect” and “global warming” sound
familiar to us, but we don’t know the origins of these words, or how recent they are. In a graphic,
researchers showed the frequency of these two words in Time magazine, decade by decade from 1950s.
Between the 20s and the 40s, there weren’t occurrences of both the words. From 1970 the word
“greenhouse effect” seems to grow, reaching 5 tokens per million words, and then decreasing again until its
disappearance in the 2000s. When “greenhouse effect” starts decreasing, “global warming” has a rising
frequency, reaching in 2000s 25 tokens per million words.
41. The text shows the frequency of use of maybe and perhaps in Time corpus (1950s-2000s). Describe the
results of the analysis.
As corpus linguists did for the binomial “greenhouse effect” and “global warming”, they also analysed the
frequency of two synonyms in Time magazine, from 1920 to 2000: “perhaps” and “maybe”. The curve of
“perhaps” is squiggly and not linear at all: it begins in 1920 with about 200 tokens per million words and it
ends the same way in 2000. The curve of maybe is different and it shows a very fast frequency rising,
beginning with 0 tokens in 1920 and arriving at almost the same point of “perhaps” in 2000, with about 150
tokens per million words. From this first analysis we can observe that “maybe” was born later than
“perhaps”, but besides this, people both used them a lot in 2000s. We wonder: who prefers what? Who is
keener on choosing “perhaps” rather than “maybe”? A study showed the frequency of “maybe” and
“perhaps” in COCA, distinguishing spoken, fiction, academic, magazines and newspapers. “Maybe” is more
used in spoken an in fiction, while perhaps more in academic and magazines. In newspapers, the two terms
seem to occur with a similar number of tokens per million words.
The fact that “maybe” is more used in spoken and fiction shows that Time magazine has started including
more fiction and reported speech; spoken-like language has entered the written dimension. This seems to
confirm the existence of a process called “colloquialisation of English”, which can be seen in print of
contracted forms like “isn’t” (from “is not”), “doesn’t” (from “does not”) and the replacement of
prepositions like “upon” with more neutral ones like “on”.
43. The chapter gives some examples of how corpus methods can be used in literary studies. Describe them.
We have seen how corpora can be used to analyse English as a whole language, or different registers of
English, like that of newspapers for example. But it’s also possible to use corpora to investigate the language
or the genre of individual authors. According to Stubbs, electronic text analysis can add an empirical basis for
further interpretation. For example, scholars have noted for ages that in Conrad’s short novel The Heart of
Darkness one of the most visible semantic field is that of vagueness. But Stubbs with his electronic text
analysis can also show us how many tokens per million words refer to the idea of vagueness, both nouns
(dark, smoke, fog) and pronouns (somewhere, somebody, somehow). This method is that of frequency of
groups of words. That’s the important element of corpus linguistics in the investigation of literary studies.
Beyond the method of grouping word in terms of frequency in a literary investigation, another method can
be used and it’s the “keyword analysis”. It’s the method used by Scott who wanted to underline which words
are special in a certain text if compared to the norm. In particular, he wanted to investigate how words are
more frequent or less frequent, in a book like Romeo and Juliet, than expected according to the norm.
The concept of collocation emerges in the moment in which we talk about the phenomenon according to
which we group words together in multi-word units or phrases. This is a tendency of every speaker,
particularly of native speaker, who manage to do it fluently. According to Wray, collocating means to use
words and strings of both regular and irregular construction. Basically, as Sinclair explained through the four
parameters of his system of lexical relations, collocations are tendencies to co-occur. We can make the
example of the noun “dog” analysing it as a node with collocates preceding or succeeding it. “Dog” can be
preceded by an adjective collocate, like in the case of “a happy dog”. But at the same time a collocate can
also be a verb that either precedes or succeeds the noun dog: “to have a dog”, “the dog barks” etc… There
are three theories about collocations made by Palmer, Firth and Sinclair. According to Palmer, a collocation
is a succession of two or more words that must be learnt as a whole and not separately. For Palmer, each
part of speech has different patterns. Let’s see for example the case of an expression like “to have a hard
time of it” which has the fixed pattern of VERB + SPECIFIC NOUN + PREP + NOUN. It’s a fixed expression, but
on the contrary, there are expressions like “to think nothing” in which there’s no fixed pattern, we can
change “nothing” with “something”, or “little” etc… In this case the pattern has an open slot. Firth spoke
about collocations in terms of words that can be analysed thanks to “the company they keep”. He
emphasises how a single word’s meaning has a certain influence given by the words near it. The concept of
collocation has a totally new aspect when Sinclair gives his definition by stating that a collocation is “a more-
frequent-than-average co-occurrence of two lexical items within five words of text. Comparing these three
definitions shows us that we’re dealing with different visions on the topic.
46. What are the two main types of collocations and what is the difference between them?
We have seen that Palmer, Sinclair and Firth give three different and contrasting definitions of collocations,
and for this reason we can distinguish two types of them. On one hand there are window collocations, which
recall the Firth and the Sinclair type, and on the other hand there are adjacent collocations, which recall the
Palmer type. In window collocations, there are words, called collocates, that occur in the vicinity of the
keyword but do not necessarily stand in a grammatical relation with it. In the case of adjacent collocations,
collocates occur immediately to the right or the left of the keyword. This last kind of collocation is closer to
linguistic structures and not a statistical phenomenon as window collocations. Each time we say or write a
word, we’re influenced by the word that we have just said or written.
In window collocations, a window is the space to the left and to the right of the keyword researched for such
collocations. The window span can vary in terms of size, but commonly it is four or five words to the left and
to the right. When we have 5 words to the left, we write -5, with 5 words on the right +5. Sinclair stated that
the larger the span is, the lower the significance of the keyword is. The collocates that we can analyse can be
different part of speech: we can look for the most frequent verbs collocates of the noun dog, but also the
noun collocates.
By looking for the noun collocates of “dog” we note that the most frequent collocate is the word “dog” itself.
We can explain this by saying that when we write or speak about something, in many cases we tend to
repeat the keyword of which we’re searching for collocates, and this is the case of the noun “dog”.
Studying collocations also means to find collocates and nodes that may co-occur by sheer chance. This
situation, that should be avoided, is called frequency effect. To avoid this, there are statistical measures
which calculate the frequency of words near the keyword in relation to their total frequency in the corpus.
Thanks to these calculations, we have lists of words that occur near the keyword more often than expected.
There are three statistical methods that illustrate the frequency and exclusivity of particular collocates:
mutual information (MI), the z-score, the log-likelihood. MI measures the collocational strength, that is the
ties between the node and each collocate. It’s a statistical measure that compares the probability of finding
two words together and probability of finding them distinctly. The problem with this measure is that much
importance is given to rare combinations. We can see the example of the noun “educational”: the adjective
“non-broadcast” is the most frequent one near the node “educational”, but even if it’s the top collocate in
this case, its rank is not as high as other adjectives. Z-score is a statistical measure which expresses how
unlikely it is that two words are related. The top collocate with educational is in this case “psychologist”. The
log-likelihood (LL) compares the observed and the expected values for two datasets.
50. Lists of collocates can be based on absolute (raw) frequency or one of a number of statistical measures.
What is the difference between the kind of collocates that appear on these lists?
The difference between these two lists of collocates is that, for example, the one with a statistical measure
has a top collocate that in a list with absolute frequency wouldn’t be so high. Let’s think about the verb
collocates of “dog”: the top collocate is the verb “wag”, because it’s the most use verb collocated near “dog”
within five words to the left and to the right. But a verb like “barking” has a higher frequency, even if it’s not
the most collocated verb near the node “dog”.
If we want to analyse the prepositional collocates of a certain noun, we see that the preposition itself hasn’t
meaning on its own and collocated with a node they form a multi-word unit. It’s possible to see this process
with hand: at hand, in hand, on hand, to hand. Even if hand has a specific literal meaning, when put near a
preposition, there can be a changing from literal and concrete to abstract. Let’s think of “in hand” which
refers both to have something in our hands, like a glass of water, and to have something assigned, like
having a task in hand. This process from concrete to abstract is called lexicalisation. It’s important how these
different multi-word units (preposition + hands) seem to be used with high frequency by determinate
categories: in hand and at hand are more used talking about abstract things; on hand is used mainly about
humans; to hand is used about inanimate concrete items.
Sinclair was one of the scholars who agreed that phraseology was central in language, and not a peripheral
matter. He thinks that, as language speakers, preconstructed phrases are always available in our minds and
they constitute the first option when we speak. This is what Sinclair calls the idiom principle.
There are also situations in which we it’s not easy to retrieve these preconstructed phrases, because the
don’t come in our mind, and in this case, we try to construct a phrase from scratch by using our language
and grammatical skills. This other process is coined by Sinclair as the open choice principle. Wray present
this model where lexical items represent a storage available to everyone, made of words or phrases,
depending on the need of the individual speaker.
55. What is the difference between idioms like “a red herring” and recurrent phrases like “at the moment”?
What is their importance for language learning?
A difference between an idiom like “a red herring” and a recurrent phrase lie “ at the moment is that, first of
all, as the idiom concerns, “a red herring” cannot be understood from its single part because it’s a fixed
expression and it can be comprehended by a native speaker or by who speaks the English language very well.
“at the moment”, on the contrary, as a recurrent phrase, recurs many times in English and doesn’t need a
deep comprehension. In fact, recurrent phrases are strings of words that recur frequently in texts. Altenberg
decided to study recurrent phrases constituted at least three words. Investigating recurrent phrases such as
“in a” or “and the” is not as useful because they’re just fragments of something bigger. The importance of
idioms is given by the fact that knowing them means to have a good knowledge of the language, you can
comprehend it better. On the other hand, recurrent phrases are important because of their role and function
in the discourse, referring time, space or organisation. In the case of “at the moment” we have an adverbial
function which gives us a precise temporal coordinate which helps us organising in a better way the
discourse.
56. Define the notion of idiom and explain the results of the analysis of the idiom “storm in a teacup”.
An idiom is considered a fixed or semi-fixed expression whose meaning cannot be deduced by its single
parts. We say semi-fixed because not all idioms are 100% fixed, like “pay through the nose” or “to bend the
rules”. Most idioms are difficult to understand from their single words, and the meaning is not immediately
comprehensive. While in idioms like “turning the tables”, even if we get that what we’re doing is literally
moving a table, we understand that there’s a reference to a turning point, a changing. But with idioms like
“high and dry” it’s difficult to comprehend the meaning of the phrase from the single words. Knowing idioms
is an advantage for those who speak a language. Some scholars investigated the idiom “a storm in a teacup”
on Google, choosing different web domain (UK, American, New Zealand etc…). They found both results from
native speakers of that particular region and non-native speakers. To find variants of the idiom, they typed
only “a storm in a” in the search dialogue. The result was that only a few results showed the same structure
as “a storm in a teacup”, like “a storm in a glass of water”. In the other cases modifiers emerged, changing
the structure, like “bellydance storm in a teacup”.
57. Idioms like” a bird in the hand is worth two in the bush” are often used in an abbreviated or manipulated
form like “bird in the hand”. What does this imply about how these idioms are stored in the mental
lexicon?
In his investigation of recurrent phrases in the London Lund Corpus, Altenberg also found out that many of
these phrases were incomplete. What misses in most cases is a lexical word: for example, an incomplete
phrase could be “out of the” or “a sort of”, which are recurrent, but we also see that something’s missing.
Lindquist and Levin analysed searched for the most frequent collocates of these incomplete phrases in the
BNC. The result showed that, for example, the most frequent collocate of “out of the” is the noun “window”,
whereas in the case of “a sort of” the most frequent collocate is the determiner “a”.
59. What are n-grams? Why are they interesting from a linguistic point of view?
Recurrent phrases analysed in BNC are studied by means of a specific database: Phrases in English (PIE). In
this database what we find are n-grams, that are identical strings of words of the same length (the same
number of words). “N” can be a value between from 1 to 32, and it means that if we want to look for strings
containing 16 words the object of our analysis will be 16-grams in the corpus. N-grams are very useful for
linguistic research because they tend to highlight strings and structures that combine in a language. If we
search for 5-grams strings, the most common in the BNC are structures built this way: preposition + article +
noun + of + article (at the end of the). This is what Altenberg would call an incomplete phrase. Other 5-grams
strings seem to come from the speaking dimension: examples are “I don’t know what”, “you don’t have to”
etc…
60. What are POS-grams? Why can it be interesting to search for POS strings (tag sequences) as well as strings
of words in a corpus?
Besides N-grams there are also POS-grams (part of speech grams): it’s another way to use the database PIE
(Phrases in English) and it lets you search strings of most frequent tags in corpora. POS-tagging is a technique
of linguistic annotation according to which it’s possible to look for a word in a corpus of a specific class
excluding the classes we don’t need in the research. One of the most frequent 5-gram in BNC is “at the end
of the” and consequently the most frequent POS-gram is “preposition + article + noun + of + article”. This is
an example of prepositional phrase. As for N-grams, POS-grams are important in the linguistic research
because they permit the scholar to understand the most frequent sequences in the language.
61. How can searches for recurring n-grams be interesting for the analysis of literary works?
As for the keyword analysis, even with recurrent phrases we can investigate on literature. The interesting
point of this research is that it is possible, through these recurrent clusters, to analyse the poetic trajectory
of a novelist and to compare it with that of other writers.
62. Describe the results of the analysis of the most frequent 8-grams appearing in the Dickens corpus
(Mahlberg 2007).
Mahlberg analysed the recurrent phrases in 23 novels by Dickens, all downloaded on Project Gutenberg, to
compare her corpus with that of another novelist, to be able to confront Dickens’ poetic with a corpus of
other 19th century novelists. She searched for 8-grams. The results show us that Dickens recycled phrases in
different novels, and we can see it for the fact that he used more than once a certain phrase in more than
one novel. Moreover, some hits represent usual everyday expressions, while other are expressions of her
time.
63. What is the difference between metaphor, metonymy and simile? Provide examples. Why are metaphors
considered to be more powerful than similes?
Metaphors, similes, and metonymies are very similar concepts, but at the same time they have some
differences. Firstly, metaphors constitute the description of one semantic sphere to describe something in
another sphere. Even if there’s a passage from a dimension to another, which should be different, there
must be also some similarities between them. Metaphors have three parts: the vehicle (an expression like “a
hard lesson”), the tenor (the meaning of the expression: a lesson is hard) and the ground (the connection
between the two spheres: expression and meaning). Metaphors are usually used to describe abstractly
something concrete in a striking and sometimes amusing way. We can make an example: “the body was
invaded by several bacteria”. It’s obvious that the verb “invading” recalls the army and military sphere and
it’s unusual to put it in the sphere of body. But the idea is that these bacteria enter the body in a negative
way, and it’s in this moment that the metaphor works. This type of metaphor is considered a conventional
one, because it can be easily comprehended by native speaker and non. There are also creative metaphors
like “a tower of cheese” which is created by a novelist to cause a striking effect on the reader, and in fact it’s
difficult to interpret. The first metaphor we have mentioned is “a hard lesson”, and this is considered a dead
metaphor because in the everyday language, everyone uses the adjective “hard” in a metaphorical way to
state that something’s difficult. Similes are very similar to metaphors, but they include a term of comparison
between the two spheres compared. It’s a metaphor saying “He’s a fox”, while it’s a simile to say “he’s sly
like a fox”. A metaphor has a stronger impact, and we can see it with this last example: saying to someone to
be a fox is more direct and impactful than saying that someone resembles a fox for their slyness. Finally,
there are metonymies. If metaphors are a passage from one dimension to a different one, a metonymy is the
association between the same domains. Saying “the White House” means referring, through a metonymy, to
“the US president”, and saying the “stage” may refer to the “theatre”. Besides the association of two sphere,
there are also two other types of metonymy, which could be confused with synecdoche: the passage from a
domain to another is constructed through the part-for whole (“the wheels” to say “the car”) or the whole-
for-part (“Italy” to say the football team of Italy).
Lakoff and Johnson thought that metaphors were not only a literary matter, something written, but also
something through which reality is understandable. The two scholars use the term “conceptual metaphors”
to describe general structures of our cognitive systems. These metaphors lie behind series of more detailed
metaphors. Let’s see for example the conceptual metaphor LOVE IS MADNESS, behind which lies the
metaphorical linguistic expression “I’m crazy about you/I’m so in love with you”. In conceptual metaphors
there are a source domain and a target domain: in the example LOVE IS MADNESS, MADNESS is the source,
the sphere on which the metaphor is based, and LOVE is the target, the sphere described by the source
domain. In normal metaphors the connection between the two spheres is called the ground, in conceptual
metaphors it is called mapping.
65. What are the three different ways to extract metaphors from a corpus? Which method do you think seems
most viable?
At first, the metaphor research was based on the manual study of the text and on introspection. But in a
second moment, it became possible to use corpora for metaphors. What was the problem? The fact that a
metaphor constitutes a meaning, and meanings are not annotated so it could become difficult to look for
them in corpora. But Deignan analysed three methods according to which we can investigate metaphors by
using corpora. The first method is that of starting from the source domain: in LOVE IS MADNESS, MADNESS is
the source domain and we can analyse the metaphors by seeing what kind of words from that source
domain occur. Another method is that of starting from target domain: this means that we must search for
patterns containing the word used to describe the domain. In the case of THE MIND IS A MACHINE, MIND is
the target domain, and it’s the word we must search in corpora. The third and final method to search for
metaphors is that of manual analysis: it’s about using a small corpus manually to find certain metaphors or a
particular one. I think that even if by manual analysis it’s not possible to find all the metaphors as in the case
of a corpus, this method is the most viable for the reliability of the results.
d. Dead metaphors were born as creative metaphors and grew up to become conventional metaphors before they
ended up as dead metaphors. true false
67. Studying metaphors and metonymy can tell us something about how our minds work. Do you agree?
Why?
I think that metaphors are ways of interpretation of the reality that surrounds us. First, linguistically
speaking, knowing metaphors and metonymies means having a good knowledge about a certain language;
secondly, processing from a sphere to another, from concrete to abstract, in the moment we read a
metaphor or a metonymy, our brain requires an effort. Let’s think of creative metaphors and the effort our
mind requires to try to understand such a metaphor. In the case of conventional metaphors, it’s easier to
understand them and we don’t have to concentrate so hard.