Michal Auersperger: Master Thesis
Michal Auersperger: Master Thesis
Michal Auersperger: Master Thesis
Michal Auersperger
Prague 2017
I declare that I carried out this master thesis independently, and only with the
cited sources, literature and other professional sources.
I understand that my work relates to the rights and obligations under the Act
No. 121/2000 Sb., the Copyright Act, as amended, in particular the fact that the
Charles University has the right to conclude a license agreement on the use of
this work as a school work pursuant to Section 60 subsection 1 of the Copyright
Act.
i
Title: English grammar checker and corrector: the determiners
Supervisor: RNDr. Pavel Pecina, Ph.D., Institute of Formal and Applied Lin-
guistics
ii
Contents
Introduction 3
3 Experimental Setup 29
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 British National Corpus . . . . . . . . . . . . . . . . . . . 31
3.1.3 One Billion Word Benchmark . . . . . . . . . . . . . . . . 31
3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Original features . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Extended original features . . . . . . . . . . . . . . . . . . 34
3.2.3 Countability feature . . . . . . . . . . . . . . . . . . . . . 35
3.2.4 Word embeddings . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.5 Language model prediction . . . . . . . . . . . . . . . . . . 40
3.3 Feature post-processing . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Evaluation 44
4.1 Logistic regression models . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . 44
1
4.1.2 Replication of a reported experiment . . . . . . . . . . . . 45
4.1.3 Cut-off value for categorical variables . . . . . . . . . . . . 45
4.1.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.5 Approaches to multinomial classification . . . . . . . . . . 47
4.1.6 The best model . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Gradient boosted trees . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . 49
4.2.2 The best model . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Human performance . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Comparison to automatic methods . . . . . . . . . . . . . 56
Conclusion 59
Bibliography 61
List of Figures 66
List of Tables 67
Attachments 68
4.5 Manual annotation texts . . . . . . . . . . . . . . . . . . . . . . . 68
4.5.1 Original text . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5.2 Annotator A . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5.3 Annotator B . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5.4 Annotator C . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.5 Annotator D . . . . . . . . . . . . . . . . . . . . . . . . . 73
2
Introduction
The use of the indefinite and definite articles (a, an, the) in English poses a
significant problem for many speakers with a different first language. English
grammars analyze the use of the articles in great detail but they acknowledge the
rules can be sometimes contradictory (e.g. conflict between the first mention (an)
and cataphora (the) in the sentence: I had the / an impression that something was
wrong. [Dušková et al., 2006, p.81]). As is often the case with language, grammars
provide analyses of different tendencies effecting the language phenomenon rather
than a fixed set of rules to follow. The aim of the thesis is to create an automatic
grammar checker targeted specifically at the use of the, a and an.
3
limited by an n-gram window provides more freedom in feature engineering. On
the other hand, Sun et al. [2015] advocate the use of their approach by potential
“propagation of errors in the existing tools of NLP” such as taggers and parsers
that are necessary for extracting the noun phrases from the raw text. To facilitate
comparison with what seems as larger body of literature on the subject, the former
approach was selected for most of the experiments. However, a language model
is also trained and evaluated for comparison.
As mentioned above, some researchers decided to train their models on texts
written by students of English, others relied on standard corpora, i.e. texts
written by English native speakers. Intuitively, the former approach makes more
sense for practical grammar correction systems. However for this project, the
latter option was chosen for the following reasons:
a) The student corpora are less accessible and contain less data.
b) Texts written by non-native speakers contain other kinds of errors that can in-
fluence performance of some preprocessing steps (tagging, parsing). Avoiding
these types of errors makes it possible to focus solely on the articles.
c) Utilizing the articles already present in the input text seems practical (one
could, for example, make more corrections in certain contexts while staying
conservative in others). However, this means the resulting system would model
not only the nature of the grammatical phenomenon, but also some systematic
linguistic behavior in non-native speakers of English. This is problematic for
two reasons. Firstly, there is not much consensus on the regularity of such
behavior. Gressang [2010] contrasts different studies finding in turn that the
single main source of article errors is omission, predominant use of the or
predominant use of a/an. Rozovskaya and Roth [2010] show large variability
in article error types based on the first language of the speaker. Secondly,
the researchers who make use the original articles in text sometimes find they
need to introduce artificial errors since the true errors are sparse [Rozovskaya
et al., 2013], [Lee, 2004].
English articles come in three forms: the, a and an. In accord with previous
research, in this thesis the two forms of the indefinite article are treated as one
entity. The choice between a and an is a matter of simple phonological rules
rather than a question of noun phrase determination. As such, the proper form
can be chosen in a post-processing step.
Summing up the above, in this project, the article correction is understood as
a classification task. Each noun phrase in the input text is assigned with one of
4
the three categories irrespective of the potential original article: definite article
(the), indefinite article(a/an) and zero article (no surface realization).
Thesis structure
Chapter 1 provides linguistic perspective on the use of articles and determiners in
general. In Chapter 2, two machine learning algorithms used for the experiments
in this thesis are described, namely logistic regression and gradient boosted trees.
The motivation for this choice of the methods is also mentioned. Chapter 3
deals with the experimental setup; namely, the processing of the data and the
features used for the experiments. The evaluation of different approaches as well
as comparison to human performance is given in Chapter 4.
5
1. English Articles from the
Linguistic Perspective
The following chapter provides an introduction to the use of articles from the
point of view of English grammars. In particular, the information is based on
two established sources: very thorough grammar description by Quirk et al. [1985]
and more theory-based work by Dušková et al. [2006].
Generally, there are four specific forms of articles in English: two forms of
the indefinite article: a [@], an [@n] and two forms of the definite article, whose
distinction is, however, apparent only in the spoken language, – the [D@] and the
[Di:]. For both the definite and indefinite articles, the longer variants, i.e. an and
the [Di:], occur in front of a (pronounced) vowel.1
The articles are one way of expressing the grammatical category of definite-
ness, which is a property of English noun phrases. Similarly to countability,
another such grammatical category, it does not have a counterpart in the system
of the Czech (and generally Slavic) noun. Presumably, this is why the correct use
of English articles poses a large problem for Czech speakers.
From the semantic perspective, definiteness conveys the information about the
nature of the reference of the noun. In other words, it helps distinguish different
types of relationship between a noun phrase and the entity being referred to.
This notion can be illustrated by the etymology of the two basic forms: the
definite article originates from the old English demonstrative ‘θœt’ (that) and
the indefinite article from the Old English numeral ‘ān’ (one). This closeness
can be observed in some contemporary uses such as for the moment (≃ for this
moment), Don’t say a word (≃ not a single word). [Dušková et al., 2006, p. 59]
In greater detail, the relationship between definiteness and the reference of the
noun will be discussed in Section 1.3.
6
ordinal numerals and others.
Typically, all determiners occur at the beginning of the corresponding noun
phrase. However, more than one determiner can be used simultaneously. With
respect to the possible co-occurrence of different sets of determiners, Quirk et al.
[1985, p. 253] describe the following small system of three determiner classes:
When more determiners are used within a given noun phrase, they cannot be
from the same category and they must follow the provided order of the cate-
gories. Thus, the following example all the five boys is valid because it contains
a premodifier, a central modifier and a postmodifier in the given order; while the
example the five all boys is invalid because the order is broken. [Quirk et al., 1985,
p. 253] In practice, noun phrases often occur with only one (if any) determiner.
Apart from a possible preceding predeterminer (once a week, twice as much,
half the money), Dušková et al. [2006, p. 60] give the following situations when
an article is moved from its typical position at the beginning of the noun phrase:
• rather, quite
rather an unexpected result, quite a long time
1.2 Countability
The usage of the definite and indefinite articles as well as other determiners is
heavily influenced by another grammatical category of the English noun – count-
ability. Semantically, countability reflects the distinction between the referred
entities along the lines of continuous – discrete. In other words, countable nouns
refer to classes whose instances are conceptually distinguishable from one another,
while uncountable nouns (also called mass nouns) refer to homogeneous entities
with no distinct units.
7
Understanding what is countable and what is not differs from language to
language in certain situations. The same is true for the extent to which this
distinction is manifested in the grammar of a language. Dušková et al. [2006,
p. 50] contrast English with Czech, where the notion of countability is manifested
only in the presence/absence of plurals, e.g. vzduch — *vzduchy vs. pes —
psi. On the other hand, in English the countability of nouns further influences
the choice of other co-occurring words within the sentence, most notably the
determiners. This is what makes countability a grammatical category.
1.3 Reference
As mentioned earlier, the grammatical category of definiteness (and its realization
in the form of articles and other determiners) carries information about the type
of reference of the noun phrase. This section deals with the relationship between
different reference types and the corresponding types of determination.
The two main reference categories are the generic and specific reference. The
first one is represented by such sentences as: Men are from Mars, women are from
Venus. Here, both men and women refer to all the members of the given class.
In contrast, the stranger in Then a stranger came in refers to a particular man,
a specific instance of the class of all strangers. Therefore it expresses the specific
reference. This type can be further divided into the definite and indefinite sub-
types. The indefinite one is illustrated by the previous example. The stranger as
the member of the class of strangers is not uniquely defined in the situation. The
definite specific reference is illustrated by the sentence I want to see my doctor,
where the doctor “is referring to something which can be identified uniquely in
the contextual or general knowledge shared by speaker and hearer.”[Quirk et al.,
1985, p. 265] Table 1.1 shows the means by which these types of reference are
expressed for both countable and uncountable nouns.
8
reference type Countable Uncountable
the cat music
Generic a cat milk
cats
the cat the music
Definite
the cats the milk
Specific a cat music, some music
Indefinite cats milk, some milk
some cats
Table 1.1: Means of expressing different types of reference for countable and
uncountable nouns. The table is adopted from [Dušková et al., 2006, pp. 61–2]
However, not all the variants are equally likely in each context. The most con-
strained variant is the indefinite article. For example, the noun phrase in the
object position of the sentence Nora has been studying a medieval history play
does not have generic interpretation. [Quirk et al., 1985, p. 281] Another factor
limiting the use of the indefinite article for generic reference is the semantic class
of the predicate. Dušková et al. [2006, p. 63] enumerate the following verbs with
which the generic a does not occur: abound, be rare, increase, decrease, . . . .
9
Other factors can be inter-textual. The anaphoric reference refers to situ-
ations where the referent of the noun is defined based on the previous discourse.
This can be based on direct correspondence between two noun phrases: We went
to see a play, . . . , the play was not very good ; or more free correspondence: I
went down to pet her dog, but the beast almost bit me. Sometimes the relationship
between the noun and its antecedent is less straightforward and can be based on
association: I saw there an old guitar. I wanted to play it but found out the strings
were broken. Here the definite reference of the strings arises from its association
to the guitar. (This situation is called associative anaphoric reference in [Dušková
et al., 2006]).
The cataphoric reference is similar, but the noun is specified by what fol-
lows rather than what precedes it in the discourse. Nut surprisingly, the distance
between the noun phrase and the part of the text that makes it definite is much
more limited than in the case of the anaphoric reference. Mostly, it is expressed
by different kinds of post-modification within the same sentence. Based on the
syntactic properties of the post-modification, Dušková et al. [2006, pp. 67–9] list
the following options for the catphoric reference (the examples are taken from
the same source):
b) Postmodification by an of -phrase
Similarly to the previous case, postmodification by an of -phrase expresses the
cataphoric reference if it is specific enough to identify a single member of the
class denoted by the head of the noun phrase: the roof of the house, the bottom
of the sea. This is not necessarily the case for all the of -phrases: a page of the
book.
10
The content clause refers to the same referent as its parent noun phrase and
is used to make the referent more specific. Often, this results in the use of
the definite article: The fact that he has won several tournaments won’t help
him to pass his exams. However, sometimes the content clause can be used
for qualifying the noun phrase in which case the is not used: Her father died
at a time when she was too young to fully feel his loss.
d) Apposition
Similarly to content relative clauses, apposition serves as a means of providing
additional information about the referent. Often, this additional information
leads to the fact the referent is uniquely defined: the number seven, the river
Thames, the City of London.
Finally, Quirk et al. [1985] mention two other types of specific definite refer-
ence: sporadic and logical. The first type includes specific nouns that often
occur with the definite article, such as the radio, the Internet, the theater, the
news, the train, the bus, . . . . What holds these examples together is the fact that
in these cases the “reference is made to an institution which may be observed
recurrently at various places and times”[Quirk et al., 1985, p. 269]. The other —
perhaps more straightforward — category consists of examples where the unique-
ness of the referent is inherently encoded in the meaning of a special modifier:
the next/previous/best/worse book, the only question I have, the final stop, . . . .
11
1.4 Exceptions and conflicting tendencies in ar-
ticle usage
Unfortunately, it is not possible to give a complete system of the use of En-
glish determiners as there are always many exceptions to the general tendencies
provided above. One such example is the class of proper nouns. Due to their
semantic nature, they express specific definite reference, but unlike the common
nouns, they often occur without a definite article: Agatha Christie, little Emily,
Europe, Norway, Oxford Street, Easter Monday, . . . ; others are used with the:
the Burtons, the United States, the Crimea, the Congo, the Alps, the Shetlands,
the City of New York, the Pacific Ocean, the Thames, the Times, . . . . Further-
more, some proper nouns can become common nouns depending on the context:
A Mr. March to see you; He is not a Mozart; while others can take an article
depending on the type of modification He wrapped the trembling Emily in his
coat; the London I am talking about; the vision of a new Canada. [Dušková et al.,
2006, pp. 75–9]
Sometimes even common nouns do not occur with articles: go to school, in
court, go to bed, travel by train, go by car, from head to foot, hand in hand,
in contrast with, in support of, in fear, in trouble, take office, give way, . . . .
[Dušková et al., 2006, pp. 79–81]
As mentioned in the introduction, there are cases where there are more options
with respect to the choice of articles. This can result from different interpretations
of the situation: I discussed an interesting project with Jim last night. Afterwards
I went to discuss (a/the) project with Fred. [Dušková et al., 2006, p. 66] Similarly,
the use of an article can differ with respect to the countable/uncountable inter-
pretation of the noun: There was a short silence — There was absolute silence.
Yet in other cases, the options reflect virtually no modification in the meaning:
throw (a) new light on, at this time of (the) day, in (the) summer, take (a) pride
in sth., watch (the) TV . . . . [Dušková et al., 2006, pp. 81–2]
12
2. Machine Learning Algorithms
In the following chapter, two machine learning algorithms used for the experi-
ments in this thesis are described, namely logistic regression and gradient boosted
trees.
The motivation for this choice is mainly the ability of both algorithms to pro-
cess sparse data: logistic regression (in its form of the maximum entropy model)
is an established approach for many problems in the area of natural language
processing, where sparsity is inherent to most of them (for a review of maximum
entropy model use cases in NLP, see [Ratnaparkhi, 1998]); the other algorithm
— gradient boosted trees — has recently seen a new implementation that can
handle sparse data efficiently [Chen and Guestrin, 2016].
Moreover, logistic regression was used by Lee [2004], whose experiment is
replicated as part of this thesis (for motivation see Chapter 3). Another reason for
selecting gradient boosted trees is the fact that its implementation — XGBoost —
reportedly achieves very good results on many different tasks (as demonstrated by
Kaggle competitions) [Chen and Guestrin, 2016], while general gradient boosted
trees have proven effective in large scale production code [He et al., 2014].
Training
13
function, i.e. a function that measures the error a model is making. In the case
of logistic regression the cost can be expressed as
m n
1 ∑ 1 ∑ 2
J(θ) = − [yi log(hθ (xi )) + (1 − yi ) log(1 − hθ (xi ))] + λ θ . (2.3)
m i=1 m j=1 j
Here, m stands for the number of training examples, n stands for the number
of features, xi and yi represent the feature vector and the true label of the ith
example. The last term m1 λ nj=1 θj2 is a regularization term that controls the
∑
complexity of the model. The strength of the regularization can be tuned by the
parameter λ.
By minimizing the cost function of the given form, one maximizes the likeli-
hood L(θ|x), which is defined as the probability of the observed data given the
model:
m
∏ m
∏
L(θ|x) = P (x|θ) = pθ (yi |xi ) = hθ (x)yi (1 − hθ (x))1−yi . (2.4)
i=1 i=1
Now, the cost function given in (2.3) is equal to − m1 log L(θ|x) plus the regular-
ization term.
For minimizing the cost function, gradient descent can be used, however in
practice, more efficient algorithms are usually chosen. The implementation of
logistic regression used for this thesis uses the so called LBFGS algorithm [No-
cedal, 1980]. It is based on the Newton’s method, a procedure that iteratively
approaches a local minimum of a function2 by setting
f ′ (xt−1 )
xt := xt−1 − α ′′ , (2.5)
f (xt−1 )
where xt and xt−1 are the values of x at the step t and t − 1, respectively;
and α ∈ (0, 1) is a learning rate, i.e. the parameter determining the speed of
convergence. This update rule comes from approximating the function f by its
second order Taylor expansion at the point xt−1 :3
1
f (xt ) = f (xt−1 + ∆x) ≃ f (xt−1 ) + f (xt−1 )′ ∆x + f (xt−1 )′′ ∆x2 . (2.6)
2
Setting the first derivative of the approximation of the function with respect to
2
Since the cost function in (2.3) is convex, the algorithm finds the global minimum of the
function.
3
∑∞ f (n) (a)
Taylor series approximates a function f (x) at a point a by n=0 n! (x − a(n) )n , where
(n) th
f (a) is the n derivative of f evaluated at a.
14
∆x to be zero and solving for ∆x, the update rule in (2.5) is derived as:
[ ]
d 1
0= f (xt−1 ) + f (xt−1 )′ ∆x + ′′
f (xt−1 ) ∆x 2
d∆x 2
= f (xt−1 )′ + f (xt−1 )′′ ∆x
f ′ (xt−1 )
∆x = − (2.7)
f ′′ (xt−1 )
where xt and xt−1 are vectors. ∇f (xt−1 ) and Hf (xt−1 ) represent the gradient and
the Hessian matrix of the function and, given the function f takes m parameters
x1 , x2 , . . . xm , are defined as:
⎡ ∂f
⎤
∂x1
⎢ ⎥ ∂f
∂x2
⎢ ⎥
∇f = ⎢ . ⎥
⎢ . ⎥
⎣ . ⎦
∂f
⎡ ∂xm2 ⎤
∂ f ∂2f ∂2f
∂x21 ∂x1 ∂x2
... ∂x1 ∂xm
∂2f ∂2f ∂2f ⎥
⎢ ⎥
⎢
∂x2 x1 ∂x22
... ∂x2 xm ⎥
H=⎢ ⎥.
⎢
.. .. ... ..
⎢
⎣ . . . ⎥
⎦
∂2f ∂2f ∂2f
∂xm x1 ∂xm x2
... ∂x2m
15
multinomial classification:
k
∏ f (y,x)
αj j
j=1
p(y|x) = k
, (2.9)
∑∏ f (y ′ ,x)
αj j
y ′ j=1
y is the target class, x is the example or the ‘context’. Each example is described
by k binary features f (x, y). In the learning phase, each feature is assigned with
its weight α.
The notion of a feature is different from the common understanding of the
term in other machine learning methods. Here, a feature is a function f : A×B →
{0, 1}, where A denotes the set of output classes and B denotes the set of possible
contexts. After introducing the contextual predicate cp : B → {true, f alse},
which is a function indicating the presence of some information in the given
context b ∈ B; Ratnaparkhi [1998] defines a feature as
⎧
⎨1 if a = a′ and cp(b) = true
fcp,a′ (a, b) = (2.10)
⎩0 otherwise.
Thus all the features of the model are binary, and all of them are functions of
both the context and the output variable.
In order to estimate the parameters of the model ({αi }k1 ), Ratnaparkhi [1998]
uses an iterative algorithm known as general iterative scaling (GIS).
There are two main differences between (binary) logistic regression and the
maximum entropy framework: firstly, the number of classes being predicted and
secondly, the type of features used for describing the examples (because logistic
regression uses real-valued explanatory variables irrespective of the output class).
Ratnaparkhi [1998, pp. 27–28] shows that if one enables real-valued features
and limits the number of predicted classes, the maximum entropy model (2.9)
is equivalent to logistic regression (2.1). Assuming the new type of features is
defined as
⎧
⎨1 if a = 1
f0 (a, b) =
⎩0 otherwise
⎧
⎨x if a = 1
j
fj (a, b) = (2.11)
⎩0 otherwise,
16
(2.1) can be derived as follows:
k
∏ f (1,x)
αj j
j=0
p(y = 1|x) = k k
∏ f (0,x) ∏ f (1,x)
αj j + αj j
j=0 j=0
k
∏ f (1,x)
αj j
j=0
= k
∏ f (1,x)
1+ αj j
j=0
θT x
e 1
= Tx = . (2.12)
1+e θ 1 + e−θT x
The third equality results from setting θ := (ln(α0 ), ln(α1 ), . . . , ln(αk )) and from
the fact that
k k k
f (1,x)
∏ ∑ ∑
αj j = exp( ln(αj )fj (y, x)) = exp( ln(αj )xj ).
j=0 j=0 j=0
exp(θ(c)⊤ x)
hθ (x) = p(y = c|x) = ∑K . (2.13)
exp(θ (j)⊤ x)
j=1
K denotes the number of classes, c an arbitrary class and θ(c) a set of parameters
associated with the class c. Unlike in the case of binary logistic regression, here
17
the set of model parameters forms a matrix:
⎡ (1) (2) (K)
⎤
θ0 θ0 . . . θ0
⎢ (1) (2)
⎢θ1 θ1 . . . θ1(K) ⎥
⎥
θ=⎢ ⎢ .. .. . . .. ⎥ (2.14)
⎣ . . . . ⎦
⎥
(1) (2) (K)
θn θn . . . θn
Each row of the matrix corresponds to a single feature and each column corre-
sponds to a specific output class.
In order to adapt the training phase to the new model, the cost function from
the binary model (2.3) is replaced by
m K
1 ∑∑ exp(θ(k)⊤ x) 1
J(θ) = − I(yi = k) log ∑K + λΩ. (2.15)
m i=1 k=1 j=1 exp(θ
(j)⊤ x) m
I is the indicator function signaling whether its argument is true or not. By using
the indicator function together with the summation over all K classes, the cost
function considers for each example only the probability the model assigns to the
correct class y. The regularization term Ω also needs to be changed to include
all the weights of the model:
n ∑
∑ K
2
Ω= θij . (2.16)
i=1 j=1
Again, the form of the cost function is justified by its relationship to the
likelihood, which for this model is expressed as
m
∏ exp(θ(yi )⊤ xi )
L(θ|x) = P (x|θ) = ∑K . (2.17)
(j)⊤ x )
i=1 j=1 exp(θ i
18
trained on different subsets of the general problem and then combined together
to produce the final outcome. This type of approach is not limited to logistic
regression and can be used with any set of binary classifiers. (In fact, it is often
used with support vector machines that are “inherently two-class classifiers.”
[Manning et al., 2008, p. 330])
In the one-vs-rest approach, a single binary classifier is trained for each class to
discriminate between that class on the one side and the other classes on the other
side. At the time of classification, all the models are run on the given example
and the label that achieves the highest score is selected as the final outcome.
Similarly to the multinomial logistic regression, this algorithm learns K sets
of parameters (where K is the number of classes). Here however, the parameters
are estimated separately on modified data, which can lead to different results as
shown in Figure 2.1. A disadvantage of this approach is that by dividing the
training examples into a single class and its complement, one can create highly
imbalanced data.
0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
19
One vs. one approach
However, in the case of three classes, the number of models is the same as for the
one-vs-rest approach.
The final outcome of the set of binary models can be in this case determined
by simple major vote, i.e. selecting the class that was predicted most often.
The obvious disadvantage of this approach is that sometimes ties may occur.
In the case of three-class problem, this happens when each model predicts a
different class. To mitigate this problem, Hastie and Tibshirani [1998] suggested
an algorithm known as pairwise coupling.
Unlike in the case of the one-vs-rest approach, the problem with a set of
pairwise classifiers is that the output they produce cannot be compared directly
as each output represents P (Ai |Ai orAj ), where Ai and Aj belong to a set of K
classes A1 , A2 , . . . AK . Setting K := 3, the individual outputs for one observation
can be represented by a 3 by 3 matrix:
⎡ ⎤
. r1,2 r1,3
⎣r2,1 . r2,3 ⎦
⎢ ⎥
r3,1 r3,2 .
pi
where each rij = pi +pj
. In the matrix, each pair (rij , rji ) sums to 1. The task
is to estimate the probabilities pi = P (Ai ). Setting nij to be the number of
observations based on which the probability rij was estimated, i. e. the number
of examples in the training data where the target class is either Ai or Aj , Hastie
and Tibshirani [1998] propose estimating the probabilities by finding parameters
of the binomial model
To find the probabilities pi , they suggest an iterative algorithm that makes the
initial guess of p̂i and µij at each step closer to rij . This is done by minimizing
the Kullback-Leibler distance between the theoretical and empirical distribution:
∑ rij
ℓ(p) = nij rij log
i̸=j
µij
[ ]
∑ rij 1 − rij
= nij rij log + (1 − rij )log (2.19)
i<j
µij 1 − µij
20
By setting its derivative
( )
∂ℓ(p) ∑ rij 1
= − + (2.20)
∂pi j̸=i
pi pi + pj
Hastie and Tibshirani [1998] prove that the above procedure converges, which
results in stable estimates of P (Ai ) for each class A1 , A2 , . . . AK . Based on these
estimates, one can predict a new example by selecting the class with the highest
probability p̂.
Besides the more involved process of combining the prediction of the individ-
ual binary models, another disadvantage of the one-vs-one approach is the fact
that each classifier is used for also predicting examples it did not see during train-
ing (e.g. for an observation with true output Ak , a model trained to distinguish
between Ai and Aj also needs to be used). A related problem is a lower num-
ber of training examples for each binary classifier when compared to one-vs-rest
approach. On the other hand, the models in one-vs-one approach do not suffer
from artificially introduced imbalance to the data. The way one-vs-one method
can differ from the previous ones in terms of the extracted decision boundaries
can be seen in Figure 2.1.
While sometimes the one-vs-one and one-vs-all approaches are considered “not
very elegant” [Manning et al., 2008, p. 330], in practice, they can work well as
will be shown by the experiments in a later section of the thesis.
21
implementation of gradient tree boosting, an algorithm introduced by Friedman
[2000]. The general idea of boosting is to build an ensemble of models, where
each model performs slightly better than chance (the models are often called
‘weak learners’). However, each model in the sequence is trained in such a way,
that it learns to improve upon the prediction of the previous ones. Together, the
collection of such models produces a composite model with (hopefully) accurate
predictions (the so called ‘strong learner’).
XGBoost has been shown to achieve state-of-the-art results in many machine
learning problems, as illustrated by top scoring solutions in competitions such as
Kaggle. Moreover, the fact the implementation is scalable by enabling parallel
tree learning and by effective handling of sparse data makes it also a good choice
for natural language processing applications, determiner prediction included.
In general terms, the XGBoost model is defined as a collection of functions
whose outputs are summed to produce the final prediction:
K
∑
h(x) = fk (x), fk ∈ F. (2.22)
k=1
Here, x stands for the feature vector corresponding to the example being pre-
dicted; h(x) is the model hypothesis or prediction for the particular example; f is
a function from a predefined functional space F. While XGBoost implementation
is general enough to provide more options for F, the standard choice, taken also
in this thesis, is to define F as a set of possible regression trees.
22
— the output of the function — shared by all the observations belonging to the
corresponding region (Figure 2.2c).
x2 ≤ 2
yes no
x1 ≤ 3 x1 ≤ 2
yes no yes no
3 R3 R4
h(x)
x2
1 R1 R2
x1 x2
0
0 1 2 3 4
x1
Let ci denote the constant corresponding to the region (leaf) Ri out of the
total number of J such regions (leaves). Then the prediction of the regression
tree can be formally written as
J
∑
h(x) = ci I(x ∈ Ri ), (2.23)
i=1
where I is the indicator function signaling whether its argument is true or not.
Training
The parameters of the model described in (2.23) are the structure of the tree (i.e.
the nodes with the splitting criteria), which is represented by I in the equation;
and the set of output values {ci }J1 .
23
These parameters are estimated from the training data by a greedy algorithm.
The tree is built by a top–down approach, i.e. starting at the root and finishing
at the leaves. For each node, a splitting criterion is selected out of all possible
features and their values. The objective is to split the data into two sets that
are, with respect to the target variable, as homogeneous as possible. The quality
of the split is determined by the mean squared error:
N
1 ∑
M SE = (yi − h′ (xi ))2 , (2.24)
N i=1
where N is the number of the training observations and h′ is the current model
(i.e. the tree built so far, including the split being evaluated). The output values
{ci }J1 corresponding to the regions {Ri }J1 are computed as the mean of the output
variable of all the observations contained within the specific region:
|Ri |
1 ∑
ci = (yj |xj ∈ Ri ). (2.25)
|Ri | j=1
The same procedure of finding the best data split is applied recursively to
the resulting sets from the previous steps (Figure 2.3) until a stopping criterion
is met. This might be based on the number of observations with respect to the
leaves, or on a threshold on a node’s mean squared error beyond which the node
is not split any further.
M
∑
∗
P = pm . (2.26)
m=0
24
4
0
0 1 2 3 4 0 1 2 3 4
Figure 2.3: First steps of the greedy algorithm minimizing the mean squared
error of a regression tree on an artificially generated dataset (500 observations of
two uniform random variables: X1 ∼ U(0, 4), X2 ∼ U(0, 4), Y = X1 X2 ). From
top left to bottom right the MSE values are 12.1, 7.8, 4.5 and 3.2.
At each step t, one tries to find a tree ft that makes the predictions of the
new ensemble as close to the observed values as possible. This is achieved by
minimizing a cost function, which can be, depending on the task, the mean
squared error (2.24), logistic loss (2.3), multinomial logistic loss (2.15) and many
others.
2.2.3 XGBoost
XGBoost extends on the original tree boosting algorithm by employing additional
regularization term. Thus the actual regularized cost function at the step t is
25
defined as:
n
∑ t
∑
(t) (t)
L = l(yi , ŷi ) + Ω(fi )
i=1 i=1
∑n t
∑
(t−1)
= l(yi , ŷi + ft (xi )) + Ω(fi ) (2.29)
i=1 i=1
where T is the number of tree leaves, wj is the predicted value for the j th leaf of
the tree and γ and λ are coefficients regulating the strength of the regularization
term with respect to the tree size and the prediction values. (Both last coefficients
serve as hyperparameters of the model). Unfortunately, the choice of the specific
form of the regularization term (2.32) is not justified by Chen and Guestrin [2016].
5
See Section 2.1.1 for its definition.
26
By expanding the regularization term (2.32) and setting Ij = {i|xi ∈ Ri }, i.e.
set of all example indices belonging to the region (leaf) j, the cost function can
be further rewritten as:
n T
(t)
∑ 1 2 1 ∑ 2
L̃ = [gi ft (xi ) + hi ft (xi )] + γT + λ wj
i=1
2 2 j=1
T
∑ ∑ 1 ∑
= [( gi )wj + ( hi + λ)wj2 ] + γT (2.33)
j=1 i∈I
2 i∈I
j j
For a given tree, the best estimate for the output value6 wj can be found
analytically by taking the derivative of the relevant part of the loss function and
setting it to zero: (( i∈Ij gi )wj + 12 ( i∈Ij hi + λ)wj2 )′ = 0. Solving the equation
∑ ∑
The formula also illustrates the impact the regularization coefficient λ has on
lowering the predicted values of the tree.
After substituting the leaf values in (2.33) with their estimates from (2.34) and
simplifying, the ultimate regularized loss function of the whole tree is expressed
as
T 2
∑
(t) 1 ∑ ( i∈Ij gi )
L̃ =− ∑ + γT. (2.35)
2 j=1 i∈Ij hi + λ
Here, I stands for all the examples belonging to the potential split node, IL
and IR represent the examples belonging to its left and right child (given the
split is chosen). Thus the cost reduction of a split can be interpreted as the
difference between the cost of ∑
the node before the∑split and the cost
∑ of the two
( gi )2 ( i∈I gi )2 ( i∈I gi )2
new nodes after the split: (− 12 ∑ i∈I
+γ)−(− 1∑ L
+γ − 1∑ R
+γ).
i∈I hi +λ 2 i∈I hi +λ
L
2 i∈I hi +λ
R
Additionally, this form of the split gain equation shows explicitly the role of the
6
Chen and Guestrin [2016] use the term ‘weight’ for the output of a leaf
27
regularization parameter γ. It prevents the algorithm from selecting splits that do
not result in cost reduction that is ‘large enough’, namely those whose reduction
is less than γ.
The provided derivation of the split finding criterion illustrates the differences
between XGBoost and the original algorithm of gradient boosted trees. Using the
second order approximation of the original loss function l (2.30) makes it possible
to have a single implementation for any user-defined loss function (as long as l′
and l′′ are provided). Moreover, by adding the tree complexity penalty to the cost
function (2.32), regularization is embedded to the core of the tree construction.
28
3. Experimental Setup
The conducted experiments were designed to firstly replicate and secondly to
improve upon the experiment by Lee [2004] who, in turn, followed rather closely
the experiment by Minnen et al. [2000]. The reason for such strong adherence to
other research (i.e. the replication step) is the desire to produce truly comparable
results. Lee [2004] provides rather clear description of experiment settings and,
on the given corpus, reaches the best results known to the author. In this chapter,
sources of data are described as well as the features used by the classifiers.
3.1 Data
3.1.1 Penn Treebank
Because of the above-mentioned replication step, the basis for the classifiers
trained for the thesis is the Penn Treebank corpus.1 It contains a collection of
manually parsed newspaper articles from the business-focused Wall Street Jour-
nal. There is about one million words in the corpus. The articles are divided
into 24 sections. The treebank has been especially popular with parsing research
which settled on de facto standard division of the data into the training, held-out
and parsing parts. This division was accepted by Lee [2004] and also by this
project. Sections 00 – 21 constitute the training, 22 the held-out, and 23 the
testing part.
Although the treebank data are manually tagged and parsed, Lee [2004] chose
to modify the training data by performing automatic preprocessing from the raw
text. For part of speech tagging, the tagger by Ratnaparkhi [1998] was used. Then
the data was parsed by the Collins parser [Collins, 2010]. Lee [2004] advocates
the use of the Penn Treebank data as a means for facilitating the comparison with
previous research, namely that by Knight and Chander [1994] and Minnen et al.
[2000]. However, it is not clear why the experiment differs from those articles by
not using manual parses. Since Lee [2004] outperforms the other two articles, the
same approach is adopted here in order to ensure that potential differences in the
results are not caused by the fact better training data are used.2
From the training data, all the base noun phrases are extracted. A base noun
1
The Penn Treebank, version 3. 1999. Distributed by Linguistic Data Consorcium. URL:
https://catalog.ldc.upenn.edu/ldc99t42
2
In fact, the Collins parser was trained on the Penn Treebank corpus, thus when it is used
to parse its own training data, the parses are likely to be more accurate than if different data
was used.
29
phrase can be defined as a noun phrase that does not dominate any other non
possessive noun phrase. [Vadas and Curran, 2007] Examples of base noun phrases
are labeled by the ‘NPB’ tag in the following phrases:
While base noun phrases are marked as such by the Collins’ parser in the training
data, they have to be specifically extracted from the test data.3 There is no non-
base noun phrase that would contain an article as its direct child in the test
data.
Apart from identifying noun phrases, head of phrases had to be extracted
because — as will be described later — it is an important piece of information for
many features. Again, Collins parser identifies the heads of phrases automatically.
For the test data, the heads were extracted using the head finding rules given by
Collins [2010].
Much syntactic ambiguity stems from the fact that the hierarchy of a sentence
is manifested only as a linear sequence of words.4 In the context of determiner
prediction, this can also play a role, for example in phrases containing a posses-
sive noun phrase. Considering the string “a Bigg’s hypermarket”, its syntactic
interpretation can be either of the following:
30
From the training set, about 263 000 noun phrases were extracted. The held-
out and test sets are formed by about 10 000 and 14 000 extracted phrases re-
spectively. The distribution of the articles over the data is shown in Table 3.1.
It also shows the baseline accuracy of a model that always predicts zero article –
0.711.
Table 3.1: Distribution of the articles (determiners) over the data sets.
3.2 Features
In this section, the features used for the classifier are described. Conceptually,
the features are divided into several groups that are then used for evaluation. The
5
The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Oxford
University Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.
ox.ac.uk/
31
later groups serve as an extension not as a substitution of the previous ones. In
what follows, each group, its label for further reference, and a list of corresponding
features are given.
• head form
lemma of the head of the noun phrase;
e.g. hypermarket — a Bigg’s hypermarket
• head number
grammatical number of the noun phrase head;
e.g. singular — a Bigg’s hypermarket
• parent
part of speech of the parent of the noun phrase;
e.g. PP — The average seven-day simple yield of NP(the 400 funds) was
8.12%.
32
set of lemmas of words that follow the head and belong to the noun phrase,
excluding articles;
e.g. None — NP(The average seven-day simple yield) of the 400 funds was
8.12%.
• words before NP
set of lemmas of two words before the noun phrase, excluding articles;
e.g. [yield, of ] — The average seven-day simple yield of NP(the 400 funds)
was 8.12%.
• words after NP
set of lemmas of two words after the noun phrase, excluding articles;
e.g. [be, <number>] — The average seven-day simple yield of NP(the 400
funds) was 8.12%.
• non-article determiner
other determiner (not including articles);
e.g. no — There’s NP(no question) about . . .
• hypernyms
the hypernyms for the first synset (sense) of the head, as extracted from
the WordNet [Miller, 1995];
e.g. questioning — There’s NP(no question) about . . .
• referent
an indicator whether the given head appeared in any of the five previous
sentences;
33
3.2.2 Extended original features
label: Ext
The features mentioned in this section are either a simple modification of the
previous features or they can be understood as directly inspired by them. There
is no additional source of information used for extracting the features, just the
text itself and syntactic parses.
• head proper
the information discarded in the previous feature: an indicator whether the
head of the noun phrase is a proper noun;
e.g. true — chairman of NP(the Prudential Insurance Co.)
• words before NP
• words after NP
The six features above copy the identically titled features from the original
set with the exception that the word sequences are not taken as sets of
separate words but as fixed n-grams.
e.g. (words before head feature) average seven-day simple — the average
seven-day simple yield
• referent
an extension of the same feature from the previous set, an indicator whether
the given head appeared in any of the five previous sentences or in the
current sentence up to the occurrence of the head;
34
this feature replaces the referent feature from the previous set, i.e. they are
not used together for any trained model;
• postmodification type
part of speech of the following sibling of the noun phrase in case the noun
phrase is a child of another noun phrase;
e.g. PP — NP($500 million) of Remic mortgage securities
• object - preposition
in case the noun phrase is identified as an object of a prepositional phrase
(its parent is a prepositional phrase), the feature corresponds to the prepo-
sition;
e.g. of — At the end of NP(the day)
35
Decision lists classifier
At first, some head nouns are extracted from a parsed corpus, namely those that
can be identified as count or mass by applying a small set of rules (e.g. Is the
noun in plural? (count noun); Does it appear with an indefinite article? (count
noun); Is the noun modified by ”much”? (mass noun); for the complete list
of rules, see [Nagata et al., 2005, p. 817]). The nouns are extracted together
with three types of contexts: all the words within the noun phrase, three words
preceding the noun phrase and three words following the noun phrase (articles,
and some other function words are excluded, context words are lemmatized and
lowercased). Each word from each context then forms a rule together with the
corresponding context type, head noun and the extracted label (i.e. count/mass).
For example, for the head noun chicken in the sentence she ate a piece of fired
chicken for dinner, three decision rules would be extracted: piece−3 → Mass; frynp
→ Mass; dinner+3 → Mass; (the subscripts identify the context type). [Nagata
et al., 2005, p. 819] All the rules are then sorted into a decision list by their log
likelihood ratio defined by:
p(M C|wc )
log , (3.1)
p(M C|wc )
where M C is the variable denoting the mass or count label, and p(M C|wc ) is the
probability of the head noun occurring with the M C label with the word w in
the context c. The probability is estimated from the corpus by:
f (M C, wc ) + α
p(M C|wc ) = , (3.2)
f (wc ) + 2α
where f (M C, wc ) is the frequency of the head noun labeled as M C while having
word w in the context c; α is a smoothing parameter and is set to 0.5. In addition,
each decision list is extended by one rule that assigns the head noun the most
frequent label M Cmajor regardless of any context. The score of this default rule
is given by
p(M Cmajor )
log (3.3)
p(M Cmajor )
where p(M Cmajor ) is estimated by:
f (M Cmajor )) + α
p(M Cmajor ) = (3.4)
f (M Cmajor ) + f (M Cmajor ) + 2α
where f (M Cmajor ) and f (M Cmajor ) correspond to the frequency of the target
noun appearing with the more and less frequent label respectively, the smoothing
parameter α is set to 0.5.
36
An illustration of a possible decision list for the target noun chicken is given
by table 3.2.
rule
context word context type label log-likelihood ratio
piece -3 Mass 1.49
count -3 Count 1.49
peck +3 Count 1.32
fish -3 Mass 1.28
dish -3 Mass 1.23
pig np Count 1.23
Table 3.2: Decision list for determining countability of the noun chicken; taken
from Nagata et al. [2005]
When a new noun phrase is encountered, the rules from the corresponding
decision list are checked from top to bottom. The first rule that can be applied
to the target noun and its context determines the final label of the noun. In
case more rules share the same score, the label predicted by more decision rules
is used. In case both labels are predicted by the same number of rules with the
same score, following rules in the list are taken into account. If the default rule
(assigning the target noun with a label regardless of context) is part of the tie,
its label is used. The default rule is the only rule that can be applied in any
context, therefore rules with lower scores are discarded and the decision must be
made when the default rule is met in the list.
Data
As mentioned in the section on used data sources, decision lists are trained on
the training section of the British National Corpus. Detailed information is given
in Section 3.1.2. There were about 260 000 different lemmas for which its count-
ability could be guessed at least once based on the rules in [Nagata et al., 2005].
For each of those lemmas a decision list was created.
37
Motivation
A word embedding is a representation of a word that maps the word into a mul-
tidimensional space. The mapping is trained in such a way that words occurring
within similar contexts are closer to each other in the vector space than words
occurring in different contexts. Moreover, the well-known property of this rep-
resentation is that some linguistic patterns can be expressed by simple linear
operations on these vectors, e.g. “vec(‘Madrid’) - vec(‘Spain’) + vec(‘France’) is
closer to vec(‘Paris’) than to any other word vector” [Mikolov et al., 2013b, p. 1]
Even though no one-to-one relationship between an embedding dimension
and a certain linguistic phenomenon has ever been described; we try to use the
individual dimensions of the embeddings as features in the classifier, hoping the
classifier could benefit at least from some of the features.
Word2vec
There are more ways the word vectors can be obtained. For this thesis, we select
the architecture known as word2vec, proposed by Mikolov et al. [2013b]. It is
based on the Skip-gram model introduced earlier by Mikolov et al. [2013a] that is
made more efficient so that it enables learning the word vectors on large data. The
authors train a neural network with one hidden layer to predict the probabilities
of different words co-occurring with the input word within a context window.
The architecture of the neural network is shown in figure 3.1.
Each node in the input layer corresponds to one word in the vocabulary.
The words are represented in one-hot encoding, i.e. by vectors whose length
is the same as the size of the vocabulary. In each vector, there is exactly one
value (corresponding to the word) set to 1, all other values are 0. The hidden
layer (sometimes also called a projection layer) consists of 300 nodes and when
trained, represents the actual word embeddings. Each node in the output layer
is associated with a word in the vocabulary and outputs the probability that the
given word appears in the context of the input word.
Formally, the objective of the Skip-gram model is to maximize:
T
1∑ ∑
log(p(wt+j |wt )) (3.5)
T t=1 −c≤j≤c,j̸=0
where T is the size of the training data and c determines the size size of the context
window. Probability of two words occurring in the same context is defined as:
38
Input Hidden Output
layer layer layer
x1 a31
a21
x2 a32
x3 a22 a33
x4 a34
..
.
.. ..
. .
a2300
xV a3V
where vw is the vector representation of word w (i.e. the output of the hidden
layer), vw′ is the vector of weights associated with the node w in the output layer
and V is the size of the vocabulary [Mikolov et al., 2013b, pp. 2-3].
The parameters of the neural network can be represented by two matrices:
Θ1 ∈ Rm×V for the weights needed for transition from the input (1) to the
hidden (2) layer and Θ2 ∈ RV ×m for the transition from the hidden (2) to the
output (3) layer. There is no activation function between layer 1 and 2. This
means the activation of the hidden layer is given simply by a2 = θ1 x; x is the
one-hot encoded input vector. Since only a single value in x is equal to 1 and all
other values are zero, the activation of layer 2 for word wi corresponds to Θ1∗,i ;
i.e. to the ith column of the parameter matrix. Therefore, by training the neural
network so that it outputs similar probabilities for words occurring in similar
contexts (3.6), it learns to represent such words with similar vectors.
39
There are other ways to train word embeddings. Levy and Goldberg [2014]
show the effect of choosing different types of contexts on the nature of similarity
among the word vectors. As a response to the ‘linear bag of words contexts’ used
by word2vec, they use ‘dependency-based contexts’ that take into account not
only nearby words but also their dependency relationship to the input word. The
authors conclude that the embeddings based on the latter type of context “are
less topical and exhibit more functional similarity than the original Skip-gram
embeddings”. [Levy and Goldberg, 2014, p. 302] In word2vec type embeddings,
among most similar words to word florida are words like fla, alabama, gainesville,
tallahassee while vectors trained on dependency-based contexts favor words like
texas, lousiana, georgia, california. [Levy and Goldberg, 2014, p. 305]
However, implementing and experimenting with architectures for extracting
word vectors was out of the scope of the thesis. Instead, we choose to use the
pre-trained 300-dimensional word vectors provided by Google6 . The data contain
3 000 000 embeddings and were trained on about 100 billion words.
Language model
Since estimating such probability directly from data would be impossible due
to the data sparsity, the above is further simplified by using nth order Markov
property:
t
∏
P (w1 , w2 , w3 , . . . wt ) ≈ P (wi |wi−(n−1) , wi−(n−2) . . . , wi−1 ). (3.8)
i=1
6
https://code.google.com/archive/p/word2vec/
40
The sequence wt−(n−1) , wt−(n−2) . . . , wt−1 , wt is called an n-gram. Using maximum
likelihood estimate given a training dataset, the probabilities are estimated by
where #(x) stands for the number of times the sequence x was observed in the
data. Even with the simplifying assumption (3.8), the problem of data sparsity
remains. When predicting the probability on some unseen data, some of the
observed n-grams are likely to have no corresponding counts in the training data,
making (3.9) unusable. To mitigate the problem couple of smoothing approaches
have been designed. Generally, the approaches reserve some of the empirical
probability mass for unseen examples. One such approach known as Kneser-Ney
smoothing was used for the language model trained as part of this thesis.
Chen and Goodman [1999] motivate this approach by the simple San Francisco
example. If the bigram San Francisco occurs frequently in the training text, the
unigram probability of the word Francisco will be relatively high. When other
smoothing methods use this unigram probability as a back-off for higher order
n-grams, they produce a large probability estimate for the following bigram: (-
OOV-, Francisco), where -OOV- stands for any word not seen in the training
set. However, since Francisco occurs only with San, the probability of the given
bigram should be low.
Thus, instead of a simple unigram probability, Kneser-Ney smoothing uses
the probability of the word occurring within a novel context:
41
Language model training and evaluation
To use the trained language model as a feature for a classifier, the following
process was employed: For each noun phrase, three candidate sentences were
created by inserting the or a/an to the left boundary of the noun phrase, or
leaving the space blank. The three candidates were then evaluated by perplexity
(3.13) and the one with the lowest perplexity was selected as the winner. Its
category, i.e. definite/indefinite/zero, was the new value of the feature for the
given noun phrase. Furthermore, the suggestion was stored and used as part of
the sentence when predicting the article for the following noun phrase.
42
tures so that each distinct value in a list forms a new binary feature (bag-of-words
approach). Thus an example noun phrase broad market averages, with a words-
before-head feature value “broad market”, is represented by features words-before-
head-broad and words-before-head-market, whose value is 1, and other feautures
(words-before-head-{personal/his/decline/...}), which are set to 0.
Finally, for word embeddings vectors, each dimension of the vector is taken
as a separate new feature.
43
4. Evaluation
In this chapter, the experiments conducted for this thesis are described and evalu-
ated. The first two sections deal with classifiers based on the two machine learning
algorithms described in Chapter 2. All the classifiers have been built and eval-
uated on the Penn Treebank Wall Street Journal data (described in more detail
in Section 3.1.1), using the features presented in Section 3.2. The next section
provides comparison of the classifier-based approach with the performance of a
language model. Finally, for reference, the human performance on the task is
given and compared to the best automatic method.
44
4.1.2 Replication of a reported experiment
The replication of the article generation experiment reported by Lee [2004] was
attempted by following its experimental setup, namely the choice and processing
of the data, employed explanatory variables and the choice of the machine learning
algorithm. A multinomial logistic regression model was fit on the training and
tested on the test data described in Section 3.1.1. The only difference from the
referenced section was the fact that for this experiment, discarding rare values of
categorical features was not performed as there is no mention of such a step in
the original article.
The resulting accuracy as compared to the baseline model (predicting the
most frequent category to all examples) and the result reported by Lee [2004] is
given in Table 4.1. As indicated by the results, the replication experiment failed
Table 4.1: Evaluation of the replication experiment on the test data. Baseline
stands for simple major vote, i.e. predicting no (zero) article for all test examples;
Multinomial LR stands multinomial logistic regression model implemented as part
of the thesis; and Reported result stands for the model by Lee [2004]. The second
column indicates the regularization parameter used for training. The accuracy
scores measure the performance on the test dataset.
to exactly meet the target accuracy. The difference between the two results is
about half a percent. It is not clear why this discrepancy occurred.1
45
Similarly to the previous experiment, the models use only the original set of
features suggested by Lee [2004] (Section 3.2.1).
Table 4.2: Evaluation of the effect of the threshold for cutting off low-frequency
feature values on the accuracy and the number of final features. The models are
compared on the held-out data.
4.1.4 Features
Next, the effect of the newly designed features on the performance of the classifier
is evaluated. At first, a model is trained on each of the new feature set together
with the original feature set. Then the feature sets are added together iteratively
and the performance is measured with respect to the growing number of features.
Finally the model utilizing all the available features is trained. The corresponding
results are presented in Table 4.3.
The results show that when taken independently, each feature set presented
in Section 3.2 brings some new useful information for the model. Specifically, the
biggest improvement is achieved by the ‘extended’ feature set, which leaves behind
all the features that utilize additional sources of data. In that feature set, the
major role is played by the modification of the six list features: i.e. representing
the context of the noun phrase head as a fixed string rather than a bag of words (or
tags). Interestingly, utilizing each dimension of a pre-trained word embeddings
as a separate explanatory variable can also improve performance. Other two
46
model C (λ) feature sets accuracy
Multinomial LR 0.4 (2.5) Orig 87.81%
Multinomial LR 0.6 (1.7) Orig-Ext 88.39%
Multinomial LR 0.4 (2.5) Orig-Cnt 88.05%
Multinomial LR 0.4 (2.5) Orig-Emb 88.27%
Multinomial LR 0.4 (2.5) Orig-Lm 88.08%
Multinomial LR 0.2 (5.0) Orig-Ext-Cnt 88.33%
Multinomial LR 0.2 (5.0) Orig-Ext-Cnt-Emb 88.80%
Multinomial LR 0.3 (3.3) Orig-Ext-Cnt-Emb-Lm 89.08%
Table 4.3: Evaluation of models trained on different sets of features. The accuracy
values represent the performance on the held-out data.
47
model C (λ) accuracy
Multinomial LR 0.3 (3.3) 89.08%
One-vs-one a|0:1.25 (0.8); a|the:0.4 (2.5); the|0:0.7 (1.43) 89.33%
One-vs-rest oob 0.6 (1.7) 89.45%
One-vs-rest man a:0.6 (1.7); the:0.5 (2); 0:0.7 (1.43); 89.49%
0.895
0.890
0.885
accuracy
0.880
0.875
One-vs-rest man
One-vs-rest oob
0.870 One-vs-one
Multinomial LR
The results show that the multinomial logistic regression model was outper-
formed systematically by the other approaches. The ‘out-of-box’ and ‘manual’
48
approaches towards the one-vs-rest classification are very close to each other re-
gardless of the data size. Despite their closeness, the ‘out-of-box’ model did not
achieve higher score than its alternative in any experiment. Finally, the one-
vs-one approach starts being more effective (with respect to multinomial logistic
regression) only when trained on larger datasets. This is in accord with the dis-
advantage of this approach mentioned in Section 2.1.3, namely the fact that the
individual binary models cannot use all the provided data.
• Objective Function
As explained in Section 2.2, XGBoost implementation is able to handle
different types of objective functions. For this experiment, the predefined
softmax function (2.15) was chosen.
• Number of trees
The actual amount of weak learners in the resulting ensemble. The learning
49
algorithm assumes a fixed number of estimators, so it must be given by the
researcher.
• Gamma
The gamma parameter of the regularization term as shown in (2.36). It
represents another way of preventing the algorithm from learning too com-
plicated base learners. Each split in the tree must result in cost reduction
that is higher than the given value.
• Column subsampling
Column subsampling is a method often used for random forests. The pa-
rameter represents a proportion of features used for building a tree. For
each tree a sample of the corresponding size is randomly drawn from the
feature pool. The parameter value of 1 represents no subsampling. Ran-
domly limiting the number of features a tree can use results in a collection
of trees that are less correlated. Less correlated trees can better reduce the
variance of the model, i.e. limit the amount of over-fitting.
• Row subsampling
Similarly to the previous parameter, when setting the parameter value to
be less then 1, a random subsample of the training examples is drawn from
the training data. Again, this is done for each new tree in the ensemble.
The motivation is the same as in the case of column subsampling.
• Shrinkage
Shrinkage can be seen as an analogy to learning rate. It takes a value from 0
to 1 which is then used to scale down the importance of each new tree. This
“leaves space for future trees to improve the model.” [Chen and Guestrin,
2016, p. 3]
50
Tuning process
As mentioned above, the objective function was set to softmax to perform multi-
nomial classification. For selecting the values for the following parameters, 5-fold
cross-validation on the training data was used to get the performance of the model
Further, in accord with [Hastie et al., 2009, p. 365], the shrinkage parameter
was set to a low value (0.1) and the number of trees was estimated. In this
step, the performance of the model is measured after a tree is added to the
ensemble. If there is no improvement of the performance in the course of the last
20 trees, the algorithm finishes and the number of trees corresponding to the best
performance is selected. During this step, the other parameters were set to their
default values: maximum tree depth – 5, minimum weight of a child – 1, gamma:
0, column subsampling: 0.8, row subsampling: 0.8.
Then optimal values for maximum tree depth and minimum weight of a child
were estimated by performing a grid-search over the values (5, 7, 9, 11, 13) for
the former and (1, 3, 5) for the latter parameter.
In the next step the gamma parameter was selected out of eight possible
values: (0, 0.1, 0.2 . . . 0.7)
Finally column subsampling and row sumbsampling were selected using grid-
search over the values (0.7, 0.8, 0.9, 1) for both of the parameters.
• gamma: 0.6
• Shrinkage: 0.1
With the given set of values the model achieves 91.86% accuracy on the held-
out data set and 91.84% accuracy on the test data. Table 4.5 summarizes the
most important models learned so far.
51
model feature sets accuracy
Baseline — 71.1%
Reported result Orig 87.7%
Multinomial LR Orig 88.36%
One-vs-rest oob Orig-Ext-Cnt-Emb-Lm 89.08%
Boosted trees Orig-Ext-Cnt-Emb-Lm 91.84%
Table 4.5: Evaluation of the best logistic regression and gradient boosting models.
For reference, the baseline accuracy is given followed by the result reported by
Lee [2004] and the model attempting to replicate the result. The accuracy values
represent the performance on the test data.
0.98
0.96
0.94
accuracy
0.92
0.90
Figure 4.2: Training and held-out data accuracy for logistic regression and
boosted trees models
52
4.3 Language model
The main approach presented in the thesis is linguistically motivated in the sense
that it first predicts the linguistic structure of each sentence (by part-of-speech
tagging and parsing) and only then it assigns specific elements of such a structure
with an article. Although this seems analogous to what speakers of English do,
it also raises some problems. Namely, the assumption of having correct parses is
hard to guarantee in practice.
In this section, an alternative approach is evaluated: each position between
two words is considered as a potential place for an article. For each such a position
three options corresponding to three article forms (zero, the, a/an) are evaluated
by means of a language model. The winning alternative is kept and the next
position is evaluated until the end of the sentence is reached. The process is
illustrated in Table 4.6.
53
the original and predicted texts are tokenized and compared. The ‘a/an’ token
is considered to be equal to both ‘a’ and ‘an’. The results on the test portion of
the Penn Treebank are given in Table 4.7. For reference, the logistic regression
and gradient boosting models are evaluated in the same manner. Both models
used automatic parses for the prediction.
Table 4.7: Evaluation of the language model (LM) with respect to the logistic
regression (Logreg) and gradient boosting models (Boosting). The models are
evaluated on the Penn Treebank test data consisting of about 57 000 words. The
main metric is the number of errors per 100 words. #correct represents number
of correctly predicted articles (definite or indefinite, zero is not included). The
other columns represent the distribution of the error types.
The logistic regression and gradient boosting models, on the other hand, seem
to be too conservative, as they attempt a prediction quite rarely. This is demon-
strated by their main source of error, i.e. deletion. Interestingly, contrary to the
evaluation of the classification task, here the gradient boosting model performs
worse than logistic regression.
In order to compensate for the over-generating of the language model, two
parameters were introduced: perplexity threshold and decision margin. Both
parameters were supposed to restrict the language model to only such predictions
where it is ‘confident enough’. Perplexity threshold sets the limit in absolute
terms by setting a value above which a decision is not considered. Decision
margin attempts to control the confidence relatively, i.e. it sets the minimum
relative difference in the cost between the first and second best candidates for the
54
given position. However, when both parameters were tuned by grid-search on the
held-out dataset, the solution gravitated towards the baseline performance of not
predicting anything as seen by the results of the LM-tun model in Table 4.7.
Table 4.8: Human performance on the article generation task. Measured on text
with 163 candidate noun phrases.
55
Annotators A, D In a monopoly situation, a producer may be able to use
his market power at the expense of the consumer, . . .
Here, the noun producer can be seen as having definite associative reference
(i.e. as being defined based on the association with the concept of the market:
{consumer, producer, demand, supply, . . . }, which has already been mentioned
at this point), or it can be seen as having singulative indefinite reference (i.e. any
producer/some producers). A different type of ambiguity is illustrated by another
example:
Original High prices were maintained from 1974 onwards in the face
of inelastic demand for oil from oil-importing countries,
but an oil glut in the 1980s put pressure on individual
countries to . . .
Annotators B, D . . . , but the oil glut in the 1980s put pressure on individ-
ual countries to . . .
By choosing the first option, one interprets the oil glut as one in a sequence of
such periods of oversupply, which happened to occur in the 1980s. The other
option suggests the glut of the 1980s is a particular period known to the reader.
The latter interpretation leaves no place for other such gluts during the 1980s.
Apparently, to predict (or correct) an article in certain cases requires very deep
understanding of the intended meaning and cannot be predicted from the text
alone.
56
penn-test manual-test
zero 71.1% 59%
the 19.7% 29%
a/an 9.3% 12%
Table 4.9: Distribution of the articles (determiners) over the Penn Treebank and
BNC manual test data
Table 4.11: Confusion matrix for the most successful manual annotation. The
rows correspond to the articles found in the original text, the columns correspond
to the predictions.
Out of each row of the matrix, the probability distribution over the predic-
57
tions given the true article was estimated by maximum likelihood estimation:
#{x & y}
p(x|y) = #y
(x and y denote the predicted and true article, respectively).
The three probability distributions are then used to estimate the distribution
of errors that would be made on the Penn test set. Specifically, let cx|y denote the
number of times x is predicted for any noun phrase with a real article y on that
dataset. Then, the distribution of cx|y is estimated as ĉx|y = cy p(x|y), where cy
is defined as the number of noun phrases whose true class is y. By computing
ĉx|y for all the nine combinations of x and y, one creates an estimated confusion
matrix on the new dataset.
∑
From this matrix, the final estimate of the accuracy
x ĉx|x
can be expressed as ∑ ∑ .
x y ĉx|y
Table 4.12 gives the estimated accuracies for the best and the average human
performance together with the empirical accuracies of the best gradient boosting
and logistic regression models. The results show that when the models are tested
on the data they were prepared for, their comparison to human performance is
much more favorable. Both models perform slightly worse than the best human
annotator and above the average annotator performance.
model accuracy
Best annotator (estimated) 91.96%
Boosted trees 91.84%
Logistic regression 89.08%
Average annotator (estimated) 85.98%
Of course, the above conclusion is valid only if the estimate of the manual
performance on the dataset is reliable. There are two issues with this assumption.
Firstly, the estimates of the three probability distributions py (x) = p(x|y) are
based on limited data. Namely, pthe is based on 48, pa/an on 20 and pzero on 95
observations. Secondly, even if taking py for granted, the accuracy estimation
assumes that for a given article, the distribution of its prediction errors is the
same for both the manual and automatic dataset. In other words, it assumes
there is the same level of uncertainty connected with an article no matter which
dataset is used.
58
Conclusion
In accord with previous research [Minnen et al., 2000], [Lee, 2004], [Turner and
Charniak, 2007], [Sun et al., 2015], the thesis interprets the problem of article
error correction as an article generation task. Firstly, by looking into the previ-
ous literature on the subject, a group of papers was identified, which provided
comparable results in terms of problem formulation, approach and the data used.
Out of this group, the article reporting the best results was selected.
As the first step, the selected article was replicated (Section 4.1.2). A logistic
regression model was trained on the training section of the Penn Treebank and
evaluated on its test section. For each noun phrase, the model predicts one of the
three possible categories corresponding to the article choice: the for the definite
article, a/an for the indefinite articles and zero for noun phrases with no article.
The features used are the ones described in the paper. This leads to the accuracy
of 88.36% as compared to the original 87.7%. The source of this difference was
not identified.
This baseline experiment was further improved in several ways. Firstly, new
feature sets were designed that would hopefully improve the representation of the
examples. The first set of features used only the information already available in
the training data and achieved improvement from 87.81% to 88.39% as measured
on the held-out dataset. Then features using additional data sources were added:
prediction of countability category, prediction of a language model and word
embeddings representation of the head of the noun phrase. When used together,
the accuracy of the model further improved to 89.08% (Section 4.1.4).
Next, the effect of different approaches to extending binary logistic regression
to multi-class problems was evaluated. Apart from the original multinomial lo-
gistic regression using the softmax function, one-vs-one and one-vs-rest methods
of combining binary classifiers were evaluated. For the given problem, one-vs-
rest approach provided the best result raising the held-out accuracy to 89.49%
(Section 4.1.5).
The final attempt to improve the performance was made by replacing logistic
regression with gradient boosted trees. After tuning the model and running it on
the same data as the models discussed above, the accuracy of the classification
improved to 91.86% for the held-out dataset. When measured on the test data,
the performance is 91.84% (Section 4.2). This represents an increase of accuracy
of 4.14% when compared to the best reported result on the task known to the
author – 87.7% [Lee, 2004]. This corresponds to about 34% reduction in error.
An alternative approach based solely on the predictions of a large language
59
model was also investigated. However, it was not found useful when it was eval-
uated on the Penn Treebank test data (Section 4.3).
As a reference, the performance of human annotators on the same task is given
(Section 4.4). Because the data were taken from another source, the comparison
is not straightforward. It turned out that evaluating the trained model on the
new dataset did not produce good results as the model was too conservative when
predicting article changes. On the other, when the automatic and manual meth-
ods are compared on the dataset the models were prepared for, the performance
of the best model matches the estimated performance of the best annotator.
Although this work presents an improvement on the best result for the given
task by Lee [2004], there remains much to be done in terms of practical application
of the suggested approach. The model was trained on newspaper articles focused
on the world of finance, which suggests the generalization of such a model might
be limited. This suggestion is also supported by one of the experiments.
In recent years, recurrent neural networks and long short-term memory neural
networks in particular get much attention also in the NLP community [Mikolov
et al., 2013b], [Józefowicz et al., 2016]. While gradient boosted trees seem to work
well with the given problem, it would be interesting to see if the neural network
based models can improve the performance by utilizing less constrained context
of each noun phrase.
60
Bibliography
Leo Breiman, Jerome Friedman, Charles J. Stone, and R.A. Olshen. Classification
and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability
series. Taylor & Francis, 1984. ISBN 9780412048418.
Chris Callison-Burch, Philipp Koehn, Philipp Monz, and Omar F. Zaidan, ed-
itors. Proceedings of the Sixth Workshop on Statistical Machine Translation.
Association for Computational Linguistics, Edinburgh, Scotland, July 2011.
ISBN 9781937284121.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, and
Phillipp Koehn. One billion word benchmark for measuring progress in statis-
tical language modeling. CoRR, abs/1312.3005, 2013. URL http://arxiv.
org/abs/1312.3005.
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system.
CoRR, abs/1603.02754, 2016. URL http://arxiv.org/abs/1603.02754.
Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. Building a large annotated
corpus of learner english: The NUS corpus of learner english. In Proceedings
of the Eighth Workshop on Innovative Use of NLP for Building Educational
Applications, pages 22–31, Atlanta, Georgia, June 2013. Association for Com-
putational Linguistics.
Jane E. Gressang. A frequency and error analysis of the use of determiners, the
relationships between noun phrases, and the structure of discourse in English
essays by native English writers and native Chinese, Taiwanese, and Korean
61
learners of English as a Second language. PhD thesis, The University of Iowa,
2010.
Na-Rae Han, Martin Chodorow, and Claudia Leacock. Detecting errors in english
article usage by non-native speakers. Natural Language Engineering, 12(02):
115–129, 2006.
Tevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statisti-
cal Learning: Data Mining, Inference, and Prediction, Second Edition. Springer
Series in Statistics. Springer New York, 2009. ISBN 9780387848587.
Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi,
Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Can-
dela. Practical lessons from predicting clicks on ads at facebook. In Proceedings
of the Eighth International Workshop on Data Mining for Online Advertising,
ADKDD’14, pages 5:1–5:9, New York, NY, USA, 2014. ACM.
Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu.
Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016. URL
http://arxiv.org/abs/1602.02410.
Daniel Jurafsky and James H. Martin. Speech and Language Processing (Sec-
ond Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009. ISBN
9780131873216.
62
AAAI ’94, pages 779–784, Menlo Park, CA, USA, 1994. American Association
for Artificial Intelligence.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation
of word representations in vector space. CoRR, abs/1301.3781, 2013a. URL
http://arxiv.org/abs/1301.3781.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Dis-
tributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems, pages 3111–3119, 2013b.
Guido Minnen, Francis Bond, and Ann Copestake. Memory-based learning for
article generation. In Proceedings of the 2nd workshop on Learning language
in logic and the 4th conference on Computational natural language learning -
Volume 7, CoNLL ’00, pages 43–48, Stroudsburg, PA, USA, 2000. Association
for Computational Linguistics.
Ryo Nagata, Takahiro Wakana, Fumito Masui, Atsuo Kawai, and Naoki Isu.
Detecting article errors based on the mass count distinction. In Proceedings
of the Second international joint conference on Natural Language Processing,
IJCNLP’05, pages 815–826, Berlin, Heidelberg, 2005. Springer-Verlag.
Hwee Tou Ng, Mei Wu Siew, Yuanbin Wu, Christian Hadiwinoto, and Joel
Tetreault. The CoNLL-2013 shared task on grammatical error correction. In
Proceedings of the Seventeenth Conference on Computational Natural Language
Learning: Shared Task, Sofia, Bulgaria, 2013. Association for Computational
Linguistics.
63
Hwee Tou Ng, Mei Wu Siew, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy
Susanto, and Christopher Bryant. The CoNLL-2014 shared task on grammat-
ical error correction. In Proceedings of the Eighteenth Conference on Compu-
tational Natural Language Learning: Shared Task, Baltimore, Maryland, 2014.
Association for Computational Linguistics.
Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. A Com-
prehensive grammar of the English language. Longman, London, 1985. ISBN
9780582517349.
Alla Rozovskaya and Dan Roth. Annotating ESL errors: Challenges and rewards.
In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building
Educational Applications, 2010.
Alla Rozovskaya, Kai-Wei Chang, Mark Sammons, and Dan Roth. The Univer-
sity of Illinois system in the CoNLL-2013 shared task. In Proceedings of the
Seventeenth Conference on Computational Natural Language Learning: Shared
Task, Sofia, Bulgaria, 2013. Association for Computational Linguistics.
Chengjie Sun, Xiaoqiang Jin, Lei Lin, Yuming Zhao, and Xiaolong Wang. Con-
volutional neural networks for correcting english article errors. In Natural Lan-
guage Processing and Chinese Computing: 4th CCF Conference, pages 102–110,
Nanchang, China, 2015. Springer International Publishing.
Jennie Turner and Eugene Charniak. Language modeling for determiner selection.
In Human Language Technologies 2007: The Conference of the North American
Chapter of the Association for Computational Linguistics; Companion Volume,
Short Papers, NAACL-Short ’07, pages 177–180, Stroudsburg, PA, USA, 2007.
Association for Computational Linguistics.
64
David Vadas and James R. Curran. Parsing internal noun phrase structure with
Collins’ models. In Proceedings of the Australasian Language Technology Work-
shop 2007, pages 109–116, Melbourne, Australia, December 2007.
65
List of Figures
2.1 Comparison of decision boundaries of the one-vs-one, one-vs-rest
and multinomial approaches for logistic regression . . . . . . . . . 19
2.2 Regression tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Partitioning of the feature space by a regression tree . . . . . . . . 25
66
List of Tables
3.1 Distribution of the articles (determiners) over the data sets. . . . 31
3.2 An example of a decision list for predicting countability . . . . . . 37
67
Attachments
4.5 Manual annotation texts
4.5.1 Original text
Sell-through, as the retail sector of the video market is obscurely known, is at the
moment very much a bull market, and there has lately been a growing shift to
releasing more recent movies in this way. This has been capped by the well publi-
cised current appearance of no less than Rain Man (Warner) for sale (£14.99) as
well as for rent, when hitherto the two spheres have tacitly been regarded as mu-
tually exclusive. But for all the changes, the prime emphasis in the retail market,
so far as movies are concerned, still seems to be on the past, and predominantly
on the mainstream Hollywood past.
Current refinements, no doubt with Christmas present buying in mind, include
boxed sets of three or more related cassettes from Warners — for example, all
three of James Dean’s movies for £29.95, or for the same price, Marilyn Monroe
in The Prince and the Showgirl, The Misfits and (a bit of a scoop, since it has not
been on video before) Some Like It Hot. This kind of collection, though usually
available individually, is increasingly and intelligently proving to be a marketing
ploy. For random instance, CBS/Fox has recently released (£9.99 each) not only
Cronenberg’s 1986 The Fly and its far from negligible 1958 forerunner, but also
the 1959 quick-buck sequel to the latter, Return Of The Fly, a collector’s item,
though not necessarily in qualitative terms.
And to move from B to Z movies, connoisseurs of the bizarre can now lay in
their own copy of Edward Woods’ Plan 9 From Outer Space (Palace, £14.99),
once voted the most incompetent film of all time. The rental sector meanwhile
provides — along with all the box-office successes which nowadays transfer to tape
within a few months and probably need no further introduction — the chance
to catch up on a variety of (often more deserving) movies which have been less
widely seen in cinemas here.
Competitive markets are essential if the market system is to operate effectively.
In a monopoly situation, the producer may be able to use his market power at
the expense of the consumer, although the likelihood of this happening will be
moderated if close substitutes for the product exist or if the monopolist fears
entry of new, but similar, products into the market, attracted by high prices
and the profits being made. It is relatively rare for a firm to have an absolute
monopoly of the market, but there are many instances where a small number of
68
large firms dominate a market — this is called an oligopoly. Such firms may act
overtly (and illegally) or covertly (still illegally, but hard to detect) as if they
were in a monopoly situation.
Agreements or understandings could cover price-fixing and/or sharing out the
available work without resorting to competition (which would have resulted in
lower prices). In order for the agreement to stick, no single firm must break ranks,
encouraged by the prospect of greater market share by lowering its price, or else a
free-for-all may develop producing a competitive price, as if the market had been
operating smoothly. The operation of such a price-fixing agreement can be seen
in the way that the price of oil has been fixed by the Organisation of Petroleum
Exporting Countries (OPEC) cartel.
High prices were maintained from 1974 onwards in the face of inelastic demand
for oil from oil-importing countries, but an oil glut in the 1980s put pressure on
individual countries to reduce their prices in order to gain market share. Despite
attempts to prevent this by OPEC, the price of oil plummeted in 1986. Even if
there are no formal agreements to interfere with the market, implicit understand-
ings may be reached, in which case the behaviour of the firms involved might, in
practice, be the same as if a formal agreement had existed.
4.5.2 Annotator A
Sell-through, as a retail sector of the video market is obscurely known, is at
the moment very much a bull market, and there has lately been a growing shift
to releasing more recent movies in this way. This has been capped by the well
publicised current appearance of no less than Rain Man (Warner) for sale (£14.99)
as well as for rent, when hitherto the two spheres have tacitly been regarded as
mutually exclusive. But for all changes, a prime emphasis in the retail market,
so far as movies are concerned, still seems to be on the past, and predominantly
on the mainstream Hollywood past.
Current refinements, no doubt with Christmas present buying in mind, in-
clude boxed sets of three or more related cassettes from Warners — for example,
all three of James Dean’s movies for £29.95, or for same price, Marilyn Monroe
in Prince and Showgirl, Misfits and (a bit of a scoop, since it has not been on
video before) Some Like It Hot. This kind of collection, though usually available
individually, is increasingly and intelligently proving to be a marketing ploy. For
random instance, CBS/Fox has recently released (£9.99 each) not only Cronen-
berg’s 1986 Fly and its far from negligible 1958 forerunner, but also the 1959
quick-buck sequel to the latter, Return Of Fly, collector’s item, though not nec-
essarily in qualitative terms.
69
And to move from B to Z movies, connoisseurs of the bizarre can now lay in
their own copy of Edward Woods’ Plan 9 From Outer Space (Palace, £14.99), once
voted the most incompetent film of all time. Rental sector meanwhile provides
— along with all box-office successes which nowadays transfer to tape within a
few months and probably need no further introduction — the chance to catch up
on a variety of (often more deserving) movies which have been less widely seen
in cinemas here.
Competitive markets are essential if the market system is to operate effectively.
In a monopoly situation, a producer may be able to use his market power at
the expense of the consumer, although the likelihood of this happening will be
moderated if close substitutes for the product exist or if the monopolist fears the
entry of a new, but similar, products into market, attracted by high prices and
profits being made. It is relatively rare for firm to have an absolute monopoly
of market, but there are many instances where a small number of large firms
dominate the market — this is called an oligopoly. Such firms may act overtly
(and illegally) or covertly (still illegally, but hard to detect) as if they were in a
monopoly situation.
Agreements or understandings could cover price-fixing and/or sharing out
available work without resorting to competition (which would have resulted in
lower prices). In order for agreement to stick, no single firm must break ranks,
encouraged by the prospect of a greater market share by lowering its price, or
else a free-for-all may develop producing competitive price, as if the market had
been operating smoothly. The Operation of such a price-fixing agreement can
be seen in way that price of oil has been fixed by the Organisation of Petroleum
Exporting Countries (OPEC) cartel.
High prices were maintained from 1974 onwards in the face of inelastic demand
for oil from oil-importing countries, but oil glut in 1980s put pressure on individual
countries to reduce their prices in order to gain a market share. Despite attempts
to prevent this by OPEC, the price of oil plummeted in 1986. Even if there are
no formal agreements to interfere with the market, implicit understandings may
be reached, in which case the behaviour of the firms involved might, in practice,
be the same as if a formal agreement had existed.
4.5.3 Annotator B
Sell-through, as a retail sector of the video market is obscurely known, is at
the moment very much bull market, and there has lately been a growing shift to
releasing more recent movies in this way. This has been capped by well publicised
current appearance of no less than Rain Man (the Warner) for sale (£14.99) as
70
well as for rent, when hitherto the two spheres have tacitly been regarded as
mutually exclusive. But for all changes, the prime emphasis in the retail market,
so far as movies are concerned, still seems to be on the past, and predominantly
on the mainstream Hollywood past.
Current refinements, no doubt with Christmas present buying in mind, include
boxed sets of three or more related cassettes from the Warners — for example,
all three of the James Dean’s movies for £29.95, or for the same price, Marilyn
Monroe in the Prince and Showgirl, the Misfits and (a bit of a scoop, since it
has not been on video before) the Some Like It Hot. This kind of collection,
though usually available individually, is increasingly and intelligently proving to
be a marketing ploy. For random instance, CBS/Fox has recently released (£9.99
each) not only the Cronenberg’s 1986 Fly and its far from negligible the 1958
forerunner, but also the 1959 quick-buck sequel to the latter, the Return Of Fly,
a collector’s item, though not necessarily in qualitative terms.
And to move from B to Z movies, connoisseurs of bizarre can now lay in their
own copy of the Edward Woods’ Plan 9 From Outer Space (Palace, £14.99),
once voted the most incompetent film of all time. The rental sector meanwhile
provides — along with all box-office successes which nowadays transfer to tape
within few months and probably need no further introduction — a chance to
catch up on variety of (often more deserving) movies which have been less widely
seen in cinemas here.
Competitive markets are essential if the market system is to operate effectively.
In monopoly situation, the producer may be able to use his market power at
expense of the consumer, although likelihood of this happening will be moderated
if close substitutes for product exist or if the monopolist fears entry of new, but
similar, products into market, attracted by high prices and profits being made. It
is relatively rare for a firm to have the absolute monopoly of the market, but there
are many instances where a small number of large firms dominate the market —
this is called oligopoly. Such firms may act overtly (and illegally) or covertly (still
illegally, but hard to detect) as if they were in the monopoly situation.
Agreements or understandings could cover price-fixing and/or sharing out
available work without resorting to a competition (which would have resulted in
lower prices). In order for an agreement to stick, no single firm must break ranks,
encouraged by the prospect of a greater market share by lowering its price, or else
free-for-all may develop producing a competitive price, as if the market had been
operating smoothly. The operation of such a price-fixing agreement can be seen
in way that price of oil has been fixed by Organisation of Petroleum Exporting
Countries (OPEC) cartel.
High prices were maintained from 1974 onwards in face of the inelastic de-
71
mand for oil from oil-importing countries, but the oil glut in 1980s put pressure
on individual countries to reduce their prices in order to gain a market share.
Despite attempts to prevent this by OPEC, the price of oil plummeted in 1986.
Even if there are no formal agreements to interfere with the market, implicit
understandings may be reached, in which a case behaviour of the firms involved
might, in practice, be the same as if a formal agreement had existed.
4.5.4 Annotator C
Sell-through, as the retail sector of the video market is obscurely known, is at
the moment very much a bull market, and there has lately been a growing shift
to releasing more recent movies in this way. This has been capped by the well
publicised current appearance of no less than the Rain Man (Warner) for sale
(£14.99) as well as for rent, when hitherto the two spheres have tacitly been
regarded as mutually exclusive. But for all the changes, prime emphasis in the
retail market, so far as movies are concerned, still seems to be on the past, and
predominantly on mainstream Hollywood past.
Current refinements, no doubt with Christmas present buying in mind, include
the boxed sets of three or more related cassettes from Warners — for example, all
three of James Dean’s movies for £29.95, or for the same price, Marilyn Monroe
in Prince and the Showgirl , Misfits and (a bit of a scoop , since it has not
been on video before) Some Like It Hot. This kind of collection, though usually
available individually, is increasingly and intelligently proving to be a marketing
ploy. For random instance, CBS/Fox has recently released (£9.99 each) not only
Cronenberg’s 1986 Fly and its far from negligible 1958 forerunner, but also the
1959 quick-buck sequel to the latter, Return Of Fly, a collector’s item, though
not necessarily in qualitative terms.
And to move from B to Z movies, connoisseurs of the bizarre can now lay in
their own copy of Edward Woods’ Plan 9 From Outer Space (Palace, £14.99),
once voted the most incompetent film of all time. The rental sector meanwhile
provides — along with all box-office successes which nowadays transfer to tape
within a few months and probably need no further introduction — chance to catch
up on a variety of (often more deserving) movies which have been less widely seen
in cinemas here.
Competitive markets are essential if the market system is to operate effectively.
In a monopoly situation, the producer may be able to use his market power at
the expense of the consumer, although the likelihood of this happening will be
moderated if close substitutes for the product exist or if the monopolist fears
entry of new, but similar, products into the market, attracted by high prices and
72
profits being made. It is relatively rare for a firm to have absolute monopoly of
the market, but there are many instances where a small number of large firms
dominate the market — this is called oligopoly. Such firms may act overtly
(and illegally) or covertly (still illegally, but hard to detect) as if they were in a
monopoly situation.
Agreements or understandings could cover price-fixing and/or sharing out
available work without resorting to competition (which would have resulted in
lower prices). In order for an agreement to stick, no single firm must break ranks,
encouraged by the prospect of a greater market share by lowering its price, or
else a free-for-all may develop producing a competitive price, as if the market had
been operating smoothly. Operation of such a price-fixing agreement can be seen
in the way that the price of oil has been fixed by the Organisation of Petroleum
Exporting Countries (OPEC) cartel.
High prices were maintained from 1974 onwards in the face of an inelastic
demand for oil from oil-importing countries, but an oil glut in the 1980s put
pressure on individual countries to reduce their prices in order to gain a market
share. Despite attempts to prevent this by the OPEC, the price of oil plummeted
in 1986. Even if there are no formal agreements to interfere with the market,
implicit understandings may be reached, in which case the behaviour of firms
involved might, in practice, be the same as if a formal agreement had existed.
4.5.5 Annotator D
Sell-through, as the retail sector of the video market is obscurely known, is at
the moment very much a bull market, and there has lately been a growing shift
to releasing more recent movies in this way. This has been capped by the well
publicised current appearance of no less than Rain Man (Warner) for sale (£14.99)
as well as for rent, when hitherto two spheres have tacitly been regarded as
mutually exclusive. But for all the changes, a prime emphasis in the retail market,
so far as movies are concerned, still seems to be on the past, and predominantly
on the mainstream Hollywood past.
The current refinements, no doubt with Christmas present buying in mind,
include boxed sets of three or more related cassettes from Warners — for example,
all three of James Dean’s movies for £29.95, or for the same price, Marilyn Monroe
in Prince and Showgirl , the Misfits and (a bit of a scoop , since it has not
been on video before) Some Like It Hot. This kind of collection, though usually
available individually, is increasingly and intelligently proving to be a marketing
ploy. For random instance, CBS/Fox has recently released (£9.99 each) not only
Cronenberg’s 1986 the Fly and its far from negligible 1958 forerunner, but also
73
the 1959 quick-buck sequel to the latter, Return Of Fly, a collector’s item, though
not necessarily in qualitative terms.
And to move from B to Z movies, connoisseurs of the bizarre can now lay in
their own copy of Edward Woods’ Plan 9 From Outer Space (Palace, £14.99),
once voted the most incompetent film of all time. The rental sector meanwhile
provides — along with all box-office successes which nowadays transfer to tape
within a few months and probably need no further introduction — a chance to
catch up on a variety of (often more deserving) movies which have been less
widely seen in cinemas here.
Competitive markets are essential if a market system is to operate effectively.
In a monopoly situation, a producer may be able to use his market power at
the expense of the consumer, although the likelihood of this happening will be
moderated if close substitutes for the product exist or if the monopolist fears
entry of new, but similar, products into the market, attracted by high prices
and the profits being made. It is relatively rare for a firm to have an absolute
monopoly of a market, but there are many instances where a small number of
large firms dominate a market — this is called an oligopoly. Such firms may act
overtly (and illegally) or covertly (still illegally, but hard to detect) as if they
were in a monopoly situation.
Agreements or understandings could cover price-fixing and/or sharing out
available work without resorting to competition (which would have resulted in
lower prices). In order for an agreement to stick, no single firm must break the
ranks, encouraged by the prospect of greater market share by lowering its price,
or else a free-for-all may develop producing a competitive price, as if the market
had been operating smoothly. The operation of such a price-fixing agreement can
be seen in the way that the price of oil has been fixed by the Organisation of
Petroleum Exporting Countries (OPEC) cartel.
High prices were maintained from 1974 onwards in the face of inelastic demand
for oil from oil-importing countries, but the oil glut in the 1980s put pressure
on individual countries to reduce their prices in order to gain a market share.
Despite attempts to prevent this by OPEC, the price of oil plummeted in 1986.
Even if there are no formal agreements to interfere with the market, implicit
understandings may be reached, in which case the behaviour of firms involved
might, in practice, be the same as if a formal agreement had existed.
74