Corpus of The Canon of Western Literature Green 2017
Corpus of The Canon of Western Literature Green 2017
Corpus of The Canon of Western Literature Green 2017
research-article2017
LAL0010.1177/0963947017718996Language and LiteratureGreen
Article
stylistics
Clarence Green
Nanyang Technological University, Singapore
Abstract
This paper introduces the Corpus of the Canon of Western Literature (Version 1.0), accompanied
by a demonstration of its potential uses. The canon of western literature has been an important
construct in the study of literature, long standing and long contested. It has been argued to
represent many of the greatest works produced in the history of western literature. This corpus
operationalizes the western canon based on Harold Bloom’s The Western Canon: The Books and
School of the Ages (1994). The paper describes the development of the corpus, its organization and
source material. Corpus procedures are applied to the corpus, such as word frequency analysis,
lemmatization and keyness, to demonstrate its potential uses in culturomics and corpus stylistics,
two interdisciplinary fields between the traditional and digital humanities, and the linguistic and
literary approaches to literature. Culturomics is the study of culture and social psychology via the
investigation of corpora of literature as cultural artefacts, while corpus stylistics is the application
of corpus linguistics to traditional literary scholarship. The corpus introduced in this paper is
open source and freely available.
Keywords
Corpus linguistics, culturomics, stylistics, western canon
1 Introduction
A relatively recent paper in Science, introducing the Google Books corpus with approxi-
mately 4% of books ever published, termed a new field of study: culturomics (Michel
Corresponding author:
Clarence Green, NIE3-03-118, National Institute of Education, Nanyang Technological University, 1
Nanyang Walk, 637616, Singapore.
Email: clarence.green@nie.edu.sg
Green 283
et al., 2011). As originally framed, culturomics was the use of the Google Books corpus
to investigate the culture and social psychology of different times and places, with the
corpus considered as a collection of cultural artefacts. While culturomics is a new term,
widely cited (Acerbi et al., 2013; Greenfield, 2013; Pechenick et al., 2015), using cor-
pora for cultural studies is something corpus linguists have been doing for some time
(e.g. Baker, 2003). Parallel to the rise of culturomics has been the related field of corpus
stylistics. Corpus stylistics is the study of literary style via computational tools applied
to machine readable literary works. It combines the science of linguistics with literary
studies and, like culturomics, is one of the growing interdisciplinary fields between the
traditional and digital humanities.
This paper introduces the Corpus of the Canon of Western Literature (Version 1.0),
with a demonstration of its potential in culturomics and corpus stylistics. The canon of
western literature has been an important construct in the study of literature, long standing
and long contested (Beach et al., 2016; Guillory, 2013). Speaking broadly, traditional-
minded literature scholars have held the works of the canon to be the greatest literature
in the history of the West (Adler and Weismann, 2000; Bloom, 1994). By ‘greatest’, they
tend to mean that such literature exhibits qualities such as aesthetic beauty, profound
ideas, themes, notable characters and language, and impressive artistic skill. Canonical
works are also those that have influenced other literature; for example, by exhibiting
intertextuality and impacting culture, e.g. Aristotle’s Politics or Christendom. The
Corpus of the Canon of Western Literature (henceforth CCWL) is an attempt to opera-
tionalize the construct of the western canon as defined by Bloom (1994). The paper first
describes the development and organization of the CCWL. Next, to demonstrate its
applications to culturomics and stylistics, some standard corpus procedures are reported,
such as lemmatization, keyness, standardized type–token ratios (a measure of vocabu-
lary range), as well as word and sentence length estimates across genres, authors and
texts.
from the digital repository Project Gutenberg consisting of 3403 public domain literary
texts. They found that the frequency and dispersion of words associated with lexical
domains such as anger, fear, joy and surprise help predict publication periods of texts as
these words tap into the changing cultural milieus of different historical periods (see also
Hughes et al., 2012).
Corpus stylistics is concerned with how the literary style of an author, text or genre is
reflected in language, yet like culturomics it is also interested in broader issues of how
literature reflects culture, how ideas and themes pattern in texts, and how literature cre-
ates psychological effects in readers and characters (McIntyre, 2015). Even though cor-
pus linguistics is advancing toward ever increasing complex quantitative research
designs, the basic toolkit of the field has provided much insight into literature. Stubbs
(2005: 14), for example, shows that the application of what he calls ‘very simple fre-
quency stuff’ such as word lists and collocations capture important themes and style
markers in Conrad’s Heart of Darkness. Amongst the most frequent words are seem, like
and looked, as well as something, somebody, sometimes, somewhere, somehow, which
Stubbs (2005) argues reflect the vagueness and sense of the inscrutable that has long
been noted as a stylistic marker of Conrad’s novella (Leavis, 2011 [1948]).
Mahlberg and McIntyre (2011: 216) view corpus stylistics as ‘an approach that can
link in with the concerns in literary stylistics and criticism’, rather than as field of study
that competes with traditional literary studies (see also McIntyre, 2015). They demon-
strate this in a corpus stylistic study of Fleming’s Casino Royale where, similar to Stubbs
(2005), frequency information functions as evidence for arguments about theme, style
and characterization. Beside raw frequency, they employ corpus linguistics procedures
such as lemmatization and keyword analysis, which identify lexis associated with core
themes (e.g. cards, casinos, spies), characters (Bond, Le Chiffre, Vesper) and the male
viewpoint (e.g. the subjective pronoun he). Mahlberg and McIntyre (2011: 221) report
that a key semantic domain in Fleming’s work is physicality, since there is high fre-
quency of lemmas associated with the body. Further, the representation of the body is
constructed differently according to gender. A collocational analysis of the n-gram his
body (i.e. Bond’s) compared to the central female character Vesper reveals Bond’s col-
locates emphasize his ability to separate his physical self from his mental and emotional
self, while Vesper’s body is presented either sexually, collocating with words such as
morals, bed, sheet, sensual, conquest, or from Bond’s point of view as unemotional,
cold, arrogant, remote.
Not only do the above studies indicate the wide range of research applications for
literary corpora once they are built, but also how the basic toolkit of corpus linguistics
can produce insights into literature, culture and social psychology (Greenfield, 2013).
The following sections describe a newly built literary corpus and, by way of introduc-
tion, apply some of the above procedures in the context of culturomics and corpus
stylistics.
the canon of western literature (Bloom, 1994). The canon of western literature has been
an influential idea in literary studies. It has been argued to consist of the core literary tradi-
tion of the west. Canonical literature has been defined as texts with great aesthetic beauty
and important influence in shaping other literature, as well as western thought and culture
in general. Leavis (2011 [1948]) argued it represents a ‘Great Tradition’ in which previous
great works shape the style and form of the literature that follows. Adler and Weismann
(2000) use a similar phrase: the ‘Great Conversation’. They conceive of the canon as an
intertextual conversation between authors across centuries, where ideas, styles, charac-
ters, philosophies and science are discussed, refined, rejected and renewed. The canon has
an overall coherence, they believe, as literature that does not participate in this ‘Great
Conversation’, either explicitly or implicitly via literary criticism, falls outside canonical
literature. Bloom (1994), author of the influential The Western Canon: The Books and
School of the Ages, presents a similar definition, though he largely excludes scientific
treatises as he argues that aesthetic beauty is a key inclusion criterion. Bloom (1994) is
one of the staunchest current defenders of the western canon, and also offers one of the
most cited taxonomies of canonical authors and texts.
The challenges and critiques of the canon are well known, part of the general culture
wars of recent academia (Gorak, 2013), and include that the canon overwhelmingly rep-
resents white male authors, characters and viewpoints, suppresses the voices of women,
the cultures of minorities, the spiritual beliefs of those not consistent with an era’s reign-
ing (and often brutally enforced) theology, etc. The canonicity of any text is debatable,
and overrepresented is literature related to the Greco-Roman tradition, which partly
reflects 19th century models of liberal arts education (Towheed and Owens, 2011). Further,
there is a debate over who gets to choose the works in the canon, as scholars who have
proposed lists of canonical literature tend to be much like the authors they include, i.e.
white, male, English speakers of European heritage. The current paper’s introduction of a
corpus of the canon of western literature is not meant as a defence of the construct itself.
Rather, the corpus is presented as an object of study for the empirical investigation of what
has been held up to be literature of great importance to western culture (cf. Google Books).
nested within the theocratic age are the Ancient Greeks and the Romans, while nested in
the democratic age are works from Great Britain and the United States.
The majority of texts in the canon are from the British Isles or the United States and
originally written in English. Indeed, one might suggest that Bloom’s (1994) western
canon is more specifically a western canon of the English-speaking peoples. Hundreds of
literary works not originally in English, from Homer to Proust, are listed by Bloom
(1994), and these have been included in the CCWL in translation. While Bloom (1994)
might hold that the works should be read in the original languages (though this is not
clear), others, such as Adler and Weisman (2000), argue that translations still represent
the ‘Great Conversation’, and so it was decided they have a place in the corpus. Of
course, the style of the translator and era of translation influence these texts, but the
CCWL has been designed for researchers to ignore translated texts if desired.
The development of the CCWL proceeded as follows. Every text listed in Bloom’s
(1994) Appendix A was searched for in Project Gutenberg (www.gutenberg.org/), a digi-
tal repository of public domain literature. Project Gutenberg texts are not copyrighted
and are available freely for research. Each text contains a licence statement, and scholars
who use this corpus should read the licence, as countries vary on copyright. The CCWL
is freely available under the standard licencing of Project Gutenberg upon request from
the author or via the download link in the notes section of this paper.1 The corpus was
tagged and cleaned to minimize non-target text. Licence statements were put behind the
XML tags <License>; footnotes, endnotes, indexes, introductions, appendices and con-
tents pages were tagged <notes>. Texts were also tagged for the genres <fiction>, <non-
fiction>, <play>, <poetry>, <prose>, <scripture>, <mixed genres>. When possible,
regex scripts were written to remove noise such as line break characters, page numbers,
etc. Plays presented a particular challenge as Gutenberg editions standardly have a period
immediately after a line-initial speaking character’s name. This skews estimates of mean
sentence length, and such repetition affects type–token ratios (TTRs). To minimize this,
all plays (and works such as Plato’s Dialogues) had the speaker’s names put behind
<character> tags. All files were Utf-8 encoded, which provides a standard and compact
formatting for all characters in text files.
Text files were kept intact as much as possible; that is, sometimes a single volume in
Project Gutenberg contained multiple target texts from an author listed in Bloom (1994).
However, when a target text was only available in a collected volume, non-target texts
within that file were removed. Files in the corpus were named according to Bloom’s
Appendix (i.e. author/title), rather than given codes. This was done in an interdiscipli-
nary sprit, in the hopes that intuitive file names may make the corpus more accessible to
non-corpus linguists such as literary scholars. When there were multiple versions of the
same text available, it was decided to use the edition that had been most downloaded
from Project Gutenberg. This is arbitrary, but it is possible the most downloaded version
is more central to the canon than less read editions. Bloom (1994) operates similarly,
including only the King James version of the Bible. A supplementary part-of-speech
tagged version of the corpus was also developed, with tagging by TagAnt (Anthony,
2015). Checks of random samples suggested that tag accuracy varies, with performance
best on prose written after 1800. For example, within Chaucer’s Canterbury Tales the
tagger handled some archaic style with 100% accuracy, e.g. Thus _RB can _MD Fortune
Green 287
A. The theocratic age Word count B. The aristocratic age Word count
(2000 BCE to 1321 CE) (1321 to 1832)
A1. Ancient Near East 1,183,465 B1. Italy 2,062,782
A2. Ancient India 618,326 B2. Portugal 74,835
A3. Ancient Greeks 1,810,721 B3. Spain 715,556
A4. Hellenistic Greeks 951,025 B4. England and Scotland 14,416,044
A5. The Romans 805,486 B5. France 2,336,258
A6. The Middle Ages 1,307,171 B6. Germany 585,929
Total: 6,676,230 Total: 20,191,304
C. The democratic age Word count D. The chaotic age Word count
(1832 to 1900) (20th century)
C1. Italy 279,505 D1. Italy 56,079
C3. France 3,054,359 D4. Portugal 6,953
C4. Scandinavia 169,748 D5. France 331,477
C5. Great Britain 19,287,528 D6. Great Britain and Ireland 5,937,856
C6. Germany 1,124,197 D7. Germany 470,454
C7. Russia 3,963,272 D8. Russia 346,211
C8. United States 7,734,357 D9. Scandinavia 534,970
Total: 35,612,966 D15. Yiddish 96,361
D23. Australia and New Zealand 212,723
D24. The United States 1,889,639
Total: 9,882,723
Corpus of the canon of 72,363,224
western literature word
count:
_NP her _PP wheel _NN govern _VV, while it was inaccurate with other sequences, e.g.
He _PP which _WDT that _DT misconceiveth _NN oft _RB misdeemeth _VVZ. An exam-
ination of the 100 most frequent NP tags in periods A3, A4 and A5 (see table 1) indicated
an error rate of around 6%. Given time and resource constraints in this phase of the pro-
ject, machine tagging has not been checked by hand by independent raters nor errors
corrected.
The final corpus contains 805 individual files (many containing multiple works) in a
flat structure and, excluding non-target text, approximately 73 million words, which
compares favourably to large corpora such as the British National Corpus (BNC) at 100
million. Table 1 shows the organization of the corpus and the sample sizes (excluding
license statements and edition notes) for each literary age, society and culture listed in
Bloom (1994).
Table 1 indicates significant word count differences exist in the representation of
times and places, but this reflects the canon as described by Bloom (1994). Approximately
25% of the corpus is British literature from the democratic age (1832–1900 CE). The
sample sizes for other periods and cultures/societies are quite good, nonetheless, with
288 Language and Literature 26(4)
around half of the nested subcorpora around or greater than one million words. Corpora
of a million words have been effectively used since the 1960s (e.g. Brown) until the cur-
rent era (e.g. International Corpus of English). It is worth noting that Bloom (1994) is not
strictly chronological in categorization, but considers also literary movement. For exam-
ple, the romantic poets are nested in the democratic age, as they were a reaction to neo-
classicism and a style he considers of the aristocratic age. Not every text listed in Bloom
(1994) was obtainable in Project Gutenberg. Literature from the chaotic age has the least
coverage, as many of the texts are still under copyright; yet, as Table 1 shows, the age
nevertheless has sizeable representation. Gaps in consecutive numbering (e.g. D2–3)
indicate no available texts. The exact coverage of the western canon as described by
Bloom (1994) can only be approximated. This is for two reasons. One is that Bloom is at
times vague about the texts that are canonical; for example, while the specific titles of
Dickens are listed, for other authors he simply notes ‘Selected Poems’ or ‘Short Novels’.
The second issue relating to coverage is that where Bloom specifies the complete works
of an author as canonical, Project Gutenberg did not always have all their work. If we
estimate representation by authors, from the theocratic age, the CCWL represents 48 of
63 (76%) canonical authors mentioned by Bloom (1994); from the aristocratic age, 88 of
139 (63%); from the democratic age, 125 of 159 (79%); and finally from the chaotic age,
where Bloom (1994) lists a total of 506 authors, only 58 (11%) are represented.
Representation bias is thus toward literature before 1900. Bloom (1994: 548) leaves open
whether chaotic age texts are technically canon, as he suggests they must also withstand
the test of time: ‘I am not as confident about this list… Not all of the works here can
prove to be canonical’.
5 Applications to culturomics
This section applies a few standard corpus procedures to the CCWL, and illustrates how
the corpus can be used for culturomics. Simple frequency has its interest, but to home in
on the lexis of literature lemmatization and keyness procedures often provides more
insights (McIntyre, 2015; Stubbs, 2005). Keyness highlights lexis in a corpus that stands
out statistically in terms of relative frequency and dispersion compared to a larger refer-
ence corpus. Reported in Table 2 are the 20 highest ranked keywords in the CCWL,
computed against the BNC. The BNC is a far from perfect reference corpus (no currently
available corpus would be) as it is a contemporary, mixed-genre corpus of speech and
writing. Nevertheless, it is a well-known British corpus of a size larger than the CCWL,
and the comparison for the generation of keywords, while problematic, is not meaning-
less. Consider that when a school student encounters Shakespeare, the lexis that stands
out is that which is distinct from their everyday experience of English: e.g. Shall I com-
pare thee to a summer’s day?
Table 2 shows that pronouns stand out as keywords in the CCWL. This likely reflects
a property of literature that Stockwell and Mahlberg (2015) call the textual trace of char-
acterization, i.e. characters display pronominal chains reflecting their participation in a
narrative. Note that masculine pronouns are more key than female ones. In the top 20
keywords, five male referents occur, four being pronominal, and one superordinate: man.
There is only one female referent, the pronoun her, which is not subjective case; indeed,
Green 289
Table 2. Highest ranked keywords in the corpus of the canon of western literature.
nominative she is only the 28th keyword of the CCWL, compared to he, ranked 4th. The
subject of a clause is typically the agent, one who does, acts, perceives, thinks or senses
(Givón, 1993), while the predicate is the part of the clause where propositions prototypi-
cally package those who are recipients, instruments, acted upon or thought about
(Halliday, 2003). Thus, Table 2 suggests that gender representation in canonical litera-
ture is qualitatively and quantitatively distinct. This observation is not necessarily true
only of canonical literature, but it demonstrates nonetheless how the CCWL can be used
to bolster with supporting empirical evidence long-standing criticisms of the canon, such
as that it is dominated by male characters, experience and viewpoints.
As discussed, Mahlberg and McIntyre (2011) effectively used lemmatization to high-
light lexis associated with key themes, characters and semantic domains in their study of
Casino Royale. A function word stoplist and the Someya (1998) list of 4,762 lemmas
were therefore applied to the CCWL using Wordsmith v.6 (Scott, 2016). The Someya
(1998) list, derived from modern corpora, lacks coverage of archaisms like in the Chaucer
example above, but this seems a relatively minor limitation. Table 3 ranks the 25 most
frequent lemmas in the CCWL.
A few interesting observations can be drawn from Table 3. The first is that canonical
literature exhibits the Pollyanna Effect (Ingram et al., 2016). The Pollyanna Effect pro-
poses that although human languages tend to have a wider range of words for negative
experience, those for positive experience are much more frequent. In the CCWL, the
most frequent lemmas reflect recurrent themes of love and life, things that are great and
good, and discussions of the heart and God. This positivity bias is more marked than in
a general corpus. For example, good occurs 1276 times per million words in the BNC
(Leech et al., 2001)., compared to 1430 p/m in the CCWL; great occurs 635 p/m words
in the BNC and 1524 p/m in the CCWL; heart 152 p/m in the BNC and 755 p/m in the
CCWL; and finally love occurs 150 times p/m in the BNC but 1200 times p/m words in
canonical literature. This suggests that even though canonical literature from Homer to
Hemmingway addresses death, war, heartache and tragedy, the overall cultural preoccu-
pations of the western canon over history have been largely positive.
The list also shows many lemmas for body parts. Some of these lemmas are physical,
such as hand, heart, eye, and others are for bodily sensory experience such as hear,
290 Language and Literature 26(4)
Table 3. Most frequent lemmas in the corpus of the canon of western literature.
speak, feel. The reason why body part language plays such an important role is perhaps
the cognitive poetic one noted by Stockwell and Mahlberg (2015: 132); namely, that
effective characterization for mind-modelling requires more description of the body than
non-literary language since the author needs to communicate what characters look like,
how they move, what they are doing, in order to help readers create a cognitive represen-
tation. Table 3 reflects the (not surprising) fact that human experience is a major focus of
canonical literature, and that this experience is embodied.
5.1 The decline in influence of the Greco-Romans and the theocratic age
Michel et al. (2011) argue that culturomics can track the rise and fall of the cultural pre-
occupations of those who produced the texts in a corpus. This section explores two cul-
tural preoccupations of canonical literature, namely religion and the Greco-Romans.
Firstly, let us consider religion as a literary theme over time. As was reported in Table 3,
God is the 18th most frequent lemma in the CCWL, indicating that religion is a canonical
theme. Yet, the focus on religion wanes over time. Lemma lists computed for each age
indicate that in the theocratic age religion is a dominant topic, with God as the 2nd most
frequent lemma, lord 3rd, and soul 35th. The top four keywords, computed against the rest
of the corpus, are God, son, lord and king respectively. Bloom’s (1994) intuitive naming
of a theocratic age of canonical literature seems apt. However, in the aristocratic age,
God is only the 18th most frequent lemma, lord 15th and soul 81st. By the democratic age,
God has slipped to 50th, lord 77th, soul 87th; and by the chaotic age, God is 65th, lord
350th and soul 107th. While the influence and themes of the theocratic age decline, the
rise of humanism appears to take its place. For example, even though man is the most
frequent lemma in the theocratic age and all others, it is ranked seven places (i.e. eighth)
below God as a keyword for the era; however, by the democratic age, God is no longer
within even the top 500 keywords. Further, in the democratic and chaotic ages, the top
20 keywords and lemmas contain the following words which theocratic age literature
Green 291
Note: function words God, King, and character names in plays were excluded.
does not: eye, face, stand, sit, cry, feel, walk, laugh – all related to human (bodily) experi-
ence. The data suggest a shift of focus in canonical literature across time from the spir-
itual to the representation of human experience. Arguably, the decline in religion
evidenced in canonical literature is a reflection of the decline in its historical centrality to
western culture (i.e. a culturomic trend).
Let us consider the intertextual question of the influence of classical literature on the
western canon. A long-standing claim has been that the influence of the Greco-Romans
has been unparalleled in terms of style, themes, philosophy, characters etc. (Highet, 2015
[1953]: 19). To compute literary connections to the classics, the Greco-Roman subcor-
pora of the CCWL were queried: approximately 3,567,232 words of texts nested within
A3: The Ancient Greeks, A4: The Hellenistic Greeks, and A5: The Romans. To create a
metric for tracking classical reference in subsequent literary eras, the 50 highest ranked
keywords (computed against remaining eras) and the 100 most frequent proper nouns
were extracted (from the Part of Speech tagged version, with tag accuracy checked by
hand) and used as batch searches in Wordsmith 6 (Scott, 2016). The cutoff ranks are
arbitrary (Mahlberg and McIntyre, 2011), but the procedure produced a list of characters,
places and historical figures central to Greco-Roman literature, as reflected in the sample
in Table 4.
The keywords and proper nouns in Table 4 capture important classical characters
(Achilles), places (Rome), gods (Zeus), people (Socrates), as well as characteristics of
the Greco-Romans such as the emphasis on the city, ships and citizens, and the valour of
the army and war. Reported in Table 5, normalized per million words, are the keywords
and proper nouns from Greco-Roman literature tracked across the literary ages.
Table 5 suggests a general decline of the literary influence of the classics, or at least,
with their literary preoccupations. Greco-Roman keywords steadily decline till the mod-
ern era, as do literary references to Greco-Roman characters, people and places. However,
note how references to proper nouns from the classical period spike in the literature of
the aristocratic age. This age, which in Bloom’s (1994) estimation spans 1321 to 1832
AD, represents the late middle ages, Renaissance and the reestablishment of democracy.
One of the defining characters of this period of western history was looking back to the
classical world (Pitts and Versluys, 2014).
292 Language and Literature 26(4)
Table 5. Frequencies (per million) of Greco-Roman lexis across time in canonical literature.
Table 6. Mean word lengths in the corpus of the canon of western literature (CCWL).
verbose sentences of Joyce (O’Halloran, 2007). Table 7 reports the authors/texts in the
CCWL with the longest and shortest average sentence lengths.
In Table 7, again one can see both styles of authors and genres reflected in sentence
length. Plays have a much shorter mean sentence length than prose, though not it seems in
the era of Shakespeare and Marlowe, where the style was not intended to represent actual
speech. This is unlike modern playwrights, who use the shortest sentences, an imitation of
spoken utterances which tend to be shorter and lack the syntactic complexity of writing
(Greenbaum and Nelson, 1995). Ibsen’s style of realism, with its truncated utterances to
produce melancholic effects, is reflected in the fact that he has multiple plays within the 10
texts with the shortest mean sentence length in the corpus. Poetry has generally longer
sentences than prose, which one suspects reflects that a unit of scansion is more often offset
from other text lines by a comma, or (semi)colon, as in Milton (Fish, 2001), rather than
sentence punctuation. Table 7 also suggests that long sentences pattern with the Greco-
Roman or Aristocratic Ages. As the previous section indicated, the two periods appear to
be intertextually and culturally related. Note that Ulysses had one of the shortest sentence
lengths in the CCWL, despite having one of the longest sentences in the history of litera-
ture. The estimate here, however, accords with previous reported estimates (Borja, 2014),
and the novel did have the second highest standard deviation in the corpus.
Scholars have often used the literary output of authors to estimate their vocabulary
size, Shakespeare being one frequently studied case (Craig, 2011). A common procedure
for the estimate is the type–token ratio, which calculates how many different types of
words there are in a text (i.e. lemmas) relative to how many actual words there are in the
text (i.e. tokens). If an author’s work has higher number of types to the overall number
of tokens, this indicates it contains a wider vocabulary range (Holmes, 1994). Since text
length affects the type–token ratio (Baker, 2004), i.e. texts with more words will have
more words that occur only once, Table 8 reports a standardized TTR based on averages
per 1000 words for the authors/texts in the CCWL.
While Ulysses has one of the shorter average sentence lengths in canonical literature,
Table 8 indicates the novel has the highest standardised type token ratio (STTR) of any
prose work in the corpus. The finding is consistent with previous stylistic work that has
emphasized Joyce’s lexical complexity (O’Halloran, 2007). Generally, poets seem to have
the widest vocabulary range in the canon. There are several reasons for this. One is that
poetry relies more heavily than other literature on the artistic choices made in relation to
vocabulary, so rather than frequent words that come to mind easily, poets select words that
are less common. Further, a poem is usually short, and the demands of the form sacrifice
function words. A collection of poems also might not deal with same characters, places and
things, thus decreasing STTR. Lexical range appears to be an element of the style of Ibsen,
Synge and Oscar Wilde, at least in his plays, while authors such as Pushkin have a high
STTR regardless of the form they are working in. Children’s literature and religious prose,
which had shorter words and sentences, tends to have a higher rate of lexical repetition.
The previous data have indicated that there is variation style according to genre and
author across the three metrics of word length, sentence length and vocabulary range.
However, some authors, e.g. Defoe, Joyce and Coleridge, appear multiple times across
the measures, suggesting there may be a relationship across these elements of style. A
Pearson’s product moment was therefore computed for all texts in the CCWL, finding
Green 295
Table 7. Mean sentence lengths in the corpus of the canon of western literature (CCWL).
Table 8. Vocabulary range in the corpus of the canon of western literature (CCWL).
the following general correlations: word and sentence length (r=.39, p<.01), word length
and STTR (r=.45, p<.01), STTR and sentence length (r= .07, p<.01). In other words,
canonical literature with longer sentences has a moderate tendency to also have longer
words; higher vocabulary ranges tend to pattern with an increased use of longer words,
and there is a weak but significant relationship between larger vocabulary ranges and
longer sentences.
7 Conclusion
This paper has introduced the Corpus of the Canon of Western Literature (Version 1), a
corpus of approximately 73 million words that represents the construct of the western canon
according to Bloom (1994). Future releases of the CCWL aim to add more markup to the
files, such as date of publication, more genre categories and, when required, the translators
and original languages. Further markup will help researchers disambiguate how such vari-
ables affect canonical literature. A few limitations of the corpus and its analysis presented
above are worth closing with. One general limitation on the corpus is the issue of translation
for non-English texts. In translation, there is often a blend of the language and style of an era
with that of the source material, the King James Bible being a good example. Also, as noted,
the CCWL does not have complete representation of the western canon described in Bloom
(1994). The open source nature of this corpus, however, allows for the CCWL to be updated
(by anyone) with other editions, perhaps beyond Project Gutenberg, to improve coverage
and quality. While much time and effort has been expended to try to reduce noise and thus
provide other researchers with accurate numbers and a useful corpus, noise still remains. It
also should be noted that different corpus tools can produce variable estimates of word
count, sentence length etc. Future releases will further reduce transcription errors, unwanted
characters and any other non-target text that may still remain. While the culturomic and
stylistic analysis above has been introductory, future research can use this corpus for much
more complex quantifications of style and culture, for example which authors in the canon
cluster together according to intertextuality or other style metrics? Are there differences in
country of origin in literary preoccupations? Do male and female canonical authors (of
which there are only approximately 7% for the latter) differ in their construction of themes,
characters and narrative ideas? How have what Adler and Weismann (2000) termed the
‘great ideas’ contained in the western canon spread throughout literature across time and
place? The canon of western literature has been an important and contested idea in literary
studies, and the corpus introduced in this paper is hoped to be of use to scholars interested in
culturomics and stylistics.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this
article.
298 Language and Literature 26(4)
Notes
1. Corpus download link: https://www.dropbox.com/s/xtv2r37ytfc9pp7/Corpus%20of%20
the%20Canon%20of%20Western%20Literature%20%281.0%29.rar?dl=0. Future releases
and any changes to the permanent online repository to be announced via Corpora-list: mail-
man.uib.no/listinfo/corpora.
2. Gibbon’s Decline and Fall is a single work across multiple volumes in the corpus. The
reported mean is for the single work as a whole. This was also done for Parizval, Lives of the
Artists and Don Quixote. It was not done for different works across multiple volumes by the
same author.
References
Acerbi, A, Lamos V, Garnett, P and Bentley RA (2013) The expression of emotions in 20th cen-
tury books. PloS One 8(3): e59030.
Adler MJ and Weismann M (2000) How to Think about the Great Ideas: From the Great Books of
Western Civilization. Chicago: Open Court Publishing.
Anthony L (2015) TagAnt (Version 1.2.0) [Computer Software]. Tokyo, Japan: Waseda University.
Available at: www.laurenceanthony.net/ (accessed 7 June 2016).
Baker P (2003) No effeminates please: A corpus-based analysis of masculinity via personal adverts
in Gay News/Times 1973–2000. The Sociological Review 51(1): 243–260.
Baker P (2004) Querying keywords questions of difference, frequency, and sense in keywords
analysis. Journal of English Linguistics 32(4): 346–359.
Beach R, Appleman D, Fecho B and Simon R (2016) Teaching Literature to Adolescents. London:
Routledge.
Bloom H (1994) The Western Canon: The Books and School of the Ages. New York: Harcourt.
Borja M (2014) How unreadable are James Joyce’s novels? Significance 11(3). Available at:
www.statslife.org.uk/culture/1572.
Craig H (2011) Shakespeare’s vocabulary: Myth and reality. Shakespeare Quarterly 62(1): 53–74.
Fish SE (2001) How Milton Works. Harvard: Harvard University Press.
Givón T (1993) English Grammar: A Function-based Introduction. Amsterdam: Benjamins.
Gorak J (2013) The Making of the Modern Canon: Genesis and Crisis of a Literary Idea. London:
Bloomsbury.
Greenbaum S and Nelson G (1995) Clause relationships in spoken and written English. Functions
of Language 2(1): 1–21.
Greenfield PM (2013) The changing psychology of culture from 1800 through 2000. Psychological
Science 24(9): 1722–1731.
Guillory J (2013) Cultural Capital: The Problem of Literary Canon Formation. Chicago:
University of Chicago Press.
Halliday MAK (2003) On Language and Linguistics. London: A&C Black.
Highet G (2015 [1953]) The Classical Tradition. New York: Oxford University Press.
Holmes DI (1994) Authorship attribution. Computers and the Humanities 28(2): 87–106.
Hughes JM, Foti N, Krakauer D and Rockmore D (2012) Quantitative patterns of stylistic influ-
ence in the evolution of literature. Proceedings of the National Academy of Sciences 109(20):
7682–7686.
Ingram J, Hand C and Maciejewski G (2016) Exploring the measurement of markedness and its
relationship with other linguistic variables. PloS One 11(6): e0157141.
Leavis FR (2011 [1948]) The Great Tradition: George Eliot, Henry James, Joseph Conrad.
London: Faber & Faber.
Green 299
Leech G, Rayson P and Wilson A (2001) Word Frequencies in Written and Spoken English: Based
on the British National Corpus. Longman: London.
McIntyre D (2015) Towards an integrated corpus stylistics. Topics in Linguistics 16(1): 59–68.
Mahlberg M and McIntyre D (2011) A case for corpus stylistics: Ian Fleming’s Casino Royale.
English Text Construction 4(2): 204–227.
Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Google Books Team, Pickett JP, Hoiberg
D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA and Aiden EL (2011) Quantitative
analysis of culture using millions of digitized books. Science 331(6014): 176–182.
O’Halloran K (2007) The subconscious in James Joyce’s ‘Eveline’: A corpus stylistic analysis that
chews on the ‘Fish hook’. Language and Literature 16(3): 227–244.
Pechenick E, Danforth C and Dodds P (2015) Characterizing the Google Books corpus: Strong
limits to inferences of socio-cultural and linguistic evolution. PloS One 10(10): e0137041.
Pitts M and Versluys MJ (2014) Globalisation and the Roman World: World History, Connectivity
and Material Culture. Cambridge: Cambridge University Press.
Samothrakis S and Fasli M (2015) Emotional sentence annotation helps predict fiction genre. Plos
One 10(11): e0141922.
Scott M (2016). Wordsmith (Version 6) [Computer Software]. Liverpool: OUP.
Someya Y (1998) E-Lemma [Data file]. Available at: www.lexically.net/downloads/e_lemma.zip.
Stockwell P and Mahlberg M (2015) Mind-modelling with corpus stylistics in David Copperfield.
Language and Literature 24(2): 129–147.
Stubbs M (2005) Conrad in the computer: Examples of quantitative stylistic methods. Language
and Literature 14(1): 5–24.
Toolan M (2009) Narrative Progression in the Short Story: A Corpus Stylistic Approach.
Amsterdam: John Benjamins.
Towheed S and Owens WR (2011) The History of Reading: International Perspectives, c. 1550–
1945. London: Palgrave Macmillan.
Wierzbicka A (1997) Understanding Cultures through their Key Words: English, Russian, Polish,
German, and Japanese. Oxford: Oxford University Press.
Wood D (2012) Character synthesis in the adventures of Huckleberry Finn. The Explicator 70(2):
83–86.
Author biography
Clarence Green holds a PhD in linguistics. His research interests include the psychology of lan-
guage, corpus linguistics, stylistics and cognitive-functional grammar, particularly from a quanti-
tative perspective. His research has appeared in journals such as Cognitive Linguistics, Functions
of Language and Literary and Linguistic Computing. He currently lectures in psycholinguistics,
corpus linguistics and research methods at the National Institute of Education, Nanyang
Technological University.