Retrieve
Retrieve
Retrieve
Natalia Levshina1
Abstract
1. Introduction: aims
Online film subtitles are an attractive source of data for corpus linguistics.
They are freely downloadable in many different languages from numerous
online repositories. Probably the most attractive feature of film subtitles in
comparison with other multilingual parallel corpora (e.g., translations of the
Bible, proceedings of the European Parliament, the European Union law and
the documents of the United Nations) is that film subtitles are, as a rule,
stylistically much closer to informal spoken dialogues than any of these
1
Leipzig University (IPF 141199), Nikolaistraße 6–10 04109 Leipzig, Germany.
Correspondence to: Natalia Levshina, e-mail: natalevs@gmail.com
Corpora 2017 Vol. 12 (3): 311–338
DOI: 10.3366/cor.2017.0123
© Edinburgh University Press
www.euppublishing.com/cor
312 N. Levshina
2
Accessed 21 April 2015 at: http://opus.lingfil.uu.se/.
3
Accessed 21 April 2015 at: www.korpus.cz/intercorp.
Online film subtitles 313
4
The ParTy corpus (Parallel corpus for Typology). See
www.natalialevshina.com/corpus.html.
5
Accessed 8 September 2015 at: https://www.sketchengine.co.uk/new-model-corpus/.
6
Accessed 8 September 2015 at: http://corpus.byu.edu/soap/.
314 N. Levshina
At the same time, most researchers seem to stress similarity between film
or TV series dialogue and naturally occurring conversations, as far as their
pragmatic and lexico-grammatical features are concerned (for an overview,
see Dose, 2014: Chapter 4.3.4).
Finally, no one can guarantee the quality of film subtitles
downloaded from online repositories. Both non-translated and translated
subtitles may contain typos and errors. It is very difficult to find reliable
information about the subtitler of a specific film and his or her linguistic
background and expertise. Although most repositories have systems of
downvoting bad subtitles, this does not guarantee a high quality of subtitles
that have not been marked as poor. However, in my personal experience
based on linguistic analysis of subtitles and native speakers’ evaluation of text
samples in several languages, the vast majority of subtitles have acceptable
quality, although one can see occasional transcription and translation
errors (see Bednarek, 2010: 70), as well as mistakes in orthography and
punctuation. As can be seen from online comments or additional information
in the files, many subtitles have been corrected several times, which implies
a high level of dedication of the members of the online subtitling community.
Do all these peculiarities make film subtitles too risky to be used for
linguistic research? To answer this question, I assume here a quantitative ap-
proach, treating English subtitles as equal with other registers of written and
spoken English. The main research questions of this paper are as follows:
2. Data
2.1 Corpora
The subtitles investigated in this study were collected from the online
repository OpenSubtitles.org.7 The films represent various fictional genres,
according to the genre classification from the International Movie Database:8
drama, adventure, fantasy, comedy, mystery, crime, etc. In total, I
selected twenty-three files for films that were originally in English,
twenty files representing films that were originally in French and fourteen
files representing films originally in languages other than English or
French (fourteen different languages). According to the International
Movie Database, most English films in the sample were either American
or produced by international teams with American participation (e.g.,
USA/UK/Australia). French was chosen because it is easy to find sufficient
data, due to the popularity of French films. It was considered separately from
the other languages because one of the goals was to try to identify linguistic
features that would reflect the properties of one specific source language
(see Section 5). The overwhelming majority of the films, English and non-
English, were created in the last twenty years. All files were in the SubRip
(.srt) format.
The subtitles were compared with samples from well-known corpora
of written and spoken British and American English. Both British and
American data were used because subtitles do not represent only one of
these two varieties exclusively. In fact, most files in the sample contain
the American spelling variants (e.g., color), but there are a few files where
the British variants are found (e.g., colour). Each national variety was
represented by two written registers (newspaper texts and fiction) and two
spoken registers (transcripts of informal conversations and radio and TV
broadcasts, which mostly represented unscripted conversations). The choice
of registers was motivated by their availability in large comparable corpora
of British and American English. The British data were taken from the
British National Corpus (BNC). The files were selected on the basis of the
meta-information. All files included 5,000 to 15,000 tokens, which made
them comparable to the subtitle files. The American data from newspapers,
fiction and media broadcasts were taken from the corresponding components
of the Corpus of Contemporary American English (COCA). For each of
the components, a local copy of the corpus contained twenty-three very
7
Accessed 21 April 2015 at: http://opensubtitles.org.
8
Accessed 21 April 2015 at: http://www.imdb.com/.
316 N. Levshina
Number of Number of
Subcorpus
files/samples tokens
Subtitles of films originally in
English 23 254,914
large files that represent each year from 1990 to 2012. A sample of 8,000
words was drawn from each of the twenty-three files for each register.9
The informal conversations in American English were taken from the
Santa Barbara Corpus of Spoken American English (SBCSAE). The richly
annotated dialogue scripts were stripped from the information about pauses,
background noises, coughing, etc., with the help of a Python script. The
number of files and the number of tokens in each subcorpus are shown in
Table 1.
The next step was to extract n-grams with n of 1, 2 and 3 with the help of
Python scripts. It did not make much practical sense to try larger n because
9
Although the fiction sub-corpus of COCA includes film scripts, the sample drawn for this
study did not contain them.
Online film subtitles 317
of the relatively small sizes of the sub-corpora, which would produce too
many hapax legomena. Moreover, as Gries et al. (2011) demonstrate, the
longer n-grams do not add much new information for the purposes of register
classification. N-grams were defined as sequences of n tokens within a
sentence which were not separated by punctuation marks. Punctuation marks
themselves were not considered to be elements of n-grams. The difference
between lower and upper case was disregarded. The contracted forms (e.g.,
I’ll and don’t) were regarded as combinations of two grams (I + ’ll and
do + n’t). The possessive marker ’s was treated as a separate gram. I chose
this method of tokenisation because it had been used in the majority of the
corpora selected. In the corpora where this was not the case, the data were
first normalised automatically with the help of a Python script.
These n-grams and their frequencies in each register served as an
input for all subsequent analyses, which are described in Sections 3 to 5.
Since the results based on the 2-grams were intermediate between the ones
based on the 1-grams and 3-grams, the 2-grams will not be discussed for
reasons of space.
3.1 Methodology
5 percent, 10 percent, 25 percent – all data without hapax legomena and all
data with hapax legomena. All these clustering models were nearly identical.
All statistical analyses and graphics presented in this paper were
performed or created in R, a free software environment for statistical
computing and graphics (R Core Team, 2014). Sections 3.2 and 3.3 present
the clustering solutions based on the 1-grams and 3-grams.
of subtitles are extremely high: they are, in fact, the highest coefficients
among all types of registers. This means that there is no principled difference
between the subtitles in original English and translations, so far as the
frequencies of the 1-grams are concerned. As one could expect on the basis
of the clustering model, the strongest correlations between the subtitles and
the other registers are observed in the case of British informal conversations,
followed by the American informal conversations (see the shaded rows in
Table 2). This holds for all three types of subtitles, although the translated
subtitles tend to have slightly lower coefficients than the original subtitles.
The next highest correlations are with the TV and radio broadcasts, which
are followed by the fiction. The lowest correlation coefficients are observed
between the subtitles and newspapers. For comparison, the lowest correlation
between all registers is found between the British conversations and the
British newspapers (r = 0.586). In the American data, the lowest correlation
is between the conversations and newspapers, too, although the correlation is
higher (r = 0.686). This suggests that the differences between the traditional
registers are greater than the difference between the subtitles and the informal
conversations.
with the written registers, except the British newspapers, which merge the
last.
The subtitles cluster with spoken registers: they first merge with the
informal conversations and then with the TV and radio broadcasts. This is
supported by the correlation coefficients displayed in Table 3. Again, the
strongest correlations are between the subtitles and conversations, followed
by the broadcasts and fiction. The original and translated subtitles are
again more similar to one another than to the other registers, although the
translations are again slightly less correlated with the other registers than the
subtitles in original English.
a c
n-gram the raw frequency of the
the raw frequency of the
n-gram in the informal
n-gram in the subtitles
spontaneous conversations
b d
all other the raw frequency of all
the raw frequency of all
n-grams other n-grams in the
other n-grams in the
informal spontaneous
subtitles
conversations
Frequency in Frequency in
1-gram ORdisc
subtitles conversations
howard 137 0 207.6
malkovich 124 0 188
paro 109 0 165.3
gatsby 95 0 144.2
daisy 93 0 141.1
The discussion below is based on the analysis of the top 100 most distinctive
1-grams in the subtitles with DPnorm < 0.5 (i.e., the 1-grams with the
highest OR) and an equivalent set of 1-grams in the informal spontaneous
conversations in British and American English (i.e., the 1-grams with the
lowest OR). For illustration, the top fifteen n-grams in both registers are
shown in Table 6. Note that the prominent positions of numerals (10, 20, 3;
eighty and twelve) in both lists are explained by the fact that numerals are
represented in different ways in the two sub-corpora: by digits in the subtitles
and by words in the conversations. The representation of numbers by digits in
the subtitles can be explained by space limitations. The 1-gram tape indicates
that the participants in the conversation often referred to the fact of being
recorded.
More frequent in subtitles More frequent in conversations
Freq. Freq. Freq. Freq.
1-gram ORdisc 1-gram ORdisc
subtitles conversations subtitles conversations
10 88 0 133.6 erm 0 850 <0.001
Online film subtitles
Table 6: Top fifteen most distinctive 1-grams in the subtitles and conversations.
325
326 N. Levshina
Considering the top 100 1-grams that are the most representative of
the subtitles, one can observe that this sub-corpus contains a relatively high
number of direct addresses, attention signals, greetings and polite formulae,
which are exemplified by such 1-grams as Mr, Sir, gentlemen, kid, guys,
hey, thanks, excuse, welcome, sorry and pleasure (as in ‘it’s my pleasure’ or
‘with pleasure’). This observation is in line with the one made by Mittmann
(2006: 577), who found that TV series dialogues contain more greetings
and polite formulae than naturally occurring conversations (see also Freddi,
2012: 392). According to Freddi (2012), film dialogues try to mimic everyday
conversations by representing the ritualised acts of daily routine. A possible
explanation of this difference might be that films and TV drama series
represent more dynamic social situations than the informal conversations
in the BNC and SBCSAE, where the interlocutors in one recording session
usually know one another well and do not come and go often.
Another finding is that the subtitles contain a relatively high number
of words that describe a mental state (e.g., happy, sorry, scared, afraid and
crazy), evaluative adjectives (e.g., beautiful, perfect, dangerous, strange and
important) and expletives (bitch and damn). This corresponds to Quaglio’s
(2008) observation based on his analysis of TV series Friends, where he
found the language of the TV series to be more emotional and dramatic than
that of normal conversations (see also Bednarek, 2011). This higher degree
of emotionality has to do with the entertainment function of films and series:
the viewers are supposed to be involved with and feel with the characters.
Importantly, the subtitles contain a large number of verbs in the base
form (e.g., promise, trust, act, protect, kill, stop, let, help and speak). Most
commonly, these verbs are either in the imperative, or in the future tense, or
part of an infinitival verbal complement. Consider Examples 1 to 3:
(1) Listen, word to the wise, stop dressing like you’re running for
Congress.
(Bad Teacher)
(actually). These elements may be less frequent in the subtitles because the
latter in fact represents prepared speech, where fewer overlaps, hesitations
and corrections can be expected than in spontaneous dialogues (see Dose,
2014: 97–8). Closely related to discourse markers are mental verbs, such as
wonder, suppose and mean, which can perform different discursive functions:
hedging (I suppose), introducing a question (I wonder) and clarification
(I mean). These verbs are also under-represented in the subtitles. Although
one may be inclined to think that discourse markers might be omitted from
the subtitles due to the limitations of space, it has in fact been observed
that some discourse markers are also under-represented in transcribed film
dialogues, where such limitations are absent (e.g., Mittmann, 2006: 578;
and Quaglio, 2008: 200). Note that the raw frequencies of the majority of
these discourse markers in the subtitles are different from 0. The difference
between the subtitles and the conversations is thus only a matter of degree.
In addition to the higher proportion of discourse markers, the
conversations have a larger number of past or perfective verb forms (e.g., had,
meant, used, thought, got, walked, bought, stuck, went and said). The spoken
sub-corpus also contains several –ing forms (driving, saying, putting, having
and sitting), which often describe the background situation or participants
(e.g., ‘they were having a biology lesson’ and ‘they’re like vultures sitting
on a rail there’). In addition, the top 100 distinctive 1-grams include two
third-person pronouns, she and they. These features are associated with
narrative discourse (Biber, 1988). Notably, Bednarek (2011) observes that
TV series are also less narrative than normal conversations. This difference
can be explained as follows. In film or TV series dialogues, characters
usually talk to one another and about their immediate actions and intentions,
rather than about past events and third (absent) parties, who may not be
immediately accessible to film viewers (see Pavesi, 2008: 84–5). Moreover,
a story is usually shown to develop in time with the help of visual means,
rather than being presented verbally by film characters. This conclusion is
also supported by a frequent occurence of time and place adverbials – for
instance, yesterday, then, there, early and week (as in ‘this week’ or ‘next
week’) – which are normally used to refer to times and places outside the
current situation (Biber, 1988: 110).
Finally, the subtitles contain relatively few words and constructions
that can be described as instances of vague language (Channel, 1994) in
comparison with the conversations, where such words and constructions are
more frequent. Examples are elements of non-numerical vague quantifiers,
such as (a) bit (of ), (a) lot (of ), and (a) couple (of ), placeholders stuff
and ones, as well as the words might and probably. Vague language is also
under-represented in TV series in comparison with natural dialogue (e.g.,
Quaglio, 2008). The speaker can use these elements as hedges, or invite
the hearer to construct the meaning together, establishing the atmosphere of
informality. Obviously, film language has fewer vague expressions because
of its communicative limitations: the viewers may not always be able
to construct the meaning because they are ‘overhearers’ who have only
328 N. Levshina
This section presents the results of an analysis of the top 100 most distinctive
3-grams in both sub-corpora with normalised DP scores below 0.5. The top
fifteen 3-grams are shown in Table 7.
The analysis has yielded a few interesting peculiarities. First, the
subtitles contain many questions or their elements (e.g., what is it, are you
sure, what do you, why don’t, how did you and are you doing), which mostly
express the speaker’s reaction to the hearer’s actions and help build the
conflict situations (Freddi, 2012: 391). There are also very many expressions
that contain the verbs of necessity or desire (e.g., I want you, I need to and I’d
like) with infinitival complements (e.g., I wanted to talk to you). Both features
were also observed by Bednarek (2011) when she compared dialogues in
TV series with other registers. These expressions propel the action forward.
They also reveal the characters’ motives and feelings and thus make the film
viewers identify with and feel for the characters.
As for the conversations, one can pinpoint the following
peculiarities. First, similar to what has been observed in the case of
1-grams, speakers in spontaneous conversations use various discourse
markers abundantly; these include (dis)fluency and clarification markers
(e.g., I said well and I mean I), hedges (e.g., I think they) and other
expressions (e.g., oh it’s and tell you what). The softening and involving
functions are also evident in tag questions (isn’t it and aren’t they) and the
downtoners only and just (e.g., it’s only and it’s just).
Again, many distinctive 3-grams in the conversations contain verb
forms and adverbials that refer to past events (e.g., he didn’t, and I was and
and then I). There are also elements that introduce reported speech (e.g.,
and she said and and I said). These elements are associated with narrative
function, which was discussed in the previous sub-section. One also finds
here a few instances of vague language (a couple of, a lot of, a little bit, a bit
of, it’s like and something like that).
Table 7: Top fifteen most distinctive 3-grams in the subtitles and conversations.
329
330 N. Levshina
desires and emotions. These elements make the story more dramatic and
involving, and propel the plot forward. The subtitles also contain relatively
many greetings, terms of direct address and polite formulae, mainly because
the recorded conversations are more static in terms of the communicative
settings and the participants. At the same time, the subtitles have relatively
low frequencies of vague expressions, narrative elements and various
discourse markers. As for vague expressions, a possible explanation might be
that the film audience has only limited knowledge of the context and cannot
seek clarification. The relative scarcity of narrative elements in the subtitles
may be due to the fact that films usually tell a story by showing it developing
on the screen, rather than through someone’s monologue. Film characters
usually discuss their immediate situations and talk to one another, rather
than discuss third parties and past events. Finally, the subtitles also contain
fewer discourse markers than the conversations. This can be explained by the
lack of actual time pressure in the interaction between the characters, who
reproduce prepared text.
Notably, all these features are, in fact, shared by the subtitles with
film and TV series transcripts, which were studied by Mittmann (2006),
Quaglio (2008), Bednarek (2011), Freddi (2012) and others. Thus, film
subtitles represent a type of filmese/serialese.
This section compares the original English subtitles with the English
subtitles translated from French and then with the translations from the other
languages. I will employ the n-gram approach, which was introduced in
Section 4, to pinpoint the differences between the types of subtitles. If the
translations are strongly influenced by the source language(s), one can expect
this to be reflected at the level of the top distinctive n-grams.
First, I will discuss the most distinctive 1-grams in the original English
subtitles and those in the subtitles of French films translated into English.
As in Section 4, the analyses are based on 100 most distinctive 1-grams in
each sub-corpus with the normalised DP scores below 0.5. The top fifteen
1-grams in each sub-corpus are shown in Table 8.
An analysis of the top 100 1-grams in the original English subtitles
shows that this sub-corpus contains less formal language than the translated
subtitles. Examples are colloquial contractions (wanna, gotta and gonna) and
informal exclamations (such as wow, Jesus and yeah). The original subtitles
also contain a relatively larger number of discourse markers (hmm, oh, uh,
actually, well and okay), as well as polite formulae, greetings, attention
signals and terms of address (thank and pleasure [as in ‘it’s my pleasure’
More frequent in original English subtitles More frequent in subtitles translated from French
1-gram Freq. original Freq. transl. ORdisc 1-gram Freq. original Freq. transl. ORdisc
Online film subtitles
Table 8: Top fifteen most distinctive 1-grams in the original English subtitles (left) and the ones translated from French (right).
331
332 N. Levshina
This section discusses the most distinctive 3-grams, which were retrieved by
using the same methodology. I will begin by comparing the original English
subtitles with those translated from French. The top fifteen 3-grams in each
sub-corpus are shown in Table 9.
An inspection of the top 100 most distinctive 3-grams in the original
English subtitles reveals the presence of polite formulae (e.g., ladies and
gentlemen, I’m sorry and to meet you [as in ‘Pleased/nice/. . . to meet you’])
and a relatively high frequency of hedges, downtoners and attention-getting
signals (e.g., I guess I, I think you, I’m just and you know what). There
are also several elements of expressions that challenge the addressee and
propel the plot forward (e.g., think about it and you talking about as part
of ‘What are you talking about?’). In addition, one can find several vague
expressions (some kind of, one of these and a lot of ). Finally, the original
subtitles have a relatively high proportion of 3-grams with the informal future
marker gonna (e.g., ’m gonna take), whereas the subtitles translated from
French more frequently contain the future marker ’ll (e.g., you’ll get and
I’ll call).
A corresponding comparison between the original English subtitles
and the subtitles translated from the languages other than French yields
highly similar results and is omitted due to limitations of space.
More frequent in original English subtitles More frequent in subtitles translated from French
Freq. Freq. Freq. Freq.
3-gram ORdisc 3-gram ORdisc
original transl. original transl.
Online film subtitles
Table 9: Top fifteen most distinctive 3-grams in the original English subtitles (left) and the ones translated from French (right).
333
334 N. Levshina
6. Conclusions
This study has compared online film subtitles with other registers of
spoken and written British and American English with the help of
n-grams with n from 1 to 3. I employed different statistical techniques and
statistics (hierarchical cluster analysis, correlation coefficients, odds ratios
and deviations of proportions as a dispersion measure). The results of the
study can be summarised as follows.
dialogue remains for future research. Another question is whether the above-
mentioned linguistic characteristics of film subtitles in English can be
extrapolated to other languages and whether one can speak about universal
filmese.
To conclude, if film dialogue is a reflection of real dialogue,
subtitles are a reflection of a reflection. At the same time, they are
remarkably close to real informal language. The results are of high practical
significance for contrastive and typological studies of world languages,
since informal dialogical language is strongly under-represented in the
linguistic data currently used in those disciplines. However, due to the
peculiarities described above, it would be risky to use subtitles as data
for full-fledged conversational and discourse analyses as a replacement for
spoken language (see Chaume, 2004; and Valdeon, 2008) and filmese in
general. For this purpose, comparable original corpora produced in natural
settings are indispensable. For other purposes, however, there seem to be no
reasons to be overly sceptical, in particular, when one’s approach is based on
a quantitative analysis of a large corpus of subtitles.
Corpora
References