Retrieve

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Online film subtitles as a corpus: an n-gram approach

Natalia Levshina1

Abstract

In this paper, I investigate online film subtitles from a quantitative


perspective, treating them as a separate register of communication. Subtitles
from films in English and other languages translated into English are
compared with registers of spoken and written communication represented
by large corpora of British and American English. A series of quantitative
analyses based of n-gram frequencies demonstrate that subtitles are
not fundamentally different from other registers of English and that
they represent a close approximation of British and American informal
conversations. However, I show that the subtitles are different from the
conversations with regard to several functional characteristics, which are
typical of the language of scripted dialogues in films and TV series in
general. Namely, the language of subtitles is more emotional and dynamic,
but less spontaneous, vague and narrative than that of normally occurring
conversations. The paper also compares subtitles in original English and
subtitles translated from other languages and detects variation that can be
explained by differences in communicative styles.
Keywords: cluster analysis, correlation, deviation of proportions, n-grams,
odds ratio, register, subtitles.

1. Introduction: aims

Online film subtitles are an attractive source of data for corpus linguistics.
They are freely downloadable in many different languages from numerous
online repositories. Probably the most attractive feature of film subtitles in
comparison with other multilingual parallel corpora (e.g., translations of the
Bible, proceedings of the European Parliament, the European Union law and
the documents of the United Nations) is that film subtitles are, as a rule,
stylistically much closer to informal spoken dialogues than any of these

1
Leipzig University (IPF 141199), Nikolaistraße 6–10 04109 Leipzig, Germany.
Correspondence to: Natalia Levshina, e-mail: natalevs@gmail.com
Corpora 2017 Vol. 12 (3): 311–338
DOI: 10.3366/cor.2017.0123
© Edinburgh University Press
www.euppublishing.com/cor
312 N. Levshina

corpora. Fictional film language (original and translated) is characterised


by ‘prefabricated orality’ (Baños-Piñero and Chaume, 2009). This means
that screenplay writers try to create film dialogues such that viewers would
recognise them as true-to-life speech. Even a brief inspection of film subtitles
demonstrates that subtitlers make an effort to achieve this realism, too.
So far as the lexicon is concerned, film subtitles usually contain few, if
any, special terms and archaisms. From a grammatical perspective, subtitles
normally contain short simple sentences, questions, exclamations, commands
and other features of involved informal communication. As an illustration,
consider a fragment from subtitles of the film The Black Swan (2010), shown
below. This is an example of the SubRip format, which contains the text and
information about the time (up to milliseconds) when a caption should appear
on and disappear from the screen.
268
00:33:22,546 → 00:33:24,109
- Here, hold this. – Yeah, sure.
269
00:33:25,548 → 00:33:29,219
You must be so excited.
270
00:33:31,080 → 00:33:32,668
Are you freaking out?
271
00:33:32,703 → 00:33:33,740
- Yeah. . . – Yeah?
272
00:33:35,981 → 00:33:36,814
Oh, it’s okay.
Online film subtitles are particularly convenient for the purposes of language
comparison because one can easily find subtitles of the same film in many
different languages and create a parallel corpus, using the timing information
for alignment. Online film subtitles may be preferable to translations of
fiction, such as The Little Prince by Antoine de Saint-Exupéry or the Harry
Potter books by J.K. Rowling, because one can choose from a wide selection
of films of different genres.
There also exist several ready-made collections of film subtitles that
are available for download and/or online queries. Perhaps the earliest and
largest collection of film subtitles in different languages can be found in the
Opus corpus (Tiedemann, 2008).2 A sub-corpus of film subtitles with Czech
as the pivot language is also included in InterCorp.3 I am currently developing

2
Accessed 21 April 2015 at: http://opus.lingfil.uu.se/.
3
Accessed 21 April 2015 at: www.korpus.cz/intercorp.
Online film subtitles 313

a collection of subtitles that are simultaneously available in many diverse


languages for typology and areal linguistics.4 As for monolingual corpora,
English film subtitles represent a part of the New Model Corpus from the
Sketch Engine corpus family.5 One should also mention transcripts of serial
television dramas that constitute Mark Davies’s Corpus of American Soap
Operas.6
Monolingual and parallel corpora based on film subtitles have
already been used in linguistic research, although the number of studies
is still relatively small. Some pioneering work based on film subtitles in
ten and more languages has been done by Levshina (2015, 2016a,b). In
psycholinguistic studies, film subtitles in the original language have been
shown to be a reliable source of lexical norms, sometimes outperforming
other sources (Keuleers et al., 2010).
Despite these attractive features and promising first results, one
should issue several cautions in using film subtitles for theoretical and
applied linguistic research. A well-known problem is the possible influence
of the source language on the target language in translations (so-called
‘translationese’, see Johansson and Hofland, 1994) – although this problem is
shared by all parallel corpora. In online subtitles, the problem is exacerbated
by the fact that it is often impossible to tell which language was the source.
For example, a French film can be translated into German directly or through
English.
Second, professional subtitlers have rigid rules to follow with regard
to the maximum length of a line, the time during which a caption should
stay on screen, etc. (e.g., Deckert, 2013: Appendix 1; and Díaz Cintas and
Remael, 2014: Chapter 4). Many elements are omitted or reformulated with
a preference for shorter constructions. In some films, the percentage of
omitted elements can be quite high. For example, speech reduction in Spanish
subtitles of one of Woody Allen’s films was as high as 40 percent because the
dialogue was too fast and verbose to be represented fully on the screen (Díaz
Cintas and Remael, 2014: 199). As for examples of reformulation, subtitlers
tend to use will instead of be going to, replace light verb constructions with
simple verbs (e.g., feel instead of have the feeling), and use simple rather than
complex tenses (e.g., simple past instead of past perfect), see Díaz Cintas and
Remael (2014: 202–4).
Moreover, previous research into transcribed TV series dialogues
and films has revealed a few differences between scripted film or TV series
dialogue and spontaneous conversation. In particular, it has been observed
that narrative and ‘vague’ elements and some discourse markers are under-
represented in film and TV dialogues in comparison with naturally occurring
conversations (e.g., Bednarek, 2011; Mittmann, 2006; and Quaglio, 2008).

4
The ParTy corpus (Parallel corpus for Typology). See
www.natalialevshina.com/corpus.html.
5
Accessed 8 September 2015 at: https://www.sketchengine.co.uk/new-model-corpus/.
6
Accessed 8 September 2015 at: http://corpus.byu.edu/soap/.
314 N. Levshina

At the same time, most researchers seem to stress similarity between film
or TV series dialogue and naturally occurring conversations, as far as their
pragmatic and lexico-grammatical features are concerned (for an overview,
see Dose, 2014: Chapter 4.3.4).
Finally, no one can guarantee the quality of film subtitles
downloaded from online repositories. Both non-translated and translated
subtitles may contain typos and errors. It is very difficult to find reliable
information about the subtitler of a specific film and his or her linguistic
background and expertise. Although most repositories have systems of
downvoting bad subtitles, this does not guarantee a high quality of subtitles
that have not been marked as poor. However, in my personal experience
based on linguistic analysis of subtitles and native speakers’ evaluation of text
samples in several languages, the vast majority of subtitles have acceptable
quality, although one can see occasional transcription and translation
errors (see Bednarek, 2010: 70), as well as mistakes in orthography and
punctuation. As can be seen from online comments or additional information
in the files, many subtitles have been corrected several times, which implies
a high level of dedication of the members of the online subtitling community.
Do all these peculiarities make film subtitles too risky to be used for
linguistic research? To answer this question, I assume here a quantitative ap-
proach, treating English subtitles as equal with other registers of written and
spoken English. The main research questions of this paper are as follows:

(1) Do English film subtitles represent a language variety that is


fundamentally different from other varieties of spoken and written
English?
(2) How similar are film subtitles to normally occurring conversations?
(3) What are the distinctive linguistic features of subtitles in
comparison with naturally occurring informal conversations? And,
(4) How similar are subtitles in original English and subtitles
translated from other languages? What are the linguistic
differences between these types of subtitles?

These questions will be answered with the help of an n-gram


approach. To answer Questions 1 and 2, I will perform a correlation analysis
and hierarchical clustering of registers based on n-gram frequencies. To
answer Question 3 about the distinctive linguistic features of subtitles
in comparison with naturally occurring conversations, I will analyse the
n-grams that are distinctive of subtitles and those that more frequently occur
in conversations. A similar procedure will be applied in order to answer
Question 4. For that purpose, n-grams that more frequently occur in original
English subtitles will be compared with those that can be found in subtitles
translated from French and several other languages.
The remaining part of the paper is organised as follows. Section 2
describes the data and methods. Section 3 presents the results of the
clustering models and correlation analyses. Section 4 discusses the distinctive
Online film subtitles 315

n-grams of subtitles in comparison with spontaneous conversations in British


and American English, whereas Section 5 contrasts the original English
subtitles with those translated from other languages. Finally, Section 6
summarises the results.

2. Data

2.1 Corpora

The subtitles investigated in this study were collected from the online
repository OpenSubtitles.org.7 The films represent various fictional genres,
according to the genre classification from the International Movie Database:8
drama, adventure, fantasy, comedy, mystery, crime, etc. In total, I
selected twenty-three files for films that were originally in English,
twenty files representing films that were originally in French and fourteen
files representing films originally in languages other than English or
French (fourteen different languages). According to the International
Movie Database, most English films in the sample were either American
or produced by international teams with American participation (e.g.,
USA/UK/Australia). French was chosen because it is easy to find sufficient
data, due to the popularity of French films. It was considered separately from
the other languages because one of the goals was to try to identify linguistic
features that would reflect the properties of one specific source language
(see Section 5). The overwhelming majority of the films, English and non-
English, were created in the last twenty years. All files were in the SubRip
(.srt) format.
The subtitles were compared with samples from well-known corpora
of written and spoken British and American English. Both British and
American data were used because subtitles do not represent only one of
these two varieties exclusively. In fact, most files in the sample contain
the American spelling variants (e.g., color), but there are a few files where
the British variants are found (e.g., colour). Each national variety was
represented by two written registers (newspaper texts and fiction) and two
spoken registers (transcripts of informal conversations and radio and TV
broadcasts, which mostly represented unscripted conversations). The choice
of registers was motivated by their availability in large comparable corpora
of British and American English. The British data were taken from the
British National Corpus (BNC). The files were selected on the basis of the
meta-information. All files included 5,000 to 15,000 tokens, which made
them comparable to the subtitle files. The American data from newspapers,
fiction and media broadcasts were taken from the corresponding components
of the Corpus of Contemporary American English (COCA). For each of
the components, a local copy of the corpus contained twenty-three very

7
Accessed 21 April 2015 at: http://opensubtitles.org.
8
Accessed 21 April 2015 at: http://www.imdb.com/.
316 N. Levshina

Number of Number of
Subcorpus
files/samples tokens
Subtitles of films originally in
English 23 254,914

Subtitles of films originally in French 20 132,159


Subtitles of films originally in other
languages 14 130,552

BrE informal conversations (BNC) 29 268,370


AmE informal conversations
(SCBSAE) 19 87,481

BrE radio and TV broadcasts (BNC) 27 234,399


AmE radio and TV broadcasts
23 184,000
(COCA)
BrE newspapers (BNC, only national) 24 237,080

AmE newspapers (COCA) 23 184,000

BrE fiction (BNC) 23 192,528

AmE fiction (COCA) 23 184,000

Total 248 2,089,483

Table 1: The number of files/samples and the number of tokens in


subcorpora.

large files that represent each year from 1990 to 2012. A sample of 8,000
words was drawn from each of the twenty-three files for each register.9
The informal conversations in American English were taken from the
Santa Barbara Corpus of Spoken American English (SBCSAE). The richly
annotated dialogue scripts were stripped from the information about pauses,
background noises, coughing, etc., with the help of a Python script. The
number of files and the number of tokens in each subcorpus are shown in
Table 1.

2.2 Extraction of n-grams

The next step was to extract n-grams with n of 1, 2 and 3 with the help of
Python scripts. It did not make much practical sense to try larger n because

9
Although the fiction sub-corpus of COCA includes film scripts, the sample drawn for this
study did not contain them.
Online film subtitles 317

of the relatively small sizes of the sub-corpora, which would produce too
many hapax legomena. Moreover, as Gries et al. (2011) demonstrate, the
longer n-grams do not add much new information for the purposes of register
classification. N-grams were defined as sequences of n tokens within a
sentence which were not separated by punctuation marks. Punctuation marks
themselves were not considered to be elements of n-grams. The difference
between lower and upper case was disregarded. The contracted forms (e.g.,
I’ll and don’t) were regarded as combinations of two grams (I + ’ll and
do + n’t). The possessive marker ’s was treated as a separate gram. I chose
this method of tokenisation because it had been used in the majority of the
corpora selected. In the corpora where this was not the case, the data were
first normalised automatically with the help of a Python script.
These n-grams and their frequencies in each register served as an
input for all subsequent analyses, which are described in Sections 3 to 5.
Since the results based on the 2-grams were intermediate between the ones
based on the 1-grams and 3-grams, the 2-grams will not be discussed for
reasons of space.

3. Clustering models of English registers

3.1 Methodology

This section represents a series of correlation and cluster analyses based on


the frequencies of the 1-grams and 3-grams, with a focus on the relationships
between the subtitles and the other registers. The idea of using n-grams
for comparison of language varieties is not new (e.g., Biber et al., 1999:
Chapter 13; Gries et al., 2011; and Xiao and McEnery, 2005). This approach
was chosen because it is much less labour-intensive than traditional multi-
dimensional analysis (e.g., Biber, 1988). It is also more objective, as it does
not require an a priori selection of linguistic features.
The first step was to compute the correlation coefficients. The
Pearson correlation coefficient r was used because it was shown to be
useful for discrimination between the registers with the help of n-grams
(Gries et al., 2011). In addition, my own experiments with the data have
demonstrated that another popular correlation coefficient, the rank-based
Spearman rho, yields a much less interpretable picture of register variation
in English. The Pearson r ranges from –1 to 1, where –1 indicates a perfect
negative, or inverse correlation, and 1 stands for a perfect positive correlation;
0 indicates a lack of any relationship. All correlation coefficients that were
computed were greater than 0.
The next step was to transform the correlation coefficients into
distances by subtracting the former from 1. After that, an hierarchical
clustering analysis was performed on the basis of the Ward algorithm, which
usually produces compact clusters. A series of hierarchical agglomerative
clustering models was created for 1 percent of the most frequent grams,
318 N. Levshina

Figure 1: A clustering model based on all 1-grams.

5 percent, 10 percent, 25 percent – all data without hapax legomena and all
data with hapax legomena. All these clustering models were nearly identical.
All statistical analyses and graphics presented in this paper were
performed or created in R, a free software environment for statistical
computing and graphics (R Core Team, 2014). Sections 3.2 and 3.3 present
the clustering solutions based on the 1-grams and 3-grams.

3.2 A clustering model based on 1-grams

Figure 1 displays a hierarchical clustering model based on all 56,619


1-grams. The figure should be interpreted as follows. The tree ‘grows’ from
the ‘leaves’ (i.e., the registers) to the ‘root’ (the top merge). Pairs of leaves or
‘branches’ (i.e., clusters with several leaves) merge from bottom to top until
all leaves and branches are included in the tree. The smaller the distance be-
tween two leaves or branches, the sooner they will merge. Therefore, one can
expect the registers with similar frequencies of the same n-grams to cluster,
and the registers with different frequencies to belong to different clusters.
One can see that the registers are sub-divided into two large clusters,
one representing the written British and American registers and the TV and
radio broadcasts, and the other containing the British and American informal
conversations and the subtitles. This can be interpreted as the distinction
between more formal and less formal registers. The translated subtitles form
a small cluster separate from the original subtitles, although all types of
subtitles merge very soon, which indicates a high level of similarity between
the original and translated subtitles.
A closer look at the correlation coefficients (see Table 2) provides
additional information. The correlation coefficients between the three types
Online film subtitles 319

Register sub_original sub_transl_other sub_transl_french


am_broadcast 0.88 0.877 0.866
am_convers 0.927 0.904 0.903
am_fiction 0.821 0.82 0.817
am_newspaper 0.693 0.695 0.691
br_broadcast 0.891 0.883 0.871
be_convers 0.947 0.93 0.936
br_fiction 0.701 0.706 0.713
br_newspaper 0.635 0.637 0.629
sub_original – 0.99 0.988
sub_transl_other 0.99 – 0.991
sub_transl_french 0.988 0.991 –

Table 2: Correlations between subtitles and other registers based on


all 1-grams (Pearson’s r coefficients). (Note: the shaded rows contain
the highest correlation coefficients between the subtitles and the other
registers.)

of subtitles are extremely high: they are, in fact, the highest coefficients
among all types of registers. This means that there is no principled difference
between the subtitles in original English and translations, so far as the
frequencies of the 1-grams are concerned. As one could expect on the basis
of the clustering model, the strongest correlations between the subtitles and
the other registers are observed in the case of British informal conversations,
followed by the American informal conversations (see the shaded rows in
Table 2). This holds for all three types of subtitles, although the translated
subtitles tend to have slightly lower coefficients than the original subtitles.
The next highest correlations are with the TV and radio broadcasts, which
are followed by the fiction. The lowest correlation coefficients are observed
between the subtitles and newspapers. For comparison, the lowest correlation
between all registers is found between the British conversations and the
British newspapers (r = 0.586). In the American data, the lowest correlation
is between the conversations and newspapers, too, although the correlation is
higher (r = 0.686). This suggests that the differences between the traditional
registers are greater than the difference between the subtitles and the informal
conversations.

3.3 Clustering models based on 3-grams

Figure 2 displays a clustering solution based on all 965,909 3-grams. The


clustering solution displays a large cluster with the spoken data and a cluster
320 N. Levshina

Figure 2: A clustering solution based on all 3-grams.

with the written registers, except the British newspapers, which merge the
last.
The subtitles cluster with spoken registers: they first merge with the
informal conversations and then with the TV and radio broadcasts. This is
supported by the correlation coefficients displayed in Table 3. Again, the
strongest correlations are between the subtitles and conversations, followed
by the broadcasts and fiction. The original and translated subtitles are
again more similar to one another than to the other registers, although the
translations are again slightly less correlated with the other registers than the
subtitles in original English.

3.4 Clustering models and correlations: interim conclusions

Section 3 has discussed several clustering solutions and correlation


coefficients that help us to answer the first two research questions. The
answer to the first question, namely, whether film subtitles represent a variety
of English that is fundamentally different from other registers, is negative.
In both clustering models, the subtitles do not form a cluster that would be
separate from the other registers. Moreover, the subtitles are more similar to
some traditional registers than these traditional registers are similar to one
another. From this, one can conclude that subtitles represent language that
does not differ fundamentally from English produced in more naturalistic
settings. In both analyses, the subtitles cluster together with the informal
spontaneous conversations. The subtitles also display the highest correlations
with this register, in particular with the British informal conversations. These
correlation coefficients are in fact the highest among all registers compared,
which suggests a close similarity between the subtitles and conversations (see
Research Question 2).
Online film subtitles 321

Register sub_original sub_transl_other sub_transl_french


am_broadcast 0.578 0.505 0.489
am_convers 0.685 0.615 0.599
am_fiction 0.446 0.408 0.4
am_newspaper 0.305 0.26 0.256
br_broadcast 0.542 0.474 0.47
be_convers 0.696 0.651 0.63
br_fiction 0.428 0.402 0.395
br_newspaper 0.099 0.072 0.079
sub_original – 0.78 0.756
sub_transl_other 0.78 – 0.743
sub_transl_french 0.756 0.743 –

Table 3: Correlations between subtitles and other registers based on


all 3-grams (Pearson’s r coefficients). (Note: the shaded rows contain
the highest correlation coefficients between the subtitles and the other
registers.)

In addition, one can draw some conclusions regarding the fourth


question about the relationships between film subtitles that are translated
from other languages and subtitles in original English. The analyses reveal
high positive correlations between the subtitles that are translations and
those which are not. The corresponding correlation coefficients are in fact
the highest observed coefficients between all registers in the data set. This
suggests a high level of similarity between different types of subtitles.
However, the translated subtitles tend to be slightly less strongly correlated
with the other spoken registers than the original ones. I will discuss possible
explanations for these differences in Section 5.

4. Distinctive n-grams in the subtitles and informal conversations

4.1 Methodology: distinctive n-gram analysis based on odds ratios and


deviations of proportions

This section investigates the Research Question 3, which concerns the


linguistic differences between the film subtitles and the British and American
spontaneous informal conversations. There exist different methods of
identifying distinctive elements of sub-corpora and registers. One can, for
instance, use multivariate approaches, such as Principal Component Analysis
or Factor Analysis (e.g., Biber, 1988), and identify the factor loadings of
different linguistic features on the dimensions of register variation. Another
322 N. Levshina

approach is to identify keywords in a sub-corpus in comparison with another


sub-corpus or a large reference corpus (Scott, 1997). Keywords are words
that occur in the sub-corpus of interest more frequently than one could expect
them to occur by chance alone. Whether the differences between the observed
and expected frequencies are statistically significant is determined with the
help of statistical measures, such as the log-likelihood ratio (Dunning, 1993),
the chi-squared statistic and the Fisher Exact Test p-value (see an overview
in Baron et al., 2009). Yet another possibility is to compare the rankings of
n-grams (e.g., Bednarek, 2011) in texts of different registers.
Here I will use an alternative method that compares the relative
frequencies of n-grams in the subtitles and the British and American
spontaneous conversations. This method was chosen over the traditional
multi-dimensional analysis because this paper focusses on the differences
between the subtitles and the spoken data, rather than on the differences
between all registers, most of which have been explored extensively. It is also
preferable to the ranking comparison approach because the latter involves
a loss of information when the level of measurement goes down from the
ratio scale to the ordinal scale. The keyword approach based on significance
testing has a few problems with underlying statistical assumptions. First, it
does not take into account the fact that many n-grams are sampled from
one and the same text written by a specific author, and this means that
the observations are not sampled randomly. In this situation, the keyword
approach based on the computation of a hypothesis testing statistic is
problematic. Another problem arises when n > 1. Consider a simple example:
if a bigram is in spite, the chances are high that the following bigram will be
spite of. The assumption of independence of observations is violated again
but in a different way.
Given all of these problems, I will use a descriptive measure of
effect size, rather than a hypothesis-testing statistic, for identification of the
n-grams that are the most distinctive of the subtitles and those that are the
most distinctive of the conversations. More specifically, I will use the odds
ratio, which is the ratio of the odds of a bigram in one type of text to the
odds of a bigram in another type of text. The method is as follows. For every
n-gram, one needs four scores shown in Table 4.
The traditional odds ratio is computed according to the following
formula:
a/b a∗d
OR = =
c/d b∗c
If the odds of an n-gram are equal in both registers, the odds ratio will be 1.
If the odds of an n-gram are higher in the subtitles than in the conversations,
the odds ratio will be greater than 1. If the odds of an n-gram are higher in
the conversations, the odds ratio will be between 0 and 1. The greater the
OR , the more distinctive (over-represented) the n-gram in the subtitles, and
the less representative it is of the conversations, and vice versa. Since the
frequency c may equal zero (i.e., when a given n-gram does not occur in the
Online film subtitles 323

Frequency in subtitles Frequency in conversations

a c
n-gram the raw frequency of the
the raw frequency of the
n-gram in the informal
n-gram in the subtitles
spontaneous conversations
b d
all other the raw frequency of all
the raw frequency of all
n-grams other n-grams in the
other n-grams in the
informal spontaneous
subtitles
conversations

Table 4: Frequencies required for computation of odds ratios.

Frequency in Frequency in
1-gram ORdisc
subtitles conversations
howard 137 0 207.6
malkovich 124 0 188
paro 109 0 165.3
gatsby 95 0 144.2
daisy 93 0 141.1

Table 5: Top five distinctive 1-grams in the subtitles (compared with


spontaneous conversations).

conversations), there is a danger of division by zero. To avoid this problem,


I will use a ‘discounted’ version of OR, adding a small number (0.5) to each
of the four frequencies.
Another concern is that some frequent n-grams may occur in one
text only. Such n-grams are not representative of the entire register, even
if their frequency is very high. Table 5 shows top five 1-grams that occur
relatively frequently in the subtitles data, based on their discounted OR. These
are proper names of film protagonists. Each of these names occurs only in one
subtitle file. Of course, such information is not particularly informative.
To solve this problem, one needs to take into account the dispersion
of n-grams in the sub-corpus. One could filter out the n-grams that occur
in less than a predetermined number of corpus documents. However, this
approach would not take into account the fact that an n-gram is more likely
to be detected in a large text than in a small text, based on chance alone.
In this paper, I will use an alternative approach, which was suggested in
Gries (2008) and Lijffijt and Gries (2012) and which takes into account the
differences in the probabilities of an n-gram occurrence depending on the
size of the corpus components. In this approach one computes a deviation of
proportions (DP) score for every word. This measure reflects how much the
324 N. Levshina

relative frequencies of a word in different components of a corpus deviate


from what one could expect based on the size of each corpus component. The
greater the deviation, the more unevenly the word is dispersed and therefore
the less representative it is of the corpus as a whole. The formula for the
computation of DP is as follows:
  
DP = 0.5 ∗ Pobs − Pexp 

In this formula, Pobs is the proportion of all instances of a word in a given


component of a corpus; Pexp is the relative size of the corpus represented
as a proportion of the number of words in the given component relative to
the number of words in the total corpus. Pexp takes into consideration the
size differences between the corpus components. DP represents the sum of
absolute differences between Pobs and Pexp , divided by 2.
This score is then normalised in order to be distributed from 0 to 1
with the help of the following formula:
DP
DPnorm =  
1 − min Pexp
If one computes the scores for the words in Table 5, they will range from
0.944 to 0.984. This indicates that the words are dispersed very unevenly.
Using an arbitrary cut-off point of DPnorm = 0.5, one can be sure that such
cases are filtered out and the remaining n-grams are truly representative of the
sub-corpus as a whole. A stricter (lower) cut-off value would be less practical
because the number of remaining n-grams becomes too small, especially in
the case of 3-grams.
In the remaining part of Section 4, I will discuss the results of the
n-gram analysis for the 1-grams and 3-grams. Since the results of a 2-grams
analysis are similar to the ones based on 1-grams and 3-grams, they will not
be discussed for reasons of space.

4.2 Subtitles versus conversations: 1-grams

The discussion below is based on the analysis of the top 100 most distinctive
1-grams in the subtitles with DPnorm < 0.5 (i.e., the 1-grams with the
highest OR) and an equivalent set of 1-grams in the informal spontaneous
conversations in British and American English (i.e., the 1-grams with the
lowest OR). For illustration, the top fifteen n-grams in both registers are
shown in Table 6. Note that the prominent positions of numerals (10, 20, 3;
eighty and twelve) in both lists are explained by the fact that numerals are
represented in different ways in the two sub-corpora: by digits in the subtitles
and by words in the conversations. The representation of numbers by digits in
the subtitles can be explained by space limitations. The 1-gram tape indicates
that the participants in the conversation often referred to the fact of being
recorded.
More frequent in subtitles More frequent in conversations
Freq. Freq. Freq. Freq.
1-gram ORdisc 1-gram ORdisc
subtitles conversations subtitles conversations
10 88 0 133.6 erm 0 850 <0.001
Online film subtitles

20 69 0 104.9 cos 2 682 0.003


3 125 1 63.2 er 6 1343 0.003
promise 94 3 20.4 eighty 0 64 0.006
gentlemen 62 2 18.9 mm 35 1442 0.018
secret 85 3 18.4 pound 6 164 0.03
pleasure 56 2 17.1 forty 4 112 0.03
crazy 136 6 15.9 fifty 10 136 0.058
sir 411 22 13.8 quarter 4 50 0.067
calm 102 6 11.9 twenty 33 310 0.081
trust 112 7 11.3 thirty 17 158 0.083
act 65 4 11 tape 11 92 0.094
protect 49 3 10.7 round 29 235 0.094
Mr 593 47 9.4 twelve 12 99 0.095
kill 266 21 9.4 nine 30 225 0.102

Table 6: Top fifteen most distinctive 1-grams in the subtitles and conversations.
325
326 N. Levshina

Considering the top 100 1-grams that are the most representative of
the subtitles, one can observe that this sub-corpus contains a relatively high
number of direct addresses, attention signals, greetings and polite formulae,
which are exemplified by such 1-grams as Mr, Sir, gentlemen, kid, guys,
hey, thanks, excuse, welcome, sorry and pleasure (as in ‘it’s my pleasure’ or
‘with pleasure’). This observation is in line with the one made by Mittmann
(2006: 577), who found that TV series dialogues contain more greetings
and polite formulae than naturally occurring conversations (see also Freddi,
2012: 392). According to Freddi (2012), film dialogues try to mimic everyday
conversations by representing the ritualised acts of daily routine. A possible
explanation of this difference might be that films and TV drama series
represent more dynamic social situations than the informal conversations
in the BNC and SBCSAE, where the interlocutors in one recording session
usually know one another well and do not come and go often.
Another finding is that the subtitles contain a relatively high number
of words that describe a mental state (e.g., happy, sorry, scared, afraid and
crazy), evaluative adjectives (e.g., beautiful, perfect, dangerous, strange and
important) and expletives (bitch and damn). This corresponds to Quaglio’s
(2008) observation based on his analysis of TV series Friends, where he
found the language of the TV series to be more emotional and dramatic than
that of normal conversations (see also Bednarek, 2011). This higher degree
of emotionality has to do with the entertainment function of films and series:
the viewers are supposed to be involved with and feel with the characters.
Importantly, the subtitles contain a large number of verbs in the base
form (e.g., promise, trust, act, protect, kill, stop, let, help and speak). Most
commonly, these verbs are either in the imperative, or in the future tense, or
part of an infinitival verbal complement. Consider Examples 1 to 3:

(1) Listen, word to the wise, stop dressing like you’re running for
Congress.
(Bad Teacher)

(2) I’ll help you grab your rocks.


(Batman and Robin)

(3) Let him speak.


(The Hobbit: An Unexpected Journey)

Such commands and expressions of intentions create dynamism and propel


the plot.
Looking at the top 100 1-grams of the conversations, one can find a
high number of various discourse markers, such as expressions of solidarity
or attention (e.g., yeah and mm), (dis)fluency markers (e.g., erm and er),
indicators of topic shifts (e.g., well and anyway) or corrective markers
Online film subtitles 327

(actually). These elements may be less frequent in the subtitles because the
latter in fact represents prepared speech, where fewer overlaps, hesitations
and corrections can be expected than in spontaneous dialogues (see Dose,
2014: 97–8). Closely related to discourse markers are mental verbs, such as
wonder, suppose and mean, which can perform different discursive functions:
hedging (I suppose), introducing a question (I wonder) and clarification
(I mean). These verbs are also under-represented in the subtitles. Although
one may be inclined to think that discourse markers might be omitted from
the subtitles due to the limitations of space, it has in fact been observed
that some discourse markers are also under-represented in transcribed film
dialogues, where such limitations are absent (e.g., Mittmann, 2006: 578;
and Quaglio, 2008: 200). Note that the raw frequencies of the majority of
these discourse markers in the subtitles are different from 0. The difference
between the subtitles and the conversations is thus only a matter of degree.
In addition to the higher proportion of discourse markers, the
conversations have a larger number of past or perfective verb forms (e.g., had,
meant, used, thought, got, walked, bought, stuck, went and said). The spoken
sub-corpus also contains several –ing forms (driving, saying, putting, having
and sitting), which often describe the background situation or participants
(e.g., ‘they were having a biology lesson’ and ‘they’re like vultures sitting
on a rail there’). In addition, the top 100 distinctive 1-grams include two
third-person pronouns, she and they. These features are associated with
narrative discourse (Biber, 1988). Notably, Bednarek (2011) observes that
TV series are also less narrative than normal conversations. This difference
can be explained as follows. In film or TV series dialogues, characters
usually talk to one another and about their immediate actions and intentions,
rather than about past events and third (absent) parties, who may not be
immediately accessible to film viewers (see Pavesi, 2008: 84–5). Moreover,
a story is usually shown to develop in time with the help of visual means,
rather than being presented verbally by film characters. This conclusion is
also supported by a frequent occurence of time and place adverbials – for
instance, yesterday, then, there, early and week (as in ‘this week’ or ‘next
week’) – which are normally used to refer to times and places outside the
current situation (Biber, 1988: 110).
Finally, the subtitles contain relatively few words and constructions
that can be described as instances of vague language (Channel, 1994) in
comparison with the conversations, where such words and constructions are
more frequent. Examples are elements of non-numerical vague quantifiers,
such as (a) bit (of ), (a) lot (of ), and (a) couple (of ), placeholders stuff
and ones, as well as the words might and probably. Vague language is also
under-represented in TV series in comparison with natural dialogue (e.g.,
Quaglio, 2008). The speaker can use these elements as hedges, or invite
the hearer to construct the meaning together, establishing the atmosphere of
informality. Obviously, film language has fewer vague expressions because
of its communicative limitations: the viewers may not always be able
to construct the meaning because they are ‘overhearers’ who have only
328 N. Levshina

restricted access to the contextual information ‘shared’ by the characters


on the screen (Dose, 2014: 94–7). Moreover, the viewers do not have an
opportunity to ask for clarification if they fail to construct the meaning. In
addition, in real conversations speakers may be under time pressure, stress,
fatigue, etc., and therefore resort to vague language when they fail to produce
an exact expression.

4.3 Distinctive 3-grams

This section presents the results of an analysis of the top 100 most distinctive
3-grams in both sub-corpora with normalised DP scores below 0.5. The top
fifteen 3-grams are shown in Table 7.
The analysis has yielded a few interesting peculiarities. First, the
subtitles contain many questions or their elements (e.g., what is it, are you
sure, what do you, why don’t, how did you and are you doing), which mostly
express the speaker’s reaction to the hearer’s actions and help build the
conflict situations (Freddi, 2012: 391). There are also very many expressions
that contain the verbs of necessity or desire (e.g., I want you, I need to and I’d
like) with infinitival complements (e.g., I wanted to talk to you). Both features
were also observed by Bednarek (2011) when she compared dialogues in
TV series with other registers. These expressions propel the action forward.
They also reveal the characters’ motives and feelings and thus make the film
viewers identify with and feel for the characters.
As for the conversations, one can pinpoint the following
peculiarities. First, similar to what has been observed in the case of
1-grams, speakers in spontaneous conversations use various discourse
markers abundantly; these include (dis)fluency and clarification markers
(e.g., I said well and I mean I), hedges (e.g., I think they) and other
expressions (e.g., oh it’s and tell you what). The softening and involving
functions are also evident in tag questions (isn’t it and aren’t they) and the
downtoners only and just (e.g., it’s only and it’s just).
Again, many distinctive 3-grams in the conversations contain verb
forms and adverbials that refer to past events (e.g., he didn’t, and I was and
and then I). There are also elements that introduce reported speech (e.g.,
and she said and and I said). These elements are associated with narrative
function, which was discussed in the previous sub-section. One also finds
here a few instances of vague language (a couple of, a lot of, a little bit, a bit
of, it’s like and something like that).

4.4 Interim conclusions

The analyses based on 1-grams and 3-grams converge and complement


each other. In comparison with the spontaneous conversations, the subtitles
contain many expressions that express the speaker’s cognitive reactions,
More frequent in subtitles More frequent in conversations
Freq. Freq. Freq. Freq.
3-gram ORdisc 3-gram ORdisc
subtitles convers. subtitles convers.
Online film subtitles

get out of 84 1 39.6 I du n 0 82 0.004


I’m here 56 2 15.9 cos it’s 0 46 0.008
it’s me 48 2 13.6 I said well 0 43 0.008
let’s go 241 12 13.6 well I’m 0 43 0.008
out of here 81 7 7.6 well I do 0 41 0.008
are you doing 173 20 5.9 well it’s 1 77 0.014
I was a 46 5 5.9 oh that’s 1 76 0.014
this is my 46 5 5.9 and I said 2 121 0.014
we have a 53 6 5.8 I mean I 2 64 0.027
I’m sorry 216 27 5.5 it’s alright 1 38 0.027
I love you 99 13 5.2 well that’s 3 83 0.029
take care of 66 9 4.9 no it’s 2 53 0.033
what kind of 51 7 4.8 haven’t got 3 72 0.034
where are you 76 11 4.7 oh it’s 2 48 0.037
I’m afraid 47 7 4.4 no I do 1 28 0.037

Table 7: Top fifteen most distinctive 3-grams in the subtitles and conversations.
329
330 N. Levshina

desires and emotions. These elements make the story more dramatic and
involving, and propel the plot forward. The subtitles also contain relatively
many greetings, terms of direct address and polite formulae, mainly because
the recorded conversations are more static in terms of the communicative
settings and the participants. At the same time, the subtitles have relatively
low frequencies of vague expressions, narrative elements and various
discourse markers. As for vague expressions, a possible explanation might be
that the film audience has only limited knowledge of the context and cannot
seek clarification. The relative scarcity of narrative elements in the subtitles
may be due to the fact that films usually tell a story by showing it developing
on the screen, rather than through someone’s monologue. Film characters
usually discuss their immediate situations and talk to one another, rather
than discuss third parties and past events. Finally, the subtitles also contain
fewer discourse markers than the conversations. This can be explained by the
lack of actual time pressure in the interaction between the characters, who
reproduce prepared text.
Notably, all these features are, in fact, shared by the subtitles with
film and TV series transcripts, which were studied by Mittmann (2006),
Quaglio (2008), Bednarek (2011), Freddi (2012) and others. Thus, film
subtitles represent a type of filmese/serialese.

5. Focussing on the subtitles

This section compares the original English subtitles with the English
subtitles translated from French and then with the translations from the other
languages. I will employ the n-gram approach, which was introduced in
Section 4, to pinpoint the differences between the types of subtitles. If the
translations are strongly influenced by the source language(s), one can expect
this to be reflected at the level of the top distinctive n-grams.

5.1 Distinctive 1-grams

First, I will discuss the most distinctive 1-grams in the original English
subtitles and those in the subtitles of French films translated into English.
As in Section 4, the analyses are based on 100 most distinctive 1-grams in
each sub-corpus with the normalised DP scores below 0.5. The top fifteen
1-grams in each sub-corpus are shown in Table 8.
An analysis of the top 100 1-grams in the original English subtitles
shows that this sub-corpus contains less formal language than the translated
subtitles. Examples are colloquial contractions (wanna, gotta and gonna) and
informal exclamations (such as wow, Jesus and yeah). The original subtitles
also contain a relatively larger number of discourse markers (hmm, oh, uh,
actually, well and okay), as well as polite formulae, greetings, attention
signals and terms of address (thank and pleasure [as in ‘it’s my pleasure’
More frequent in original English subtitles More frequent in subtitles translated from French
1-gram Freq. original Freq. transl. ORdisc 1-gram Freq. original Freq. transl. ORdisc
Online film subtitles

wondering 19 0 21.8 Paris 9 42 0.13


uh 214 6 18.5 hiding 5 15 0.199
wow 50 2 11.3 several 5 13 0.228
Jesus 47 2 10.6 arrived 10 23 0.25
honey 81 5 8.3 normal 11 24 0.263
entire 34 2 7.7 months 29 57 0.287
sitting 20 1 7.7 boss 28 50 0.316
appreciate 19 1 7.3 hurry 17 30 0.321
actually 68 5 7 yesterday 13 22 0.336
hoping 15 1 5.8 calm 29 47 0.348
seriously 25 2 5.7 hours 24 38 0.356
oh 799 82 5.4 broken 10 16 0.356
wanna 182 21 4.8 dog 19 30 0.358
hey 436 51 4.8 hour 30 47 0.359
begin 20 2 4.6 empty 11 17 0.368

Table 8: Top fifteen most distinctive 1-grams in the original English subtitles (left) and the ones translated from French (right).
331
332 N. Levshina

or ‘with pleasure’], welcome, hi, hey, Mr and honey) in comparison with


the translated subtitles. Interestingly, among the most distinctive 1-grams
are also a few –ing forms (wondering, sitting, hoping, living, putting and
talking). One can also find a few instances of vague expressions (sort, thing,
guess, suppose, kind, lot, probably, sounds, seem and might). The language
of the original subtitles is thus more interactive, informal and vague than that
of the subtitles translated from French.
The distinctive 1-grams in the translated French films, in contrast,
include several past or perfect verb forms (arrived, saw, stopped, changed,
sent, asked and kept) and third-person singular pronouns and verb forms
(he, she, him, his, wants, needs and thinks). This finding suggests that the
language of the translated subtitles is more narratorial than that of the original
subtitles. The list also includes the contracted future marker ’ll. Its higher
frequency in the translated subtitles may be explained by the preference of
the informal marker gonna in the original subtitles.
A corresponding comparison between the original English subtitles
and the subtitles translated from other languages (except French) has revealed
a very similar picture. In addition, the list of most distinctive 1-grams in
the translated subtitles contains the auxiliary shall, which is frequently used
as a future marker, and conjunction although, which is typically used in
writing.

5.2 Distinctive 3-grams

This section discusses the most distinctive 3-grams, which were retrieved by
using the same methodology. I will begin by comparing the original English
subtitles with those translated from French. The top fifteen 3-grams in each
sub-corpus are shown in Table 9.
An inspection of the top 100 most distinctive 3-grams in the original
English subtitles reveals the presence of polite formulae (e.g., ladies and
gentlemen, I’m sorry and to meet you [as in ‘Pleased/nice/. . . to meet you’])
and a relatively high frequency of hedges, downtoners and attention-getting
signals (e.g., I guess I, I think you, I’m just and you know what). There
are also several elements of expressions that challenge the addressee and
propel the plot forward (e.g., think about it and you talking about as part
of ‘What are you talking about?’). In addition, one can find several vague
expressions (some kind of, one of these and a lot of ). Finally, the original
subtitles have a relatively high proportion of 3-grams with the informal future
marker gonna (e.g., ’m gonna take), whereas the subtitles translated from
French more frequently contain the future marker ’ll (e.g., you’ll get and
I’ll call).
A corresponding comparison between the original English subtitles
and the subtitles translated from the languages other than French yields
highly similar results and is omitted due to limitations of space.
More frequent in original English subtitles More frequent in subtitles translated from French
Freq. Freq. Freq. Freq.
3-gram ORdisc 3-gram ORdisc
original transl. original transl.
Online film subtitles

one of those 19 0 20.7 you’ll get 3 12 0.149


I guess I 15 0 16.5 I’m scared 3 9 0.196
‘m gonna take 12 0 13.3 let me go 8 21 0.21
to meet you 27 1 9.7 I’ll call 11 27 0.222
and it’s 20 1 7.3 what’s wrong 12 27 0.241
I hope you 20 1 7.3 look at him 4 9 0.251
in the world 37 3 5.7 it’s for 6 13 0.256
thought you were 15 1 5.5 I’ll go 10 21 0.259
you and I 15 1 5.5 what’s that 12 24 0.271
I need you 24 2 5.2 won’t be 11 20 0.298
just don’t 23 2 5 want to see 13 23 0.305
‘s what you 13 1 4.8 it’s your 11 18 0.33
I can get 13 1 4.8 if it’s 10 16 0.338
some kind of 13 1 4.8 take care of 21 31 0.362
ladies and gentlemen 18 2 3.9 have to get 12 16 0.402

Table 9: Top fifteen most distinctive 3-grams in the original English subtitles (left) and the ones translated from French (right).
333
334 N. Levshina

5.3 Interim conclusions

This section focussed on the differences between the original English


subtitles and the English subtitles of films originally in French and in
other languages. The results of the distinctive n-gram analyses do not
provide evidence of strong translationese effects. Rather, the main difference
lies in the level of (in)formality and interactivity. The original English
subtitles contain more discourse markers of different types than the translated
subtitles. Moreover, the English original subtitles contain significantly more
instances of gonna, wanna and gotta, as well as other informal expressions,
while the translators from French and from other languages seem to prefer the
more formal form ’ll (or even shall). Vague language is used somewhat more
frequently, too, in the original English subtitles, whereas narrative discourse
elements are somewhat more frequently used in the non-original subtitles. It
seems that the differences between the n-grams reflect genre-related or even
cultural differences between the countries. It should be mentioned, however,
that the differences observed in these analyses are more subtle overall than
those between the subtitles in general and the spontaneous conversations
(see Section 4), as one can conclude from a relatively small number of the
corresponding distinctive n-grams. The majority of the n-grams in the top
100 lists are lexical units that seem to reflect the plot.

6. Conclusions

This study has compared online film subtitles with other registers of
spoken and written British and American English with the help of
n-grams with n from 1 to 3. I employed different statistical techniques and
statistics (hierarchical cluster analysis, correlation coefficients, odds ratios
and deviations of proportions as a dispersion measure). The results of the
study can be summarised as follows.

(1) As the cluster analyses based on the frequencies of n-grams have


demonstrated, film subtitles are not fundamentally different from
other varieties of spoken and written British and American English.
The subtitles do not form a separate cluster and merge early with
the other varieties.
(2) As suggested by the results of the clustering and correlation
coefficients based on the frequencies of n-grams, film subtitles
are very similar to British and American informal spontaneous
conversations.
(3) In comparison with the informal spontaneous conversations,
the film subtitles exhibit a number of differences. First, they
contain many emotional expressions (including expletives), and
constructions expressing intentions, necessity and desire, which
make the viewers feel involved with and feel for the characters
Online film subtitles 335

and also propel the plot forward. The higher frequency of


greetings, polite formulae and direct addresses can be explained
by more dynamic social interaction in films than in the recorded
conversations. At the same time, film subtitles contain fewer
pause fillers, reformulations and other discourse markers, which
are typical of spontaneous discourse produced under real-time
constraints. Whether and to what extent the creators of film
subtitles can further reduce the number of discourse markers for
purposes of compactness, as pointed out by Díaz Cintas and
Remael (2014: 214–16), requires a separate investigation. The
language of subtitles is also less vague and narrative than that
of the informal conversations. These features come from the
specific characteristics of films as a medium of communication,
where clarity and accessibility of referents play an important role,
and where the story usually develops in time with the help of
visual means, rather than being explicitly told by the characters.
Notably, these results are strikingly similar to the results of
previous analyses of fictional TV series dialogue transcriptions
(e.g., Bednarek, 2011; Mittmann, 2006; and Quaglio, 2008).
(4) As one can judge from the inspection of the most distinctive
n-grams, the subtitles of films translated from other languages are
not fundamentally different from the subtitles of films that were
originally in English, so far as the distribution of the n-grams
is concerned. Most differences can be explained by the varying
degrees of (in)formality and interactivity. In particular, the original
subtitles contain more discourse markers, informal expressions
and vague language than the subtitles translated from French and
other languages. In this regard, the original subtitles are closer to
natural dialogue. However, the language of the translated subtitles
is somewhat more narrative, although the differences are very
subtle.

As mentioned above, the analyses presented in this paper corroborate


the results of previous studies of film and TV language. It has been found
that the distinctive linguistic features of film subtitles are strikingly similar
to those of fictional TV series dialogues, which were investigated previously
on the basis of TV series transcripts. This finding has two implications.
First, since the language of fictional films (this study) and TV series exhibit
very similar peculiarities when compared with the language of spontaneous
conversations, one can hypothesise that film subtitles and TV series dialogues
belong to one broad register of fictional TV/film dialogue (see Bednarek,
2011). Second, since most researchers agree that TV dialogues represent
naturally occurring conversations quite faithfully, in spite of these differences
(see an overview in Dose, 2014: Chapter 4.3.4), one can conclude that film
subtitles can be seen as an acceptable approximation of natural dialogue.
The important question of how subtitles are different from the actual film
336 N. Levshina

dialogue remains for future research. Another question is whether the above-
mentioned linguistic characteristics of film subtitles in English can be
extrapolated to other languages and whether one can speak about universal
filmese.
To conclude, if film dialogue is a reflection of real dialogue,
subtitles are a reflection of a reflection. At the same time, they are
remarkably close to real informal language. The results are of high practical
significance for contrastive and typological studies of world languages,
since informal dialogical language is strongly under-represented in the
linguistic data currently used in those disciplines. However, due to the
peculiarities described above, it would be risky to use subtitles as data
for full-fledged conversational and discourse analyses as a replacement for
spoken language (see Chaume, 2004; and Valdeon, 2008) and filmese in
general. For this purpose, comparable original corpora produced in natural
settings are indispensable. For other purposes, however, there seem to be no
reasons to be overly sceptical, in particular, when one’s approach is based on
a quantitative analysis of a large corpus of subtitles.

Corpora

British National Corpus (BNC), version 3 (BNC XML Edition). 2007.


Distributed by Oxford University Computing Services on behalf of the
BNC Consortium. Available online at: http://www.natcorp.ox.ac.uk/.
Corpus of Contemporary American English (COCA): 450 million words,
1990– present. 2008–. By Mark Davies. Available online at:
http://corpus.byu.edu/coca/.
Santa Barbara corpus of spoken American English (SBCSAE), Parts
1–4. By John W. Du Bois, Wallace L. Chafe, Charles Meyer,
Sandra A. Thompson, Robert Englebretson, and Nii Martey.
Philadelphia: Linguistic Data Consortium. Available online at:
http://www.linguistics.ucsb.edu/research/santa-barbara-corpus.

References

Baños-Piñero, R. and F. Chaume. 2009. ‘Prefabricated orality: a challenge in


audiovisual translation’ in inTRAlinea Special Issue: The Translation
of Dialects in Multimedia. Accessed 24 September 2015 at:
http://www.intralinea.org/specials/article/1714.
Baron, A., P. Rayson and D. Archer. 2009. ‘Word frequency and key word
statistics in corpus linguistics’, Anglistik 20 (1), pp. 41–67.
Bednarek, M. 2010. The Language of Fictional Television: Drama and
Identity. London: Continuum.
Online film subtitles 337

Bednarek, M. 2011. ‘The language of fictional television: a case study of the


“dramedy” Gilmore Girls’, English Text Construction 4 (1), pp. 54–84.
Biber, D. 1988. Variation across Speech and Writing. Cambridge: Cambridge
University Press.
Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. 1999.
The Longman Grammar of Spoken and Written English. London:
Longman.
Channel, J. 1994. Vague Language. Oxford: Oxford University Press.
Chaume, F. 2004. ‘Discourse markers in audiovisual translating’, Meta 49
(4), pp. 843–55.
Deckert, M. 2013. Meaning in Subtitling: Toward a Contrastive Cognitive
Semantic Model. Frankfurt am Main: Peter Lang.
Díaz Cintas, J. and A. Remael. 2014. Audiovisual Translation: Subtitling.
London and New York: Routledge.
Dose, S. 2014. Describing and Teaching Spoken English: An Educational-
Linguistic Study of Scripted Speech. Unpublished PhD thesis. Giessen:
Justus-Liebig-Universität Giessen.
Dunning, T. 1993. ‘Accurate methods for the statistics of the surprise and
coincidence’, Computational Linguistics 19 (1), pp. 61–74.
Freddi, M. 2012. ‘What AVT can make of corpora: some findings from
the Pavia Corpus of Film Dialogue’ in A. Remael, P. Orero and
M. Carroll (eds) Audiovisual Translation and Media Accessibility at
the Crossroads, pp. 381–407. Amsterdam: Rodopi.
Gries, St.Th. 2008. ‘Dispersions and adjusted frequencies in corpora’,
International Journal of Corpus Linguistics 13 (4), pp. 403–37.
Gries, St.Th., J. Newman and C. Shaoul. 2011. ‘N-grams and the
clustering of registers’, Empirical Language Research 5 (1).
Accessed 24 September 2015 at: http://ejournals.org.uk/ELR/article/
2011/1.
Johansson, S. and K. Hofland. 1994. ‘Towards an English–Norwegian
parallel corpus’ in U. Fries, G. Tottie and P. Schneider (eds) Creating
and Using English Language Corpora, pp. 25–37. Amsterdam: Rodopi.
Keuleers, E., M. Brysbaert and B. New. 2010. ‘SUBTLEX-NL: a new
frequency measure for Dutch words based on film subtitles’, Behavior
Research Methods 42, pp. 643–50.
Levshina, N. 2015. ‘European analytic causatives as a comparative concept:
evidence from a parallel corpus of film subtitles’, Folia Linguistica
49 (2), pp. 487–520.
Levshina, N. 2016a: ‘Verbs of letting in Germanic and Romance languages:
a quantitative investigation based on a parallel corpus of film subtitles’,
Languages in Contrast 16 (1), pp. 84–117.
338 N. Levshina

Levshina, N. 2016b. ‘Why we need a token-based typology: a case study of


analytic and lexical causatives in fifteen European languages’, Folia
Linguistica 50 (2), pp. 507–42.
Lijffijt, J. and S. Gries. 2012. ‘Correction to “dispersions and adjusted
frequencies in corpora”’, International Journal of Corpus Linguistics
17 (1), pp. 147–9.
Mittmann, B. 2006. ‘With a little help from Friends (and others): lexico-
pragmatic characteristics of original and dubbed film dialogue’ in
C. Houswitschka, G. Knappe and A. Müller (eds) Anglistentag 2005
Bamberg. Proceedings of the Conference of the German Association of
University Teachers of English, pp. 573–85. Trier: Wissenschaftlicher
Verlag Trier.
Pavesi, M. 2008. ‘Spoken language in film dubbing: target language norms,
interference and translational routines’ in D. Chiaro, C. Heiss and
C. Bucaria (eds) Between Text and Image: Updating Research in
Screen Translation, pp. 79–99. Amsterdam: John Benjamins.
Quaglio, P. 2008. ‘Television dialogue and natural conversation: linguistic
similarities and functional differences’ in A. Ädel and R. Reppen
(eds) Corpora and Discourse: The Challenges of Different Settings,
pp. 189–210. Amsterdam: John Benjamins.
R Core Team. 2014. ‘R: a language and environment for statistical
computing’. Vienna, Austria: R Foundation for Statistical Computing.
Available online at: http://www.r-project.org/.
Scott, M. 1997. ‘PC analysis of key words – and key key words’, System
25 (2), pp. 233–45.
Tiedemann, J. 2008. ‘Synchronizing translated movie subtitles’ in
Proceedings of the 6th International Conference on Language
Resources and Evaluation (LREC’2008).
Valdeon, R.A. 2008. ‘Inserts in modern script-writing and their translation
into Spanish’ in D. Chiaro, C. Heiss and C. Bucaria (eds) Between
Text and Image: Updating Research in Screen Translation, pp. 117–32.
Amsterdam: John Benjamins.
Xiao, R.Z. and A. McEnery. 2005. ‘Two approaches to genre analysis: three
genres in modern American English’, Journal of English Linguistics
33 (1), pp. 62–82.
Copyright of Corpora is the property of Edinburgh University Press and its content may not
be copied or emailed to multiple sites or posted to a listserv without the copyright holder's
express written permission. However, users may print, download, or email articles for
individual use.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy