Zhang Wenbin ISF2009 Paper
Zhang Wenbin ISF2009 Paper
Abstract—Traditional movie gross predictions are based not use any post-release data in the following experiments, and
on numerical and categorical movie data. But since the all the predictions are out-of-sample predictions. In practice, our
1990s, text sources such as news have been proven to approach provides a feasible and more accurate estimation regarding
carry extra and meaningful information beyond tradi- the investment worthiness for some pre-release investors and almost
tional quantitative finance data, and thus can be used as all the post-release investors.
The contents of this paper are organized as follows. First, we
predictive indicators in finance. In this paper, we use the will review related work briefly. Second, we will describe the movie
quantitative news data generated by Lydia, our system data sources, both traditional movie data and news data, and give a
for large-scale news analysis, to help us to predict movie correlation analysis. We then set up different models with traditional
grosses. By analyzing two different models (regression movie data, movie news data, and their combination respectively as
and k-nearest neighbor models), we find models using well as evaluate their performance. Finally, we conclude that we can
only news data can achieve similar performance to those improve traditional movie gross prediction through news analysis.
use numerical and categorical data from The Internet
Movie Database (IMDB). Moreover, we can achieve better II. R ELATED W ORK
performance by using the combination of IMDB data Different people work on movie gross prediction from different
and news data. Further, the improvement is statistically perspectives. Most previous work ([5], [6], [7], [8], [9], et al.), forecast
movie grosses based on IMDB data with regression or stochastic
significant.
models. However, their models either work poorly or need post-
release data in order to make reasonable prediction, which are not
I. I NTRODUCTION acceptable in practice. For example, Simonoff and Sparrow [6] gave
The movie industry is of intense interest to both economists and three predictions for movie The Horse Whisperer, which had an actual
the public because of its high profits and entertainment nature. In gross of $74.37 million. Its predicted grosses for the pre-release, first
2007, the total revenue of U.S. movie market was $8.74 billion and weekend and Oscar models are $1.405 million, $63.932 million and
it continues to grow. An interesting question is to forecast pre-release $59.391 million respectively. However, both the first weekend model
movie grosses, because investors in the movie market want to make and Oscar model are post-release models. Sawhney and Eliashberg
wise decisions. The investors could be widely either earlier stage [7] also claimed that their model works pretty well by taking the first
investors like movie studios or movie distributors, or later stage ones three weeks of gross data as input, but admitted that it is much more
like movie retailers, exhibitors, home video makers, or even book or difficult to give shape estimation for either model parameters or gross
CD-ROM publishers. if we don’t have any early stage movie gross data. Although the post-
Traditionally, people predict gross based on historical IMDB data release models are also useful in some situations, pre-release models
analysis regarding specific characteristics, e.g., the movie’s genre, are of more practical importance.
MPAA rating, budget, director, number of first-week theaters, etc., Moreover, there has been substantial interest in the NLP com-
but with somewhat limited success. Nevertheless, recent publications munity on using movie reviews as a domain to test sentiment
([1], [2], [3], et al.), have shown the media’s power on forecasting analysis methods, e.g., [10], [11], et al. Basically speaking, they apply
financial market like stock prices, volatilities, or earnings. Considering information retrieval or machine learning techniques to classify movie
the encouraging results, it is reasonable to infer that news has reviews into some categories and hope to produce better classification
predictive power for movie grosses as well. We are unaware of any accuracy than human being. The classification categories are like
previous attempt to apply linguistic analysis to movie gross prediction. “thumbs up” vs. “thumbs down”, “positive” vs. “negative”, or “like”
Therefore, here we focus on improving movie gross prediction through vs. “dislike”. Pang and Lee [12] gives a detail review in this domain.
news analysis. However, to the best of our knowledge, news and sentiment analysis
Our primary goal is to prove that we can give better pre- has not been previously studied as a predictor of movie grosses. In
release prediction of movie grosses if we use news data, because addition, Mishne and Glance [13] show that movie sales have some
commercially successful movies, actors, or directors are always correlation with movie sentiment references, but they neither build
accompanied by media exposure. Our experiments use Lydia ([4], prediction models or show the value of the correlation because they
http://www.textmap.com), a high-speed text processing system, to think the result is not good enough for accurate modeling.
analyze news publicity and output movie news data, and then to help
our movie gross prediction. III. M OVIE DATA AND C ORRELATION A NALYSIS
In this paper, our particular contributions are: 1) We provide a There are two kinds of movie data used in this paper, movie
comprehensive way to evaluate news data and linguistic sentiment specific variables and movie news data. The movie specific variables
indexes as well as give a detailed analysis for movie news data; are collected from traditional movie websites like IMDB, but the
2) We build k-nearest neighbor models for movie gross predictions, movie news data is obtained from Lydia. We need to analyze the
which have not been studied in previous movie prediction literatures; correlation between movie grosses and traditional movie variables or
3) Through large scale analysis, we prove that news data is capable news variables, and then let it guide us to set up reasonable models for
of helping people to build models with better performance. We do movie gross prediction. The correlation between variables is measured
Movie Variables Categories Movies Mean Median Min Max Corr p-Value
Budget All 1500 45.97 25.48 0.0008 600.79 0.672 <0.001
Opening Screens All 1500 45.97 25.48 0.0008 600.79 0.647 <0.001
First Week Gross All 1500 45.97 25.48 0.0008 600.79 0.841 <0.001
World Gross All 1500 45.97 25.48 0.0008 600.79 0.936 <0.001
Holiday 640 55.13 32.22 0.008 600.79 0.132 <0.001
Release Date
Non-holiday 860 39.15 21.44 0.0008 436.72 -0.132 <0.001
G 41 82.72 58.40 0.669 339.71 0.103 0.261
PG 201 65.60 42.27 0.119 436.72 0.128 0.035
MPAA Rating PG-13 500 59.04 35.28 0.011 436.72 0.154 <0.001
R 646 30.68 16.98 0.0008 216.33 -0.221 <0.001
NC-17 17 18.18 7.4 0.030 70.10 -0.049 0.426
Sequel 127 90.24 64.96 0.146 436.72 0.224 0.006
Source
Not Sequel 1373 41.88 22.73 0.0008 600.79 -0.224 <0.001
USA 1191 50.28 30.31 0.0008 600.79 0.141 <0.001
Origin Country
Not USA 309 29.38 11.55 0.009 317.56 -0.141 0.006
Table I: Correlation Coefficient of Movie Variables versus Movie Grosses. The given value of Mean, Median, Min, and Max grosses are in
terms of million dollars. The bold numbers show the corresponding correlations are statistically significant at a 0.05 significance level.
Figure 1: The normal quantile-quantile plots used for the normality test of the movie budgets or grosses distributions. Figure (a), (b), and (c)
show the results for all movies, low-grossing movies, low-budgeting movies in our test data set respectively, which indicate movie budgets or
grosses are not following strictly Gaussian distribution.
Entities Duration Gross (Pre-rel) Gross (Post-rel) Budget (Pre-rel) Budget (Post-rel)
1 week 0.707 0.781 0.497 0.480
Movie 1 month 0.672 0.779 0.463 0.474
4 months 0.629 0.749 0.437 0.455
1 week 0.494 0.602 0.311 0.389
Director 1 month 0.371 0.495 0.218 0.389
4 months 0.192 0.317 0.117 0.078
1 week 0.640 0.726 0.476 0.528
Top 3 Actors 1 month 0.569 0.683 0.448 0.477
4 months 0.493 0.618 0.413 0.424
1 week 0.646 0.725 0.533 0.595
Top 15 Actors 1 month 0.575 0.686 0.477 0.530
4 months 0.511 0.618 0.415 0.433
Table II: Correlation Coefficient of Logged Pre-release News Article Counts versus Logged Grosses under various scenarios. The rows indicate
what kind of entities are examined in terms of what kind of duration, i.e., 1 week, 1 month, or 4 months. The columns indicate the correlation
is for gross or budget in terms of pre-release(or post-release) article counts.
by both strength and significance. The strength is futher evaluated under different scenarios. Table III shows the correlations between
by the correlation coefficient r, or sometimes by the coefficient of movie grosses and sentiment counts in seven categories.
determination
q r2 , while the significance can be verified by t-test 1) Movie Grosses versus News Reference Counts: Some signifi-
t = r N −2
, in which r is the correlation coefficient, N is the cant observations from our experiments are below.
1−r 2
• Article counts vs. Frequencies: Grosses have higher corre-
sample size, and N − 2 is the degree of freedom of the t-test. lation with article counts than with total entity references.
In this paper, we use significance level 0.05 to test the statistical • Time Value of Money: Higher correlations could be
significance because 0.05 is conventionally a standard threshold to achieved if we take inflation into consideration based on
evaluate significance although in our experiments most of the results the year-by-year interests rate before the correlation is
are statistically significant to an even stricter level. evaluated. Therefore, time value of money is always used
in this paper hereafter.
A. Traditional Movie Variables and Correlation Analy- • Raw correlation vs. Logged correlation: The logarithm
sis operation generates higher correlations for news reference
Traditional movie data is available at http://www.imdb.com and counts and grosses (or budget).
http://www.the-numbers.com. We wrote a spider program and down- • Grosses vs. Budget: News references are more highly
loaded data for all movies released from 1960 to 2008. Table I correlated with grosses than budgets.
summarizes the relationship between some important movie variables • Pre-release and Post-release References: Table II shows
and grosses by providing the correlation coefficients and some other that the post-release data correlates with grosses better
statistical data. The most important movie variables include numerical than pre-release data.
variables like budget or opening screens, and categorical variables like • Time Periods: The 1-week data has the strongest corre-
source or MPAA rating. Another important variable is genre. IMDB lation, and the correlations of 1-month data and 4-month
defines 19 genres, and in our experiments, we find some genres like data decrease accordingly.
“Action” and “Adventure” are positively correlated with grosses, while • News Entities: Director references have the least correla-
others genres like “Biography” and “Documentary” are negatively tion with grosses; movie titles and top actors have better
correlated with grosses. We notice the correlation coefficient between correlations with grosses (or budget).
movie gross and the first week gross is as high as 0.841, which • Seven Sentiment Categories: “General” and “Media”
explains why some decent models could be built with post-release sentiment counts have the highest correlation with grosses
data in some previous literatures. among all seven sentiment categories.
A particularly interesting question is movie budget vs. gross • Negative References vs. Positive References: From Table
distribution. Figure 1 shows the normal quantile-quantile plots of III, we can see that positive references are better corre-
movie budget and gross. Figure 1(a) shows both budget and gross are lated with grosses than negative ones for all sentiment
not strictly Gaussian distribution, because there are more low-grossing categories except “Crime” and “Health”.
movies than high-grossing movies. However, if we particularly study • Low-grossing Movies vs. High-grossing Movies: For low-
the low-grossing movies or low-budgeting movies, Figure 1(b) and grossing movies, the news references for top 3 actors are
1(c) show that high budget may result in low gross, and low budget better gross predictors than those of top 15 actors. For
may result in high gross as well. In this paper, we will pay more high-grossing movies, we have the opposite conclusion.
attention on high-grossing movies because they can generate more 2) Movie Grosses versus Derived Sentiment Indexes: Based on
revenue and have more media exposure. raw sentiment references, we derive several sentiment mea-
sures, including polarity, subjectivity, positive references per
B. News Data and Correlation Analysis reference, negative references per reference, and positive-
Movie news data is generated form the Lydia system, which negative differences per reference. They are defined as the
does high-speed analysis of online daily newspapers. The input of follows.
pos senti ref s
Lydia includes the coverage of around 1000 nationwide and local • polarity =
total senti ref s
newspapers. One difficulty for movie news analysis is title matching, total senti ref s
which causes lots of false positives or false negatives during entity • subjectivity = total ref s
identification phase. For example, Lydia may fail to identify certain pos senti ref s
movies’ name like “15 Minutes”, “Pride”, “Next”, “Interview”, etc. • pos ref s per ref = total ref s
correctly. Our solution is to filter out these “bad” data before our neg senti ref s
analysis based on three rules: 1) Common-word-named movies should • neg ref s per ref = total ref s
be removed; 2) Movies should have a reasonable news coverage; pos senti ref s−neg senti ref s
3) News approaching to a movie’s opening date should have more • senti dif f s per ref = total ref s
references of this movie than news far away from its opening date.
Eventually, we get a data set size of 498 movies, and we divided these Figure 2 shows the correlations between grosses and all these
movies into two parts - 60% as the training set and the rest 40% as five statistics are not strong. However, correlation coefficients
the predicting set. for several of them, such as polarity, negative references per
Lydia generates an entity database. For each entity, the Lydia reference, and positive-negative differences per reference are
data includes the daily article counts, daily frequency counts, as still statistically significant at a 0.05 significance level.
well as daily sentiment (both positive and negative) counts in seven 3) Pairwise Correlation of Various News Statistical Measures:
categories: General, Business, Crime, Health, Politics, Sports, and Figure 2 shows that the pairwise correlation details. We no-
Media. tice article count, frequency, positive frequency, and negative
Based on above raw counts, we evaluated the accumulated news frequency are highly correlated each other. To avoid multi-
references for the first week (1-week data), the second week through collinearity, our prediction model preferably use only one of
the 4th week (1-month data), and the 5th week through the 16th week them. We can also use some derived sentiment indexes because
(4-month data) period before the release of movies respectively. Our they are not strongly correlated with raw references.
correlation analysis includes the evaluation of the media coverage in
terms of four different entities - movie titles, directors, top 3 actors, IV. P REDICTION M ODELS AND C OMPARISON
and top 15 actors. Table II shows the correlation analysis of logged Two basic modeling methodologies used in this paper are regres-
pre-release news reference counts versus logged grosses or budget sion and k-nearest neighbor classifiers. Regression models forecast
Figure 2: The pairwise plot (using 1-week, pre-release data) of move gross, news references, sentiment references, and derived sentiment indexes
for the time period from January 1990 to December 2007. Notes: 1) The ten variables respectively are: movie gross (Gross), news frequencies
(Freq), news article counts (ArtCnts), general positive counts (GenPos), general negative counts (GenNeg), polarity (Pola), subjectivity (Subj),
positive references per reference (PosPer), negative references per reference (NegPer), positive-negative differences per reference (DiffPer). 2)
The correlation coefficient between two variables shows on the bottom of each box. 3) Red color indicates two variables are positively correlated,
while blue color indicates two variables are negatively correlated. 4) Saturation of colors indicates the strength of the correlation. 5) A green
box indicates the corresponding correlation coefficient is statistically significant, while a black box means the opposite side. Some significant
observations: 1) News references are highly correlated with movie grosses. 2) Positive references have higher correlation with grosses than
negative references. 3) Frequencies, article counts, positive references, and negative references are highly correlated each other. 4) Polarity
is positively correlated with movie grosses and the correlation is not strong but yet statistically significant, and so does the positive-negative
differences per reference. 5) Negative references per reference is negatively correlated with movie grosses and the correlation is not strong
but still statistically significant. 6) Subjectivity is negatively correlated with movie grosses, but the correlation is not statistically significant. 7)
Derived sentiment indexes are not highly correlated with news references, which will give us some new information other than the raw counts.
Scenarios General Business Crime Health Politics Sports Media
Positive 0.692 0.666 0.418 0.520 0.615 0.684 0.695
1 week
Negative 0.665 0.564 0.594 0.624 0.565 0.444 0.513
Positive 0.665 0.651 0.401 0.520 0.603 0.669 0.675
1 month
Negative 0.650 0.579 0.580 0.616 0.564 0.466 0.507
Positive 0.625 0.626 0.370 0.497 0.561 0.635 0.643
4 months
Negative 0.608 0.544 0.541 0.557 0.531 0.438 0.490
Table III: Logged Movie Grosses versus Logged Pre-release Positive or Negative Sentiment Counts in Seven Sentiment Categories, in terms of
movie title coverage. The bold numbers show that positive references are better correlated with grosses than negative ones except for “Crime”
and “Health” categories. One reason is that a movie may be more attractive due to excess violence.
grosses by a regression equation. By contrast, k-NN models identify performance substantially in regression models. The performance
the most “similar” movie of the target movie from the training set by of K-NN models strongly depended on the training set size. With
examining their similarities, because we think that “similar” movies additional training data, and the increasing of k (but yet still a small
should have similar grosses. number), the performance of K-NN models will be further improved.
To evaluate performance (or accuracy) of models, many measures For all models, the high-grossing movies are predicted significantly
are proposed. Hyndman and Koehler [14] gave a detail description better than low-grossing movies. The overall performance of K-NN
about them. Here we make some adjustment for our purpose. We models is similar to, but the high-grossing performance is better
suppose G is the actual gross and P is the predicted gross, and then than that of regression models. If we use regression models for
we have below evaluation methods: low-grossing movies and k-NN models for high-grossing movies,
1) AMAPE (Adjusted Mean PAbsolute Percentage/Relative Error): the best prediction will be expected.
n
|AP Ei |
AM AP E = i=1
n
, while AP E = B. Prediction from News Variables
G−P G−P
maxabs ( G , P ) is adjusted percentage error. In this section, we will predict movie grosses using news data only.
The operator “maxabs ” chooses the element that has the Several models are built as follows.
biggest absolute value. For example, for a movie whose 1) Regression Models Using News References Only (nReg1w and
actual gross is $50 million, predictions $75 million and $33.3 nRegmov+act15 ): Model nReg1w takes three indicators, the
million are equally good for this movie because they have the pre-release 1-week news article counts in terms of movie titles,
same |AP E| value of 0.5. P top 3 actors, and top 15 actors. By contrast, nRegmov+act15
n
(100−min(100,|AP Ei |))
2) Score of Models: Score = i=1 takes six indicators, the pre-release 1-week, 1-month and 4-
n
3) α% percentage coverage: month news article counts in terms of movie titles and top 15
P Cα% =
N umber of movies whose |AP E|≤α% actors. The simulation result shows that models nReg1w and
T otal number of movies (n) nRegmov+act15 have similar performance, and both of them
A. Prediction from Traditional Movie Variables perform better than other news-reference-based models, which
means our predictors are chosen properly.
Traditional movie models are our base models. We build separate 2) Regression Models Using News References plus Senti-
models according to budget information availabilities, i.e., “budget” ment Data (nReg1w+senti1 and nReg1w+senti2 ): Based on
and “nobudget” cases. nReg1w , nReg1w+senti1 adds raw sentiment counts , while
1) Regression Models (Regbudget and Regnobudget ): Model nReg1w+senti2 adds derived sentiment statistics like polarity
Regbudget use variables budget, holiday flag, MPAA rating, or subjectivity. However, both their overall performance and
sequel flag, foreign flag, opening screens, and genres. Model high-grossing performance have no significant improvements
Regnobudget is the same, but with removing budget indicator. compared to the base model nReg1w , because sentiment counts
Therefore, the regression model is: ln(G) = β0 + β1 P1 + are highly correlated with the news article counts and thus carry
β2 P2 + ... + βk Pk + , where G is gross, Pi s are predictors, little extra information while regressing.
βi s are coefficients of predictors, and is random noise. 3) k-Nearest Neighbor Models (nkNN1w , nkNNmov+act15 , and
2) k-Nearest Neighbor Models (kNNbudget and kNNnobudget ): nkNN1w+senti1 ): The three k-NN models use the same
The similarity of movies could be measured by “distance”, indicators as corresponding regression models. The distance
which is further evaluated in a multi-dimensional space. Firstly, of two movies can be easily computed by normalizing the
we define the distance for each dimension. For example, reference or sentiment counts. Surprisingly, the sentiment data
the distance of two budget value B1 , B2 are defined as: in the k-NN models shows some predictive power and the
max(B1 ,B2 )−min(B1 ,B2 )
dis(B1 , B2 ) = min(B1 ,B2 )
. The distance improvement is statistically significant. The basic reason is
for other variables are defined accordingly. We then get the that the sentiment data will be helpful in identifying more
completed similar movie pairs. Moreover, k-NN models have worse
pPn distance 2
formula by Euclidean measure Pn Dis = overall performance but better high-grossing performance than
i=1
dis i , Manhattan measure Dis = i=1
|disi |,
or regressing the training data set to determine the coefficients corresponding regression models. In addition, k-NN models
for all dimensions. Our experiments show that regressor works using news data can achieve similar performance with IMDB
the best among all above three approaches. The basic reason models, especially for high-grossing movies.
is that different variables have different scales in terms of their
influence on movie grosses but only the regressing method C. Prediction from Combined Variables and Perfor-
considers the difference. After this, we find the k movies from mance Comparison
the training set which are the k nearest neighbors. In addition, We have shown that decent models can be built using either
our results show that the k-NN models work poorly when traditional IMDB data or news data. Now we build models with the
k = 1, but they work well enough while k = 7. Table IV combination of IMDB data and news data, and indeed yield even bet-
shows some nearest neighbor pairs. ter prediction results. For example, in the “nobudget” case (“budget”
The performance data shows Regbudget is better than is not an input variable), Regnobudget is the regression model with
Regnobudget , which means that budget is capable of improving IMDB data and it yields only a coefficient of determination R2 of
No. MPAA Genre Source Cout. Scrns Bgt($M) Gro($M) Date Name
1 R Comedy Original Screenplay USA 7 3.500 0.221 09/14/07 Ira and Abby
R Comedy Original Screenplay USA 7 3.500 0.107 02/17/06 Winter Passing
2 PG-13 Adventure Based on Book or Short UK 4285 150.000 292.005 07/11/07 Harry Potter and the Order of the
Story Phoenix
PG-13 Adventure Based on Book or Short UK 3858 150.000 290.013 11/18/05 Harry Potter and the Goblet of Fire
Story
3 R Action Original Screenplay USA 2 1.000 0.000884 04/21/06 In Her Line of Fire
R Adventure Original Screenplay USA 2 1.000 9.015 12/02/05 Transamerica
Table IV: Example of Nearest Neighbor Pairs Identified with Numerical and Categorical Indicators (Model kNNbudget ). Pair 2 shows that
the algorithm identifies one “Harry Potter” movie with another “Harry Potter” movie as its nearest neighbor, which indicates a very good
comparison. Pair 3 is a strange pair, which is an almost perfectly matched pair but their grosses differ substantially. But generally speaking, the
prediction based on nearest neighbors achieves similar performance with regression models.
Regression Models
PerfOverall PerfHigh
Predictor Model
AMAPE Score AMAPE Score
Regnobudget 7.83 92.8 8.97 92.41
IMDB
Regbudget 3.53 96.47 2.03 97.97
nReg1w 8.72 92.1 4.02 96.2
News
nRegmov+act15 10.46 92.07 2.87 97.13
Regnobudget +nReg1w 3.82 96.81 2.48 97.52
Regnobudget +nRegmov+act15 3.79 96.21 2.4 97.6
Combined
Regbudget +nReg1w 2.76 97.24 1.57 98.43
Regbudget +nRegmov+act15 2.63 97.37 1.54 98.46
k-Nearest Neighbor Models
PerfOverall PerfHigh
Source Model
AMAPE Score AMAPE Score
kNNnobudget 18.66 89.9 2.44 97.56
IMDB
kNNbudget 11.68 92.11 1.16 98.84
nkNN1w 24.22 87.25 1.79 98.21
News
nkNNmov+act15 21.6 87.57 2.2 97.8
kNNnobudget +nkNN1w 11.13 92.03 1.14 98.87
kNNnobudget +nkNNmov+act15 10.89 92.17 1.16 98.84
Combined
kNNbudget +nkNN1w 3.37 96.88 1.06 98.91
kNNbudget +nkNNmov+act15 5.82 95.13 1.01 98.99
Table V: Performance Comparison for IMDB, News, and Combined Models. The bold numbers show the comparison of a group of experiments.
The data proves: 1) For regression methods, the news models have similar overall accuracy to, but better accuracy of high-grossing movies
than IMDB models. 2) For k-NN methods, the news models have worse overall accuracy, but yet still better accuracy of high-grossing movies
than IMDB models. 3) For both regression and k-NN methods, the combined models prove superior to both IMDB and news models for either
overall accuracy or accuracy of high-grossing movies. Other groups of experiments indicate the same results.
Figure 3: Comparison of Regression Models (“nobudget” case). Figure 4: Comparison of k-NN Models (“nobudget” case). These
These models use IMDB data, news data and their combination models use IMDB data, news data and their combination respectively.
respectively. The combined model works best among all three models, The combined model works best among all three models, both for
both for overall performance and high-grossing performance. overall performance and high-grossing performance.
0.448, which is almost the same with the result of pre-release model [2] W. S. Chan, “Stock price reaction to news and no-news:
from Simonoff and Sparrow [6]. By contrast, Regnobudget +nReg1w Drift and reversal after headlines,” Journal of Financial
is the corresponding regression model with IMDB data plus news data, Economics, vol. 70, pp. 223–260, 2003.
and it achieves a R2 of 0.788, which indicates a big improvement.
We studied the adjusted percentage error (or residual) plots for [3] P. C. Tetlock, M. Saar-Tsechansky, and S. Macskassy,
IMDB models, news models, and combined models. The results show
“More than words: Quantifying language to measure
some movies’ grosses are highly overestimated and some others are
highly underestimated if we use only IMDB data, and the error plots firms’ fundamentals,” in Proceedings of 9th Annual
are not symmetric or with zero mean. However, the high deviations Texas Finance Festival, May 2007.
are smoothed by news indicators, i.e., highly underestimated or
overestimated grosses are eliminated in news models or combined [4] L. Lloyd, D. Kechagias, and S. Skiena, “Lydia: A system
models. Furthermore, the combined models make the the adjusted for large-scale news analysis,” in Proceedings of 12th
percentage error plots symmetric, which is a sign of another benefit String Processing and Information Retrieval (SPIRE
of using news data. 2005), vol. LNCS 3772, Buenos Aires, Argentina, 2005,
The completed performance data of IMDB models, news models, pp. 161–166.
and combined models are listed in Table V. Compared to pure
IMDB models or pure news models, the combined models yield
nice performance improvement, either for regression models or k-
[5] A. Chen, “Forecasting gross revenues at the movie
NN models, which can be indicated by smaller AMAPE and higher box office,” Working paper, University of Washington,
scores. Our t-test proves the improvement is statistically significant. Seattle, WA, June 2002.
Figures 3 and 4 show the “Percentage vs. Percentage Coverage”
comparison (“nobudget” case) of IMDB, news, and the combined [6] J. S. Simonoff and I. R. Sparrow, “Predicting movie
models. The X-axis shows the α% percentage, while the Y-axis grosses: Winners and losers, blockbusters and sleepers,”
shows the corresponding α% percentage coverage. These plots show Chance, vol. 13(3), pp. 15–24, 2000.
both overall performance and high-grossing performance of combined
models are higher than those of IMDB or news models. We have [7] M. S. Sawhney and J. Eliashberg, “A parsimonious
exactly the same conclusion for “budget” case. Furthermore, the model for forecasting gross box-office revenues of mo-
comparison of Figures 3 and 4 also shows that regression models
work better for overall performance, while k-NN models perform
tion pictures,” Marketing Science, vol. Vol. 15, No. 2,
better for high-grossing performance. That is, regression models are pp. 113–131, 1996.
more suitable for low-grossing movies, but k-NN models are more
suitable for high-grossing movies. [8] R. Sharda and E. Meany, “Forecasting gate receipts using
neural network and rough sets,” in Proceedings of the
V. C ONCLUSIONS International DSI Conference, 2000, pp. 1–5.
We have discussed the correlation of movie grosses with both tradi- [9] R. Sharda and D. Delen, “Forecasting box-office success
tional IMDB data and movie news data, and built models with IMDB of motion pictures with neural networks,” Expert Systems
data, news data, and their combination respectively. Our experiments with Applications, vol. 30, pp. 243–254, 2006.
proved media’s predictive power in movie gross prediction.
Detailed conclusions are as the follows. Firstly, movie news
[10] B. Pang and L. Lee, “Thumbs up? sentiment classifica-
references are highly correlated with movie grosses, and sentiment
measures including derived sentiment indexes are also correlated tion using machine learning techniques,” in Proceedings
with movie grosses. Secondly, movie gross prediction can be done of the Conference on Empirical Methods in Natural Lan-
by either IMDB data, news data, or their combination. Prediction guage Processing (EMNLP), Philadelphia, July 2002, pp.
models using merely news data can achieve similar performance 79–86.
with models using IMDB data, especially for high-grossing movies,
while the combined models using both IMDB and news data yield [11] P. Chaovalit and L. Zhou, “Movie review mining: a
the best result. Therefore, news data is proven to be capable of comparison between supervised and unsupervised clas-
improving movie gross prediction in our analysis. Thirdly, both sification approaches,” in Proceedings of the Hawaii
regression and k-nearest neighbor classifiers can be used for movie
International Conference on System Sciences (HICSS),
gross prediction. With the same indicators, regression models have
better low-grossing performance, but k-NN models have better high- 2005.
grossing performance. Finally, article counts for movie entities are
good movie gross predictors. News sentiment data are good predictors [12] B. Pang and L. Lee, “Opinion mining and sentiment
for k-NN models, but not good predictors for regression models. analysis,” Foundations and Trends in Information Re-
For future work, we plan to compare our results with large-scale trieval, vol. Vol. 2, No 1-2, pp. 1–135, 2008.
analysis of blog data, web reviews, as well as news data, to determine
what kind of sources has greater predictive power over what time [13] G. Mishne and N. Glance, “Predicting movie sales from
scale. blogger sentiment,” In AAAI Symposium on Computa-
tional Approaches to Analysing Weblogs (AAAI-CAAW),
R EFERENCES pp. 155–158, 2006.
[1] G. Fung, J. Yu, and W. Lam, “Stock prediction: Inte- [14] R. J. Hyndman and A. B. Koehler, “Another look at
grating text mining approach using real-time news,” in measures of forecast accuracy,” Internationl Journal of
Proceedings of IEEE Int. Conference on Computational Forecasting, vol. 22, pp. 679–688, 2006.
Intelligence for Financial Engineering, 2003, pp. 395–
402.