A Nonparanormal Approach to Combining Textual and Visual Information
for Predicting and Generating Popular Meme Descriptions
William Yang Wang and Miaomiao Wen
School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213
Abstract
The advent of social media has brought Inter-
net memes, a unique social phenomenon, to the front stage of the Web. Embodied in the form of images with text descriptions, little do we know about the “language of memes”. In this paper, we statistically study the correla- tions among popular memes and their word- ings, and generate meme descriptions from Figure 1: An example of the LOL cat memes. raw images. To do this, we take a multi- modal approach—we propose a robust non- include superimposed text with broken grammars paranormal model to learn the stochastic de- and/or spellings. pendencies among the image, the candidate Even though the memes are popular over the In- descriptions, and the popular votes. In experi- ments, we show that combining text and vision ternet, the “language of memes” is still not well- helps identifying popular meme descriptions; understood: there are no systematic studies on pre- that our nonparanormal model is able to learn dicting and generating popular Internet memes from dense and continuous vision features jointly the Natural Language Processing (NLP) and Com- with sparse and discrete text features in a prin- puter Vision (CV) perspectives. cipled manner, outperforming various com- In this paper, we take a multimodal approach to petitive baselines; that our system can gener- predict and generate popular meme descriptions. To ate meme descriptions using a simple pipeline. do this, we collect a set of original meme images, 1 Introduction a list of candidate descriptions, and the correspond- ing votes. We propose a robust nonparanormal ap- In the past few years, Internet memes become a new, proach (Liu et al., 2009) to model the multimodal contagious social phenomenon: it all starts with an stochastic dependencies among images, text, and image with a witty, catchy, or sarcastic sentence, and votes. We then introduce a simple pipeline for gen- people circulate it from friends to friends, colleagues erating meme descriptions combining reverse im- to colleagues, and families to families. Eventually, age search and traditional information retrieval ap- some of them go viral on the Internet. proaches. In empirical experiments, we show that Meme is not only about the funny picture, the our model outperforms strong discriminative base- Internet culture, or the emotion that passes along, lines by very large margins in the regression/ranking but also about the richness and uniqueness of its experiments, and that in the generation experiment, language: it is often highly structured with special the nonparanormal outperforms the second-best su- written style, and forms interesting and subtle con- pervised baseline by 4.35 BLEU points, and obtains notations that resonate among the readers. For ex- a BLEU score improvement of 4.48 over an unsu- ample, the LOL cat memes (e.g., Figure 1) often pervised recurrent neural network language model trained on a large meme corpus that is almost 90 and Zitnick, 2014; Donahue et al., 2014; Fang et times larger. Our contributions are three-fold: al., 2014; Karpathy and Fei-Fei, 2014) using neural network models. Although the above studies have • We are the first to study the “language of shown interesting results, our task is arguably more memes” combining NLP, CV, and machine complex than generating text descriptions: in ad- learning techniques, and show that combining dition to the visual and textual signals, we have to the visual and textual signals helps identifying model the popular votes as a third dimension for popular meme descriptions; learning. For example, we cannot simply train a con- • Our approach empowers Internet users to select volutional neural network image parser on billions better wordings and generate new memes auto- of images, and use recurrent neural networks to gen- matically; erate texts such as “There is a white cat sitting next • Our proposed robust nonparanormal model to a laptop.” for Figure 1. Additionally, since not outperforms competitive baselines for predict- all images are suitable as meme images, collecting ing and generating popular meme descriptions. training images is also more challenging in our task. In contrast to prior work, we take a very In the next section, we outline related work. In different approach: we investigate copula meth- Section 3, we introduce the theory of copula, and ods (Schweizer and Sklar, 1983; Nelsen, 1999), in our nonparanormal approach. In Section 4, we de- particular, the nonparanormals (Liu et al., 2009), for scribe the datasets. We show the prediction and gen- joint modeling of raw images, text descriptions, and eration results in Section 5 and Section 6. Finally, popular votes. Copula is a statistical framework for we conclude in Section 7. analyzing random variables from Statistics (Liu et 2 Related Work al., 2012), and often used in Economics (Chen and Fan, 2006). Only until very recently, researchers Although the language of Internet memes is a rel- from the machine learning and information retrieval atively new research topic, our work is broadly re- communities (Ghahramani et al., 2012; Han et al., lated to studies on predicting popular social media 2012; Eickhoff et al., 2013). start to understand the messages (Hong et al., 2011; Bakshy et al., 2011; theory and the predictive power of copula models. Artzi et al., 2012). Most recently, Tan et al. (2014) Wang and Hua (2014) are the first to introduce semi- study the effect on wordings for Tweets. However, parametric Gaussian copula (a.k.a. nonparanormals) none of the above studies have investigated multi- for text prediction. However, their approach may modal approaches that combine text and vision. be prone to overfitting. In this work, we generalize Recently, there has been growing interests in Wang and Hua’s method to jointly model text and inter-disciplinary research on generating image de- vision features with popular votes, while scaling up scriptions. Gupta el al. (2009) have studied the prob- the model using effective dropout regularization. lem of constructing plots from video understand- ing. The work by Farhadi et al. (2010) is among 3 Our Approach the first to generate sentences from images. Kulka- rni et al. (2011) use linguistic constraints and a con- A key challenge for joint modeling of text and vision ditional random field model for the task, whereas is that, because textual features are often relatively Mitchell et al. (2012) leverage syntactic information sparse and discrete, while visual features are typi- and co-occurrence statistics and Dodge et al. (2012) cally dense and continuous, it is difficult to model use a large text corpus and CV algorithms for detect- them jointly in a principled way. ing visual text. With the surge of interests in deep To avoid comparing “apple and oranges” in the learning techniques in NLP (Socher et al., 2013; De- same probabilistic space, we propose the non- vlin et al., 2014) and CV (Krizhevsky et al., 2012; paranormal approach, which extends the Gaussian Oquab et al., 2013), there have been several unref- graphical model by transforming its variables by ereed manuscripts on parsing images and generating smooth functions. More specifically, for each di- text descriptions lately (Vinyals et al., 2014; Chen mension of textual and visual features, instead of