meme-1-2

I Can Has Cheezburger?
A Nonparanormal Approach to Combining Textual and Visual Information

for Predicting and Generating Popular Meme Descriptions
William Yang Wang and Miaomiao Wen

School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Abstract
The advent of social media has brought Inter-

net memes, a unique social phenomenon, to
the front stage of the Web. Embodied in the
form of images with text descriptions, little do
we know about the “language of memes”. In
this paper, we statistically study the correla-
tions among popular memes and their word-
ings, and generate meme descriptions from Figure 1: An example of the LOL cat memes.
raw images. To do this, we take a multi-
modal approach—we propose a robust non- include superimposed text with broken grammars
paranormal model to learn the stochastic de- and/or spellings.
pendencies among the image, the candidate
Even though the memes are popular over the In-
descriptions, and the popular votes. In experi-
ments, we show that combining text and vision ternet, the “language of memes” is still not well-
helps identifying popular meme descriptions; understood: there are no systematic studies on pre-
that our nonparanormal model is able to learn dicting and generating popular Internet memes from
dense and continuous vision features jointly the Natural Language Processing (NLP) and Com-
with sparse and discrete text features in a prin- puter Vision (CV) perspectives.
cipled manner, outperforming various com- In this paper, we take a multimodal approach to
petitive baselines; that our system can gener-
predict and generate popular meme descriptions. To
ate meme descriptions using a simple pipeline.
do this, we collect a set of original meme images,
1 Introduction a list of candidate descriptions, and the correspond-
ing votes. We propose a robust nonparanormal ap-
In the past few years, Internet memes become a new, proach (Liu et al., 2009) to model the multimodal
contagious social phenomenon: it all starts with an stochastic dependencies among images, text, and
image with a witty, catchy, or sarcastic sentence, and votes. We then introduce a simple pipeline for gen-
people circulate it from friends to friends, colleagues erating meme descriptions combining reverse im-
to colleagues, and families to families. Eventually, age search and traditional information retrieval ap-
some of them go viral on the Internet. proaches. In empirical experiments, we show that
Meme is not only about the funny picture, the our model outperforms strong discriminative base-
Internet culture, or the emotion that passes along, lines by very large margins in the regression/ranking
but also about the richness and uniqueness of its experiments, and that in the generation experiment,
language: it is often highly structured with special the nonparanormal outperforms the second-best su-
written style, and forms interesting and subtle con- pervised baseline by 4.35 BLEU points, and obtains
notations that resonate among the readers. For ex- a BLEU score improvement of 4.48 over an unsu-
ample, the LOL cat memes (e.g., Figure 1) often pervised recurrent neural network language model
trained on a large meme corpus that is almost 90 and Zitnick, 2014; Donahue et al., 2014; Fang et
times larger. Our contributions are three-fold: al., 2014; Karpathy and Fei-Fei, 2014) using neural
network models. Although the above studies have
• We are the first to study the “language of shown interesting results, our task is arguably more
memes” combining NLP, CV, and machine complex than generating text descriptions: in ad-
learning techniques, and show that combining dition to the visual and textual signals, we have to
the visual and textual signals helps identifying model the popular votes as a third dimension for
popular meme descriptions; learning. For example, we cannot simply train a con-
• Our approach empowers Internet users to select volutional neural network image parser on billions
better wordings and generate new memes auto- of images, and use recurrent neural networks to gen-
matically; erate texts such as “There is a white cat sitting next
• Our proposed robust nonparanormal model to a laptop.” for Figure 1. Additionally, since not
outperforms competitive baselines for predict- all images are suitable as meme images, collecting
ing and generating popular meme descriptions. training images is also more challenging in our task.
In contrast to prior work, we take a very
In the next section, we outline related work. In
different approach: we investigate copula meth-
Section 3, we introduce the theory of copula, and
ods (Schweizer and Sklar, 1983; Nelsen, 1999), in
our nonparanormal approach. In Section 4, we de-
particular, the nonparanormals (Liu et al., 2009), for
scribe the datasets. We show the prediction and gen-
joint modeling of raw images, text descriptions, and
eration results in Section 5 and Section 6. Finally,
popular votes. Copula is a statistical framework for
we conclude in Section 7.
analyzing random variables from Statistics (Liu et
2 Related Work al., 2012), and often used in Economics (Chen and
Fan, 2006). Only until very recently, researchers
Although the language of Internet memes is a rel- from the machine learning and information retrieval
atively new research topic, our work is broadly re- communities (Ghahramani et al., 2012; Han et al.,
lated to studies on predicting popular social media 2012; Eickhoff et al., 2013). start to understand the
messages (Hong et al., 2011; Bakshy et al., 2011; theory and the predictive power of copula models.
Artzi et al., 2012). Most recently, Tan et al. (2014) Wang and Hua (2014) are the first to introduce semi-
study the effect on wordings for Tweets. However, parametric Gaussian copula (a.k.a. nonparanormals)
none of the above studies have investigated multi- for text prediction. However, their approach may
modal approaches that combine text and vision. be prone to overfitting. In this work, we generalize
Recently, there has been growing interests in Wang and Hua’s method to jointly model text and
inter-disciplinary research on generating image de- vision features with popular votes, while scaling up
scriptions. Gupta el al. (2009) have studied the prob- the model using effective dropout regularization.
lem of constructing plots from video understand-
ing. The work by Farhadi et al. (2010) is among 3 Our Approach
the first to generate sentences from images. Kulka-
rni et al. (2011) use linguistic constraints and a con- A key challenge for joint modeling of text and vision
ditional random field model for the task, whereas is that, because textual features are often relatively
Mitchell et al. (2012) leverage syntactic information sparse and discrete, while visual features are typi-
and co-occurrence statistics and Dodge et al. (2012) cally dense and continuous, it is difficult to model
use a large text corpus and CV algorithms for detect- them jointly in a principled way.
ing visual text. With the surge of interests in deep To avoid comparing “apple and oranges” in the
learning techniques in NLP (Socher et al., 2013; De- same probabilistic space, we propose the non-
vlin et al., 2014) and CV (Krizhevsky et al., 2012; paranormal approach, which extends the Gaussian
Oquab et al., 2013), there have been several unref- graphical model by transforming its variables by
ereed manuscripts on parsing images and generating smooth functions. More specifically, for each di-
text descriptions lately (Vinyals et al., 2014; Chen mension of textual and visual features, instead of

meme-1-2

Uploaded by

Copyright:

Available Formats

meme-1-2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

meme-1-2

Uploaded by

Copyright:

Available Formats

I Can Has Cheezburger?

A Nonparanormal Approach to Combining Textual and Visual Information

William Yang Wang and Miaomiao Wen

The advent of social media has brought Inter-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.