Deep Learning in Next-Frame Prediction A Benchmark Review

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Received March 7, 2020, accepted March 30, 2020, date of publication April 10, 2020, date of current version

April 23, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.2987281

Deep Learning in Next-Frame Prediction: A


Benchmark Review
YUFAN ZHOU, HAIWEI DONG , (Senior Member, IEEE),
AND ABDULMOTALEB EL SADDIK , (Fellow, IEEE)
Multimedia Computing Research Laboratory, School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada
Corresponding author: Haiwei Dong (haiwei.dong@ieee.org)

ABSTRACT As an unsupervised representation problem in deep learning, next-frame prediction is a new,


promising direction of research in computer vision, predicting possible future images by presenting historical
image information. It provides extensive application value in robot decision making and autonomous driving.
In this paper, we introduce recent state-of-the-art next-frame prediction networks and categorize them into
two architectures: sequence-to-one architecture and sequence-to-sequence architecture. After comparing
these approaches by analyzing the network architecture and loss function design, the pros and cons are
analyzed. Based on the off-the-shelf data-sets and the corresponding evaluation metrics, the performance
of the aforementioned approaches is quantitatively compared. The future promising research directions are
pointed out at last.

INDEX TERMS Frame prediction architecture, loss function design, state-of-the-art evaluation.

I. INTRODUCTION understand the existing environment [2]. Many examples of


Next-frame prediction, that is, predicting what happens next predictive systems can be found where next-frame prediction
in the form of an image or a few images, is an emerging is beneficial. For instance, predicting future frames enables
field of computer vision and deep learning. This prediction autonomous agents to make smart decisions in various tasks.
is built on the understanding of the information in the his- Kalchbrenner et al. [3] proposed a video pixel network that
torical images that have occurred so far. It refers to starting contributes to helping robots make decisions by understand-
from continuous, unlabeled video frames and constructing ing the current images and estimating the discrete joint dis-
a network that can accurately generate subsequent frames. tribution of the raw pixel values between images. Oh et al.
The input of the network is the previous few frames, and [4] combined a predictive model with the deep Q-learning
prediction is/are the next frame(s). These predictions can be algorithm for better performance of an artificial intelligence
not only of human motion but also for any object motion and (AI) agent in playing Atari games. Other approaches [5],
background in the images. Modeling contents and dynamics [6] provided a visual predictive system for vehicles, which
from videos or images is the main task for next-frame predic- predicts a future position of pedestrians in the image to
tion which is different from motion prediction. Next-frame guide the vehicles to slow down or brake. Benefiting from
prediction is to predict future image(s) through a few previous next-frame prediction, Klein et al. [7] forecasted weather
images or video frames whereas motion prediction refers by predicting radar cloud images. In summary, next-frame
to inferring dynamic information such as human motion prediction enables artificial intelligence to create a better
and an object’s movement trajectory from a few previous understanding of the surrounding environments and provides
images or video frames. a huge potential to deal with many different tasks based on
Many scenes in real life can be predicted since they satisfy predictive ability.
physical laws (e.g., inertia), such as ball parabola prediction Since deep learning has shown its effectiveness in image
for a ping-pong robot [1]. The prediction of moving objects processing [8], deep learning for next-frame prediction is
facilitates the advance decision of the machine. Similarly, very powerful compared with the traditional machine learn-
images can also be predicted so that the machine can better ing. Traditional machine learning methods often require the
manual extraction of features and much preprocessing work.
The associate editor coordinating the review of this manuscript and Time sequence predictions in machine learning use linear
approving it for publication was Mehedi Masud . models such as ARIMA or exponential smoothing, which

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 69273
Y. Zhou et al.: Deep Learning in Next-Frame Prediction: A Benchmark Review

often fail to meet the needs of increasingly complex real- target through a reward function to model the motion trajec-
world environments. It is difficult to learn the features from tory of a car. Deo et al. [30] presented a recurrent model based
images efficiently. A large number of methods have proven on LSTM [31] for the prediction of vehicle movement and its
that deep learning is more suitable in the study of image rep- trajectory under the condition of freeway traffic.
resentation learning [9]. Despite the great progress in deep-
learning architecture, next-frame prediction remains a big B. DEEP LEARNING IN IMAGE GENERATION
challenge which can be summarized from two aspects: image Image generation is generating new images based on an
deblurring and long-term prediction. In this paper, we define existing dataset. It is generally divided into two categories.
a prediction system as a long-term prediction if it can generate One category is to generate images according to attributes,
more than 20 future frames. The details will be discussed which are generally text descriptions. The other category is
in the next sections. To conclude, the next-frame prediction from image to image: taking the historical images as the input
is of great importance in the field of artificial intelligence of the network to generate image(s) for specific purposes,
by predicting future possibilities and making decisions in such as denoising [32], super-resolution [33] and image style
advance. transfer [34].
In this paper, we cover recent papers and discuss the ideas, The most common network structures in the field of image
contributions, and drawbacks of the previously proposed generation are autoencoders and generative adversarial net-
methods. Through the comparison of network structure, loss works (GANs) [35]. Autoencoders are the most popular
function design, and performance in experiments, the advan- network architecture to generate images. An autoencoder is
tages and disadvantages of these methods are summarized, usually composed of two parts: an encoder and a decoder.
which inspires us to have a perspective towards the future The encoder encodes the data into a latent variable, and the
research directions. decoder reconstructs the latent variable into the original data.
There are several variants of autoencoders, including sparse
autoencoders [36] and variable autoencoders (VAEs) [37]. As
II. RELATED WORK an example, Mansimov et al. [38] used a VAE to iteratively
A. PREDICTIVE LEARNING draw pictures based on words in an article based on the
Predictive learning predicts future possibilities by under- recurrent neural network DRAW [39]. The feature of the text
standing existing information. Generally, predictive learning description is used as the input of the network to generate the
is used to solve the sequence problem. There are several required image.
practical applications for using predictive learning. Recurrent A generative adversarial network (GAN) is a commonly
networks are suitable for seeking patterns in sequence data, used training model in image generation. There are two com-
such as video representation, traffic flow, and weather pre- ponents in a GAN: a discriminative model and a generative
diction. Song et al. [22] proposed pyramid dilated bidirec- model. Images are generated from the generative model,
tional ConvLSTM to effectively detect significant regions in while the discriminative model is trained to maximize the
videos. Zhang et al. [23] predicted traffic flow by designing probability of applying the correct label to both examples
a spatiotemporal model. Shi et al. [24] predicted rainfall and samples from the generative model. The discriminative
by the use of their proposed convolutional LSTM network model learns the boundary between classes, while the gen-
(ConvLSTM), which combined the convolutional operation erative part models the distribution of individual classes.
with a recurrent layer. Using a GAN can make the generated picture clearer. Several
Additionally, as a way to create strong artificial intelli- approaches [33], [34] have been proposed to successfully
gence, predictive learning has been applied in the fields of generate sharper images with GANs. Furthermore, different
motion prediction, such as action prediction and trajectory types of GANs have been designed to generate images, such
prediction. Vu et al. [25] proposed a method to predict human as ImprovedGAN [40] and InfoGAN [41]. There are also
actions from static scenes using the correlation information combinations of different GANs to perform image gener-
between actions and scenes. Ryoo et al. [26] implemented ation tasks. Zhang et al. [42] proposed StackGAN to iter-
probabilistic action prediction and used the integration his- atively generate high-quality images. There are two-stage
togram of spatiotemporal features to model how the feature generators in their approach. The first stage produces low-
distribution changes over time. Lan et al. [27] developed a resolution images, and the second stage produces high-
max-margin learning framework and proposed a new rep- resolution images based on the results of the first stage.
resentation called ‘‘human movemes’’ for action prediction. ProgressiveGAN [43] trained 4 × 4 pixel generators and dis-
Walker et al. [28] used the optical flow algorithm to mark a criminators first and then gradually added additional layers
video and then trained an optical flow prediction model that to double the output resolution to 1024 × 1024.
can predict the motion of each pixel.
Furthermore, the performance and safety of self-driving III. STATE-OF-THE-ART APPROACHES
cars can be improved since the behavior of vehicles and Next-frame prediction can be taken as a spatiotemporal prob-
pedestrians can be estimated in advance by predictive learn- lem. Given a sequence of images in continuous time steps,
ing systems. Walker et al. [29] tried to select the optimal predicting the next frame is performed by time sequence

69274 VOLUME 8, 2020


Y. Zhou et al.: Deep Learning in Next-Frame Prediction: A Benchmark Review

TABLE 1. Comparison between different next-frame prediction approaches.

learning. Surrounding environmental information is learned code URLs are included in Table 1. A recurrent neural
from a sequence of images, and the regularity of pixel network is rarely used in the sequence-to-one architecture,
changes between images is retrieved. In addition, for a spe- while it is widely used in the sequence-to-sequence archi-
cific image, the relationship between pixels is a significant tecture. As a classic neural network in image processing,
factor to be considered when performing next-frame predic- autoencoders are widely used in both types of architectures
tion. The key feature can be extracted from the spatial struc- for next-frame prediction. Usually, the encoder can extract
ture of the image by the position, appearance, and shape of the spatial features from the previous frames, and the decoder
the object. We categorize the networks for next-frame predic- can regress pixels and reconstruct the next frames. The main
tion into two architectures: sequence-to-one architecture and advantage of autoencoder models is that they reduce the
sequence-to-sequence architecture. For the former architec- amount of input information by extracting the most repre-
ture, the input of the deep learning model is a set of frames in sentative information from the original image and put the
time-step order between t and t +k. The prediction is the next reduced information into the neural network for learning.
frame. For the latter architecture, the input is temporal frames In addition, the structure of the autoencoder can adapt to a few
that are separately fed into the neural network. Specifically, input variables. For the loss function, L1 , L2 , and adversarial
the frame in time step t is input to the deep learning model, loss are used in both architectures. Among them, the most fre-
and the prediction is the next frame in time step t + 1. This quently used loss function is L2 . For the number of predicted
operation is continuously conducted until the deep learning frames, Villegas et al. [14] predicted the largest number of
model achieves the frame in the (t +m)th time step. Sequence- frames: up to 128 frames. In contrast, most state-of-the-art
to-one architectures focus on the spatial structure from the approaches can only predict less than 20 frames. In the
set of input frames while sequence-to-sequence architectures following subsections, the two architectures are illustrated
focus on the factor of the temporal sequence. respectively.
Table 1 lists a collection of recent representative
next-frame prediction approaches. These state-of-the-art
approaches are compared in terms of their learning model A. SEQUENCE-TO-ONE ARCHITECTURE
structure, the number of inputs, the number of predicted As mentioned in the definition of a sequence-to-one archi-
frames, and loss functions. Additionally, the corresponding tecture, most approaches [6], [12], [13], [15] concatenate the

VOLUME 8, 2020 69275


Y. Zhou et al.: Deep Learning in Next-Frame Prediction: A Benchmark Review

FIGURE 1. A general sequence-to-one architecture with a pyramid of autoencoder model and GAN. The generator is the autoencoder model where the
autoencoder model consists of convolutional networks and the decoder consists of deconvolutional networks. The discriminator is a two-class classifier,
which distinguishes the image generated by the generator from real data. The previous frames are resized into different sizes and fed into the overall
network, while the output is the next frame with different scales.

previous frames on the channel dimension as a set of input limitation, i.e., slow processing speed. Lin et al. [44] pro-
images for networks. The frames are sorted in chronologi- posed a feature pyramid network to improve the processing
cal order. Although the approaches from [10], [11] gener- speed. Instead of extracting features from images of various
ated possible future frames from a single image, these two scales, they extracted the various scales of features from a
approaches are categorized as sequence-to-one architectures single image. The low-resolution and semantically strong
due to the following reasons. For the approach in [11], features are combined with high-resolution and semantically
the authors converted the input frame into pyramid sizes. weak features via a top-down pathway and lateral connec-
Therefore, the input to the image autoencoder is a multiscale tions. They applied their approach to the task of object
set. Moreover, the authors set a motion autoencoder to extract tracking. However, this kind of pyramid structure can be used
the convolutional kernels from the image difference. The to speed up the next-frame prediction model in the future
cross convolution operation from their approach combines research.
the convolutional kernels with the feature maps from the The combination of autoencoders and generative adversar-
image autoencoder. For the approach from [10], they con- ial networks is a popular operation in the field of next-frame
sidered the temporal factor as the input to the state layer in prediction, especially in the sequence-to-one architecture.
the autoencoder model. The encoder has two branches: one Figure 1 is a representative example of a multiscale network
branch receives the input image and the other branch receives with a GAN to generate the next predicted frame. In general,
the time difference of the desired prediction. The decoder the input of the discriminative model is a real sample or a
generates reliable frames based on the latent variable output generated sample, whose purpose is to distinguish the output
from the encoder. of the generative model from the real sample as much as
In image learning, the pyramid structure has been proven possible. The two models confront each other and adjust
to be efficient for high-level feature extraction. As shown the parameters continuously. The images generated by the
in Figure 1, feature pyramids are independently constructed approach in [13] are much sharper in their experiments with
from images of various scales. Feature pyramids are a seman- the help of a GAN. Since the set of generators in the approach
tically multiscale feature representation. Mathieu et al. [13] [13] is a multiscale structure, the corresponding set of dis-
used multiscale networks through images at multiple scales criminators is also a multiscale structure. The loss calculated
to maintain a high resolution and reconstructed high-quality by multiple discriminators is accumulated and updated as the
frames with. Liu et al. [12] proposed an end-to-end deep weights of the model. Hintz et al. [45] was inspired by Math-
voxel flow network. They predicted the 3D voxel flow by a ieu’s method but replaced the generator with the reservoir
convolutional pyramid autoencoder. The voxel flow is added computing network, which is a more complex RNN struc-
to a volume sampling layer to synthesize the desired frame. ture for dealing with high-dimensional sequence problems.
Nevertheless, featuring each scale of an image has an obvious The discriminator structure and training method remained

69276 VOLUME 8, 2020


Y. Zhou et al.: Deep Learning in Next-Frame Prediction: A Benchmark Review

the same. In addition, the approaches of Liang et al. [6], use the structure of recurrent neural networks to model the
Mathieu et al. [13], and Villegas et al. [14] used GANs to temporal sequence data and discover the relationship in a
enhance the quality of the predicted frames. There are several sequence.
kinds of GAN structures that can be used for next-frame pre- Most sequence-to-sequence architectures use LSTM or
diction, such as WGAN [46], WAGAN-GP [47] and DCGAN ConvLSTM for future frame generation. The current
[48]. Liang et al. [6] proposed dual WGANs to encourage approaches use multiple structures to consider both spa-
background and motion prediction separately. Both decoders tial and temporal features and combine the autoencoder
benefit from the learning of the other; a flow warping and model or GAN model with the RNN model. The approaches
a flow estimator are used to obtain the image from the pre- in [16], [18] first applied the encoder to the whole input
dicted flow and vice versa, respectively. Therefore, they used sequence and then unroll the decoder for multiple predictions.
two discriminators to distinguish real/fake frames and flows In this sense, the model has fewer computational require-
separately, ensuring that the pixel-level predictor produced ments. Michalski et al. [15] used a pyramid of gated autoen-
sequences with proper motion and that the flow predictor was coders for prediction and employed recurrent connections to
coherent at the pixel level. predict potentially any desired length of time. The higher lay-
Inspired by human pose estimation, Villegas et al. [14] ers in their network model the changes from the transforma-
used high-order structural information to assist in the next- tions extracted from the lower layers among the frames. Finn
frame prediction of human activities, which are key human et al. [19] tried to decompose the motion and content: two key
body joints. The human skeleton structure is extracted components that generate dynamics in videos. Their network
from the input image with an hourglass network [49]. is built upon an autoencoder convolutional neural network
Villegas et al. [14] also used LSTM to predict the location and ConvLSTM for pixel-level prediction. ConvLSTM is a
of key points in the next frame. A recurrent neural network classic representation of spatiotemporal predictive learning.
(RNN) plays a minor role in sequence-to-one architecture Lotter et al. [5] used ConvLSTM for their prediction archi-
and assists prediction. The skeleton information in the next tecture based on the concept of predictive coding. In their
frame is fused into the encoder network in the form of approach, the image prediction error can be transmitted in
a heat map. The experiments from the paper showed that the network to achieve a better way to learn the frame repre-
this type of video generation based on high-order structural sentation. Villegas et al. [50] also proposed a motion-content
information can effectively reduce error propagation and network (MCNet) for separating the background and human
accumulation. However, this method has certain limitations. motion. The network has two encoder inputs: one encoder
The background information remains unchanged, and it can receives the image sequence difference as the motion input
only model changes in human motion. For human activities, and uses LSTM to model the motion dynamics, and the other
retaining static objects while predicting motion is a valuable encoder receives the last frame of the static image. After com-
direction. bining outputs from LSTM and outputs from the static image
By measuring the distance between the generated sample encoder, the convolutional decoder takes the combination and
and the real sample, researchers usually use the L1 or L2 outputs the predicted frames. Learning the temporal changes
distance. Using only the L1 or L2 distance as the loss function in objects’ features is a new direction for predicting frames,
will result in a blurry generated image. When predicting more but it has lead to a relatively small revolutionary change.
frames, this problem is even more serious in a sequence-to- Another innovative proposal is disentangled representation
one architecture. To solve the problem of image blurriness [51], which means that a change in a single underlying factor
caused by using the L1 or L2 loss function, Mathieu et al. [13] of variation should lead to a change in a single factor in the
proposed an image gradient difference loss, which penalizes learned representation. A disentangled representation should
the gradient inconsistency between the predicted sample and separate the distinct, informative factors of variations in the
the real sample by introducing the difference in the intensity data [52]. The application of disentangled representation in
of the neighboring image. next-frame prediction is that applying a recurrent model to the
time-varying components enables future-frame prediction.
B. SEQUENCE-TO-SEQUENCE ARCHITECTURE Denton et al. [17] broke down each feature into narrowly
Another type of next-frame prediction architecture is the defined variables and encoded them as separate dimensions:
sequence-to-sequence architecture. It better reflects the char- pose and content. Pose is the sequence of frames. Content
acter of temporal information. As shown in Figure 2, represents human actions. The combination of features from
sequence-to-sequence architectures lead to different losses pose and content encoders was input to LSTM networks for
in time steps since they predict one frame in each time applying next-frame prediction. In addition, motivated by
step. The best results are achieved by assigning different DeepMind’s use of the Atari game for reinforcement learning
weights to each time point, which is the main difference in (RL) problems, Oh et al. [4] proposed two spatiotemporal
the setting of the loss function between sequence-to-sequence prediction architectures based on a deep network containing
architectures and sequence-to-one architectures. Consider- action variables. Based on their understanding, future frames
ing both the spatial and temporal features is one main fea- are related not only to past frames but also to the current
ture of sequence-to-sequence architectures. Many researchers operation or behavior.

VOLUME 8, 2020 69277


Y. Zhou et al.: Deep Learning in Next-Frame Prediction: A Benchmark Review

FIGURE 2. A general sequence-to-sequence architecture. The upper part is a representative expansion for
sequence prediction. The lower part is the prediction model, which is composed of a recurrent neural network
and an autoencoder. The frame sequence is fed into the RNN. The total loss is composed of the difference
between the predicted images and the ground truth in each time step.

ConvLSTM is a classic representation of spatiotemporal in a sequence-to-sequence architecture. Since the predictions


predictive learning. Inspired by ConvLSTM, Wang et al. are made recursively, the small errors in the pixels are expo-
[20] designed a new strong RNN memory (called ST-LSTM) nentially magnified when performing deeper future predic-
by adding the operation of spatiotemporal memory to the tions. Based on the PredRNN, Wang et al. improved their
standard temporal cell. They made developments in the spa- network and proposed PredRNN++ [21] to make long-term
tiotemporal structural information. A shared output gate is next-frame predictions. In their model, a new spatiotemporal
used in their ST-LSTM to fuse both memories seamlessly. storage mechanism, called causal LSTM, was designed. The
In addition, they also proposed a new model structure: Pre- new LSTM attains more powerful modeling capabilities to
dRNN. According to the authors, the traditional connection achieve stronger spatial correlation and short-term dynamics.
between multilayer RNN networks ignores the effect of the In addition, they also proposed a gradient highway unit,
top-level cell at time t on the bottom-level cell at time t + 1, which provides a fast path for gradients from future pre-
and in their opinion, this effect is very significant. Therefore, dictions to long-interval past inputs to avoid the vanishing
they added top-level and bottom-level connections between gradient problem.
the time steps t and t + 1 in PredRNN. The combination of
PredRNN and ST-LSTM can make the memory flow spread IV. DATASETS AND EVALUATION METRICS
in both horizontal and vertical directions to make a high First of all, we define the symbols used in this section.
accuracy of long-term next-frame prediction. However, there The deep learning model generates the next frames
is a disadvantage to performing long-term frame prediction Xn+1 , Xn+2 , . . . , Xn+T , while the input is a set of continuous

69278 VOLUME 8, 2020


Y. Zhou et al.: Deep Learning in Next-Frame Prediction: A Benchmark Review

video frames X1 , X2 , . . . , Xn . T is the number of frames that where µY and µŶ are the average of Y and Ŷ , respectively;
need to be predicted. In this paper, we define Y and Ŷ as the σY and σŶ are the variance of Y and Ŷ , respectively; σY Ŷ is
ground truth and the generated prediction, respectively. The the covariance of Y and Ŷ . C1 and C2 are constants.
prediction is Ŷ : Yˆ1 , Yˆ2 , . . . , YˆT .
Because it is a new field of research, we found that there
V. EXPERIMENTS AND DISCUSSION
are currently no datasets specifically designed for next-frame
A. EXPERIMENTS
prediction, while researchers generally use motion video
datasets or car-driving video datasets. We list eight com- To conduct the experiments fairly, we set the time interval
monly used datasets with their URLs, resolutions, motions, to 0.2 seconds between frames. There are two frequencies of
categories, videos and frame rates listed in Table 2. The the raw video data in our comparison experiment as 10 fps
above datasets are also used for action recognition, detection, and 25 fps. We have modified the order of the input images.
and segmentation. The goal of next-frame prediction is to For a dataset of 25 fps, the inputs of the networks are the
predict the changes in pixels between images. Due to smooth, 1st frame, the 6th frame, the 11th frame, and the 16th frame,
continuous changes between frames from motion videos or and the output is the prediction of the 21st frame. For the
car-driving videos, these datasets are the most appropriate dataset of 10 fps, we use the following method: the inputs
for prediction. The Sports1m dataset has the most categories, are the 1st frame, 3rd frame, 5th frame, and 7th frame, and
while the Human3.6M dataset has the most videos. The the output is the prediction of the 9th frame. We construct
frames per second (FPS) for all the datasets varies from the networks proposed by Denton, Liu, Mathieu, Srivasta,
10 to 50. The image resolution of each dataset is different, Oliu, Finn, Lotter, and Villegas on the UCFsports, KITTI,
from full-size images (2048 × 1024) to small-size images KTH, and UCF101 datasets. We use TensorFlow 1.12 and
(640×360). Autonomous driving datasets, such as KITTI and a GTX 1080 Ti GPU with 12 GB memory to train and test
CityScape, generally have a large image resolution. The cate- these networks and compare their prediction by using the
gories of sports generally include walking, bowling, pushing MSE, PSNR, and SSIM. The results are shown in Table 3.
up, jumping, bowling, diving, crawling, punching, swing- The images are normalized between 0 and 1. The batch
ing, hand waving, and hand clapping. Among the datasets, size is 32.
Sports1m, UCFsports and Penn Action contain sports scenes UCF101, KTH, and UCFsports represent the most chal-
of athletes or people. Human3.6m, UCF101, THOMOS-15, lenging tasks in action prediction and classification. The
and HMDB-51 datasets include not only sports scenes but KITTI dataset is a computer vision evaluation dataset from
also daily life scenes. the autonomous driving platform AnnieWAY. The methods
Since the final results are images, the commonly used proposed by Villegas and Denton are required with human
methods for evaluating the quality of frames between the skeleton data for prediction, and KITTI is a vehicle driving
ground truth Y and the prediction Ŷ are the mean square dataset. Therefore, we did not train and test these models
error (MSE), peak signal-to-noise ratio (PSNR), and struc- on the KITTI dataset. We have seen that all the methods
tural similarity index (SSIM) [53]. N is the number of pix- can achieve a very high SSIM and PSNR values in pre-
els. The MSE measures the average of the squares of the dicting the frames in the action dataset experiment (KTH,
errors or deviations. The MSE is calculated by: UCF101, UCFsports) but low scores in the vehicle driving
dataset experiment (KITTI). The reason for the low scores in
N
1 X the KITTI dataset is that objects in the image are complex:
MSE(Y , Ŷ ) = (Yi − Ŷi )2 (1) perhaps buildings, transportations, pedestrians and so on. We
N
t=1 have implemented most of the temporal input architecture
and three of the spatial input architecture. In the spatial input
The PSNR is an engineering term for the ratio between
architecture, most of the methods use autoencoders to make
the maximum possible power of a signal and the power of
predictions, but there are also special ones: Villegas consid-
corrupting noise that affects the fidelity of its representation.
ered the extra information of the human skeleton to predict the
The PSNR is calculated by:
next frame, and other networks were not considered. Mathieu
max 2 considers the pyramid network structure and can extract mul-
PSNR(Y , Ŷ ) = 10 log10 PN Ŷ
(2) tiple features. Liu used a pyramid in the autoencoder network
2
i=1 (Yi − Ŷi ) structure. In the sequence-to-sequence architecture, most of
them use an RNN structure, and the difference is how to
where maxŶ is the maximum possible value of the image
extract image features. Only Oh’s method is not implemented
intensities.
because the prediction is the next frame of a video game based
The SSIM measures the image similarity from luminance,
on the next action to play.
contrast, and structure between two images. The calculation
process is as follows:
B. TECHNICAL ANALYSIS
(2µY µŶ + C1 ) + (2σY Ŷ + C2 )
SSIM (Y , Ŷ ) = (3) From the results in Table 3, we can see that, in general, as the
(µ2Y + µŶ 2 + C1 )(σY2 + σ 2 + C2 ) value of MSE decreases, both SSIM and PSNR increase.

VOLUME 8, 2020 69279


Y. Zhou et al.: Deep Learning in Next-Frame Prediction: A Benchmark Review

TABLE 2. The commonly used datasets for the next-frame prediction.

Mathieu’s achieved an outstanding result with an SSIM complex. Predicting the next frame with the scene of vehicle
of 0.885 on the USFsports dataset, while Finn’s method won driving is still a challenging problem.
first place on the KITTI, KTH, and UCF101 datasets. On the Srivastava’s method does not perform well since the fully
action datasets, some methods can achieve an SSIM of 80, but connected network is not suitable for next-frame prediction.
none of the results of the methods are ideal in the prediction It cannot handle the diverse changes in the background, and
of the autonomous driving dataset KITTI. The motion of many parameters require too much computation. GANs are
each object in the autonomous driving scene is dynamic and helpful in the approaches proposed by Denton, Mathieu,

69280 VOLUME 8, 2020


Y. Zhou et al.: Deep Learning in Next-Frame Prediction: A Benchmark Review

TABLE 3. The performance comparison of the state-of-the-art approaches in the next-frame prediction.

Villegas, and Oliu. The desired predictions need to be guided methods all achieved SSIMs of approximately 0.60, which
under the discriminator. The reason for the low value is that indicates that there are large deviations in the generated
Mathieu’s method achieves a good prediction with fewer images. In the image prediction for single object motion from
motions, whereas in the moving regions, it predicts frames the action datasets, the structure of the pyramid autoencoder,
with increased blurriness. As the results of Denton’s approach such as in Mathieu’s method and Liu’s method, is beneficial
show, simply using an autoencoder architecture to extract to maintaining the original feature information of the image.
the features from the previous frames does not improve The structure of a CNN with LSTM can achieve a high SSIM
the prediction results. Finn’s original approach is to predict and PSNR, such as with the methods proposed by Lotter,
future frames based on the constraints of action. Without the Oliu, and Finn. Finn used a multilayer ConvLSTM, while
state of action, it achieves a low performance. Liu’s method Lotter used a single layer of ConvLSTM. Although Oliu
and Finn’s method achieve a similar performance on the proposed a folded recurrent model, the essence of the network
KITTI dataset. Since the data from the KITTI dataset are consists of an autoencoder and RNN structure. These results
autonomous driving scenarios, the background is typically show that recurrent networks will play a positive role in next-
dynamic. The networks proposed by Lotter, Oliu, and Srivasta frame prediction. All the researchers did not consider the
have limited ability to predict frames with dynamic back- time costs. Reducing the number of parameters as much as
grounds. possible for fast prediction is of great value.
The temporal information of the objects also forms a part Second, the proper design of loss functions is another
of the features of objects. Oliu and Lotter combined CNN direction of improvement. The most commonly used loss
and RNN models to extract the motion features recurrently. functions are the mean squared error, GAN loss function, and
The main idea of these networks is combining CNN and image gradient differential loss function. Most networks use
RNN models that can recurrently extract motion features. The the per-pixel loss to measure the actual differences among
network always keeps the errors from each step to model pixels in the images, such as the mean squared error, mean
the motion. After a few steps, the results are much better. The average error and image gradient differential loss function.
network of Oliu shares the information between the encoder As shown in Table 3, using the per-pixel loss does not achieve
and decoder to reduce the computational cost and avoids a high-quality generated image. The reason may be that the
re-encoding the predictions when generating a sequence of constraint from per-pixel loss does not reflect the high-level
frames. features of the image. Different from the per-pixel loss, per-
ceptual loss functions [54] are based on differences between
C. DISCUSSION high-level image feature representations since they make a
Next-frame prediction is a new field of research for deep large contribution in the field of image style transfer. The
learning. In this paper, we have introduced the commonly features extracted by convolutional neural networks can be
used datasets, state-of-the-art and quantitative evaluations used as parts of the loss function. By comparing the feature
and conducted experiments on four datasets. Although these value from the image to be generated through the convo-
networks perform well, the existing methods still need to be lutional layers and the feature value from the target image
improved. We describe some directions of improvement for through the convolutional layers, the generated image is more
next-frame prediction. semantically similar to the target image. This is the main
First, the combination of multiple network structures could concept of the perceptual loss. As the purpose of next-frame
improve next-frame prediction. Different networks address prediction is to reconstruct future frames, the perceptual loss
different problems. For example, CNNs are used to model can play a significant role in the prediction. For human move-
spatial information, while RNNs are used to solve the time ment prediction, previous researchers did not consider the
sequence problem. The combination of an RNN and a CNN movement limitation of each body part. We can set up the
can handle dynamic backgrounds and extract the necessary loss function based on the movement of key points in the
features from the previous frames. For next-frame prediction human body. During movement, the movement angles and
with multi-object motion (KITTI experiments), the current distance between each point have a maximum and minimum

VOLUME 8, 2020 69281


Y. Zhou et al.: Deep Learning in Next-Frame Prediction: A Benchmark Review

bounds. Based on the bounds, we can adjust the loss function [6] X. Liang, L. Lee, W. Dai, and E. P. Xing, ‘‘Dual motion GAN for future-
to optimize the network directly. flow embedded video prediction,’’ in Proc. IEEE Int. Conf. Comput. Vis.
(ICCV), Oct. 2017, pp. 1762–1770.
Third, the current next-frame prediction approaches are [7] B. Klein, L. Wolf, and Y. Afek, ‘‘A dynamic convolutional layer for short
pixel-level prediction. Each pixel value of both moving range weather prediction,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
objects and static backgrounds is predicted. Such a predic- Recognit., Jun. 2015, pp. 4840–4848.
[8] Y. Qiu, Y. Liu, J. Arteaga-Falconi, H. Dong, and A. El Saddik, ‘‘EVM-
tion scheme needs a lot of computational resources. Dif- CNN: Real-time contactless heart rate estimation from facial video,’’ IEEE
ferentiating the moving objects from the background can Trans. Multimedia, vol. 21, no. 7, pp. 1778–1787, Jul. 2019.
speed up the prediction. Additionally, combining the pre- [9] Y. Jiang, H. Dong, and A. El Saddik, ‘‘Baidu Meizu deep learning com-
petition: Arithmetic operation recognition using end-to-end learning OCR
dicted motion with the original frames to perform next- technologies,’’ IEEE Access, vol. 6, pp. 60128–60136, 2018.
frame prediction is another solution to improve the prediction [10] V. Vukotic, S. L. Pintea, C. Raymond, G. Gravier, and J. V. Gemert, ‘‘One-
efficiency. A good example is Villegas’s work [14] which step time-dependent future video frame prediction with a convolutional
encoder-decoder neural network,’’ in Proc. Int. Conf. Image Anal. Process.,
estimates the movement information of the human skele-
2017, pp. 140–151.
ton and transforms the skeletons into images to predict the [11] T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman, ‘‘Visual dynamics:
next frame. Probabilistic future frame synthesis via cross convolutional networks,’’ in
Fourth, a long-term prediction of future frames can be Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 91–99.
[12] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, ‘‘Video frame
further improved. Most of the current next-frame prediction synthesis using deep voxel flow,’’ in Proc. IEEE Int. Conf. Comput. Vis.
approaches can only predict short-term frames in the next (ICCV), Oct. 2017, pp. 4473–4481.
one or two seconds. Regarding the predicted frames, to the [13] M. Mathieu, C. Couprie, and Y. LeCun, ‘‘Deep multi-scale video prediction
beyond mean square error,’’ in Proc. Int. Conf. Learn. Represent., 2016,
best of our knowledge, the maximum number of the predicted pp. 1–14.
frames is 128. In order to achieve longer-term prediction, [14] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee, ‘‘Learning to
more kinds of input information could be useful, such as generate long-term future via hierarchical prediction,’’ in Proc. Int. Conf.
Mach. Learn., vol. 70, 2017, pp. 3560–3569.
depth image (compensating the 3D geometric information), [15] V. Michalski, R. Memisevic, and K. Konda, ‘‘Modeling deep temporal
infrared image (compensating the weak lighting condition), dependencies with recurrent grammar cells,’’ in Proc. Adv. Neural Inf.
etc. By properly fusing the prediction results through optimal Process. Syst., 2014, pp. 1925–1933.
estimation (e.g., Kalman filter), a longer-term prediction may [16] N. Srivastava, E. Mansimov, and R. Salakhudinov, ‘‘Unsupervised learning
of video representations using LSTMs,’’ in Proc. Int. Conf. Mach. Learn.,
be able to achieved. 2015, pp. 1–10.
[17] E. Denton and V. Birodkar, ‘‘Unsupervised learning of disentangled rep-
resentations from videos,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017,
VI. CONCLUSION pp. 4414–4423.
Frame predictive learning is a powerful and useful way of [18] M. Oliu, J. Selva, and S. Escalera, ‘‘Folded recurrent neural networks
understanding and modeling the dynamics of natural scenes. for future video prediction,’’ in Proc. Eur. Conf. Comput. Vis., 2018,
pp. 716–731.
The long-term accurate prediction of the movement of an
[19] C. Finn, I. Goodfellow, and S. Levine, ‘‘Unsupervised learning for physical
object, animal, or person is crucial to the future interactive interaction through video prediction,’’ in Proc. Adv. Neural Inf. Process.
human-machine interface, which can be widely applied in Syst., 2016, pp. 64–72.
many areas, including simulating and predicting future road [20] Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, ‘‘PredRNN: Recurrent
neural networks for predictive learning using spatiotemporal LSTMs,’’ in
events, proactively cooperating with a human for robots, Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 879–888.
decision making and reasoning in understanding human’s [21] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu, ‘‘PredRNN++: Towards
intention. As the SSIMs and PSNRs results of the state- a resolution of the deep-in-time dilemma in spatiotemporal predictive
learning,’’ in Proc. Int. Conf. Mach. Learn., 2018, pp. 5123–5132.
of-the-art approaches are less than 0.9 and 30 respectively [22] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam, ‘‘Pyramid dilated
according to our experiments, we believe current research deeper ConvLSTM for video salient object detection,’’ in Proc. Eur. Conf.
on next-frame prediction is still in the early stage. There is Comput. Vis., 2018, pp. 744–760.
[23] J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi, ‘‘DNN-based prediction model
great potential for performance improvement in next-frame for spatio-temporal data,’’ in Proc. ACM Int. Conf. Adv. Geograph. Inf.
prediction. Syst., 2016, pp. 1–4.
[24] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo,
‘‘Convolutional LSTM network: A machine learning approach for precip-
REFERENCES itation nowcasting,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 1, 2015,
[1] H.-I. Lin and Y.-C. Huang, ‘‘Ball trajectory tracking and prediction for pp. 802–810.
a ping-pong robot,’’ in Proc. 9th Int. Conf. Inf. Sci. Technol. (ICIST), [25] T. Vu, C. Olsson, I. Laptev, A. Oliva, and J. Sivic, ‘‘Predicting actions from
Aug. 2019, pp. 222–227. static scenes,’’ in Proc. Eur. Conf. Comput. Vis., 2014, pp. 421–436.
[2] Y. Miao, H. Dong, J. Al-Jaam, and A. El Saddik, ‘‘A deep learning system [26] M. S. Ryoo, ‘‘Human activity prediction: Early recognition of ongoing
for recognizing facial expression in real-time,’’ ACM Trans. Multimedia activities from streaming videos,’’ in Proc. IEEE Int. Conf. Comput. Vis.,
Comput., Commun., Appl., vol. 15, no. 2, pp. 3301–3320, 2019. Nov. 2011, pp. 1036–1043.
[3] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, [27] T. Lan, T.-C. Chen, and S. Savarese, ‘‘A hierarchical representation
A. Graves, and K. Kavukcuoglu, ‘‘Video pixel networks,’’ in Proc. Int. for future action prediction,’’ in Proc. Eur. Conf. Comput. Vis., 2014,
Conf. Mach. Learn., 2017, pp. 1771–1779. pp. 689–704.
[4] J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh, ‘‘Action-conditional video [28] J. Walker, A. Gupta, and M. Hebert, ‘‘Dense optical flow prediction from
prediction using deep networks in Atari games,’’ in Proc. Adv. Neural Inf. a static image,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
Process. Syst., 2015, pp. 2863–2871. pp. 2443–2451.
[5] W. Lotter, G. Kreiman, and D. Cox, ‘‘Deep predictive coding networks [29] J. Walker, A. Gupta, and M. Hebert, ‘‘Patch to the future: Unsupervised
for video prediction and unsupervised learning,’’ in Proc. Int. Conf. Learn. visual prediction,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Represent., 2017, pp. 1–18. Jun. 2014, pp. 3302–3309.

69282 VOLUME 8, 2020


Y. Zhou et al.: Deep Learning in Next-Frame Prediction: A Benchmark Review

[30] N. Deo and M. M. Trivedi, ‘‘Multi-modal trajectory prediction of surround- [52] Y. Bengio, A. Courville, and P. Vincent, ‘‘Representation learning:
ing vehicles with maneuver based LSTMs,’’ in Proc. IEEE Intell. Vehicles A review and new perspectives,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
Symp. (IV), Jun. 2018, pp. 1179–1184. vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
[31] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural [53] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, ‘‘Image quality
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. assessment: From error visibility to structural similarity,’’ IEEE Trans.
[32] J. Xie, L. Xu, and E. Chen, ‘‘Image denoising and inpainting with deep neu- Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
ral networks,’’ in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 341–349. [54] J. Johnson, A. Alahi, and L. Fei-Fei, ‘‘Perceptual losses for real-time style
[33] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, transfer and super-resolution,’’ in Proc. Eur. Conf. Comput. Vis., 2016,
A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, ‘‘Photo-realistic pp. 694–711.
single image super-resolution using a generative adversarial network,’’
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 105–114.
[34] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation
with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis. YUFAN ZHOU received the B.Eng. degree in rail-
Pattern Recognit. (CVPR), Jul. 2017, pp. 105–114. way traffic signaling and control from Southwest
[35] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, Jiaotong University, China, in 2017. He is cur-
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in rently pursuing the M.A.Sc. degree in electrical
Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680. and computer engineering with the University of
[36] A. Ng. (2010). Sparse Autoencoder. [Online]. Available: https://web. Ottawa. His research interests include artificial
stanford.edu/class/cs294a/sparseAutoencoder.pdf intelligence and multimedia.
[37] D. J. Rezende, S. Mohamed, and D. Wierstra, ‘‘Stochastic backpropagation
and approximate inference in deep generative models,’’ in Proc. Int. Conf.
Mach. Learn., 2014, pp. 1–14.
[38] E. Mansimov, E. Parisotto, J. Ba, and R. Salakhutdinov, ‘‘Generating
images from captions with attention,’’ in Proc. Int. Conf. Learn. Represent.,
2016, pp. 1–12.
[39] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, HAIWEI DONG (Senior Member, IEEE) received
‘‘DRAW: A recurrent neural network for image generation,’’ in Proc. Int. the Dr.Eng. degree in computer science and sys-
Conf. Mach. Learn., 2015, pp. 1462–1471. tems engineering from Kobe University, Kobe,
[40] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen,
Japan, in 2008, and the M.Eng. degree in con-
and X. Chen, ‘‘Improved techniques for training GANs,’’ in Proc. Adv.
trol theory and control engineering from Shanghai
Neural Inf. Process. Syst., 2016, pp. 2234–2242.
[41] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, Jiao Tong University, Shanghai, China, in 2010.
‘‘InfoGAN: Interpretable representation learning by information maximiz- He was a Research Scientist with the University
ing generative adversarial nets,’’ in Proc. Adv. Neural Inf. Process. Syst., of Ottawa, Ottawa, ON, Canada; a Postdoctoral
2016, pp. 2172–2180. Fellow with New York University, New York, NY,
[42] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, USA; a Research Associate with the University of
‘‘StackGAN: Text to photo-realistic image synthesis with stacked gen- Toronto, Toronto; and a Research Fellow (PD) with the Japan Society for the
erative adversarial networks,’’ in Proc. IEEE Int. Conf. Comput. Vis., Promotion of Science, Tokyo, Japan. He is currently a Principal Engineer
Oct. 2017, pp. 5907–5915. with the Noah’s Ark Lab of Huawei Technologies, Toronto, ON, Canada. His
[43] T. Karras, T. Aila, S. Laine, and J. Lehtinen, ‘‘Progressive growing of research interests include artificial intelligence, robotics, and multimedia.
GANs for improved quality, stability, and variation,’’ in Proc. Int. Conf.
Learn. Represent., 2018, pp. 1–26.
[44] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
‘‘Feature pyramid networks for object detection,’’ in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. ABDULMOTALEB EL SADDIK (Fellow, IEEE)
[45] J. J. Hintz, ‘‘Generative adversarial reservoirs for natural video prediction,’’ is a Distinguished University Professor and a
M.S. thesis, Univ. Texas Austin, Austin, TX, USA, 2016. University Research Chair with the School of
[46] M. Arjovsky, S. Chintala, and L. Bottou, ‘‘Feature pyramid networks for Electrical Engineering and Computer Science,
object detection,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., University of Ottawa. He has supervised more than
2017, pp. 2117–2125. 120 researchers. He has coauthored ten books and
[47] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, more than 550 publications and chaired more than
‘‘Improved training of Wasserstein GANs,’’ in Proc. Adv. Neural Inf. 50 conferences and workshops. His research focus
Process. Syst., 2017, pp. 5767–5777. is on the establishment of digital twins to facili-
[48] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation learn- tate the well-being of citizens using AI, the IoT,
ing with deep convolutional generative adversarial networks,’’ in Proc. Int.
AR/VR, and 5G to allow people to interact in real time with one another
Conf. Learn. Represent., 2016, pp. 1–16.
as well as with their smart digital representations. He has received research
[49] A. Newell, K. Yang, and J. Deng, ‘‘Stacked hourglass networks for human
pose estimation,’’ in Proc. Eur. Conf. Comput. Vis., 2016, pp. 483–499. grants and contracts totaling more than $20 M.
[50] R. Villegas, J. Yang, X. L. S. Hong, and H. Lee, ‘‘Decomposing motion and He is an ACM Distinguished Scientist and a Fellow of the Engineering
content for natural video sequence prediction,’’ in Proc. Int. Conf. Learn. Institute of Canada and the Canadian Academy of Engineers. He has received
Represent., 2017, pp. 1–22. several international awards, such as the IEEE I&M Technical Achievement
[51] F. Locatello, S. Bauer, M. Lucic, G. Rätsch, S. Gelly, B. Schölkopf, Award, the IEEE Canada C.C. Gotlieb (Computer) Medal, and the A.G.L.
and O. Bachem, ‘‘Challenging common assumptions in the unsupervised McNaughton Gold Medal for important contributions to the field of computer
learning of disentangled representations,’’ in Proc. Int. Conf. Mach. Learn., engineering and science.
2019, pp. 1–37.

VOLUME 8, 2020 69283

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy