Database Meets Deep Learning: Challenges and


Wei Wang† , Meihui Zhang‡ , Gang Chen§ ,

H. V. Jagadish# , Beng Chin Ooi† , Kian-Lee Tan†

National University of Singapore ‡ Singapore University of Technology and Design
Zhejiang University # University of Michigan

{wangwei, ooibc, tankl}@comp.nus.edu.sg ‡
meihui zhang@sutd.edu.sg
cg@zju.edu.cn #

ABSTRACT tem optimization and large scale data-driven ap-

Deep learning has recently become very popular on ac- plications since 1970s, which are closely related to
count of its incredible success in many complex data- the first two factors. It is natural to think about
driven applications, including image classification and the relationships between databases and deep learn-
speech recognition. The database community has worked ing. First, are there any insights that the database
on data-driven applications for many years, and there- community can o↵er to deep learning? It has been
fore should be playing a lead role in supporting this new shown that larger training datasets and a deeper
wave. However, databases and deep learning are differ- model structure improve the accuracy of deep learn-
ent in terms of both techniques and applications. In this ing models. However, the side e↵ect is that the
paper, we discuss research problems at the intersection training becomes more costly. Approaches have been
of the two fields. In particular, we discuss possible im- proposed to accelerate the training speed from both
provements for deep learning systems from a database the system perspective [5, 19, 9, 28, 11] and the the-
perspective, and analyze database applications that may ory perspective [45, 12]. Since the database commu-
benefit from deep learning techniques. nity has rich experience with system optimization,
it would be opportune to discuss the applicability
of database techniques for optimizing deep learn-
1. INTRODUCTION ing systems. For example, distributed computing
In recent years, we have witnessed the success of and memory management are key database tech-
numerous data-driven machine-learning-based ap- nologies. They are also central to deep learning.
plications. This has prompted the database com- Second, are there any deep learning techniques
munity to investigate the opportunities for integrat- that can be adapted for database problems? Deep
ing machine learning techniques in the design of learning emerged from the machine learning and
database systems and applications [29]. A branch of computer vision communities. Recently, it has been
machine learning, called deep learning [22, 18], has successfully applied to other domains, like NLP [13].
attracted worldwide interest in recent years due to However, few studies have been conducted using
its excellent performance in multiple areas including deep learning techniques for database problems. This
speech recognition, image classification and natural is partially because traditional database problems
language processing (NLP). The foundation of deep — like indexing, transaction and storage manage-
learning was established about twenty years ago in ment — involve less uncertainty, whereas deep learn-
the form of neural networks. Its recent resurgence is ing is good at predicting over uncertain events. Nev-
mainly fueled by three factors: immense computing ertheless, there are problems in databases like knowl-
power, which reduces the time to train and deploy edge fusion [10] and crowdsourcing [27], which are
new models, e.g. Graphic Processing Unit (GPU) probabilistic problems. It is possible to apply deep
enables the training systems to run much faster learning techniques in these areas. We will discuss
than those in the 1990s; massive (labeled) training specific problems like querying interface, knowledge
datasets (e.g. ImageNet) enable a more comprehen- fusion, etc. in this paper.
sive knowledge of the domain to be acquired; new The rest of this paper is organized as follows: Sec-
deep learning models (e.g. AlexNet [20]) improve tion 2 provides background information about deep
the ability to capture data regularities. learning models and training algorithms; Section 3
Database researchers have been working on sys- discusses the application of database techniques for

initialize read mini- compute update data loss
parameters batch data gradients parameters

Figure 1: Stochastic Gradient Descent. gradient

data W
optimizing deep learning systems. Section 4 de- inner-
scribes research problems in databases where deep gradient product b
learning techniques may help to improve perfor-
mance. Some final thoughts are presented in Sec- data input
tion 5.

Figure 2: Data flow of Back-Propagation.

Deep learning refers to a set of machine learn-
ing models which try to learn high-level abstrac- is illustrated in Figure 2, where a simple feedfor-
tions (or representations) of raw data through mul- ward model is trained by traversing along the solid
tiple feature transformation layers. Large training arrows to compute the data (feature) of each layer,
datasets and deep complex structures enhance the and along the dashed arrows to compute the gradi-
ability of deep learning models for learning e↵ec- ent of each layer and each parameter (W and b).
tive representations for tasks of interest. There are
three popular categories of deep learning models ac- 3. DATABASES TO DEEP LEARNING
cording to the types of connections between layers In this section, we discuss the optimization tech-
[22], namely feedforward models (directed connec- niques used in deep learning systems, and research
tion), energy models (undirected connection) and opportunities from the perspective of databases.
recurrent neural networks (recurrent connection).
Feedforward models, including Convolution Neural 3.1 Stand-alone Training
Network (CNN), propagate input features through Currently, the most e↵ective approach for im-
each layer to extract high-level features. CNN is proving the training speed of deep learning mod-
the state-of-the-art model for many computer vi- els is to use Nvidia GPU with the cuDNN library.
sion tasks. Energy models, including Deep Belief Researchers are also working on other hardware,
Network (DBN) are typically used to pre-train other e.g. FPGA [21]. Besides exploiting advancements
models, e.g., feedforward models. Recurrent Neu- in hardware technology, operation scheduling and
ral Network (RNN) is widely used for modeling se- memory management are two important components
quential data. Machine translation and language to consider.
modeling are popular applications of RNN.
Before deploying a deep learning model, the model 3.1.1 Operation Scheduling
parameters involved in the transformation layers Training algorithms of deep learning models typ-
need to be trained. The training turns out to be a ically involve expensive linear algebra operations as
numeric optimization procedure to find parameter shown in Figure 3, where the matrix W 1 and W 2
values that minimize the discrepancy (loss function) could be larger than 4096⇤4096. Operation schedul-
between the expected output and the real output. ing is to first detect the data dependency of oper-
Stochastic Gradient Descent (SGD) is the most widely ations and then place the operations without de-
used training algorithm. As shown in Figure 1, pendencies onto executors, e.g., CUDA streams and
SGD initializes the parameters with random val- CPU threads. Take the operations in Figure 3 as an
ues, and then iteratively refines them based on the example, a1 and a2 in Figure 3 could be computed
computed gradients with respect to the loss func- in parallel because they have no dependencies. The
tion. There are three commonly used algorithms first step could be done statically based on dataflow
for gradient computation corresponding to the three graph or dynamically [3] by analyzing the orders of
model categories above: Back Propagation (BP), read and write operations. Databases also have this
Contrastive Divergence (CD) and Back Propaga- kind of problems in optimizing transaction execu-
tion Through Time (BPTT). By regarding the lay- tion [44] and query plans. Those solutions should
ers of a neural net as nodes of a graph, these algo- be considered for deep learning systems. For in-
rithms can be evaluated by traversing the graph in stance, databases use cost models to estimate query
certain sequences. For instance, the BP algorithm plans. For deep learning, we may also create a cost

3.2 Distributed Training
Distributed training is a natural solution for ac-
celerating the training speed of deep learning mod-
els. The parameter server architecture [9] is typi-
cally used, in which the workers compute parameter
gradients and the servers update the parameter val-
ues after receiving gradients from workers. There
are two basic parallelism schemes for distributed
Figure 3: Sample operations from a deep training, namely, data parallelism and model par-
learning model. allelism. In data parallelism, each worker is as-
signed a data partition and a model replica, while
for model parallelism, each worker is assigned a par-
model to find an optimal operation placing strategy tition of the model and the whole dataset. The
for the second step of operation scheduling given a database community has a long history of work-
fixed computing resources including executors and ing on distributed environment, ranging from par-
memory. allel databases [23] and peer-to-peer systems [37]
to cloud computing [25]. We will discuss some re-
3.1.2 Memory Management search problems relevant to databases arising from
distributed training in the following paragraphs.
Deep learning models are becoming larger and
larger, and already occupy a huge amount of mem-
3.2.1 Communication and Synchronization
ory space. For example, the VGG model [32] can-
not be trained on normal GPU cards due to mem- Given that deep learning models have a large
ory size constraints. Many approaches have been set of parameters, the communication overhead be-
proposed towards reducing memory consumption. tween workers and servers is likely to be the bottle-
Shorter data representation, e.g. 16-bit float [7] is neck of a training system, especially when the work-
now supported by CUDA. Memory sharing is an ers are running on GPUs which decrease the com-
e↵ective approach for memory saving [3]. Take Fig- putation time. In addition, for large clusters, the
ure 3 as an example, the input and output of the synchronization between workers can be significant.
sigmoid function share the same variable and thus Consequently, it is important to investigate efficient
the same memory space. Such operations are called communication protocols for both single-node mul-
‘in-place’ operations. Recently, two approaches were tiple GPU training and training over a large clus-
proposed to trade-o↵ computation time for mem- ter. Possible research directions include : a) com-
ory. Swapping memory between GPU and CPU pressing the parameters and gradients for trans-
resolves the problem of small GPU memory and mission [30]; b) organizing servers in an optimized
large model size by swapping variables out to CPU topology to reduce the communication burden of
and then swapping back manually[8]. Another ap- each single node, e.g., tree structure [15] and AllRe-
proach drops some variables to free memory and re- duce structure [42] (all-to-all connection); c) using
computes them when necessary based on the static more efficient networking hardware like RDMA [5].
dataflow graph[4].
Memory management is a hot topic in the database 3.2.2 Concurrency and Consistency
community with a significant amount of research Concurrency and consistency are critical concepts
towards in-memory databases [35, 46], including lo- in databases. For distributed training of deep learn-
cality, paging and cache optimization. To elaborate ing models, they also matter. Currently, both declar-
more, the paging strategies could be useful for de- ative programming (e.g., Theano and TenforFlow)
ciding when and which variable to swap. In addi- and imperative programming (e.g., Ca↵e and SINGA)
tion, failure recovery in databases is similar to the have been adopted in existing systems for concur-
idea of dropping and recomputing approach, hence rency implementation. Most deep learning systems
the logging techniques in databases could be con- use threads and locks directly. Other concurrency
sidered. If all operations (and execution time) are implementation methods like actor model (good at
logged, we can then do runtime analysis without the failure recovery), co-routine and communicating se-
static dataflow graph. Other techniques, including quential processes have not been explored.
garbage collection and memory pool, would also be Sequential consistency (from synchronous train-
useful for deep learning systems, especially for GPU ing) and eventual consistency (from asynchronous
memory management. training) are typically used for distributed deep learn-

Table 1: Summary of optimization techniques used in existing systems as of July 2016.
SINGA Ca↵e Mxnet TensorFlow Theano Torch
1. operation scheduling X x X - - x
2. memory management d+a+p i d+s p p -
3. parallelism d+m d d+m d+m - d+m
4. consistency s+a+h s/a s+a+h s+a+h - s
1. x: not available: X: available 2. d: dynamic; a: swap; p: memory pool; i: in-place operation; s: static;
3. d: data parallelism; m: model parallelism; 4. s: synchronous; a: asynchronous; h:hybrid; -: unknown

ing. Both approaches have scalability issues [38]. timization based on the dataflow graph.
Recently, there are studies for training convex mod- We are optimizing the Apache incubator SINGA
els (deep learning models are non-linear and non- system [28] starting from version 1.0. For stand-
convex) using a value bounded consistency model [41]. alone training, cost models are explored for runtime
Researchers are starting to investigate the influence operation scheduling. Memory optimization includ-
of consistency models on distributed training [15, ing dropping, swapping and garbage collection with
16, 2]. There remains much research to be done on memory pool will be implemented. OpenCL is sup-
how to provide flexible consistency models for dis- ported to run SINGA on a wide range of hardware
tributed training, and how each consistency model including GPU, FPGA and ARM. For distributed
a↵ects the scalability of the system, including com- training, SINGA (V0.3) has done much work on
munication overhead. flexible parallelism and consistency, hence the fo-
cus would be on optimization of communication and
3.2.3 Fault Tolerance fault-tolerance, which are missing in almost all sys-
Databases systems have good durability via log- tems.
ging (e.g., command log) and checkpointing. Cur-
rent deep learning systems recover the training from 4. DEEP LEARNING TO DATABASES
crashes mainly based on checkpointing files [11]. Deep learning applications, such as computer
However, frequent checkpointing would incur vast vision and NLP, may appear very di↵erent from
overhead. In contrast with database systems, which database applications. However, the core idea of
enforce strict consistency in transactions, the SGD deep learning, known as feature (or representation)
algorithm used by deep learning training systems learning, is applicable to a wide range of applica-
can tolerate a certain degree of inconsistency. There- tions. Intuitively, once we have e↵ective represen-
fore, logging is not a must. How to exploit the SGD tations for entities, e.g., images, words, table rows
properties and system architectures to implement or columns, we can compute entity similarity, per-
fault tolerance efficiently is an interesting problem. form clustering, train prediction models, and re-
Considering that distributed training would repli- trieve data with di↵erent modalities [40, 39] etc.
cate the model status, it is thus possible to recover We shall highlight a few deep learning models that
from a replica instead of checkpointing files. Ro- could be adapted for database applications below.
bust frameworks (or concurrency model) like actor
model, could be adopted to implement this kind of 4.1 Query Interface
failure recovery. Natural language query interfaces have been at-
tempted for decades [24], because of their great de-
3.3 Existing Systems sirability, particularly for non-expert database users.
A summary of existing systems in terms of the However, it is challenging for database systems to
above mentioned optimization aspects is listed in interpret (or understand) the semantics of natural
Table 1. Many researchers have extended Ca↵e [19] language queries. Recently, deep learning models
with ad hoc optimizations, including memory swap- have achieved state-of-the-art performance for NLP
ping and communication optimization. However, tasks [13]. Moreover, RNN has been shown to be
the official version is not well optimized. Similarly, able to learn structured output [34, 36]. As one so-
Torch [6] itself provides limited support for distributed lution, we can apply RNN models for parsing nat-
training. Mxnet[3] has optimization for both mem- ural language queries to generate SQL queries, and
ory and operations scheduling. Theano [1] is typi- refine it using existing database approaches. For
cally used for stand-alone training. TensorFlow [11] instance, heuristic rules could be applied to correct
has the potential for the aforementioned static op- grammar errors in the generated SQL queries. The

challenge is that a large amount of (labeled) train- we regard each block as a pixel of one image, then
ing samples is required to train the model. One deep learning models, e.g., CNN, could be exploited
possible solution is to train a baseline model with a to extract the spatial locality between nearby blocks.
small dataset, and gradually refining it with users’ For instance, if we have the real-time location data
feedback. For instance, users could help correct the (e.g., GPS data) of moving objects, we could learn a
generated SQL query, and these feedback essentially CNN model to capture the density relationships of
serve as labeled data for subsequent training. nearby areas for predicting the traffic congestion for
a future time point. When temporal data is mod-
4.2 Query Plans eled as features over a time matrix, deep learning
Query plan optimization is a traditional database models, e.g. RNN, can be designed to model time
problem. Most current database systems use com- dependency and predict the occurrence in a future
plex heuristic and cost models to generate the query time point. A particular example would be disease
plan. According to [17], each query plan of a para- progression modeling [26] based on historical med-
metric SQL query template has an optimality re- ical records, where doctors would want to estimate
gion. As long as the parameters of the SQL query the onset of certain severity of a known disease.
are within this region, the optimal query plan does
not change. In other words, query plans are in- 5. CONCLUSIONS
sensitive to small variations of the input parame- In this paper, we have discussed databases and
ters. Therefore, we can train a query planner which deep learning. Databases have many techniques for
learns from a set of pairs of SQL queries and opti- optimizing system performance, while deep learn-
mal plans to generate (similar) plans for new (sim- ing is good at learning e↵ective representation for
ilar) queries. To elaborate more, we can learn a data-driven applications. We note that these two
RNN model that accepts the SQL query elements “di↵erent” areas share some common techniques for
and meta-data (like bu↵er size and primary key) as improving the system performance, such as memory
input, and generates a tree structure [36] represent- optimization and parallelism. We have discussed
ing the query plan. Reinforcement learning (like Al- some possible improvements for deep learning sys-
phaGo [31]) could also be incorporated to train the tems using database techniques, and research prob-
model on-line using the execution time and mem- lems applying deep learning techniques in database
ory footprint as the reward. Note that approaches applications. Let us not miss the opportunity to
purely based on deep learning models may not be contribute to the exciting challenges ahead!
very e↵ective. In particular, the training dataset
may not be comprehensive to include all query pat- 6. ACKNOWLEDGEMENT
terns, e.g. some predicates could be missing in the
training datasets. To solve these problems, a better We would like to thank Divesh Srivastava for his
approach would be to combine database solutions valuable comments. This work is supported by the
and deep learning. National Research Foundation, Prime Minister’s Of-
fice, Singapore, under its Competitive Research Pro-
4.3 Crowdsourcing and Knowledge Bases gramme (CRP Award No. NRF-CRP8-2011-08).
Many crowdsourcing [43] and knowledge base [10] Meihui Zhang is supported by SUTD Start-up Re-
applications involve entity extraction, disambigua- search Grant under Project No. SRG ISTD 2014
tion and fusion problems, where the entity could 084.
be a row of a database, a node in a graph, etc.
