0% found this document useful (0 votes)
18 views

ibm_tabformer

Uploaded by

Hongming Zheng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

ibm_tabformer

Uploaded by

Hongming Zheng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

TABULAR TRANSFORMERS FOR MODELING MULTIVARIATE TIME SERIES

Inkit Padhi, Yair Schiff, Igor Melnyk, Mattia Rigotti, Youssef Mroueh, Pierre Dognin
Jerret Ross, Ravi Nair, Erik Altman

IBM Research, T.J Watson Research Center & MIT-IBM Watson AI Lab

ABSTRACT and 2) generate realistic synthetic tabular time series.


arXiv:2011.01843v2 [cs.LG] 11 Feb 2021

Tabular time series represent a hierarchical structure that


Tabular datasets are ubiquitous in data science applications.
we leverage by endowing transformer-based language mod-
Given their importance, it seems natural to apply state-of-
els with field-level transformers, which encode individual
the-art deep learning algorithms in order to fully unlock their
rows into embeddings that are in turn treated as embedded
potential. Here we propose neural network models that rep-
tokens that are passed to BERT [1]. This results in an alter-
resent tabular time series that can optionally leverage their
native architectures for tabular time series encoding that can
hierarchical structure. This results in two architectures for
be pre-trained end-to-end for representation learning that we
tabular time series: one for learning representations that is
call Tabular BERT (TabBERT). Another important contribu-
analogous to BERT and can be pre-trained end-to-end and
tion is adapting state-of-the-art (SOTA) language generative
used in downstream tasks, and one that is akin to GPT and
models GPT [2] to produce realistic synthetic tabular data
can be used for generation of realistic synthetic tabular se-
that we call Tabular GPT (TabGPT). A key ingredient of our
quences. We demonstrate our models on two datasets: a syn-
language metaphor in modeling tabular time series is the
thetic credit card transaction dataset, where the learned rep-
quantization of continuous fields, so that each field is defined
resentations are used for fraud detection and synthetic data
on a finite vocabulary, as in language modeling.
generation, and on a real pollution dataset, where the learned
encodings are used to predict atmospheric pollutant concen- As mentioned, static tabular data have been widely ana-
trations. Code and data are available at https://github. lyzed in the past, typically with feature engineering and clas-
com/IBM/TabFormer. sical learning schemes such as gradient boosting or random
forests. Recently, [3] introduced TabNet, which uses attention
Index Terms— Tabular time series, BERT, GPT to perform feature selection across fields and shows the ad-
vantages of deep learning over classical approaches. A more
1. INTRODUCTION recent line of work [4, 5] concurrent to ours, deals with the
joint processing of static tabular and textual data using trans-
Tabular datasets are ubiquitous across many industries, espe- former architectures, such as BERT, with the goal of querying
cially in vital sectors such as healthcare and finance. Such tables with natural language. These works consider the static
industrial datasets often contain sensitive information, raising case, and to the best of our knowledge, our work is the first to
privacy and confidentiality issues that preclude their public re- address tabular time series using transformers.
lease and limit their analysis to methods that are compatible On the synthetic generation side, a plethora of work [6,
with an appropriate anonymization process. We can distin- 7, 8, 9, 10] are dedicated to generating static tabular data
guish between two types of tabular data: static tabular data using Generative Adversarial Networks (GANs), conditional
that corresponds to independent rows in a table, and dynamic GANs, and variational Auto-Encoders. [11, 12] argue for the
tabular data that corresponds to tabular time series, also re- importance of synthetic generation on financial tabular data
ferred to also as multivariate time series. The machine learn- in order to preserve user privacy and to allow for training
ing and deep learning communities have devoted considerable on cloud-based solutions without compromising real users’
effort to learning from static tabular data, as well as generat- sensitive information. Nevertheless, their generation scheme
ing synthetic static tabular data that can be released as a pri- falls short of modeling the temporal dependency in the data.
vacy compliant surrogate of the original data. On the other Our work addresses this crucial aspect in particular. In sum-
hand, less effort has been devoted to the more challenging mary, the main contributions of our paper are:
dynamic case, where it is important to also account for the • We propose Hierarchical Tabular BERT to learn represen-
temporal component of the data. The purpose of this paper is tations of tabular time series that can be used in downstream
to remedy this gap by proposing deep learning techniques to: tasks such as classification or regression.
1) learn useful representation of tabular time series that can be • We propose TabGPT to synthesize realistic tabular time se-
used in downstream tasks such as classification or regression ries data.
• We train our proposed models on a synthetic credit card Embedded Row1 Embedded RowT
transaction dataset, where the learned encodings are used for v11 v12 vT1 vT3
v̂13 ··· v̂T2
a downstream fraud detection task and for synthetic data gen-
Sequence
eration. We also showcase our method on a public real-world Embedding
SE1 SET
pollution dataset, where the learned encodings are used to pre- Tabular BERT: Sequence Encoding Transformer Module
dict the concentration of pollutant. Row
RE1 RET
• We open-source our synthetic credit-card transactions Embedding
dataset to encourage future research on this type of data. Field Transformer Field Transformer
The code to reproduce all experiments in this paper is avail- Random
Field11 Field12 Mask ··· FieldT1 Mask FieldT3
able at https://github.com/IBM/TabFormer. Our Masking
Row1 RowT
code is built within HuggingFace’s framework [13].

Fig. 2: TabBERT: Field level masking and cross entropy.


2. TABBERT: UNSUPERVISED LEARNING OF
MULTIVARIATE TIME SERIES REPRESENTATION
rows and N columns (or fields), an input, t, to TabBERT is
represented as a windowed series of T time-dependent rows,
2.1. From Language Modeling to Tabular Time Series
t = [ti+1 , ti+2 , ..., ti+T ], (1)

where T (M ) is the number of consecutive rows, and se-


lected with a window offset (or stride). TabBERT is a vari-
ant of BERT, which accommodates the temporal nature of
rows in the tabular input. As shown in Fig. 2, TabBERT en-
codes the series of transactions in a hierarchical fashion. The
Fig. 1: An example of sequential tabular data, where each field transformer processes rows individually, creating trans-
row is a transaction. A to M are the fields of the transac- action/row embeddings. These transaction embeddings are
tions. Some of the fields are categorical, others are contin- then fed to a second-level transformer to create sequence em-
uous, but through quantization we convert all fields into cat- beddings. In other words, the field transformer tries to capture
egorical. Each field is then processed to build its own local intra-transaction relationships (local), whereas the sequence
vocabulary. A single sample is defined as some number of transformer encodes inter-transaction relationships (global),
contiguous transactions, for example rows 1 through 10, as capturing the temporal component of the data. Note that hi-
shown in this figure. erarchical transformer has already been proposed in NLP in
order to hierarchically model documents [14, 15].
In Fig. 1, we give an example of tabular time series, that
Many pre-trained transformer models on domain specific
is a sequence of card transactions for a particular user. Each
data have recently been successfully applied in various down-
row consists of fields that can be continuous or categorical. In
stream tasks. More specifically, BioBERT [16], VideoBERT
order to unlock the potential of language modeling techniques
[17], ClinicalBERT [18] are pre-trained efficiently on various
for tabular data, we quantize continuous fields so that each
domains, such as biomedical, YouTube videos, and clinical
field is defined on its own local finite vocabulary. We define
electronic health record, respectively. Representations from
a sample as a sequence of rows (transactions in this case).
these BERT variants achieve SOTA results on tasks ranging
The main difference with NLP is that we have a sequence
from video captioning to hospital readmission. In order to as-
of structured rows consisting each of fields defined on local
certain the richness of learned TabBERT representation, we
vocabulary. As introduced in previous sections, unsupervised
study two downstream tasks: classification (Section 2.3) and
learning for multivariate time series representations require
regression (Section 2.4).
modeling both inter- and intra-transaction dependencies. In
the next section, we show how to exploit this hierarchy in
learning unsupervised representations for tabular time series. 2.3. TabBERT Features for Classification
Transaction Dataset One of the contributions of this work is
2.2. Hierarchical Tabular BERT in introducing a new synthetic corpus for credit card transac-
tions. The transactions are created using a rule-based gener-
In order to learn representations for multivariate tabular data, ator where values are produced by stochastic sampling tech-
we use a recipe similar to the one employed for training niques, similar to a method followed by [19]. Due to privacy
language representation using BERT. The encoder is trained concerns, most of the existing public transaction datasets are
through masked language modeling (MLM), i.e by predict- either heavily anonymized or preserved through PCA-based
ing masked tokens. More formally, given a table with M transformations, and thus distort the real data distributions.
The proposed dataset has 24 million transactions from 20,000 Fraud PRSA
users. Each transaction (row) has 12 fields (columns) consist- Features Prediction Head F1 RMSE
ing of both continuous and discrete nominal attributes, such MLP 0.74 38.5
Raw
as merchant name, merchant address, transaction amount, etc. LSTM 0.83 43.3
MLP 0.76 34.2
TabBERT
Training TabBERT For training TabBERT on our transac- LSTM 0.86 32.8
tion dataset, we create samples as sliding windows of 10
transactions, with a stride of 5. The continuous and cate- Table 1: Performance comparison on the classification task of
gorical values are quantized, and a vocabulary is created, fraud detection (Fraud), and the regression task of pollution
as described in Section 2.1. Note that during training we prediction (PRSA) for two approaches: one based on Tab-
exclude the label column isFraud?, to avoid biasing the BERT features and the baseline using raw data. We compare
learned representation for the downstream fraud detection two architectures: MLP and LSTM for the downstream tasks.
task. Similar to strategies used by [1], we mask 15% of a of 15K samples. As reported in Tab. 1, using TabBERT fea-
sample’s fields, replacing them with the [M ASK ] token, and tures shows significant improvement in terms of RMSE over
predict the original field token with cross entropy loss. the case of using simple raw embedded features. This consis-
tent performance gain when using TabBERT features for both
Fraud Detection Task For fraud detection, we create sam- classification and regression tasks underlines the richness of
ples by combining 10 contiguous rows (with a stride of 10) representations learned from TabBERT.
in a time-dependent manner for each user. In total, there are
2.4M samples with 29,342 labeled as fraudulent. Since in
3. TABGPT: GENERATIVE MODELING OF
real world the fraudulent transactions are very rare events, a
MULTIVARIATE TIME SERIES TABULAR DATA
similar trend is observed in our synthetic data, resulting in an
imbalanced, non-uniform distribution of fraudulent and non- Another useful application of language modeling in the con-
fraudulent class labels. To account for this imbalance, during text of tabular, time-series data is the preservation of data
training, we upsample the fraudulent class to roughly equal- privacy. GPT models trained on large corpora have demon-
ize the frequencies of both classes. We evaluate performance strated human-level capabilities in the domain of text gener-
of different methods using F1 binary score, on a test set con- ation. In this work, we apply the generative capabilities of
sisting of 480K samples. As a baseline, we use a multi-layer GPT as a proof of concept for creating synthetic tabular data
perceptron (MLP) trained directly on the embeddings of the that is close in distribution to the true data, with the advan-
raw features. In order to model temporal dependencies, we tage of not exposing any sensitive information. Specifically,
also use an LSTM network baseline on the raw embedded fea- we train a GPT model (referred to throughout as TabGPT) on
tures. In both cases, we pool the encoder outputs at individual user-level data from the credit card dataset in order to generate
row level to create Ei (see Fig. 2) before doing classification. synthetic transactions that mimic a user’s purchasing behav-
In Tab. 1, we compare the baselines and the methods based on ior. This synthetic data can subsequently be used in down-
TabBERT features while using the same architectures for the stream tasks without the precautions that would typically be
prediction head. During MLP and LSTM networks training, necessary when handling private information.
with TabBERT as the feature extractor, we freeze the Tab- We begin, as with TabBERT, by quantizing the data to
BERT network foregoing any update of its weights. As can create a finite vocabulary for each field. To train the TabGPT
be seen from the table, the inclusion of TabBERT features model, we select specific users from the dataset. By ordering
boosts the F1 for the fraud detection task. a user’s transactions chronologically and segmenting them
into sequences of ten transactions, the model learns to predict
2.4. TabBERT Features for Regression Tasks future behavior from past transactions, similar to how GPT
language models are trained on text data to predict future to-
Pollution Dataset For the regression task, we use a public kens from past context. We apply this approach to two of the
UCI dataset (Beijing PM2.5 Data) for predicting both PM2.5 users that have a relatively high volume of transactions, each
and PM10 air concentration for 12 monitoring sites, each con- with ∼60k transactions. For each user, we train a separate
taining around 35k entries (rows). Every row has 11 fields TabGPT model, which is depicted in Fig. 3. Unlike with Tab-
with a mix of both continuous and discrete values. For a de- BERT, we do not employ the hierarchical structure of passing
tailed description of the data, please refer to [20]. Similar to each field into a field-level transformer, but rather we pass
the pre-processing steps for our transaction dataset, we quan- sequences of transactions separated by a special [S EP ] token
tize the continuous features, remove the targets (PM2.5 and directly to the GPT encoder network.
PM10), and create samples by combining 10 time-dependent After training, synthetic data is generated by again seg-
rows with a stride of 5. We use 45K samples for training and menting a user’s transaction data into sequences of ten trans-
report a combined RMSE for both targets from the test set actions, passing the first transaction of each group of ten to
While field distribution matching evaluates fidelity of
Synthetic Row1 Synthetic RowT
generated data to real data on an aggregate level, this anal-
Field11 Field12 Field13 ··· FieldT1 FieldT2 FieldT3 ysis does not capture the sequential nature of the gen-
eration. Hence we use an additional metric that com-
Tabular GPT Causal Decoder pares two datasets of time series (ta1,i , . . . taT,i )i=1...N and
(tb1,i , . . . tbT,i )i=1...N . Inspired by the Fréchet Inception Dis-
Tabular GPT Encoder
tance (FID) [21] used in computer vision and Fréchet In-
ferSent Distance (FD) [22] in NLP, we use our TabBERT
Field11 Field12 Field13 ··· FieldT1 FieldT2 FieldT3
model to embed real and generated sequence to a fixed length
Row1 RowT vector for each instance via = TabBERT((ta1,i , . . . taT,i )), and
vib = TabBERT((tb1,i , . . . tbT,i )). via is obtained by mean
Fig. 3: TabGPT: Synthetic Transaction GPT Generator. pooling all time-wise embeddings SEt in TabBERT. Then we
compute mean and covariance for each dataset (µa , Σa ) and
the model, and predicting the remaining nine. To evaluate a (µb , Σb ), respectively. The FID score is defined as follows:
1
model’s generative capabilities we examine how it captures FIDa,b = ||µa − µb ||22 + T r(Σa + Σb − 2(Σa Σb ) 2 ) (2)
both the aggregate and time-dependent features of the data.
The quantization of non-categorical data, which enables Real
the use of a finite vocabulary for each field, renders field FID
level evaluation of the fidelity of TabGPT to the real data User 1 User 2
more straightforward. Namely, for each field, we compute User 1 - 492.92
Real
and compare histograms for both ground truth and generated
User 1 22.90 497.68
data on an aggregate level over all timestamps. To measure GPT-Gen
User 2 515.94 49.08
proximity of the true and synthetic distributions we calculate
the χ2 distance between histograms, defined as: χ2 (X , X 0 ) =
Pn (xi −x0i )2 Table 2: FID between real and GPT-generated transactions.
1 0
2 i=1 (xi +x0i ) where xi , xi are columns from the corre- FID scores between the transaction datasets for user 1 and
sponding transactions (i = 1..n) from the true (X ) and gener- user 2 are presented in Tab. 2. For the real user data, we see
ated (X 0 ) distributions, respectively. In Fig. 4, we plot results that they have different behaviors, with FID of 492.95. In
contrast, the TabGPT generated data (GPT-Gen) for user 1
matches the real user 1 more closely, as can be seen from
the relatively low FID score. The same conclusion holds for
GPT-Gen user 2. Interestingly the cross distances between
the generated user and the other real user are also maintained.
The combination of the aggregate histogram and FID analyses
indicates that TabGPT is able to learn the behavior of each
user and to generate realistic synthetic transactions.

4. CONCLUSION

In this paper, we introduce Hierarchical Tabular BERT and


Tabular GPT for modeling multivariate times series. We also
Fig. 4: For each column in the tabular data, we compare the open-source a synthetic card transactions dataset and the code
generated and ground truth distributions for the user’s data to reproduce our experiments. This type of modeling for se-
rows. The entropy of each feature is represented by the bars quential tabular data via transformers is made possible thanks
and displayed on the left vertical axis and χ2 distance between to the quantization of the continuous fields of the tabular data.
real and synthetic data distributions is represented by the line We show that the representations learned by TabBERT pro-
and displayed on right vertical axis. vide consistent performance gains in different downstream
of this evaluation for the two selected users. Overall, we see tasks. TabBERT features can be used in fraud detection in lieu
that for both users, their respective TabGPT models are able to of hand-engineered features as they better capture the intra-
generate synthetic distributions that are similar to the ground dependencies between the fields as well as the temporal de-
truth for each feature of the data, even for columns with high pendencies between rows. Finally, we show that TabGPT can
entropy, such as Amount. The TabGPT model for user 1 pro- reliably synthesise card transactions that can replace real data
duces distributions that are generally closer to ground truth, and alleviate the privacy issues encountered when training off
but for user 2, most column distributions also align closely. premise or with cloud based solutions [11, 12].
5. REFERENCES [13] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pierric
[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
Kristina Toutanova, “Bert: Pre-training of deep bidirec- Davison, Sam Shleifer, Patrick von Platen, Clara Ma,
tional transformers for language understanding,” arXiv Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
preprint arXiv:1810.04805, 2018. Sylvain Gugger, Mariama Drame, Quentin Lhoest, and
Alexander M. Rush, “Huggingface’s transformers:
[2] Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
State-of-the-art natural language processing,” ArXiv,
Dario Amodei, and Ilya Sutskever, “Language models
vol. abs/1910.03771, 2019.
are unsupervised multitask learners,” 2018.
[14] Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba,
[3] Sercan O Arik and Tomas Pfister, “Tabnet: Atten- Yishay Carmiel, and Najim Dehak, “Hierarchical trans-
tive interpretable tabular learning,” arXiv preprint formers for long document classification,” in 2019 IEEE
arXiv:1908.07442, 2019. ASRU Workshop. IEEE, 2019, pp. 838–844.
[4] Jonathan Herzig, Pawel Krzysztof Nowak, Thomas [15] Xingxing Zhang, Furu Wei, and Ming Zhou, “HIB-
Müller, Francesco Piccinno, and Julian Eisenschlos, ERT: Document level pre-training of hierarchical bidi-
“TaPas: Weakly supervised table parsing via pre- rectional transformers for document summarization,”
training,” in Proceedings of the 58th ACL. July 2020, Florence, Italy, July 2019, pp. 5059–5069, Association
pp. 4320–4333, ACL. for Computational Linguistics.
[5] Pengcheng Yin, Graham Neubig, Wen tau Yih, and Se- [16] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon
bastian Riedel, “TaBERT: Pretraining for joint under- Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang,
standing of textual and tabular data,” in Annual Confer- “Biobert: a pre-trained biomedical language representa-
ence of the Association for Computational Linguistics tion model for biomedical text mining,” Bioinformatics,
(ACL), July 2020. vol. 36, no. 4, pp. 1234–1240, 2020.
[6] Lei Xu, Synthesizing Tabular Data using Conditional [17] Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-
GAN, Ph.D. thesis, Massachusetts Institute of Technol- phy, and Cordelia Schmid, “Videobert: A joint model
ogy, 2020. for video and language representation learning,” in Pro-
ceedings of the IEEE International Conference on Com-
[7] Lei Xu and Kalyan Veeramachaneni, “Synthesizing tab-
puter Vision, 2019, pp. 7464–7473.
ular data using generative adversarial networks.(2018),”
arXiv preprint arXiv:1811.11264, 2018. [18] Kexin Huang, Jaan Altosaar, and Rajesh Ran-
ganath, “Clinicalbert: Modeling clinical notes and
[8] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and
predicting hospital readmission,” arXiv preprint
Kalyan Veeramachaneni, “Modeling tabular data using
arXiv:1904.05342, 2019.
conditional gan,” in Advances in Neural Information
Processing Systems, 2019, pp. 7335–7345. [19] Erik R Altman, “Synthesizing credit card transactions,”
arXiv preprint arXiv:1910.03033, 2019.
[9] Anton Karlsson and Torbjörn Sjöberg, “Synthesis of
tabular financial data using generative adversarial net- [20] Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe
works,” 2020. Zhang, Shuyi Zhang, Hui Huang, and Song Xi Chen,
“Assessing beijing’s pm2. 5 pollution: severity, weather
[10] Ramiro Daniel Camino, Christian Hammerschmidt, impact, apec and winter heating,” Proceedings of the
et al., “Working with deep generative models and tabu- Royal Society A: Mathematical, Physical and Engineer-
lar data imputation,” 2020. ing Sciences, vol. 471, no. 2182, pp. 20150257, 2015.
[11] Samuel Assefa, Danial Dervovic, Mahmoud Mahfouz, [21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Tucker Balch, Prashant Reddy, and Manuela Veloso, Bernhard Nessler, and Sepp Hochreiter, “Gans trained
“Generating synthetic data in finance: opportunities, by a two time-scale update rule converge to a local nash
challenges and pitfalls,” Challenges and Pitfalls (June equilibrium,” in Advances in neural information pro-
23, 2020), 2020. cessing systems, 2017, pp. 6626–6637.
[12] Dmitry Efimov, Di Xu, Luyang Kong, Alexey Nefe- [22] Stanislau Semeniuta, Aliaksei Severyn, and Sylvain
dov, and Archana Anandakrishnan, “Using genera- Gelly, “On accurate evaluation of gans for language
tive adversarial networks to synthesize artificial finan- generation,” arXiv preprint arXiv:1806.04936, 2018.
cial datasets,” 2020.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy