image-based table recognition data model and metric
image-based table recognition data model and metric
evaluation
Xu Zhong Elaheh ShafieiBavani Antonio Jimeno Yepes
IBM Research Australia IBM Research Australia IBM Research Australia
60 City Road, Southbank 60 City Road, Southbank 60 City Road, Southbank
VIC 3006, Australia VIC 3006, Australia VIC 3006, Australia
peter.zhong@au1.ibm.com elaheh.shafieibavani@ibm.com antonio.jimeno@au1.ibm.com
arXiv:1911.10683v5 [cs.CV] 4 Mar 2020
Abstract—Important information that relates to a specific the unstructured tables in a machine-readable format, where
topic in a document is often organized in tabular format to the structure of the table and the content within each cell are
assist readers with information retrieval and comparison, which encoded according to a pre-defined standard. This is often
may be difficult to provide in natural language. However,
tabular data in unstructured digital documents, e.g. Portable referred as table recognition [2].
Document Format (PDF) and images, are difficult to parse This paper solves the following three problems in image-
into structured machine-readable format, due to complexity and based table recognition, where the structured representations
diversity in their structure and style. To facilitate image-based of tables are reconstructed solely from image input:
table recognition with deep learning, we develop and release the • Data We provide a large-scale dataset PubTabNet, which
largest publicly available table recognition dataset PubTabNet1 ,
containing 568k table images with corresponding structured consists of over 568k images of heterogeneous tables ex-
HTML representation. PubTabNet is automatically generated by tracted from the scientific articles (in PDF format) contained
matching the XML and PDF representations of the scientific in PMCOA. By matching the metadata of the PDFs with the
TM
articles in PubMed Central Open Access Subset (PMCOA). We associated structured representation (provide by PMCOA2 in
also propose a novel attention-based encoder-dual-decoder (EDD) XML format), we automatically annotate each table image
architecture that converts images of tables into HTML code.
with information about both the structure of the table and
The model has a structure decoder which reconstructs the table
structure and helps the cell decoder to recognize cell content. In the text within each cell (in HTML format).
addition, we propose a new Tree-Edit-Distance-based Similarity • Model We develop a novel attention-based encoder-dual-
(TEDS) metric for table recognition, which more appropriately decoder (EDD) architecture (see Fig. 1) which consists
captures multi-hop cell misalignment and OCR errors than the of an encoder, a structure decoder, and a cell decoder.
pre-established metric. The experiments demonstrate that the
The encoder captures the visual features of input table
EDD model can accurately recognize complex tables solely relying
on the image representation, outperforming the state-of-the-art images. The structure decoder reconstructs table structure
by 9.7% absolute TEDS score. and helps the cell decoder to recognize cell content. Our
EDD model is trained on PubTabNet and demonstrates
I. I NTRODUCTION superior performance compared to existing table recognition
Information in tabular format is prevalent in all sorts of methods. The error analysis shows potential enhancements
documents. Compared to natural language, tables provide a to the current EDD model for improved performance.
way to summarize large quantities of data in a more compact • Evaluation By modeling tables as a tree structure, we
and structured format. Tables provide as well a format to assist propose a new tree-edit-distance-based evaluate metric for
readers with finding and comparing information. An example image-based table recognition. We demonstrate that our
of the relevance of tabular information in the biomedical new metric is superior to the metric [3] commonly used
domain is in the curation of genetic databases in which in literature and competitions.
just between 2% to 8% of the information was available in
the narrative part of the article compared to the information II. R ELATED WORK
available in tables or files in tabular format [1]. A. Data
Tables in documents are typically formatted for human Analyzing tabular data in unstructured documents focuses
understanding, and humans are generally adept at parsing mainly on three problems: i) table detection: localizing the
table structure, identifying table headers, and interpreting bounding boxes of tables in documents, ii) table structure
relations between table cells. However, it is challenging for recognition: parsing only the structural (row and column lay-
a machine to understand tabular data in unstructured formats out) information of tables, and iii) table recognition: parsing
(e.g. PDF, images) due to the large variability in their layout both the structural information and content of table cells.
and style. The key step of table understanding is to represent Table I compares the datasets that have been developed to
1 https://github.com/ibm-aur-nlp/PubTabNet 2 https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
Table image
Encoder
Structure decoder
As As As As As As
Rs Rs Rs Rs ••• Rs Rs •••
Cell decoder
D Rc Ac C Rc Ac
o Rc Ac a Rc Ac
•••
g Rc Ac t Rc Ac
•••
•••
Fig. 1: EDD architecture. The encoder is a convolutional neural network which captures the visual features of the input table
image. As and Ac are attention network for the structure decoder and cell decoder, respectively. Rs and Rc are recurrent units
for the structure decoder and cell decoder, respectively. The structure decoder reconstructs table structure and helps the cell
decoder to generate cell content. The output of the structure decoder and the cell decoder is merged to obtain the HTML
representation of the input table image.
address one or more of these three problems. The PubTabNet Dataset TD TSR TR # tables
dataset and the EDD model we develop in this paper aim Marmot [6] 3 7 7 958
at the image-based table recognition problem. Comparing to PubLayNet [7] 3 7 7 113k
DeepFigures [8] 3 7 7 1.4m
other existing datasets for table recognition (e.g. SciTSR3 , ICDAR2013 [2] 3 3 3 156
Table2Latex [4], and TIES [5]), PubTabNet has three key ICDAR2019 [9] 3 3 7 3.6k
advantages: UNLV [10] 3 3 7 558
417k (TD)
1) The tables are typeset by the publishers of over 6,000 jour- TableBank4 3 3 7
145k (TSR)
nals in PMCOA, which offers considerably more diversity SciTSR3 7 3 3 15k
in table styles than other table datasets. Table2Latex [4] 7 3 3 450k
Synthetic data in [5] 7 3 3 Unbounded
2) Cells are categorized into headers and body cells, which is
important when retrieving information from tables. PubTabNet 7 3 3 568k
3) The format of targeted output is HTML, which can be
directly integrated into web applications. In addition, tables TABLE I: Datasets for Table Detection (TD), Table Structure
in HTML format are represented as a tree structure. This Recognition (TSR) and Table Recognition (TR).
3 https://github.com/Academic-Hammer/SciTSR
2 https://github.com/doc-analysis/TableBank
3 https://github.com/doc-analysis/TableBank enables the new tree-edit-distance-based evaluation metric
4 https://github.com/doc-analysis/TableBank that we propose in Section V
B. Model level, allowing it to identify all types of structural errors; and
Traditional table detection and recognition methods rely problem 2) by computing the string-edit-distance when the
on pre-defined rules [11]–[16] and statistical machine learn- tree-edit operation is node substitution.
ing [17]–[21]. Recently, deep learning exhibit great perfor- III. AUTOMATIC GENERATION OF P UB TAB N ET
mance in image-based table detection and structure recog-
nition. Hao et al. used a set of primitive rules to propose PMCOA contains over one million scientific articles in both
candidate table regions and a convolutional neural network unstructured (PDF) and structured (XML) formats. A large
to determine whether the regions contain a table [22]. Fully- table recognition dataset can be automatically generated if
convolutional neural networks, followed by a conditional ran- the corresponding location of the table nodes in the XML
dom field, have also been used for table detection [23]–[25]. can be found in the PDF. Zhong et al. has proposed an
In addition, deep neural networks for object detection, such as algorithm to match the the XML and PDF representations
Faster-RCNN [26], Mask-RCNN [27], and YOLO [28] have of the articles in PMCOA, which automatically generated the
been exploited for table detection and row/column segmen- PubLayNet dataset for document layout analysis [7]. We use
tation [7], [29]–[31]. Furthermore, graph neural networks are their algorithm to extract the table regions from the PDF for
used for table detection and recognition by encoding document the tables nodes in the XML. The table regions are converted
images as graphs [5], [32]. to images with a 72 pixels per inch (PPI) resolution. We use
There are several tools (see Table II) that can convert this low PPI setting to relax the requirement of our model
tables in text-based PDF format into structured representations. for high-resolution input images. For each table image, the
However, there is limited work on image-based table recog- corresponding table node (in HTML format) is extracted from
nition. Attention-based encoder-decoder was first proposed by the XML as the ground truth annotation.
Xu et al. for image captioning [33]. Deng et al. extended It is observed that the algorithm generates erroneous bound-
it by adding a recurrent layer in the encoder for capturing ing boxes for some tables, hence we use a heuristic to
long horizontal spatial dependencies to convert images of automatically verify the bounding boxes. For each annotation,
mathematical formulas into LATEX representation [34]. The the text within the bounding box is extracted from the PDF
same model was trained on the Table2Latex [4] dataset to and compared with that in the annotation. The bounding box
convert table images into LATEX representation. As show in [4] is considered to be correct if the cosine similarity of the term
and in our experimental results (see Table II), the efficacy of frequency-inverse document frequency (Tf-idf) features of the
this model on image-based table recognition is mediocre. two texts is greater than 90% and the length of the two texts
This paper considerably improves the performance of the differs less than 10%. In addition, to improve the learnability
attention-based encoder-decoder method on image-based table of the data, we remove rare tables which contains any cell
recognition with a novel EDD architecture. Our model differs that spans over 10 rows or 10 columns, or any character that
from other existing EDD architectures [35], [36], where the occurs less than 50 times in all the tables. Tables of which
dual decoders are independent from each other. In our model, the annotation contains math and inline-formula nodes
the cell decoder is triggered only when the structure decoder are also removed, as we found they do not have a consistent
generates a new cell. In the meanwhile, the hidden state of the XML representation.
structure decoder is sent to the cell decoder to help it place After filtering the table samples, we curate the HTML
its attention on the corresponding cell in the table image. code of the tables to remove unnecessary variations. First, we
remove the nodes and attributes that are not reconstructable
C. Evaluation from the table image, such as hyperlinks and definition of
The evaluation metric proposed in [3] is commonly used acronyms. Second, table header cells are defined as th nodes
in table recognition literature and competitions. This metric in some tables, but as td nodes in others. We unify the
first flattens the ground truth and recognition result of a definition of header cells as td nodes, which preserves the
table are into a list of pairwise adjacency relations between header identify of the cells as they are still descendants of
non-empty cells. Then precision, recall, and F1-score can be the thead node. Third, all the attributes except ‘rowspan’
computed by comparing the lists. This metric is simple but and ‘colspan’ in td nodes are stripped, since they control the
has two obvious problems: 1) as it only checks immediate appearance of the tables in web browsers, which do not match
adjacency relations between non-empty cells, it cannot detect with the table image. These curations lead to consistent and
errors caused by empty cells and misalignment of cells beyond clean HTML code and make the data more learnable.
immediate neighbors; 2) as it checks relations by exact match5 , Finally, the samples are randomly partitioned into
it does not have a mechanism to measure fine-grained cell 60%/20%/20% training/
content recognition performance. In order to address these development/test sets. The training set contains 548,592 sam-
two problems, we propose a new evaluation metric: Tree-Edit- ples. As only a small proportion of tables contain spanning
Distance-based Similarity (TEDS). TEDS solves problem 1) (multi-column or multi-row) cells, the evaluation on the raw
by examining recognition results at the global tree-structure development and test sets would be strongly biased towards
tables without spanning cells. To better evaluate how a model
5 Both cells are identical and the direction matches performs on complex table structures, we create more balanced
development and test sets by randomly drawing 5,000 tables HTML code Structural Cell
tokens tokens
with spanning cells and 5,000 tables without spanning cells
from the corresponding raw set. <thead> <thead>
D
<tr> <tr>
o
<td colspan=“2”> <td
IV. E NCODER - DUAL - DECODER (EDD) MODEL colspan=“2”
g
Dog<sup>a</sup> <sup>
Fig. 1 shows the architecture of the EDD model, which </td> >
a
<td> </td>
consists of an encoder, an attention-based structure decoder, Cat <td>
</sup>
and an attention-based cell decoder. The use of two decoders </td> Tokenization </td>
C
</tr>
is inspired by two intuitive considerations: i) table structure </tr>
</thead>
a
</thead>
recognition and cell content recognition are two distinctively <tbody> <tbody>
t
Fig. 3: Example of cell shift perturbation, where 90% of the cells in the first row are shifted. TEDS = 34.9%. Adjacency
relation F1 score = 80.3%.
Fig. 4: Example of cell content perturbationat the 10% perturbation level. TEDS = 93.2%. Adjacency relation F1 score =
19.1%.
(e) Acrobat R on PDF (TEDS = 74.8%) (f) Acrobat R on Image (TEDS = 64.2%)
Fig. 6: Table recognition results of EDD and 7 baseline approaches on an example input table which has a complex header
structure (4 multi-row (span=3) cells, 2 multi-column (span=3) cells, and three normal cells). Our EDD model perfectly
recognizes the complex structure and cell content of the table, whereas the baselines struggle with the complex table header.
Our EDD model is able to generate an extremely close match focuses its attention around the cells in the row. When the
to the ground truth, making no error in structure recognition opening tag (‘<td>’) of a new cell is generated, the structure
and a single optical character recognition (OCR) error (‘PF’ decoder pays more attention around the cell. For the closing
recognized as ‘PC’). The second header row is missing in the tag ‘</td>’ tag, the attention of the structure decoder spreads
results of WYGIWYS, which also makes a few errors in the across the image. Since ‘</td>’ always follows the ‘<td>’
cell content. On the other hand, the off-the-shelf tools make or ‘>’ token, the structure decoder relies on the language
substantially more errors in recognizing the complex structure model rather than the encoded feature map to predict it. Fig. 7
of the table headers. This demonstrates the limited capability (d) shows the aggregated attention of the cell decoder when
of these tools on recognizing complex tables. generating the content of each cell. Compared to the structure
decoder, the cell decoder has more focused attention, which
Figs. 7 (a) - (c) illustrate the attention of the structure
falls on the cell content that is being generated.
decoder when processing an example input table. When a new
row is recognized (‘<tr>’ and ‘</tr>’), the structure decoder
First, current PubTabNet dataset does not provide coordinates petition,” in 2013 12th International Conference on Document Analysis
and Recognition. IEEE, 2013, pp. 1449–1453.
of table cells, which we plan to supplement in the next version. [3] M. Hurst, “A constraint-based approach to table structure derivation,”
This will enable adding an additional branch to the EDD 2003.
Fig. 7:
Fig. 4: Attention distribution of the structure decoder (a - c) and the cell decoder (d) on an example input table. The texts
at the center of the images are predictions of the EDD model. The structure decoder focuses its attention around table cells
at the
when recognizing new rows and cells, whereas the cell decoder places more focused attention on cell content.
when
Frequency (k)
TEDS (%)
TEDS (%)
3 60 3 60
2 40 2 40
1 20 1 20
0 250 500 7500 00 250 500 0 (a) Category 1
Width (pixel) Height (pixel)
5 100 5 100
4 80 4 80
Frequency (k)
Frequency (k)
TEDS (%)
TEDS (%)
3 60 3 60
2 40 2 40
1 20 1 EDD
WYGIWYS 20
00 500 10000 00 200 0
# structural tokens # tokens of the longest cell
Fig. 8: Impact of table size in terms of width, height, number (b) Category 2
of structural tokens, and number of tokens in the longest cell
on the performance of EDD and WYGIWYS. The bar plots
(left axis) are the histogram of PubTabNet test set w.r.t. the
above properties. The line plots (right axis) are the mean TEDS
of the samples in each bar. The vertical dashed lines are the
thresholds in Equation 3.