0% found this document useful (0 votes)
26 views

image-based table recognition data model and metric

Uploaded by

sennei239
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

image-based table recognition data model and metric

Uploaded by

sennei239
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Image-based table recognition: data, model, and

evaluation
Xu Zhong Elaheh ShafieiBavani Antonio Jimeno Yepes
IBM Research Australia IBM Research Australia IBM Research Australia
60 City Road, Southbank 60 City Road, Southbank 60 City Road, Southbank
VIC 3006, Australia VIC 3006, Australia VIC 3006, Australia
peter.zhong@au1.ibm.com elaheh.shafieibavani@ibm.com antonio.jimeno@au1.ibm.com
arXiv:1911.10683v5 [cs.CV] 4 Mar 2020

Abstract—Important information that relates to a specific the unstructured tables in a machine-readable format, where
topic in a document is often organized in tabular format to the structure of the table and the content within each cell are
assist readers with information retrieval and comparison, which encoded according to a pre-defined standard. This is often
may be difficult to provide in natural language. However,
tabular data in unstructured digital documents, e.g. Portable referred as table recognition [2].
Document Format (PDF) and images, are difficult to parse This paper solves the following three problems in image-
into structured machine-readable format, due to complexity and based table recognition, where the structured representations
diversity in their structure and style. To facilitate image-based of tables are reconstructed solely from image input:
table recognition with deep learning, we develop and release the • Data We provide a large-scale dataset PubTabNet, which
largest publicly available table recognition dataset PubTabNet1 ,
containing 568k table images with corresponding structured consists of over 568k images of heterogeneous tables ex-
HTML representation. PubTabNet is automatically generated by tracted from the scientific articles (in PDF format) contained
matching the XML and PDF representations of the scientific in PMCOA. By matching the metadata of the PDFs with the
TM
articles in PubMed Central Open Access Subset (PMCOA). We associated structured representation (provide by PMCOA2 in
also propose a novel attention-based encoder-dual-decoder (EDD) XML format), we automatically annotate each table image
architecture that converts images of tables into HTML code.
with information about both the structure of the table and
The model has a structure decoder which reconstructs the table
structure and helps the cell decoder to recognize cell content. In the text within each cell (in HTML format).
addition, we propose a new Tree-Edit-Distance-based Similarity • Model We develop a novel attention-based encoder-dual-
(TEDS) metric for table recognition, which more appropriately decoder (EDD) architecture (see Fig. 1) which consists
captures multi-hop cell misalignment and OCR errors than the of an encoder, a structure decoder, and a cell decoder.
pre-established metric. The experiments demonstrate that the
The encoder captures the visual features of input table
EDD model can accurately recognize complex tables solely relying
on the image representation, outperforming the state-of-the-art images. The structure decoder reconstructs table structure
by 9.7% absolute TEDS score. and helps the cell decoder to recognize cell content. Our
EDD model is trained on PubTabNet and demonstrates
I. I NTRODUCTION superior performance compared to existing table recognition
Information in tabular format is prevalent in all sorts of methods. The error analysis shows potential enhancements
documents. Compared to natural language, tables provide a to the current EDD model for improved performance.
way to summarize large quantities of data in a more compact • Evaluation By modeling tables as a tree structure, we
and structured format. Tables provide as well a format to assist propose a new tree-edit-distance-based evaluate metric for
readers with finding and comparing information. An example image-based table recognition. We demonstrate that our
of the relevance of tabular information in the biomedical new metric is superior to the metric [3] commonly used
domain is in the curation of genetic databases in which in literature and competitions.
just between 2% to 8% of the information was available in
the narrative part of the article compared to the information II. R ELATED WORK
available in tables or files in tabular format [1]. A. Data
Tables in documents are typically formatted for human Analyzing tabular data in unstructured documents focuses
understanding, and humans are generally adept at parsing mainly on three problems: i) table detection: localizing the
table structure, identifying table headers, and interpreting bounding boxes of tables in documents, ii) table structure
relations between table cells. However, it is challenging for recognition: parsing only the structural (row and column lay-
a machine to understand tabular data in unstructured formats out) information of tables, and iii) table recognition: parsing
(e.g. PDF, images) due to the large variability in their layout both the structural information and content of table cells.
and style. The key step of table understanding is to represent Table I compares the datasets that have been developed to

1 https://github.com/ibm-aur-nlp/PubTabNet 2 https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
Table image

Encoder

Structure decoder
As As As As As As

Rs Rs Rs Rs ••• Rs Rs •••

<thead> <tr> <td> </td> <td> </td>

Cell decoder
D Rc Ac C Rc Ac

o Rc Ac a Rc Ac
•••

g Rc Ac t Rc Ac
•••

•••

Merged output: <thead><tr><td>Dog…</td>…<td>Cat…</td>…

Fig. 1: EDD architecture. The encoder is a convolutional neural network which captures the visual features of the input table
image. As and Ac are attention network for the structure decoder and cell decoder, respectively. Rs and Rc are recurrent units
for the structure decoder and cell decoder, respectively. The structure decoder reconstructs table structure and helps the cell
decoder to generate cell content. The output of the structure decoder and the cell decoder is merged to obtain the HTML
representation of the input table image.

address one or more of these three problems. The PubTabNet Dataset TD TSR TR # tables
dataset and the EDD model we develop in this paper aim Marmot [6] 3 7 7 958
at the image-based table recognition problem. Comparing to PubLayNet [7] 3 7 7 113k
DeepFigures [8] 3 7 7 1.4m
other existing datasets for table recognition (e.g. SciTSR3 , ICDAR2013 [2] 3 3 3 156
Table2Latex [4], and TIES [5]), PubTabNet has three key ICDAR2019 [9] 3 3 7 3.6k
advantages: UNLV [10] 3 3 7 558
417k (TD)
1) The tables are typeset by the publishers of over 6,000 jour- TableBank4 3 3 7
145k (TSR)
nals in PMCOA, which offers considerably more diversity SciTSR3 7 3 3 15k
in table styles than other table datasets. Table2Latex [4] 7 3 3 450k
Synthetic data in [5] 7 3 3 Unbounded
2) Cells are categorized into headers and body cells, which is
important when retrieving information from tables. PubTabNet 7 3 3 568k
3) The format of targeted output is HTML, which can be
directly integrated into web applications. In addition, tables TABLE I: Datasets for Table Detection (TD), Table Structure
in HTML format are represented as a tree structure. This Recognition (TSR) and Table Recognition (TR).

3 https://github.com/Academic-Hammer/SciTSR
2 https://github.com/doc-analysis/TableBank
3 https://github.com/doc-analysis/TableBank enables the new tree-edit-distance-based evaluation metric
4 https://github.com/doc-analysis/TableBank that we propose in Section V
B. Model level, allowing it to identify all types of structural errors; and
Traditional table detection and recognition methods rely problem 2) by computing the string-edit-distance when the
on pre-defined rules [11]–[16] and statistical machine learn- tree-edit operation is node substitution.
ing [17]–[21]. Recently, deep learning exhibit great perfor- III. AUTOMATIC GENERATION OF P UB TAB N ET
mance in image-based table detection and structure recog-
nition. Hao et al. used a set of primitive rules to propose PMCOA contains over one million scientific articles in both
candidate table regions and a convolutional neural network unstructured (PDF) and structured (XML) formats. A large
to determine whether the regions contain a table [22]. Fully- table recognition dataset can be automatically generated if
convolutional neural networks, followed by a conditional ran- the corresponding location of the table nodes in the XML
dom field, have also been used for table detection [23]–[25]. can be found in the PDF. Zhong et al. has proposed an
In addition, deep neural networks for object detection, such as algorithm to match the the XML and PDF representations
Faster-RCNN [26], Mask-RCNN [27], and YOLO [28] have of the articles in PMCOA, which automatically generated the
been exploited for table detection and row/column segmen- PubLayNet dataset for document layout analysis [7]. We use
tation [7], [29]–[31]. Furthermore, graph neural networks are their algorithm to extract the table regions from the PDF for
used for table detection and recognition by encoding document the tables nodes in the XML. The table regions are converted
images as graphs [5], [32]. to images with a 72 pixels per inch (PPI) resolution. We use
There are several tools (see Table II) that can convert this low PPI setting to relax the requirement of our model
tables in text-based PDF format into structured representations. for high-resolution input images. For each table image, the
However, there is limited work on image-based table recog- corresponding table node (in HTML format) is extracted from
nition. Attention-based encoder-decoder was first proposed by the XML as the ground truth annotation.
Xu et al. for image captioning [33]. Deng et al. extended It is observed that the algorithm generates erroneous bound-
it by adding a recurrent layer in the encoder for capturing ing boxes for some tables, hence we use a heuristic to
long horizontal spatial dependencies to convert images of automatically verify the bounding boxes. For each annotation,
mathematical formulas into LATEX representation [34]. The the text within the bounding box is extracted from the PDF
same model was trained on the Table2Latex [4] dataset to and compared with that in the annotation. The bounding box
convert table images into LATEX representation. As show in [4] is considered to be correct if the cosine similarity of the term
and in our experimental results (see Table II), the efficacy of frequency-inverse document frequency (Tf-idf) features of the
this model on image-based table recognition is mediocre. two texts is greater than 90% and the length of the two texts
This paper considerably improves the performance of the differs less than 10%. In addition, to improve the learnability
attention-based encoder-decoder method on image-based table of the data, we remove rare tables which contains any cell
recognition with a novel EDD architecture. Our model differs that spans over 10 rows or 10 columns, or any character that
from other existing EDD architectures [35], [36], where the occurs less than 50 times in all the tables. Tables of which
dual decoders are independent from each other. In our model, the annotation contains math and inline-formula nodes
the cell decoder is triggered only when the structure decoder are also removed, as we found they do not have a consistent
generates a new cell. In the meanwhile, the hidden state of the XML representation.
structure decoder is sent to the cell decoder to help it place After filtering the table samples, we curate the HTML
its attention on the corresponding cell in the table image. code of the tables to remove unnecessary variations. First, we
remove the nodes and attributes that are not reconstructable
C. Evaluation from the table image, such as hyperlinks and definition of
The evaluation metric proposed in [3] is commonly used acronyms. Second, table header cells are defined as th nodes
in table recognition literature and competitions. This metric in some tables, but as td nodes in others. We unify the
first flattens the ground truth and recognition result of a definition of header cells as td nodes, which preserves the
table are into a list of pairwise adjacency relations between header identify of the cells as they are still descendants of
non-empty cells. Then precision, recall, and F1-score can be the thead node. Third, all the attributes except ‘rowspan’
computed by comparing the lists. This metric is simple but and ‘colspan’ in td nodes are stripped, since they control the
has two obvious problems: 1) as it only checks immediate appearance of the tables in web browsers, which do not match
adjacency relations between non-empty cells, it cannot detect with the table image. These curations lead to consistent and
errors caused by empty cells and misalignment of cells beyond clean HTML code and make the data more learnable.
immediate neighbors; 2) as it checks relations by exact match5 , Finally, the samples are randomly partitioned into
it does not have a mechanism to measure fine-grained cell 60%/20%/20% training/
content recognition performance. In order to address these development/test sets. The training set contains 548,592 sam-
two problems, we propose a new evaluation metric: Tree-Edit- ples. As only a small proportion of tables contain spanning
Distance-based Similarity (TEDS). TEDS solves problem 1) (multi-column or multi-row) cells, the evaluation on the raw
by examining recognition results at the global tree-structure development and test sets would be strongly biased towards
tables without spanning cells. To better evaluate how a model
5 Both cells are identical and the direction matches performs on complex table structures, we create more balanced
development and test sets by randomly drawing 5,000 tables HTML code Structural Cell
tokens tokens
with spanning cells and 5,000 tables without spanning cells
from the corresponding raw set. <thead> <thead>
D
<tr> <tr>
o
<td colspan=“2”> <td
IV. E NCODER - DUAL - DECODER (EDD) MODEL colspan=“2”
g
Dog<sup>a</sup> <sup>
Fig. 1 shows the architecture of the EDD model, which </td> >
a
<td> </td>
consists of an encoder, an attention-based structure decoder, Cat <td>
</sup>
and an attention-based cell decoder. The use of two decoders </td> Tokenization </td>
C
</tr>
is inspired by two intuitive considerations: i) table structure </tr>
</thead>
a
</thead>
recognition and cell content recognition are two distinctively <tbody> <tbody>
t

different tasks. It is not effective to solve both tasks at <tr> <tr> W


<td>
the same time using a single attention-based decoder. ii) <td>
</td>
o
Woof o
information in the structure recognition task can be helpful </td> <td> f
for locating the cells that need to be recognized. The encoder <td> </td>
Arf <td>
is a convolutional neural network (CNN) that captures the </td> </td>
A
r
visual features of input table images. The structure decoder <td> </tr> f
and cell decoder are recurrent neural networks (RNN) with the Meow </tbody>
</td>
attention mechanism proposed in [33]. The structure decoder </tr>
M
e
only generates the HTML tags that define the structure of </tbody> o
the table. When the structure decoder recognizes a new cell, w
the cell decoder is triggered and uses the hidden state of the
structure decoder to compute the attention for recognizing Fig. 2: Example of tokenizing a HTML table. Structural tokens
the content of the new cell. This ensures a one-to-one match define the structure of the table. HTML tags in cell content
between the cells generated by the structure decoder and the are treated as single tokens. The rest cell content is tokenized
sequences generated by the cell decoder. The outputs of the at the character level.
two decoders can be easily merged to get the final HTML
representation of the table.
As the structure and the content of an input table image are When both no and ns are td, the substitution cost is 1 if
recognized separately by two decoders, during training, the the column span or the row span of no and ns is different.
ground truth HTML representation of the table is tokenized Otherwise, the substitution cost is the normalized Levenshtein
into structural tokens, and cell tokens as shown in Fig. 2. similarity [38] (∈ [0, 1]) between the content of no and ns .
Structural tokens include the HTML tags that control the Finally, TEDS between two trees is computed as
structure of the table. For spanning cells, the opening tag
EditDist(Ta , Tb )
is broken down into multiple tokens as ‘<td’, ‘rowspan’ or T EDS(Ta , Tb ) = 1 − , (2)
‘colspan’ attributes, and ‘>’. The content of cells is tokenized max(|Ta |, |Tb |)
at the character level, where HTML tags are treated as single where EditDist denotes tree-edit distance, and |T | is the
tokens. number of nodes in T . The table recognition performance of a
Two loss functions can be computed from the EDD network: method on a set of test samples is defined as the mean of the
i) cross-entropy loss of generating the structural tokens (ls ); TEDS score between the recognition result and ground truth
and ii) cross-entropy loss of generating the cell tokens (lc ). of each sample.
The overall loss (l) of the EDD network is calculated as, In order to justify that TEDS solves the two problems
of the adjacency relation metric [3] described previously in
l = λls + (1 − λ)lc , (1)
Section II, we add two types of perturbations to the validation
where λ ∈ [0, 1] is a hyper-parameter. set of PubTabNet and examine how TEDS and the adjacency
relation metric respond to the perturbations.
V. T REE - EDIT- DISTANCE - BASED SIMILARITY (TEDS) 1) To demonstrate the empty-cell and multi-hop misalignment
Tables are presented as a tree structure in the HTML format. issue, we shift some cells in the first row downwards6 , and
The root has two children thead and tbody, which group pad the leftover space with empty cells. The shift distance
table headers and table body cells, respectively. The children of a cell is proportional to its column index. We tested
of thead and tbody nodes are table rows (tr). The leaves 5 perturbation levels, i.e., 10%, 30%, 50%, 70%, or 90%
of the tree are table cells (td). Each cell node has three of the cells in the first row are shifted. Fig. 3 shows a
attributes, i.e. ‘colspan’, ‘rowspan’, and ‘content’. We measure perturbed example, where 90% of the cells in the first row
the similarity between two tables using the tree-edit distance are shifted.
proposed by Pawlik and Augsten [37]. The cost of insertion
and deletion operations is 1. When the edit is substituting a 6 If the number of rows is greater than the number of columns, we shift the
node no with ns , the cost is 1 if either no or ns is not td. cells in the first column rightwards instead.
(a) Before perturbation (b) After perturbation

Fig. 3: Example of cell shift perturbation, where 90% of the cells in the first row are shifted. TEDS = 34.9%. Adjacency
relation F1 score = 80.3%.

(a) Before perturbation (b) After perturbation

Fig. 4: Example of cell content perturbationat the 10% perturbation level. TEDS = 93.2%. Adjacency relation F1 score =
19.1%.

2) To demonstrate the fine-grained cell content recognition 1.0 1.0


issue, we randomly modify some characters into a different
Performance score

0.8 Performance score 0.8


one. We tested 5 perturbation levels, i.e., the chance that a 0.6 0.6
character gets modified is set to be 10%, 30%, 50%, 70%,
0.4 TEDS 0.4
or 90%. Fig. 4 shows an example at the 10% perturbation
0.2 Adjacency relation 0.2
level. F1 score
0.0 0.1 0.3 0.5 0.7 0.9 0.0 0.1 0.3 0.5 0.7 0.9
Fig. 5 illustrates how TEDS and the adjacency relation F1- Perturbation level Perturbation level
score respond to the two types of perturbations at different (a) Cell shift perturbation (b) Cell content perturbation
levels. The adjacency relation metric is under-reacting to the
cell shift perturbation. At the 90% perturbation level, the table Fig. 5: Comparison of the response of TEDS and the ad-
is substantially different from the original (see example in jacency relation metric to cell shift perturbation and cell
Fig. 3). However, the adjacency relation F1-score is still nearly content perturbation. The adjacency relation metric is under-
80%. On the other hand, the perturbation causes a 60% drop reacting to cell shift perturbation and over-reacting to cell
on TEDS, demonstrating that TEDS is able to capture errors content perturbation. Whereas TEDS demonstrates superiority
that the adjacency relation metric cannot. at appropriately capturing the errors.

When it comes to cell content perturbations, the adjacency


relation metric is over-reacting. Even the 10% perturbation VI. E XPERIMENTS
level (see example in Fig. 4) leads to over 70% decrease The test performance of the proposed EDD model is
in adjacency relation F1-score, which drops close to zero compared with five off-the-shelf tools (Tabula7 , Traprange8 ,
from the 50% perturbation level. In contrast, TEDS linearly
decreases from 90% to 40% as the perturbation level increases
from 10% to 90%, demonstrating the capability of capturing 7 v1.0.4 (https://github.com/tabulapdf/tabula-java)
fine-grained cell content recognition errors. 8 v1.0 (https://github.com/thoqbk/traprange)
Camelot9 , PDFPlumber10 , and Adobe Acrobat R Pro11 ) and 16 and 80, respectively. At inference time, the output of both
the WYGIWYS model12 [34]. We crop the test tables from of the decoders are sampled with beam search (beam=3).
the original PDF for Tabula, Traprange, Camelot, and PDF- The EDD model is trained with the Adam [40] optimizer
Plumber, as they only support text-based PDF as input. Adobe with two stages. First, we pre-train the encoder and the struc-
Acrobat R Pro is tested with both PDF tables and high- ture decoder to generate the structural tokens only (λ = 1),
resolution table images (300 PPI). The outputs of the off-the- where the batch size is 10, and the learning rate is 0.001 in the
shelf tools are parsed into the same tree structure as the HTML first 10 epochs and reduced by 10 for another 3 epochs. Then
tables to compute the TEDS score. we train the whole EDD network to generate both structural
and cell tokens (λ = 0.5), with a batch size 8 and a learning
A. Implementation details
rate 0.001 for 10 epochs and 0.0001 for another 2 epochs.
To avoid exceeding GPU RAM, the EDD model is trained Total training time is about 16 days on two V100 GPUs.
on a subset (399k samples) of PubTabNet training set, which
satisfies B. Quantitative analysis
Table II compares the test performance of the proposed
width and height ≤ 512 pixels
EDD model and the baselines, where the average TEDS of
structural tokens ≤ 300 tokens simple13 and complex14 test tables is also shown. By solely
longest cell ≤ 100 tokens. (3) relying on table images, EDD substantially outperforms all
the baselines on recognizing simple and complex tables, even
Note that samples in the validation and test sets are not the ones that directly use text extracted from PDF to fill
constrained by these criteria. The vocabulary size of the table cells. Camelot is the best off-the-shelf tool in this com-
structural tokens and the cell tokens of the training data is parison. Furthermore, the performance of Adobe Acrobat R
32 and 281, respectively. Training images are rescaled to Pro on image input is dramatically lower than that on PDF
448 × 448 pixels to facilitate batching and each channel is input, demonstrating the difficulty of recognizing tables solely
normalized by z-score. on table images. When trained on the PubTabNet dataset,
We use the ResNet-18 [39] network as the encoder. The WYGIWYS also considerably outperform the off-the-shelf
default ResNet-18 model downsamples the image resolution tools, but is outperformed by EDD by 9.7% absolute TEDS
by 32. We modify the last CNN layer of ResNet-18 to study score. The advantage of EDD to WYGIWYS is more pro-
if a higher-resolution feature map improves table recognition found on complex tables (9.9% absolute TEDS) than simple
performance. A total of five different settings are tested in this tables (9.5% absolute TEDS). This proves the great advantage
paper: of jointly training two separate decoders to solve structure
• EDD-S2: the default ResNet-18
recognition and cell content recognition tasks.
• EDD-S1: stride of the last CNN layer set to 1
• EDD-S2S2: two independent last CNN layers for struc- Average TEDS (%)
Input Method
ture (stride=2) and cell (stride=2) decoder Simple13 Complex14 All
• EDD-S2S1: two independent last CNN layers for struc- Tabula 78.0 57.8 67.9
ture (stride=2) and cell (stride=1) decoder Traprange 60.8 49.9 55.4
• EDD-S1S1: two independent last CNN layers for struc- PDF Camelot 80.0 66.0 73.0
PDFPlumber 44.9 35.9 40.4
ture (stride=1) and cell (stride=1) decoder Acrobat R Pro 68.9 61.8 65.3
We evaluate the performances of these five settings on the Acrobat R Pro 53.8 53.5 53.7
validation set and find that a higher-resolution feature map and Image WYGIWYS 81.7 75.5 78.6
EDD 91.2 85.4 88.3
independent CNN layers improve performance. As a result, the
EDD-S1S1 setting provides the best validation performance,
and is therefore chosen to compare with baselines on the test TABLE II: Test performance of EDD and 7 baseline ap-
set. proaches. Our EDD model, by solely relying on table images,
The structure decoder and the cell decoder are single- substantially outperforms all the baselines.
layer long short-term memory (LSTM) networks, of which
the hidden state size is 256 and 512, respectively. Both of C. Qualitative analysis
the decoders weight the feature map from the encoder with
To illustrate the differences in the behavior of the compared
soft-attention, which has a hidden layer of size 256. The
methods, Fig. 6 shows the rendering of the predicted HTML
embedding dimension of structural tokens and cell tokens is
given an example input table. The table has 7 columns,
9 v0.7.3 (https://github.com/camelot-dev/camelot)
3 header rows, and 4 body rows. The table header has a
10 v0.6.0-alpha (https://github.com/jsvine/pdfplumber) complex structure, which consists of 4 multi-row (span=3)
11 v2019.012.20040 cells, 2 multi-column (span=3) cells, and three normal cells.
12 WYGIWYS is trained on the same samples as EDD by truncated back-
propagation through time (200 steps). WYGIWYS and EDD use the same
13 Tables without multi-column or multi-row cells.
CNN in the encoder to rule out the possibility that the performance gain of
EDD is due to difference in CNN. 14 Tables with multi-column or multi-row cells.
No. of Embryo development
Time after IVF (h) No. of oocytes No. of MII
fertilization (% of fertilized oocytes)
(replicates) oocytes (%)∗
(%)∗∗
OA (%) PF (%) CC (%)
12 103 (9) 63 (61.2) 28.6a 5 (27.8) 13 (72.2) 0 (0)
18 97 (7) 65 (67.0) 50.8b 3 (9.1) 30 (90.9) 0 (0)
24 91 (7) 59 (64.9) 49.2b 4 (13.8) 25 (86.2) 0 (0)
30 87 (8) 56 (64.4) 48.2b 4 (14.9) 9 (33.3) 14 (51.8)

(a) Input table (b) Ground truth

(c) EDD (TEDS = 99.8%) (d) WYGIWYS (TEDS = 89.8%)

(e) Acrobat R on PDF (TEDS = 74.8%) (f) Acrobat R on Image (TEDS = 64.2%)

(g) Tabula (TEDS = 47.5%) (h) Traprange (TEDS = 40.2%)

(i) Camelot (TEDS = 35.5%) (j) PDFPlumber (TEDS = 30.0%)

Fig. 6: Table recognition results of EDD and 7 baseline approaches on an example input table which has a complex header
structure (4 multi-row (span=3) cells, 2 multi-column (span=3) cells, and three normal cells). Our EDD model perfectly
recognizes the complex structure and cell content of the table, whereas the baselines struggle with the complex table header.

Our EDD model is able to generate an extremely close match focuses its attention around the cells in the row. When the
to the ground truth, making no error in structure recognition opening tag (‘<td>’) of a new cell is generated, the structure
and a single optical character recognition (OCR) error (‘PF’ decoder pays more attention around the cell. For the closing
recognized as ‘PC’). The second header row is missing in the tag ‘</td>’ tag, the attention of the structure decoder spreads
results of WYGIWYS, which also makes a few errors in the across the image. Since ‘</td>’ always follows the ‘<td>’
cell content. On the other hand, the off-the-shelf tools make or ‘>’ token, the structure decoder relies on the language
substantially more errors in recognizing the complex structure model rather than the encoded feature map to predict it. Fig. 7
of the table headers. This demonstrates the limited capability (d) shows the aggregated attention of the cell decoder when
of these tools on recognizing complex tables. generating the content of each cell. Compared to the structure
decoder, the cell decoder has more focused attention, which
Figs. 7 (a) - (c) illustrate the attention of the structure
falls on the cell content that is being generated.
decoder when processing an example input table. When a new
row is recognized (‘<tr>’ and ‘</tr>’), the structure decoder
First, current PubTabNet dataset does not provide coordinates petition,” in 2013 12th International Conference on Document Analysis
and Recognition. IEEE, 2013, pp. 1449–1453.
of table cells, which we plan to supplement in the next version. [3] M. Hurst, “A constraint-based approach to table structure derivation,”
This will enable adding an additional branch to the EDD 2003.

(a) Attention of structure decoder on the first header row

(b) Attention of structure decoder on the first body row

(c) Attention of structure decoder on the last body row

(d) Aggregated attention of cell decoder on each cell

Fig. 7:
Fig. 4: Attention distribution of the structure decoder (a - c) and the cell decoder (d) on an example input table. The texts
at the center of the images are predictions of the EDD model. The structure decoder focuses its attention around table cells
at the
when recognizing new rows and cells, whereas the cell decoder places more focused attention on cell content.
when

D. Error analysis each group.


We categorize the test set of PubTabNet into 15 equal- E. Generalization
interval groups along four key properties of table size: width, To demonstrate that the EDD model is not only suitable for
height, number of structural tokens, and number of tokens PubTabNet, but also generalizable to other table recognition
in the longest cell. Fig. 8 illustrates the number of tables in datasets, we train and test EDD on the synthetic dataset
each group and the performance of the EDD model and the proposed in [5]. We did not choose the ICDAR2013 or
WYGIWYS model on each group. The EDD model outper- ICDAR2019 table recognition competition datasets. Because,
forms the WYGIWYS model on all groups. The performance as shown in Table I, ICDAR2013 does not provide enough
of both models decreases as table size increases. We train the training data; and ICDAR2019 does not provide ground truth
models with tables that satisfy Equation 3, where the thresh- of cell content (cell position only). We synthesize 500K table
olds are indicated with vertical dashed lines in Fig. 8. Except images with the corresponding HTML representation15 , evenly
for width, we do not observe a steep decrease in performance distributed among the four categories of table styles defined
near the thresholds. We think the lower performance on larger in [5] (see Fig. 9 for example). The synthetic data is par-
tables is mainly due to rescaling images for batching, where titioned (stratified sampling by category) into 420K/40k/40k
larger tables are more strongly downsampled. The EDD model training/validation/test sets.
may better handle large tables by grouping table images into
similar sizes as in [34] and using different rescaling sizes for 15 https://github.com/hassan-mahmood/TIES_DataGeneration
5 100 5 100
4 80 4 80
Frequency (k)

Frequency (k)
TEDS (%)

TEDS (%)
3 60 3 60
2 40 2 40
1 20 1 20
0 250 500 7500 00 250 500 0 (a) Category 1
Width (pixel) Height (pixel)
5 100 5 100
4 80 4 80
Frequency (k)

Frequency (k)
TEDS (%)

TEDS (%)
3 60 3 60
2 40 2 40
1 20 1 EDD
WYGIWYS 20
00 500 10000 00 200 0
# structural tokens # tokens of the longest cell
Fig. 8: Impact of table size in terms of width, height, number (b) Category 2
of structural tokens, and number of tokens in the longest cell
on the performance of EDD and WYGIWYS. The bar plots
(left axis) are the histogram of PubTabNet test set w.r.t. the
above properties. The line plots (right axis) are the mean TEDS
of the samples in each bar. The vertical dashed lines are the
thresholds in Equation 3.

We compare the test performance of EDD to the graph (c) Category 3


neural network model TIES proposed in [5] on each table
category. We compute the TEDS score only for EDD, as TIES
predicts if two tokens (recognized by an OCR engine from the
table image) share the same cell, row, and column, but not a
HTML representation of the table16 . Instead, as in [5], the
exact match percentage is calculated and compared between
EDD and TIES. Note that the exact match for TIES only
checks if the cell, row, and column adjacency matrices of
the tokens perfectly match the ground truth, but does not (d) Category 4
check if the OCR engine makes any mistakes. For a fair
comparison, we also ignore cell content recognition errors Fig. 9: Sample table image of the four categories of table styles
when checking the exact match for EDD, i.e., the recognized defined in [5].
table is considered as an exact match as long as the structure
perfectly matches the ground truth. Model
Average TEDS (%) Exact match (%)
Table III shows the test performance of EDD and TIES, C1 C2 C3 C4 C1 C2 C3 C4
where EDD achieves an extremely high TEDS score (99.7+%) TIES – – – – 96.9 94.7 52.9 68.5
EDD 99.8 99.8 99.8 99.7 99.7 99.9 97.2 98.0
on all the categories of the synthetic dataset. This means EDD
is able to nearly perfectly reconstructed both the structure and
cell content from the table images. EDD outperforms TIES TABLE III: Test performance of EDD and TIES on the dataset
in terms of exact match on all table categories. In addition, proposed in [5]. TEDS score is not computed for TIES, as it
unlike TIES, EDD does not show any significant downgrade in does not generate the HTML representation of input image.
performance on category 3 or 4, in which the samples have a
more complex structure. This demonstrates that EDD is much
more robust and generalizable than TIES on more difficult VII. C ONCLUSION
examples. This paper makes a comprehensive study of the image-based
table recognition problem. A large-scale dataset PubTabNet
is developed to train and evaluate deep learning models.
16 [5] does not describe how the adjacency relations can be converted to a By separating table structure recognition and cell content
unique HTML representation. recognition tasks, we propose an attention-based EDD model.
The structure decoder not only recognizes the structure of [11] Y. Hirayama, “A method for table structure analysis using dp matching,”
input tables, but also helps the cell decoder to place its in Proceedings of 3rd International Conference on Document Analysis
and Recognition, vol. 2. IEEE, 1995, pp. 583–586.
attention on the right cell content. We also propose a new [12] S. Tupaj, Z. Shi, C. H. Chang, and H. Alam, “Extracting tabular infor-
evaluation metric TEDS, which captures both the performance mation from text files,” EECS Department, Tufts University, Medford,
of table structure recognition and cell content recognition. USA, 1996.
[13] J. Hu, R. S. Kashi, D. P. Lopresti, and G. Wilfong, “Medium-
Compare to the traditional adjacency relation metric, TEDS independent table detection,” in Document Recognition and Retrieval
can more appropriately capture multi-hop cell misalignment VII, vol. 3967. International Society for Optics and Photonics, 1999,
and OCR errors. The proposed EDD model, when trained pp. 291–302.
[14] B. Gatos, D. Danatsas, I. Pratikakis, and S. J. Perantonis, “Automatic
on PubTabNet, is effective on recognizing complex table table detection in document images,” in International Conference on
structures and extracting cell content from image. PubTabNet Pattern Recognition and Image Analysis. Springer, 2005, pp. 609–618.
has been made available and we believe that PubTabNet will [15] F. Shafait and R. Smith, “Table detection in heterogeneous documents,”
accelerate future development in table recognition and provide in Proceedings of the 9th IAPR International Workshop on Document
Analysis Systems. ACM, 2010, pp. 65–72.
support for pre-training table recognition models. [16] S. S. Paliwal, D. Vishwanath, R. Rahul, M. Sharma, and L. Vig,
Our future works will focus on the following two directions. “Tablenet: Deep learning model for end-to-end table detection and
First, current PubTabNet dataset does not provide coordinates tabular data extraction from scanned document images,” in 2019 Inter-
national Conference on Document Analysis and Recognition (ICDAR).
of table cells, which we plan to supplement in the next version. IEEE, 2019, pp. 128–133.
This will enable adding an additional branch to the EDD [17] T. Kieninger and A. Dengel, “The t-recs table recognition and analysis
network to also predict cell location. We think this additional system,” in International Workshop on Document Analysis Systems.
Springer, 1998, pp. 255–270.
task will assist cell content recognition. In addition, when [18] F. Cesarini, S. Marinai, L. Sarti, and G. Soda, “Trainable table location in
tables are available in text-based PDF format, the cell location document images,” in Object recognition supported by user interaction
can be used to extract cell content directly from PDF without for service robots, vol. 3. IEEE, 2002, pp. 236–240.
[19] A. C. e Silva, “Learning rich hidden markov models in document
using OCR, which might improve the overall recognition analysis: Table location,” in 2009 10th International Conference on
quality. Second, the EDD model takes table images as input, Document Analysis and Recognition. IEEE, 2009, pp. 843–847.
which implicitly assumes that the accurate location of tables in [20] T. Kasar, P. Barlas, S. Adam, C. Chatelain, and T. Paquet, “Learning
to detect tables in scanned document images using line information,”
documents is given by users. We will investigate how the EDD in 2013 12th International Conference on Document Analysis and
model can be integrated with table detection neural networks Recognition. IEEE, 2013, pp. 1185–1189.
to achieve end-to-end table detection and recognition. [21] M. Fan and D. S. Kim, “Table region detection on large-scale pdf files
without labeled data,” CoRR, abs/1506.08891, 2015.
[22] L. Hao, L. Gao, X. Yi, and Z. Tang, “A table detection method for
R EFERENCES pdf documents based on convolutional neural networks,” in 2016 12th
IAPR Workshop on Document Analysis Systems (DAS). IEEE, 2016,
[1] A. Jimeno Yepes and K. Verspoor, “Literature mining of genetic variants pp. 287–292.
for curation: quantifying the importance of supplementary material,” [23] D. He, S. Cohen, B. Price, D. Kifer, and C. L. Giles, “Multi-scale
Database, vol. 2014, 2014. multi-task fcn for semantic page segmentation and table detection,” in
[2] M. Göbel, T. Hassan, E. Oro, and G. Orsi, “ICDAR 2013 table com- 2017 14th IAPR International Conference on Document Analysis and
petition,” in 2013 12th International Conference on Document Analysis Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 254–261.
and Recognition. IEEE, 2013, pp. 1449–1453. [24] I. Kavasidis, C. Pino, S. Palazzo, F. Rundo, D. Giordano, P. Messina, and
[3] M. Hurst, “A constraint-based approach to table structure derivation,” C. Spampinato, “A saliency-based convolutional neural network for table
2003. and chart detection in digitized documents,” in International Conference
[4] Y. Deng, D. Rosenberg, and G. Mann, “Challenges in end-to-end on Image Analysis and Processing. Springer, 2019, pp. 292–302.
neural scientific table recognition,” in 2019 International Conference [25] C. Tensmeyer, V. I. Morariu, B. Price, S. Cohen, and T. Martinez,
on Document Analysis and Recognition (ICDAR). IEEE, Sep. 2019, “Deep splitting and merging for table structure decomposition,” in
pp. 894–901. 2019 International Conference on Document Analysis and Recognition
[5] S. R. Qasim, H. Mahmood, and F. Shafait, “Rethinking table recognition (ICDAR). IEEE, 2019, pp. 114–121.
using graph neural networks,” pp. 142–147, Sep. 2019. [26] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
[6] J. Fang, X. Tao, Z. Tang, R. Qiu, and Y. Liu, “Dataset, ground-truth and object detection with region proposal networks,” in Advances in neural
performance metrics for table detection evaluation,” in 2012 10th IAPR information processing systems, 2015, pp. 91–99.
International Workshop on Document Analysis Systems. IEEE, 2012, [27] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
pp. 445–449. Proceedings of the IEEE international conference on computer vision,
[7] X. Zhong, J. Tang, and A. J. Yepes, “Publaynet: largest dataset ever 2017, pp. 2961–2969.
for document layout analysis,” in 2019 International Conference on [28] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
Document Analysis and Recognition (ICDAR). IEEE, Sep. 2019, pp. once: Unified, real-time object detection,” in Proceedings of the IEEE
1015–1022. conference on computer vision and pattern recognition, 2016, pp. 779–
[8] N. Siegel, N. Lourie, R. Power, and W. Ammar, “Extracting scientific 788.
figures with distantly supervised neural networks,” in Proceedings of the [29] S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed, “Deepdesrt:
18th ACM/IEEE on joint conference on digital libraries. ACM, 2018, Deep learning for detection and structure recognition of tables in
pp. 223–232. document images,” in 2017 14th IAPR International Conference on
[9] L. Gao, Y. Huang, Y. Li, Q. Yan, Y. Fang, H. Dejean, F. Kleber, Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017,
and E.-M. Lang, “ICDAR 2019 competition on table detection and pp. 1162–1167.
recognition,” in 2019 International Conference on Document Analysis [30] A. Gilani, S. R. Qasim, I. Malik, and F. Shafait, “Table detection
and Recognition (ICDAR). IEEE, Sep. 2019, pp. 1510–1515. using deep learning,” in 2017 14th IAPR International Conference on
[10] A. Shahab, F. Shafait, T. Kieninger, and A. Dengel, “An open approach Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp.
towards the benchmarking of table structure recognition systems,” in 771–776.
Proceedings of the 9th IAPR International Workshop on Document [31] P. W. Staar, M. Dolfi, C. Auer, and C. Bekas, “Corpus conversion
Analysis Systems. ACM, 2010, pp. 113–120. service: A machine learning platform to ingest documents at scale,”
in Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. ACM, 2018, pp. 774–782.
[32] P. Riba, A. Dutta, L. Goldmann, A. Fornés, O. Ramos, and J. Lladós,
“Table detection in invoice documents by graph neural networks,” in
2019 International Conference on Document Analysis and Recognition
(ICDAR). IEEE, Sep. 2019, pp. 122–127.
[33] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,
and Y. Bengio, “Show, attend and tell: Neural image caption generation
with visual attention,” in International conference on machine learning,
2015, pp. 2048–2057.
[34] Y. Deng, A. Kanervisto, J. Ling, and A. M. Rush, “Image-to-markup
generation with coarse-to-fine attention,” in Proceedings of the 34th
International Conference on Machine Learning-Volume 70. JMLR.
org, 2017, pp. 980–989.
[35] Y.-F. Zhou, R.-H. Jiang, X. Wu, J.-Y. He, S. Weng, and Q. Peng,
“Branchgan: Unsupervised mutual image-to-image transfer with a single
encoder and dual decoders,” IEEE Transactions on Multimedia, 2019.
[36] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh,
“Learning regularity in skeleton trajectories for anomaly detection in
videos,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2019, pp. 11 996–12 004.
[37] M. Pawlik and N. Augsten, “Tree edit distance: Robust and memory-
efficient,” Information Systems, vol. 56, pp. 157–173, 2016.
[38] V. I. Levenshtein, “Binary codes capable of correcting deletions, inser-
tions, and reversals,” in Soviet physics doklady, vol. 10, no. 8, 1966, pp.
707–710.
[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in Proceedings of International Conference on Learning Representations
(ICLR), 2015.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy