Long2021 Article SceneTextDetectionAndRecogniti

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

International Journal of Computer Vision (2021) 129:161–184

https://doi.org/10.1007/s11263-020-01369-0

Scene Text Detection and Recognition: The Deep Learning Era


Shangbang Long1 · Xin He2 · Cong Yao3

Received: 14 April 2020 / Accepted: 8 August 2020 / Published online: 27 August 2020
© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract
With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an
important research area in computer vision, scene text detection and recognition has been inevitably influenced by this wave
of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial
advancements in mindset, methodology and performance. This survey is aimed at summarizing and analyzing the major
changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we
devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future
trends. Specifically, we will emphasize the dramatic differences brought by deep learning and remaining grand challenges.
We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also
collected in our Github repository (https://github.com/Jyouhou/SceneTextPapers).

Keywords Scene text · Optical character recognition · Detection · Recognition · Deep learning · Survey

1 Introduction et al. 2011), instant translation (Dvorin and Havosha 2009;


Parkinson et al. 2016), robots navigation (DeSouza and Kak
Undoubtedly, text is among the most brilliant and influen- 2002; Liu and Samarabandu 2005a, b; Schulz et al. 2015),
tial creations of humankind. As the written form of human and industrial automation (Ham et al. 1995; He et al. 2005;
languages, text makes it feasible to reliably and effectively Chowdhury and Deb 2013). Therefore, automatic text read-
spread or acquire information across time and space. In this ing from natural environments, as illustrated in Fig. 1, a.k.a.
sense, text constitutes the cornerstone of human civilization. scene text detection and recognition (Zhu et al. 2016) or
On the one hand, as a vital tool for communication and PhotoOCR (Bissacco et al. 2013), has become an increas-
collaboration, text has been playing a more important role ing popular and important research topic in computer vision.
than ever in modern society; on the other hand, the rich and However, despite years of research, a series of grand
precise high-level semantics embodied in text could be ben- challenges may still be encountered when detecting and rec-
eficial for understanding the world around us. For example, ognizing text in the wild. The difficulties mainly stem from
text information can be used in a wide range of real-world three aspects:
applications, such as image search (Tsai et al. 2011; Schroth

Communicated by Vittorio Ferrari. • Diversity and Variability of Text in Natural Scenes Dis-
tinctive from scripts in documents, text in natural scene
B Shangbang Long
shangbal@cs.cmu.edu exhibit much higher diversity and variability. For exam-
ple, instances of scene text can be in different languages,
Xin He
hexin7257@gmail.com colors, fonts, sizes, orientations, and shapes. Moreover,
the aspect ratios and layouts of scene text may vary signif-
Cong Yao
yaocong2010@gmail.com icantly. All these variations pose challenges for detection
and recognition algorithms designed for text in natural
1 Machine Learning Department, School of Computer Science, scenes.
Carnegie Mellon University, Pittsburgh, USA
• Complexity and Interference of Backgrounds The back-
2 ByteDance Ltd, Beijing, China grounds of natural scenes are virtually unpredictable.
3 MEGVII Inc. (Face++), Beijing, China There might be patterns extremely similar to text (e.g.,

123
162 International Journal of Computer Vision (2021) 129:161–184

respectively. Driven by these datasets, almost all algo-


rithms published in recent years are designed to tackle
specific challenges. For instance, some are proposed to
detect oriented text, while others aim at blurred and unfo-
cused scene images. These ideas are also combined to
make more general-purpose methods.
• Advances in Auxiliary Technologies Apart from new
datasets and models devoted to the main task, auxiliary
technologies that do not solve the task directly also find
their places in this field, such as synthetic data and boot-
Fig. 1 Schematic diagram of scene text detection and recognition. The strapping.
image sample is from total-text (Ch’ng and Chan 2017).

In this survey, we present an overview of the recent devel-


tree leaves, traffic signs, bricks, windows, and stock- opment in deep-learning-based text detection and recognition
ades), or occlusions caused by foreign objects, which from still scene images. We review methods from different
may potentially lead to confusion and mistakes. perspectives and list the up-to-date datasets. We also analyze
• Imperfect Imaging Conditions In uncontrolled circum- the status quo and future research trends.
stances, the quality of text images and videos could not There have been already several excellent review papers
be guaranteed. That is, in poor imaging conditions, text (Uchida 2014; Ye and Doermann 2015; Yin et al. 2016; Zhu
instances may be of low resolution and severe distortion et al. 2016), which also organize and analyze works related
due to inappropriate shooting distance or angle, or blurred to text detection and recognition. However, these papers are
because of out of focus or shaking, or noised on account published before deep learning came to prominence in this
of low light level, or corrupted by highlights or shadows. field. Therefore, they mainly focus on more traditional and
feature-based methods. We refer readers to these paper as
These difficulties run through the years before deep learn- well for a more comprehensive view and knowledge of the
ing showed its potential in computer vision as well as in history of this field. This article will mainly concentrate on
other fields. As deep learning came to prominence after text information extraction from still images, rather than
AlexNet (Krizhevsky et al. 2012) won the ILSVRC2012 videos. For scene text detection and recognition in videos,
(Russakovsky et al. 2015) contest, researchers turn to deep please also refer to Jung et al. (2004) and Yin et al. (2016).
neural networks for automatic feature learning and start with The remaining parts of this paper are arranged as follows:
more in-depth studies. The community are now working on In Sect. 2, we briefly review the methods before the deep
ever more challenging targets. The progress made in recent learning era. In Sect. 3, we list and summarize algorithms
years can be summarized as follows: based on deep learning in a hierarchical order. Note that we
do not introduce these techniques in a paper-by-paper order,
• Incorporation of Deep Learning Nearly all recent meth- but instead based on a taxonomy of their methodologies.
ods are built upon deep learning models. Most impor- Some papers may appear in several sections if they have
tantly, deep learning frees researchers from the exhaust- contributions to multiple aspects. In Sect. 4, we take a look
ing work of repeatedly designing and testing hand-crafted at the datasets and evaluation protocols. Finally, in Sects. 5
features, which gives rise to a blossom of works that and 6, we present potential applications and our own opinions
push the envelope further. To be specific, the use of deep on the current status and future trends.
learning substantially simplifies the overall pipeline, as
illustrated in Fig. 3. Besides, these algorithms provide
significant improvements over previous ones on standard 2 Methods Before the Deep Learning Era
benchmarks. Gradient-based training routines also facil-
itate to end-to-end trainable methods. In this section, we take a glance retrospectively at algorithms
• Challenge-Oriented Algorithms and Datasets Researchers before the deep learning era. More detailed and comprehen-
are now turning to more specific aspects and chal- sive coverage of these works can be found in Uchida (2014),
lenges. Against difficulties in real-world scenarios, newly Ye and Doermann (2015), Yin et al. (2016), and Zhu et al.
published datasets are collected with unique and repre- (2016). For text detection and recognition, the attention has
sentative characteristics. For example, there are datasets been the design of features.
featuring long text (Tu et al. 2012), blurred text (Karatzas In this period of time, most text detection methods either
et al. 2015), and curved text (Ch’ng and Chan 2017) adopt Connected Components Analysis (CCA) (Huang et al.

123
International Journal of Computer Vision (2021) 129:161–184 163

tion (Zhiwei et al. 2010; Mishra et al. 2011; Wakahara and


Kita 2011; Lee and Kim 2013), text line segmentation (Ye
et al. 2003), character segmentation (Nomura et al. 2005;
Shivakumara et al. 2011; Roy et al. 2009), single character
recognition (Chen et al. 2004; Sheshadri and Divvala 2012)
and word correction (Zhang and Chang 2003; Wachenfeld
et al. 2006; Mishra et al. 2012; Karatzas and Antonacopou-
los 2004; Weinman et al. 2007).
There have been efforts devoted to integrated (i.e. end-
to-end as we call it today) systems as well (Wang et al.
2011; Neumann and Matas 2013). In Wang et al. (2011),
characters are considered as a special case in object detec-
tion and detected by a nearest-neighbor classifier trained on
HOG features (Dalal and Triggs 2005) and then grouped
into words through a Pictorial Structure (PS) based model
(Felzenszwalb and Huttenlocher 2005). Neumann and Matas
(Neumann and Matas 2013) proposed a decision delay
approach by keeping multiple segmentations of each char-
acter until the last stage when the context of each character is
known. They detect character segmentation using extremal
Fig. 2 Illustration of traditional methods with hand-crafted features: regions and decode recognition results through a dynamic
(1) Maximally Stable Extremal Regions (MSER) (Neumann and Matas programming algorithm.
2010), assuming chromatic consistency within each character; (2) In summary, text detection and recognition methods
Stroke Width Transform (SWT) (Epshtein et al. 2010), assuming con-
before the deep learning era mainly extract low-level or mid-
sistent stroke width within each character
level handcrafted image features, which entails demanding
and repetitive pre-processing and post-processing steps. Con-
2013; Neumann and Matas 2010; Epshtein et al. 2010; Tu strained by the limited representation ability of handcrafted
et al. 2012; Yin et al. 2014; Yi and Tian 2011; Jain and Yu features and the complexity of pipelines, those methods can
1998) or Sliding Window (SW) based classification (Lee hardly handle intricate circumstances, e.g. blurred images in
et al. 2011; Wang et al. 2011; Coates et al. 2011; Wang the ICDAR 2015 dataset (Karatzas et al. 2015).
et al. 2012). CCA based methods first extract candidate com-
ponents through a variety of ways (e.g., color clustering
or extreme region extraction), and then filter out non-text 3 Methodology in the Deep Learning Era
components using manually designed rules or classifiers
automatically trained on hand-crafted features (see Fig. 2). As implied by the title of this section, we would like to
In sliding window classification methods, windows of vary- address recent advances as changes in methodology instead
ing sizes slide over the input image, where each window is of merely new methods. Our conclusion is grounded in the
classified as text segments/regions or not. Those classified as observations as explained in the following paragraph.
positive are further grouped into text regions with morpho- Methods in the recent years are characterized by the
logical operations (Lee et al. 2011), Conditional Random following two distinctions: (1) Most methods utilize deep-
Field (CRF) (Wang et al. 2011) and other alternative graph learning based models; (2) Most researchers are approaching
based methods (Coates et al. 2011; Wang et al. 2012). the problem from a diversity of perspectives, trying to solve
For text recognition, one branch adopted the feature- different challenges. Methods driven by deep learning enjoy
based methods. Shi et al. (2013) and Yao et al. (2014) the advantage that automatic feature learning can save us
propose character segments based recognition algorithms. from designing and testing a large amount of potential hand-
Rodriguez-Serrano et al. (2013), Rodriguez-Serrano et al. crafted features. At the same time, researchers from different
(2015), Gordo (2015), Almazán et al. (2014) utilize label viewpoints are enriching and promoting the community into
embedding to directly perform matching between strings and more in-depth work, aiming at different targets, e.g. faster
images. Strokes (Busta et al. 2015) and character key-points and simpler pipeline (Zhou et al. 2017), text of varying
(Quy Phan et al. 2013) are also detected as features for classi- aspect ratios (Shi et al. 2017a), and synthetic data (Gupta
fication. Another decomposes the recognition process into a et al. 2016). As we can also see further in this section, the
series of sub-problems. Various methods have been proposed incorporation of deep learning has totally changed the way
to tackle these sub-problems, which includes text binariza- researchers approach the task and has enlarged the scope of

123
164 International Journal of Computer Vision (2021) 129:161–184

Pipeline
Complexity
The evolution of scene text detection algorithms, there-
(a) (b)
fore, undergoes three main stages: (1) In the first stage,
learning-based methods are equipped with multi-step
pipelines, but these methods are still slow and complicated.
(2) Then, the idea and methods of general object detection are
(c) successfully implanted into this task. (3) In the third stage,
(d)
researchers design special representations based on sub-text
Time components to solve the challenges of long text and irregular
Input
Images
text.
Extract Word Text-Region Word Box Word Box
Box Proposal Extraction Regression Regression

Proposal Text Line 3.1.1 Early Attempts to Utilize Deep Learning


Filtering Extraction
Thresholding Thresholding
Bounding Box Character & NMS & NMS
Regression Score Map Early deep-learning-based methods (Huang et al. 2014; Tian
Cropped Cropped
Rule-Based
Images Feature Maps et al. 2015; Yao et al. 2016; Zhang et al. 2016; He et al. 2017a)
Recognition CTC
Filtering approach the task of text detection into a multi-step process.
Recognition Branch
Thresholding
Word Partition Seq-to-Seq They use convolutional neural networks (CNNs) to predict
& Merging
local segments and then apply heuristic post-processing steps
Text: Location and Transcription to merge segments into detection lines.
In an early attempt (Huang et al. 2014), CNNs are only
Fig. 3 Illustrations of representative scene text detection and recogni-
tion system pipelines. a Jaderberg et al. (2016) and b Yao et al. (2016)
used to classify local image patches into text and non-text
are representative multi-step methods. c, d are simplified pipeline. In c, classes. They propose to mine such image patches using
detectors and recognizers are separate. In d, the detectors pass cropped MSER features. Positive patches are then merged into text
feature maps to recognizers, which allows end-to-end training lines.
Later, CNNs are applied to the whole images in a fully con-
volutional approach. TextFlow (Tian et al. 2015) uses CNNs
research by far. This is the most significant change compared
to detect character and views the character grouping task as
to the former epoch.
a min-cost flow problem (Goldberg 1997).
In this section, we would classify existing methods into
In Yao et al. (2016), a convolutional neural network is used
a hierarchical taxonomy, and introduce them in a top-down
to predict whether each pixel in the input image (1) belongs
style. First, we divide them into four kinds of systems: (1)
to characters, (2) is inside the text region, and (3) the text
text detection that detects and localizes text in natural images;
orientation around the pixel. Connected positive responses
(2) recognition system that transcribes and converts the con-
are considered as detected characters or text regions. For
tent of the detected text regions into linguistic symbols;
characters belonging to the same text region, Delaunay tri-
(3) end-to-end system that performs both text detection and
angulation (Kang et al. 2014) is applied, after which a graph
recognition in one unified pipeline; (4) auxiliary methods that
partition algorithm groups characters into text lines based on
aim to support the main task of text detection and recogni-
the predicted orientation attribute.
tion, e.g. synthetic data generation. Under each category, we
Similarly, Zhang et al. (2016) first predicts a segmenta-
review recent methods from different perspectives.
tion map indicating text line regions. For each text line region,
MSER (Neumann and Matas 2012) is applied to extract char-
3.1 Detection acter candidates. Character candidates reveal information on
the scale and orientation of the underlying text line. Finally,
We acknowledge that scene text detection can be taxonom- minimum bounding boxes are extracted as the final text line
ically subsumed under general object detection, which is candidates.
dichotomized as one-staged methods and two-staged ones. He et al. (2017a) propose a detection process that also
Indeed, many scene text detection algorithms are majorly consists of several steps. First, text blocks are extracted. Then
inspired by and follow the designs of general object detec- the model crops and only focuses on the extracted text blocks
tors. Therefore we also encourage readers to refer to recent to extract text center line (TCL), which is defined as a shrunk
surveys on object detection methods (Han et al. 2018; Liu version of the original text line. Each text line represents the
et al. 2018a). However, the detection of scene text has a dif- existence of one text instance. The extracted TCL map is then
ferent set of characteristics and challenges that require unique split into several TCLs. Each split TCL is then concatenated
methodologies and solutions. Thus, many methods rely on to the original image. A semantic segmentation model then
special representation for scene text to solve these non-trivial classifies each pixel into ones that belong to the same text
problems. instance as the given TCL, and ones that do not.

123
International Journal of Computer Vision (2021) 129:161–184 165

Inspired by one-staged object detectors, TextBoxes (Liao


Input
(a) et al. 2017) adapts SSD (Liu et al. 2016a) to fit the vary-
Images ing orientations and aspect-ratios of text by defining default
FCN boxes as quadrilaterals with different aspect-ratio specs.
Feature EAST (Zhou et al. 2017) further simplifies the anchor-
Maps (b)
based detection by adopting the U-shaped design (Ron-
neberger et al. 2015) to integrate features from different
levels. Input images are encoded as one multi-channeled fea-
(c) ture map instead of multiple layers of different spatial sizes
in SSD. The feature at each spatial location is used to regress
the rectangular or quadrilateral bounding box of the underly-
ROI Pooling Bounding Box ing text instances directly. Specifically, the existence of text,
Regression i.e. text/non-text, and geometries, e.g. orientation and size
(d)
Word/Non-Word
for rectangles, and vertexes coordinates for quadrilaterals,
are predicted. EAST makes a difference to the field of text
RPN
detection with its highly simplified pipeline and efficiency to
Fig. 4 High-level illustration of methods inspired by general object perform inference at real-time speed.
detection: a Similar to YOLO (Redmon et al. 2016), regressing offsets Other methods adapt the two-staged object detection
based on default bounding boxes at each anchor position. b Variants of framework of R-CNN (Girshick et al. 2014; Girshick 2015;
SSD (Liu et al. 2016a), predicting at feature maps of different scales.
c Predicting at each anchor position and regressing the bounding box
Ren et al. 2015), where the second stage corrects the localiza-
directly. d Two-staged methods with an extra stage to correct the initial tion results based on features obtained by Region of Interest
regression results (RoI) pooling.
In Ma et al. (2017), rotation region proposal networks are
adapted to generate rotating region proposals, in order to
Overall, in this stage, scene text detection algorithms still fit into text of arbitrary orientations, instead of axis-aligned
have long and slow pipelines, though they have replaced some rectangles.
hand-crafted features with learning-based ones. The design In FEN (Zhang et al. 2018), the weighted sum of RoI
methodology is bottom-up and based on key components, poolings with different sizes is used. The final prediction
such as single characters and text center lines. is made by leveraging the textness score for poolings of 4
different sizes.
Zhang et al. (2019) propose to perform RoI and localiza-
3.1.2 Methods Inspired by Object Detection tion branch recursively, to revise the predicted position of
the text instance. It is a good way to include features at the
Later, researchers are drawing inspirations from the rapidly boundaries of bounding boxes, which localizes the text better
developing general object detection algorithms (Liu et al. than region proposal networks (RPNs).
2016a; Fu et al. 2017; Girshick et al. 2014; Girshick 2015; Wang et al. (2018) propose to use a parametrized Instance
Ren et al. 2015; He et al. 2017b). In this stage, scene text Transformation Network (ITN) that learns to predict appro-
detection algorithms are designed by modifying the region priate affine transformation to perform on the last feature
proposal and bounding box regression modules of general layer extracted by the base network, to rectify oriented text
detectors to localize text instances directly (Dai et al. 2017; instances. Their method, with ITN, can be trained end-to-end.
He et al. 2017c; Jiang et al. 2017; Liao et al. 2017, 2018a; Liu To adapt to irregularly shaped text, bounding polygons
and Jin 2017; Shi et al. 2017a; Liu et al. 2017; Ma et al. 2017; (Liu et al. 2017) with as many as 14 vertexes are proposed,
Li et al. 2017b; Liao et al. 2018b; Zhang et al. 2018), as shown followed by a Bi-LSTM (Hochreiter and Schmidhuber 1997)
in Fig. 4. They mainly consist of stacked convolutional layers layer to refine the coordinates of the predicted vertexes.
that encode the input images into feature maps. Each spatial In a similar way, Wang et al. (2019b) propose to use
location at the feature map corresponds to a region of the recurrent neural networks (RNNs) to read the features
input image. The feature maps are then fed into a classifier encoded by RPN-based two-staged object decoders and pre-
to predict the existence and localization of text instances at dict the bounding polygon with variable length. The method
each such spatial location. requires no post-processing or complex intermediate steps
These methods greatly reduce the pipeline into an end- and achieves a much faster speed of 10.0 FPS on Total-Text.
to-end trainable neural network component, making training The main contribution in this stage is the simplification of
much easier and inference much faster. We introduce the most the detection pipeline and the following improvement of effi-
representative works here. ciency. However, the performance is still limited when faced

123
166 International Journal of Computer Vision (2021) 129:161–184

Word Segments Connection Grouped into


at each anchor between anchors Text Line each pixel in the original image belongs to any text instances
(a)
or not. Post-processing methods then group pixels together
Input
Images depending on which pixels belong to the same text instance.
FCN Basically, they can be seen as a special case of instance seg-
Feature
Maps (b)
(1) text or not;
(2) belong to the same text
mentation (He et al. 2017b). Since text can appear in clusters
as adjacent pixels or not
which makes predicted pixels connected to each other, the
Four corners Grouping core of pixel-level methods is to separate text instances from
(c) each other.
PixelLink (Deng et al. 2018) learns to predict whether two
adjacent pixels belong to the same text instance by adding
(1) text or not;
extra output channels to indicate links between adjacent pix-
(d) (2) local geometries:
radius, orientation els.
Border learning method (Wu and Natarajan 2017) casts
each pixel into three categories: text, border, and background,
assuming that the border can well separate text instances.
In Wang et al. (2017), pixels are clustered according to
their color consistency and edge information. The fused
image segments are called superpixel. These superpixels are
further used to extract characters and predict text instances.
Fig. 5 Illustration of representative methods based on sub-text compo-
nents: a SegLink (Shi et al. 2017a): with SSD as base network, predict Upon the segmentation framework, Tian et al. (2019)
word segments at each anchor position, and connections between adja- propose to add a loss term that maximizes the Euclidean
cent anchors. b PixelLink (Deng et al. 2018): for each pixel, predict distances between pixel embedding vectors that belong to
text/non-text classification and whether it belongs to the same text as
different text instances, and minimizes those belonging to
adjacent pixels or not. c Corner Localization (Lyu et al. 2018b): pre-
dict the four corners of each text and group those belonging to the same the same instance, to better separate adjacent texts.
text instances. d TextSnake (Long et al. 2018): predict text/non-text and Wang et al. (2019a) propose to predict text regions at dif-
local geometries, which are used to reconstruct text instance ferent shrinkage scales, and enlarges the detected text region
round-by-round, until collision with other instances. How-
ever, the prediction at different scales is itself a variation
with curved, oriented, or long text for one-staged methods of the aforementioned border learning (Wu and Natarajan
due to the limitation of the receptive field, and the efficiency 2017).
is limited for two-staged methods. Component-level methods usually predict at a medium
granularity. Component refers to a local region of text
3.1.3 Methods Based on Sub-text Components instance, sometimes overlapping one or more characters.
The representative component-level method is Connec-
The main distinction between text detection and general tionist Text Proposal Network (CTPN) (Tian et al. 2016).
object detection is that text is homogeneous as a whole and CTPN models inherit the idea of anchoring and recurrent
is characterized by its locality, which is different from gen- neural network for sequence labeling. They stack an RNN
eral object detection. By homogeneity and locality, we refer on top of CNNs. Each position in the final feature map rep-
to the property that any part of a text instance is still text. resents features in the region specified by the corresponding
Humans do not have to see the whole text instance to know anchor. Assuming that text appears horizontally, each row of
it belongs to some text. features are fed into an RNN and labeled as text/non-text.
Such a property lays a cornerstone for a new branch of text Geometries such as segment sizes are also predicted. CTPN
detection methods that only predict sub-text components and is the first to predict and connect segments of scene text with
then assemble them into a text instance. These methods, by deep neural networks.
its nature, can better adapt to the aforementioned challenges SegLink (Shi et al. 2017a) extends CTPN by considering
of curved, long, and oriented text. These methods, as illus- the multi-oriented linkage between segments. The detection
trated in Fig. 5, use neural networks to predict local attributes of segments is based on SSD (Liu et al. 2016a), where each
or segments, and a post-processing step to re-construct text default box represents a text segment. Links between default
instances. Compared with early multi-staged methods, they boxes are predicted to indicate whether the adjacent segments
rely more on neural networks and have shorter pipelines. belong to the same text instance. Zhang et al. (2020) further
In pixel-level methods (Deng et al. 2018; Wu and Natara- improve SegLink by using a Graph Convolutional Network
jan 2017), an end-to-end fully convolutional neural network (Kipf and Welling 2016) to predict the linkage between seg-
learns to generate a dense prediction map indicating whether ments.

123
International Journal of Computer Vision (2021) 129:161–184 167

In the deep learning era, scene text recognition models


use CNNs to encode images into feature spaces. The main
difference lies in the text content decoding module. Two
major techniques are the Connectionist Temporal Classifi-
cation (Graves et al. 2006) (CTC) and the encoder–decoder
Fig. 6 a–c Representing text as horizontal rectangles, oriented rectan- framework (Sutskever et al. 2014). We introduce recognition
gles, and quadrilaterals. d The sliding-disk representation proposed in methods in the literature based on the main technique they
TextSnake (Long et al. 2018) employ. Mainstream frameworks are illustrated in Fig. 7.
Both CTC and encoder–decoder frameworks are origi-
Corner localization method (Lyu et al. 2018b) proposes nally designed for 1-dimensional sequential input data, and
to detect the four corners of each text instance. Since each therefore are applicable to the recognition of straight and hor-
text instance only has 4 corners, the prediction results and izontal text, which can be encoded into a sequence of feature
their relative position can indicate which corners should be frames by CNNs without losing important information. How-
grouped into the same text instance. ever, characters in oriented and curved text are distributed
Long et al. (2018) argue that text can be represented as a over a 2-dimensional space. It remains a challenge to effec-
series of sliding round disks along the text center line (TCL), tively represent oriented and curved text in feature spaces
which is in accord with the running direction of the text in order to fit the CTC and encoder–decoder frameworks,
instance, as shown in Fig. 6. With the novel representation, whose decodes require 1-dimensional inputs. For oriented
they present a new model, TextSnake, which learns to predict and curved text, directly compressing the features into a 1-
local attributes, including TCL/non-TCL, text-region/non- dimensional form may lose relevant information and bring
text-region, radius, and orientation. The intersection of TCL in noise from background, thus leading to inferior recogni-
pixels and text region pixels gives the final prediction of tion accuracy. We would introduce techniques to solve this
pixel-level TCL. Local geometries are then used to extract challenge.
the TCL in the form of an ordered point list. With TCL and
radius, the text line is reconstructed. It achieves state-of-the- 3.2.1 CTC-Based Methods
art performance on several curved text datasets as well as
more widely used ones, e.g. ICDAR 2015 (Karatzas et al. The CTC decoding module is adopted from speech recogni-
2015) and MSRA-TD 500 (Tu et al. 2012). Notably, Long et tion, where data are sequential in the time domain. To apply
al. propose a cross-validation test across different datasets, CTC in scene text recognition, the input images are viewed
where models are only fine-tuned on datasets with straight as a sequence of vertical pixel frames. The network outputs a
text instances and tested on the curved datasets. In all existing per-frame prediction, indicating the probability distribution
curved datasets, TextSnake achieves improvements by up to of label types for each frame. The CTC rule is then applied to
20% over other baselines in F1-Score. edit the per-frame prediction to a text string. During training,
Character-level representation is yet another effective the loss is computed as the sum of the negative log probabil-
way. Baek et al. (2019b) propose to learn a segmentation ity of all possible per-frame predictions that can generate the
map for character centers and links between them. Both target sequence by CTC rules. Therefore, the CTC method
components and links are predicted in the form of a Gaus- makes it end-to-end trainable with only word-level annota-
sian heat map. However, this method requires iterative weak tions, without the need for character level annotations. The
supervision as real-world datasets are rarely equipped with first application of CTC in the OCR domain can be traced to
character-level labels. the handwriting recognition system of Graves et al. (2008).
Overall, detection based on sub-text components enjoys Now this technique is widely adopted in scene text recogni-
better flexibility and generalization ability over shapes and tion (Su and Lu 2014; He et al. 2016; Liu et al. 2016b; Gao
aspect ratios of text instance. The main drawback is that the et al. 2017; Shi et al. 2017b; Yin et al. 2017).
module or post-processing step used to group segments into The first attempts can be referred to as convolutional recur-
text instances may be vulnerable to noise, and the efficiency rent neural networks (CRNN). These models are composed
of this step is highly dependent on the actual implementation, by stacking RNNs on top of CNNs and use CTC for train-
and therefore may vary among different platforms. ing and inference. DTRN (He et al. 2016) is the first CRNN
model. It slides a CNN model across the input images to
3.2 Recognition generate convolutional feature slices, which are then fed into
RNNs. Shi et al. (2017b) further improves DTRN by adopting
In this section, we introduce methods for scene text recog- the fully convolutional approach to encode the input images
nition. The input of these methods is cropped text instance as a whole to generate features slices, utilizing the property
images which contain only one word. that CNNs are not restricted by the spatial sizes of inputs.

123
168 International Journal of Computer Vision (2021) 129:161–184

Ground Truth

Transcription:
Transcription: C1 C2 Ct Char Type
Ground Truth 1 2
Cropped Segmentation Map
Images
CTC CTC
Resize Loss Rule
Attention Pooling
h
C1 C2 Ct a1 a2 a3 aw Convolutional
w Layers

Convolutional
Layers

Feature extractor (a) CTC-based decoding (b) Seq-to-seq learning (c) Character segmentation

Fig. 7 Frameworks of text recognition models. a Sequence tagging model, and uses CTC for alignment in training and inference. b Sequence to
sequence model, and can use cross-entropy to learn directly. c Segmentation-based methods

Instead of RNN, Gao et al. (2017) adopt the stacked Bai et al. (2018) propose an edit probability (EP) metric to
convolutional layers to effectively capture the contextual handle the misalignment between the ground truth string and
dependencies of the input sequence, which is characterized the attention’s output sequence of the probability distribution.
by lower computational complexity and easier parallel com- Unlike aforementioned attention-based methods, which usu-
putation. ally employ a frame-wise maximal likelihood loss, EP tries
Yin et al. (2017) simultaneously detect and recognize char- to estimate the probability of generating a string from the
acters by sliding the text line image with character models, output sequence of probability distribution conditioned on
which are learned end-to-end on text line images labeled with the input image, while considering the possible occurrences
text transcripts. of missing or superfluous characters.
Liu et al. (2018d) propose an efficient attention-based
encoder–decoder model, in which the encoder part is trained
under binary constraints to reduce computation cost.
3.2.2 Encoder–Decoder Methods
Both CTC and the encoder–decoder framework simplify
the recognition pipeline and make it possible to train scene
The encoder–decoder framework for sequence-to-sequence
text recognizers with only word-level annotations instead of
learning is originally proposed in Sutskever et al. (2014)
character level annotations. Compared to CTC, the decoder
for machine translation. The encoder RNN reads an input
module of the encoder–decoder framework is an implicit
sequence and passes its final latent state to a decoder RNN,
language model, and therefore, it can incorporate more lin-
which generates output in an auto-regressive way. The main
guistic priors. For the same reason, the encoder–decoder
advantage of the encoder–decoder framework is that it gives
framework requires a larger training dataset with a larger
outputs of variable lengths, which satisfies the task setting of
vocabulary. Otherwise, the model may degenerate when
scene text recognition. The encoder–decoder framework is
reading words that are unseen during training. On the con-
usually combined with the attention mechanism (Bahdanau
trary, CTC is less dependent on language models and has a
et al. 2014) which jointly learns to align input sequence and
better character-to-pixel alignment. Therefore it is potentially
output sequence.
better on languages such as Chinese and Japanese that have a
Lee and Osindero (2016) present recursive recurrent neu-
large character set. The main drawback of these two methods
ral networks with attention modeling for lexicon-free scene
is that they assume the text to be straight, and therefore can
text recognition. the model first passes input images through
not adapt to irregular text.
recursive convolutional layers to extract encoded image fea-
tures and then decodes them to output characters by recurrent
neural networks with implicitly learned character-level lan- 3.2.3 Adaptions for Irregular Text Recognition
guage statistics. The attention-based mechanism performs
soft feature selection for better image feature usage. Rectification-modules are a popular solution to irregular
Cheng et al. (2017a) observe the attention drift problem text recognition. Shi et al. (2016, 2018) propose a text
in existing attention-based methods and proposes to impose recognition system which combined a Spatial Transformer
localization supervision for attention score to attenuate it. Network (STN) (Jaderberg et al. 2015) and an attention-based

123
International Journal of Computer Vision (2021) 129:161–184 169

Sequence Recognition Network. The STN-module predicts encoder–decoder framework, the 2D attentional model main-
text bounding polygons with fully connected layers in order tains 2-dimensional encoded features, and attention scores
for Thin-Plate-Spline transformations which rectify the input are computed for all spatial locations. Similar to spatial atten-
irregular text image into a more canonical form, i.e. straight tion, Long et al. (2020) propose to first detect characters.
text. The rectification proves to be a successful strategy and Afterward, features are interpolated and gathered along the
forms the basis of the winning solution (Long et al. 2019) in character center lines to form sequential feature frames.
ICDAR 2019 ArT1 irregular text recognition competition. In addition to the aforementioned techniques, Qin et al.
There have also been several improved versions of rec- (2019) show that simply flattening the feature maps from
tification based recognition. Zhan and Lu (2019) propose 2-dimensional to 1-dimensional and feeding the resulting
to perform rectification multiple times to gradually rectify sequential features to RNN based attentional encoder–
the text. They also replace the text bounding polygons with decoder model is sufficient to produce state-of-the-art recog-
a polynomial function to represent the shape. Yang et al. nition results on irregular text, which is a simple yet efficient
(2019) propose to predict local attributes, such as radius and solution.
orientation values for pixels inside the text center region, in a Apart from tailored model designs, Long et al. (2019) syn-
similar way to TextSnake (Long et al. 2018). The orientation thesizes a curved text dataset, which significantly boosts the
is defined as the orientation of the underlying character boxes, recognition performance on real-world curved text datasets
instead of text bounding polygons. Based on the attributes, with no sacrifices to straight text datasets.
bounding polygons are reconstructed in a way that the per- Although many elegant and neat solutions have been pro-
spective distortion of characters is rectified, while the method posed, they are only evaluated and compared based on a
by Shi et al. and Zhan et al. may only rectify at the text level relatively small dataset, CUTE80, which only consists of 288
and leave the characters distorted. word samples. Besides, the training datasets used in these
Yang et al. (2017) introduce an auxiliary dense character works only contain a negligible proportion of irregular text
detection task to encourage the learning of visual representa- samples. Evaluations on larger datasets and more suitable
tions that are favorable to the text patterns. And they adopt an training datasets may help us understand these methods bet-
alignment loss to regularize the estimated attention at each ter.
time-step. Further, they use a coordinate map as a second
input to enforce spatial-awareness.
Cheng et al. (2017b) argue that encoding a text image as 3.2.4 Other Methods
a 1-D sequence of features as implemented in most methods
is not sufficient. They encode an input image to four feature Jaderberg et al. (2014a, b) perform word recognition by clas-
sequences of four directions: horizontal, reversed horizontal, sifying the image into a pre-defined set of vocabulary, under
vertical, and reversed vertical. A weighting mechanism is the framework of image classification. The model is trained
applied to combine the four feature sequences. by synthetic images, and achieves state-of-the-art perfor-
Liu et al. (2018b) present a hierarchical attention mecha- mance on some benchmarks containing English words only.
nism (HAM) which consists of a recurrent RoI-Warp layer However, the application of this method is quite limited as
and a character-level attention layer. They adopt a local trans- it cannot be applied to recognize unseen sequences such as
formation to model the distortion of individual characters, phone numbers and email addresses.
resulting in improved efficiency, and can handle different To improve performance on difficult cases such as occlu-
types of distortion that are hard to be modeled by a single sion which brings ambiguity to single character recognition,
global transformation. Yu et al. (2020) propose a transformer-based semantic
Liao et al. (2019b) cast the task of recognition into seman- reasoning module that performs translations from coarse,
tic segmentation, and treat each character type as one class. prone-to-error text outputs from the decoder to fine and lin-
The method is insensitive to shapes and is thus effective guistically calibrated outputs, which bears some resemblance
on irregular text, but the lack of end-to-end training and to the deliberation networks for machine translation (Xia
sequence learning makes it prone to single-character errors, et al. 2017) that first translate and then re-write the sentences.
especially when the image quality is low. They are also the Despite the progress we have seen so far, the evaluation of
first to evaluate the robustness of their recognition method recognition methods falls behind the time. As most detection
by padding and transforming test images. methods can detect oriented and irregular text and some even
Another solution to irregular scene text recognition is rectify them, the recognition of such text may seem redun-
2-dimensional attention (Xu et al. 2015), which has been dant. On the other hand, the robustness of recognition when
verified in Li et al. (2019). Different from the sequential cropped with a slightly different bounding box is seldom ver-
ified. Such robustness may be more important in real-world
1 https://rrc.cvc.uab.es/?ch=14. scenarios.

123
170 International Journal of Computer Vision (2021) 129:161–184

3.3 End-to-End System (a)


Grid
Generator
Input Detection Recognizer
In the past, text detection and recognition are usually cast Images
Cropped
Cropped
Cropped
Image
Images
Images
Text
as two independent sub-problems that are combined to per-
form text reading from images. Recently, many end-to-end Joint
Detection Loss Training Recognition Loss
text detection and recognition systems (also known as text
spotting systems) have been proposed, profiting a lot from (b)
the idea of designing differentiable computation graphs, as Cropped Features Maps

shown in Fig. 8. Efforts to build such systems have gained


Input Detection Recognizer
considerable momentum as a new trend. Images
Text

Two-Step Pipelines While earlier work (Wang et al. 2011, Gradients


2012) first detect single characters in the input image, recent Detection Loss End to End
Recognition Loss
Training
systems usually detect and recognize text in word-level or
line level. Some of these systems first generate text propos- (c)
als using a text detection model and then recognize them
with another text recognition model (Jaderberg et al. 2016;
Input Detection Grouping
Liao et al. 2017; Gupta et al. 2016). Jaderberg et al. (2016) Images
use a combination of Edge Box proposals (Zitnick and Dol-
Word Box Char Box Predicted Text
lár 2014) and a trained aggregate channel features detector
Predict in Pararllel
(Dollár et al. 2014) to generate candidate word bounding
boxes. Proposal boxes are filtered and rectified before being Fig. 8 Illustration of mainstream end-to-end frameworks. a In SEE
sent into their recognition model proposed in (Jaderberg et al. (Bartz et al. 2017), the detection results are represented as grid matrices.
2014b). Liao et al. (2017) combine an SSD (Liu et al. 2016a) Image regions are cropped and transformed before being fed into the
based text detector and CRNN (Shi et al. 2017b) to spot text recognition branch. b Some methods crop from the feature maps and
feed them to the recognition branch. c While a, b utilize CTC-based
in images. and attention-based recognition branch, it is also possible to retrieve
In these methods, detected words are cropped from the each character as generic objects and compose the text
image, and therefore, the detection and recognition are two
separate steps. One major drawback of the two-step methods
is that the propagation of error between the detection and to generate text proposals, and they introduced character spa-
recognition models will lead to less satisfactory performance. tial information as explicit supervision in the attention-based
Two-Stage Pipelines Recently, end-to-end trainable net- recognition branch. Lyu et al. (2018a) propose a modifica-
works are proposed to tackle this problem (Bartz et al. 2017; tion of Mask R-CNN. For each region of interest, character
Busta et al. 2017; Li et al. 2017a; He et al. 2018; Liu et al. segmentation maps are produced, indicating the existence
2018c), where feature maps instead of images are cropped and location of a single character. A post-processing step
and fed to recognition modules. that orders these character from left to right gives the final
Bartz et al. (2017) present an solution that utilizes a STN results. In contrast to the aforementioned works that perform
(Jaderberg et al. 2015) to circularly attend to each word in the RoI pooling based on oriented bounding boxes, Qin et al.
input image, and then recognize them separately. The united (2019) propose to use axis-aligned bounding boxes and mask
network is trained in a weakly-supervised manner that no the cropped features with a 0/1 textness segmentation mask
word bounding box labels are used. Li et al. (2017a) substitute (He et al. 2017b).
the object classification module in Faster-RCNN (Ren et al. One-Stage Pipeline In addition to two-staged methods, Xing
2015) with an encoder–decoder based text recognition model et al. (2019) predict character and text bounding boxes as
and make up their text spotting system. Liu et al. (2018c), well as character type segmentation maps in parallel. The
Busta et al. (2017) and He et al. (2018) develop unified text text bounding boxes are then used to group character boxes
detection and recognition systems with a very similar over- to form the final word transcription results. This is the first
all architectures which consist of a detection branch and a one-staged method.
recognition branch. Liu et al. (2018c) and Busta et al. (2017)
adopt EAST (Zhou et al. 2017) and YOLOv2 (Redmon and 3.4 Auxiliary Techniques
Farhadi 2017) as their detection branches respectively, and
have a similar text recognition branch in which text propos- Recent advances are not limited to detection and recogni-
als are pooled into fixed height tensors by bilinear sampling tion models that aim to solve the tasks directly. We should
and then transcribe into strings by a CTC-based recognition also give credit to auxiliary techniques that have played an
module. He et al. (2018) also adopt EAST (Zhou et al. 2017) important role.

123
International Journal of Computer Vision (2021) 129:161–184 171

3.4.1 Synthetic Data limited to the middle parts of large and well-defined regions,
which is an unfavorable location bias.
Most deep learning models are data-thirsty. Their perfor- UnrealText (Long and Yao 2020) is another work using
mance is guaranteed only when enough data are available. game engines to synthesize scene text images. It features
In the field of text detection and recognition, this problem deep interactions with the 3D worlds during synthesis. A
is more urgent since most human-labeled datasets are small, ray-casting based algorithm is proposed to navigate in the
usually containing around merely 1K–2K data instances. For- 3D worlds efficiently and is able to generate diverse cam-
tunately, there have been work (Jaderberg et al. 2014b; Gupta era views automatically. The text region proposal module is
et al. 2016; Zhan et al. 2018; Liao et al. 2019a) that generate based on collision detection and can put text onto the whole
data of relatively high quality, and have been widely used for surfaces, thus getting rid of the location bias. UnrealText
pre-training models for better performance. achieves significant speedup and better detector perfor-
Jaderberg et al. (2014b) propose to generate synthetic data mances.
for text recognition. Their method blends text with randomly Text Editing It is also worthwhile to mention the text edit-
cropped natural images from human-labeled datasets after ing task that is proposed recently (Wu et al. 2019; Yang
rending of font, border/shadow, color, and distortion. The et al. 2020). Both works try to replace the text content while
results show that training merely on these synthetic data can retaining text styles in natural images, such as the spatial
achieve state-of-the-art performance and that synthetic data arrangement of characters, text fonts, and colors. Text edit-
can act as augmentative data sources for all datasets. ing per se is useful in applications such as instant translation
SynthText (Gupta et al. 2016) first propose to embed text in using cellphone cameras. It also has great potential in aug-
natural scene images for the training of text detection, while menting existing scene text images, though we have not seen
most previous work only print text on a cropped region and any relevant experiment results yet.
these synthetic data are only for text recognition. Printing
text on the whole natural images poses new challenges, as 3.4.2 Weakly and Semi-supervision
it needs to maintain semantic coherence. To produce more
realistic data, SynthText makes use of depth prediction (Liu Bootstrapping for Character-Box
et al. 2015) and semantic segmentation (Arbelaez et al. 2011). Character level annotations are more accurate and better.
Semantic segmentation groups pixels together as semantic However, most existing datasets do not provide character-
clusters and each text instance is printed on one semantic level annotating. Since characters are smaller and close to
surface, not overlapping multiple ones. A dense depth map each other, character-level annotation is more costly and
is further used to determine the orientation and distortion inconvenient. There has been some work on semi-supervised
of the text instance. The model trained only on SynthText character detection. The basic idea is to initialize a character–
achieves state-of-the-art on many text detection datasets. It detector and applies rules or threshold to pick the most
is later used in other works (Zhou et al. 2017; Shi et al. 2017a) reliable predicted candidates. These reliable candidates are
as well for initial pre-training. then used as additional supervision sources to refine the
Further, Zhan et al. (2018) equip text synthesis with other character–detector. Both of them aim to augment existing
deep learning techniques to produce more realistic samples. datasets with character level annotations. Their difference is
They introduce selective semantic segmentation so that word illustrated in Fig. 9.
instances would only appear on sensible objects, e.g. a desk WordSup (Hu et al. 2017) first initializes the character
or wall in stead of someone’s face. Text rendering in their detector by training 5K warm-up iterations on synthetic
work is adapted to the image so that they fit into the artistic datasets. For each image, WordSup generates character
styles and do not stand out awkwardly. candidates, which are then filtered with word-boxes. For
SynthText3D (Liao et al. 2019a) uses the famous open- characters in each word box, the following score is computed
source game engine, Unreal Engine 4 (UE4), and UnrealCV to select the most possible character list:
(Qiu et al. 2017) to synthesize scene text images. Text is
rendered with the scene together and thus can achieve dif-
ferent lighting conditions, weather, and natural occlusions. ar ea(Bchar s ) λ2
s =w· + (1 − w) · (1 − ) (1)
However, SynthText3D simply follows the pipeline of Syn- ar ea(Bwor d ) λ1
thText and only makes use of the ground-truth depth and
segmentation maps provided by the game engine. As a result, where Bchar s is the union of the selected character boxes;
SynthText3D relies on manual selection of camera views, Bwor d is the enclosing word bounding box; λ1 and λ2 are the
which limits its scalability. Besides, the proposed text regions first- and second-largest eigenvalues of a covariance matrix
are generated by clipping maximal rectangular bounding C, computed by the coordinates of the centers of the selected
boxes extracted from segmentation maps, and therefore are character boxes; w is a weight scalar. Intuitively, the first term

123
172 International Journal of Computer Vision (2021) 129:161–184

Framework
equals to the length of the ground truth word, the charac-
Synthetic Predictions
Data ter bounding boxes are regarded as correct.
Partial Annotations In order to improve the recognition per-
Model G-0 Model G-1 Model G-i formance of end-to-end word spotting models on curved text,
Filtering
Qin et al. (2019) propose to use off-the-shelf straight scene
Warm-up
Predictions text spotting models to annotate a large number of unlabeled
images. These images are called partially labeled images,
since the off-the-shelf models may omit some words. These
partially annotated straight text prove to boost the perfor-
Filter (a) Filter (b)
mance on irregular text greatly.
Another similar effort is the large dataset proposed by
Filtering Filtering
Sun et al. (2019), where each image is only annotated with
>a one dominant text. They also design an algorithm to utilize
Score these partially labeled data, which they claim are cheaper to
Retrieve characters with ground-truth Coordinates
word-level annotations: Matrix
annotate.
2xN
M Eigenvalues
Filtering

4 Benchmark Datasets and Evaluation


Filter (c) Protocols

As cutting edge algorithms achieved better performance on


+
existing datasets, researchers are able to tackle more chal-
lenging aspects of the problems. New datasets aimed at
Ground Truth Predicted different real-world challenges have been and are being
# GT Char == # Pred Char ?
Word Box Characater
crafted, benefiting the development of detection and recog-
Fig. 9 Overview of semi-supervised and weakly-supervised methods. nition methods further.
Existing methods differ in the way with regard to how filtering is done. In this section, we list and briefly introduce the exist-
a WeText (Tian et al. 2017), mainly by thresholding the confidence ing datasets and the corresponding evaluation protocols. We
level and filtering by word-level annotation. b Scoring-based methods,
including WordSup (Hu et al. 2017) which assumes that text are straight also identify current state-of-the-art approaches to the widely
lines, and uses a eigenvalue-based metric to measure its straightness. used datasets when applicable.
c by grouping characters into word using ground truth word bounding
boxes, and comparing the number of characters (Baek et al. 2019b; Xing
et al. 2019). 4.1 Benchmark Datasets

We collect existing datasets and summarize their statistics


in Table 1. We select some representative image samples
measures how complete the selected characters can cover the from some of the datasets, which are demonstrated in Fig. 10.
word boxes, while the second term measures whether the Links to these datasets are also collected in our Github repos-
selected characters are located on a straight line, which is the itory mentioned in abstract, for readers’ convenience. In this
main characteristic for word instances in most datasets. section, we select some representative datasets and discuss
WeText (Tian et al. 2017) start with a small dataset their characteristics.
annotated on the character level. It follows two paradigms The ICDAR 2015 incidental text focuses on small and ori-
of bootstrapping: semi-supervised learning and weakly- ented text. The images are taken by Google Glasses without
supervised learning. In the semi-supervised setting, detected taking care of the image quality. A large proportion of text
character candidates are filtered with a high thresholding in the images are very small, blurred, occluded, and multi-
value. In the weakly-supervised setting, ground-truth word oriented, making it very challenging.
boxes are used to mask out false positives outside. New The ICDAR MLT 2017 and 2019 datasets contain scripts
instances detected in either way are added to the initial small of 9 and 10 languages respectively. They are the only multi-
datasets and re-train the model. lingual datasets so far.
In Baek et al. (2019b) and Xing et al. (2019), the character Total-Text has a large proportion of curved text, while
candidates are filtered with the help of word-level annota- previous datasets contain only few. These images are mainly
tions. For each word instance, if the number of detected taken from street billboards, and annotated as polygons with
character bounding boxes inside the word bounding box a variable number of vertices.

123
Table 1 Public datasets for scene text detection and recognition
Dataset (year) Image Num (train/val/test) Orientation Language Features Det. Recog.

SVT (2010) 100/0/250 Horizontal EN –  


ICDAR 2003 258/0/251 Horizontal EN –  
ICDAR 2013 229/0/233 Horizontal EN Stroke labels  
CUTE (2014) 0/0/80 Curved EN –  
ICDAR 2015 1000/0/500 Multi-oriented EN Blur, small  
International Journal of Computer Vision (2021) 129:161–184

ICDAR RCTW 2017 8034/0/4229 Multi-oriented CN –  


Total-Text (2017) 1255/0/300 Curved EN, CN Polygon label  
CTW (2017) 25000/0/6000 Multi-oriented CN Detailed attributes  
COCO-Text (2017) 43686/10000/10000 Multi-oriented En –  
ICDAR MLT 2017 7200/1800/9000 Multi-oriented 9 langanges –  
ICDAR MLT 2019 10000/0/10000 Multi-oriented 10 langanges –  
ICDAR ArT (2019) 5603/0/4563 Curved EN, CN –  
LSVT (2019) 20157/4968/4841 Multi-oriented CN 400K partially labeled images  
MSRA-TD 500 (2012) 300/0/200 Multi-oriented EN, CN Long text  –
HUST-TR 400 (2014) 400/0/0 Multi-oriented EN, CN Long text  –
CTW1500 (2017) 1000/0/500 Curved EN –  –
SVHN (2010) 73257/0/26032 Horizontal Digits Household numbers – 
IIIT5K-Word (2012) 2000/0/3000 Horizontal EN – – 
SVTP (2013) 0/0/639 Multi-oriented EN Perspective text – 
EN stands for English and CN stands for Chinese. Note that HUST-TR 400 is a supplementary training dataset for MSRA-TD 500. ICDAR 2013 refers to ICDAR 2013 Focused Scene Text
Competition. ICDAR 2015 refers to ICDAR 2015 Incidental Text Competition. The last two columns indicate whether the datasets provide annotations for detection and recognition tasks

123
173
174 International Journal of Computer Vision (2021) 129:161–184

MSRA-TD500 ICDAR2013 ICDAR2017 MLT Total-Text


Chars74K

SVT-P

IIIT5K

ICDAR2015 ICDAR2017 RCTW

Fig. 10 Selected samples from Chars74K, SVT-P, IIIT5K, MSRA-TD 500, ICDAR 2013, ICDAR 2015, ICDAR 2017 MLT, ICDAR 2017 RCTW,
and Total-Text

The Chinese Text in the Wild (CTW) dataset (Yuan et al. matching between the predicted instances and ground truth
2018) contains 32,285 high-resolution street view images, ones comes first.
annotated at the character level, including its underlying
character type, bounding box, and detailed attributes such 4.2.1 Text Detection
as whether it uses word-art. The dataset is the largest one
to date and the only one that contains detailed annotations. There are mainly two different protocols for text detection,
However, it only provides annotations for Chinese text and the IOU based PASCAL Eval and overlap based DetE-
ignores other scripts, e.g. English. val. They differ in the criterion of matching predicted text
LSVT (Sun et al. 2019) is composed of two datasets. One instances and ground truth ones. In the following part, we use
is fully labeled with word bounding boxes and word content. these notations: SGT is the area of the ground truth bound-
The other, while much larger, is only annotated with the word ing box, S P is the area of the predicted bounding box, S I is
content of the dominant text instance. The authors propose the area of the intersection of the predicted and ground truth
to work on such partially labeled data that are much cheaper. bounding box, SU is the area of the union.
IIIT 5K-Word (Mishra et al. 2012) is the largest scene
text recognition dataset, containing both digital and natural • DetEval DetEval imposes constraints on both precision,
scene images. Its variance in font, color, size, and other noises i.e. SSPI and recall, i.e. SSGTI . Only when both are larger than
makes it the most challenging one to date. their respective thresholds, are they matched together.
• PASCAL (Everingham et al. 2015) The basic idea is that,
if the intersection-over-union value, i.e. SSUI , is larger than
4.2 Evaluation Protocols a designated threshold, the predicted and ground truth
box are matched together.
In this part, we briefly summarize the evaluation protocols
for text detection and recognition. Most works follow either one of the two evaluation pro-
As metrics for performance comparison of different algo- tocols, but with small modifications. We only discuss those
rithms, we usually refer to their precision, recall and F1- that are different from the two protocols mentioned above.
score. To compute these performance indicators, the list of
predicted text instances should be matched to the ground • ICDAR-2003/2005 The match score m is calculated in
truth labels in the first place. Precision, denoted as P, is cal- a way similar to IOU. It is defined as the ratio of the
culated as the proportion of predicted text instances that can area of intersection over that of the minimum bounding
be matched to ground truth labels. Recall, denoted as R, is the rectangular bounding box containing both.
proportion of ground truth labels that have correspondents in • ICDAR-2011/2013 One major drawback of the eval-
the predicted list. F1-score is then computed by F1 = 2∗P∗R
P+R , uation protocol of ICDAR2003/2005 is that it only
taking both precision and recall into account. Note that the considers the one-to-one match. It does not consider

123
International Journal of Computer Vision (2021) 129:161–184 175

one-to-many, many-to-many, and many-to-one matching, Table 2 Detection on ICDAR 2013


which underestimates the actual performance. Therefore, Method P R F1
ICDAR-2011/2013 follows the method proposed by Wolf
and Jolion (2006), where one-to-one matching is assigned Zhang et al. (2016) 88 78 83
a score of 1 and the other two types are punished to a con- Gupta et al. (2016) 92.0 75.5 83.0
stant score less than 1, usually set as 0.8. Yao et al. (2016) 88.88 80.22 84.33
Deng et al. (2018) 86.4 83.6 84.5
• MSRA-TD 500 (Tu et al. 2012) propose a new evalua- He et al. (2017a)(∗) 93 79 85
tion protocol for rotated bounding boxes, where both the Shi et al. (2017a) 87.7 83.0 85.3
predicted and ground truth bounding box are revolved Lyu et al. (2018b) 93.3 79.4 85.8
horizontally around its center. They are matched only He et al. (2017d) 92 80 86
when the standard IOU score is higher than the threshold Liao et al. (2017) 89 83 86
and the rotation of the original bounding boxes is less a Zhou et al. (2017) 92.64 82.67 87.37
pre-defined value (in practice pi
4 ).
Liu et al. (2018e) 88.2 87.2 87.7
• TIoU (Liu et al. 2019) Tightness-IoU takes into account Tian et al. (2016) 93 83 88
the fact that scene text recognition is sensitive to miss- He et al. (2017c) 89 86 88
ing parts and superfluous parts in detection results. He et al. (2018) 88 87 88
Not-retrieved areas will result in missing characters in Xue et al. (2018) 91.5 87.1 89.2
recognition results, and redundant areas will result in Hu et al. (2017)(∗) 93.34 87.53 90.34
unexpected characters. The proposed metrics penalize Lyu et al. (2018a)(∗) 94.1 88.1 91.0
IoUs by scaling it down by the proportion of missing Zhang et al. (2018) 93.7 90.0 92.3
areas and the proportion of superfluous areas that over- Baek et al. (2019b) 97.4 93.1 95.2
lap with other text.

The main drawback of existing evaluation protocols is


that they only consider the best F1 scores under arbitrarily the performance evaluation only focuses on the text instances
selected confidence thresholds selected on test sets. Qin et al. from the scene image that appears in a predesignated vocab-
(2019) also evaluate their method with the average precision ulary, while other text instances are ignored. On the contrary,
(AP) metric that is widely adopted in general object detection. all text instances that appear in the scene image are included
While F1 scores are only single points on the precision-recall under End-to-End. Three different vocabulary lists are pro-
curves, AP values consider the whole precision-recall curves. vided for candidate transcriptions. They include Strongly
Therefore, AP is a more comprehensive metric and we urge Contextualised, Weakly Contextualised, and Generic. The
that researchers in this field use AP instead of F1 alone. three kinds of lists are summarized in Table 8. Note that under
End-to-End, these vocabularies can still serve as references.
4.2.2 Text Recognition and End-to-End System Evaluation results of recent methods on several widely
adopted benchmark datasets are summarized in the follow-
In scene text recognition, the predicted text string is com- ing tables: Table 2 for detection on ICDAR 2013, Table 4
pared to the ground truth directly. The performance evalu- for detection on ICDAR 2015 Incidental Text, Table 3 for
ation is in either character-level recognition rate (i.e. how detection on ICDAR 2017 MLT, Table 5 for detection and
many characters are recognized) or word level (whether the end-to-end word spotting on Total-Text, Table 6 for detec-
predicted word exactly the same as ground truth). ICDAR tion on CTW1500, Table 7 for detection on MSRA-TD 500,
also introduces an edit-distance based performance evalua- Table 9 for recognition on several datasets, and Table 10 for
tion. end-to-end text spotting on ICDAR 2013 and ICDAR 2015.
In end-to-end evaluation, matching is first performed in a Note that, we do not report performance under multi-scale
similar way to that of text detection, and then the text content conditions if single-scale performances are reported. We use
is compared. ∗ to indicate methods where only multi-scale performances
The most widely used datasets for end-to-end systems are reported. Since different backbone feature extractors are
are ICDAR 2013 (Karatzas et al. 2013) and ICDAR 2015 used in some works, we only report performances based on
(Karatzas et al. 2015). The evaluation over these two datasets ResNet-50 unless not provided. For a better illustration, we
are carried out under two different settings,2 the Word Spot- plot the recent progress of detection performance in Fig. 11,
ting setting and the End-to-End setting. Under Word Spotting, and recognition performance in Fig. 12.
Note that, current evaluation for scene text recognition
2 http://rrc.cvc.uab.es/files/Robust_Reading_2015_v02.pdf. may be problematic. According to Baek et al. (2019a), most

123
176 International Journal of Computer Vision (2021) 129:161–184

Table 3 Detection on ICDAR MLT 2017 Table 5 Detection and end-to-end on total-text
Method P R F1 Method Detection E2E

Liu et al. (2018c) 81.0 57.5 67.3 P R F


Zhang et al. (2019) 60.6 78.8 68.5 Lyu et al. (2018a) 69.0 55.0 61.3 52.9
Wang et al. (2019a) 73.4 69.2 72.1 Long et al. (2018) 82.7 74.5 78.4 –
Xing et al. (2019) 70.10 77.07 73.42 Wang et al. (2019b) 80.9 76.2 78.5 –
Baek et al. (2019b) 68.2 80.6 73.9 Wang et al. (2019a) 84.02 77.96 80.87 –
Long and Yao (2020) 82.2 67.4 74.1 Zhang et al. (2019) 75.7 88.6 81.6 –
Baek et al. (2019b) 87.6 79.9 83.6 –
Qin et al. (2019) 83.3 83.4 83.3 67.8
Table 4 Detection on ICDAR 2015 Xing et al. (2019) 81.0 88.6 84.6 63.6
Method P R F1 FPS Zhang et al. (2020) 86.54 84.93 85.73 –

Zhang et al. (2016) 71 43.0 54 0.5


Tian et al. (2016) 74 52 61 – Table 6 Detection on CTW1500
He et al. (2017a)(∗) 76 54 63 – Method P R F1
Yao et al. (2016) 72.26 58.69 64.77 1.6
Liu et al. (2017) 77.4 69.8 73.4
Shi et al. (2017a) 73.1 76.8 75.0 –
Long et al. (2018) 67.9 85.3 75.6
Liu et al. (2018e) 72 80 76 –
Zhang et al. (2019) 69.6 89.2 78.4
He et al. (2017c) 80 73 77 7.7
Wang et al. (2019b) 80.1 80.2 80.1
Hu et al. (2017)(∗) 79.33 77.03 78.16 2.0
Tian et al. (2019) 82.7 77.8 80.1
Zhou et al. (2017) 83.57 73.47 78.20 13.2
Wang et al. (2019a) 84.84 79.73 82.2
Wang et al. (2018) 85.7 74.1 79.5 –
Baek et al. (2019b) 86.0 81.1 83.5
Lyu et al. (2018b) 94.1 70.7 80.7 3.6
Zhang et al. (2020) 85.93 83.02 84.45
He et al. (2017d) 82 80 81 –
Jiang et al. (2017) 85.62 79.68 82.54 –
Long et al. (2018) 84.9 80.4 82.6 10.2 Table 7 Detection on MSRA-TD 500
He et al. (2018) 84 83 83 1.1 Method P R F1
Lyu et al. (2018a) 85.8 81.2 83.4 4.8
Kang et al. (2014) 71 62 66
Deng et al. (2018) 85.5 82.0 83.7 3.0
Zhang et al. (2016) 83 67 74
Zhang et al. (2020) 88.53 84.69 86.56 –
He et al. (2017d) 77 70 74
Wang et al. (2019a) 86.92 84.50 85.69 1.6
Yao et al. (2016) 76.51 75.31 75.91
Tian et al. (2019) 88.3 85.0 86.6 3
Zhou et al. (2017) 87.28 67.43 76.08
Baek et al. (2019b) 89.8 84.3 86.9 8.6
Wu and Natarajan (2017) 77 78 77
Zhang et al. (2019) 83.5 91.3 87.2 –
Shi et al. (2017a) 86 70 77
Qin et al. (2019) 89.36 85.75 87.52 4.76
Deng et al. (2018) 83.0 73.2 77.8
Wang et al. (2019b) 89.2 86.0 87.6 10.0
Long et al. (2018) 83.2 73.9 78.3
Xing et al. (2019) 88.30 91.15 89.70 –
Xue et al. (2018) 83.0 77.4 80.1
Wang et al. (2018) 90.3 72.3 80.3
Lyu et al. (2018b) 87.6 76.2 81.5
Baek et al. (2019b) 88.2 78.2 82.9
researchers are actually using different subsets when they
Tian et al. (2019) 84.2 81.7 82.9
refer to the same dataset, causing discrepancies in perfor-
Liu et al. (2018e) 88 79 83
mance. Besides, Long and Yao (2020) further point out
Wang et al. (2019b) 85.2 82.1 83.6
that half of the widely adopted benchmark datasets have
Zhang et al. (2020) 88.05 82.30 85.08
imperfect annotations, e.g. ignoring case-sensitivities and
punctuations, and provide new annotations for those datasets.
Though most paper claim to train their models to recognize in
a case-sensitive way and also include punctuations, they may
be limiting their output to only digits and case-insensitive
characters during evaluation.

123
International Journal of Computer Vision (2021) 129:161–184 177

Table 8 Characteristics of the three vocabulary lists used in ICDAR There have been several works on text detection and recog-
2013/2015 nition for autonomous vehicle (Mammeri et al. 2014, 2016).
Vocab list Description The largest dataset so far, CTW (Yuan et al. 2018), also
places extra emphasis on traffic signs. Another example is
S Per-image list of 100 words including all
words in the image the instant translation, where OCR is combined with a trans-
W All words in the entire test set lation model. This is extremely helpful and time-saving as
G A 90k-word generic vocabulary
people travel or read documents written in foreign languages.
Google’s Translate application6 can perform such instant
S stands for Strongly Contextualised, W for Weakly Contextualised, and
translation. A similar application is instant text-to-speech
G for Generic
software equipped with OCR, which can help those with
visual disability and those who are illiterate.7
5 Application Intelligent Content Analysis OCR also allows the industry
to perform more intelligent analysis, mainly for platforms
The detection and recognition of text—the visual and phys- like video-sharing websites and e-commerce. Text can be
ical carrier of human civilization—allow the connection extracted from images and subtitles as well as real-time com-
between vision and the understanding of its content fur- mentary subtitles (a kind of floating comments added by
ther. Apart from the applications we have mentioned at the users, e.g. those in Bilibili8 and Niconico9 ). On the one hand,
beginning of this paper, there have been numerous specific such extracted text can be used in automatic content tagging
application scenarios across various industries and in our and recommendation systems. They can also be used to per-
daily lives. In this part, we list and analyze the most out- form user sentiment analysis, e.g. which part of the video
standing ones that have, or are to have, significant impact, attracts the users most. On the other hand, website adminis-
improving our productivity and life quality. trators can impose supervision and filtration for inappropriate
Automatic Data Entry Apart from an electronic archive of and illegal content, such as terrorist advocacy.
existing documents, OCR can also improve our productivity
in the form of automatic data entry. Some industries involve
time-consuming data type-in, e.g. express orders written by 6 Conclusion and Discussion
customers in the delivery industry, and hand-written informa-
tion sheets in the financial and insurance industries. Applying 6.1 Status Quo
OCR techniques can accelerate the data entry process as well
as protect customer privacy. Some companies have already Algorithms The past several years have witnessed the sig-
been using these technologies, e.g. SF-Express.3 Another nificant development of algorithms for text detection and
potential application is note taking, such as NEBO,4 a note- recognition, mainly due to the deep learning boom. Deep
taking software on tablets like iPad that performs instant learning models have replaced the manual search and design
transcription as users write down notes. for patterns and features. With the improved capability of
Identity Authentication Automatic identity authentication is models, research attention has been drawn to challenges such
yet another field where OCR can give a full play to. In fields as oriented and curved text detection, and have achieved con-
such as Internet finance and Customs, users/passengers are siderable progress.
required to provide identification (ID) information, such as Applications Apart from efforts towards a general solution
identity card and passport. Automatic recognition and analy- to all sorts of images, these algorithms can be trained and
sis of the provided documents would require OCR that reads adapted to more specific scenarios, e.g. bankcard, ID card,
and extracts the textual content, and can automate and greatly and driver’s license. Some companies have been providing
accelerate such processes. There are companies that have such scenario-specific APIs, including Baidu Inc., Tencent
already started working on identification based on face and Inc., and MEGVII Inc.. Recent development of fast and effi-
ID card, e.g. MEGVII (Face++).5 cient methods (Ren et al. 2015; Zhou et al. 2017) has also
Augmented Computer Vision As text is an essential element allowed the deployment of large-scale systems (Borisyuk
for the understanding of scene, OCR can assist computer et al. 2018). Companies including Google Inc. and Amazon
vision in many ways. In the scenario of autonomous vehi- Inc. are also providing text extraction APIs.
cles, text-embedded panels carry important information, e.g.
geo-location, current traffic condition, navigation, and etc.. 6 https://translate.google.com/.
7 https://en.wikipedia.org/wiki/Screen_reader#cite_note-Braille_
3 Official website: http://www.sf-express.com/cn/sc/. display-2.
4 Official website: https://www.myscript.com/nebo/. 8 https://www.bilibili.com.
5 https://www.faceplusplus.com/face-based-identification/. 9 www.nicovideo.jp/.

123
178

123
Table 9 State-of-the-art recognition performance across a number of datasets
Methods ConvNet, Data IIIT5k SVT IC03 IC13 IC15 SVTP CUTE Total-text

50 1k 0 50 0 50 Full 0 0 0 0 0 0

Yao et al. (2014) – 80.2 69.3 – 75.9 – 88.5 80.3 – – – – – –


Jaderberg et al. (2014c) – – – – 86.1 – 96.2 91.5 – – – – – –
Su and Lu (2014) – – – – 83.0 – 92.0 82.0 – – – – – –
Gordo (2015) – 93.3 86.6 – 91.8 – – – – – – – – –
Jaderberg et al. (2016) VGG, 90k 97.1 92.7 – 95.4 80.7 98.7 98.6 93.1 90.8 – – – –
Shi et al. (2017b) VGG, 90k 97.8 95.0 81.2 97.5 82.7 98.7 98.0 91.9 89.6 – – – –
Shi et al. (2016) VGG, 90k 96.2 93.8 81.9 95.5 81.9 98.3 96.2 90.1 88.6 – 71.8 59.2 –
Lee and Osindero (2016) VGG, 90k 96.8 94.4 78.4 96.3 80.7 97.9 97.0 88.7 90.0 – – – –
Yang et al. (2017) VGG, Private 97.8 96.1 – 95.2 – 97.7 – – – – 75.8 69.3 –
Cheng et al. (2017a) ResNet, 90k +ST+ 99.3 97.5 87.4 97.1 85.9 99.2 97.3 94.2 93.3 70.6 – – –
Shi et al. (2018) ResNet, 90k + ST 99.6 98.8 93.4 99.2 93.6 98.8 98.0 94.5 91.8 76.1 78.5 79.5 –
Liao et al. (2019b) ResNet, ST+ + Private 99.8 98.8 91.9 98.8 86.4 – – – 91.5 – – 79.9 –
Li et al. (2019) ResNet, 90k + ST + Private – – 91.5 – 84.5 – – – 91.0 69.2 76.4 83.3 –
Zhan and Lu (2019) ResNet, 90k + ST 99.6 98.8 93.3 97.4 90.2 – – – 91.3 76.9 79.6 83.3 –
Yang et al. (2019) ResNet, 90k + ST 99.5 98.8 94.4 97.2 88.9 99.0 98.3 95.0 93.9 78.7 80.8 87.5 –
Long et al. (2019) ResNet, 90k + Curved ST – – 94.8 – 89.6 – – 95.8 92.8 78.2 81.6 89.6 76.3
Yu et al. (2020) ResNet, 90k + ST – – 94.8 – 91.5 – – – 95.5 82.7 85.1 87.8 –

“50”, “1k”, “Full” are lexicons. “0” means no lexicon. “90k” and “ST” are the Synth90k and the SynthText datasets, respectively. “ST+ ” means including character-level annotations. “Private”
means private training data
International Journal of Computer Vision (2021) 129:161–184
International Journal of Computer Vision (2021) 129:161–184 179

Table 10 Performance of
Method Word spotting End-to-end
end-to-end and word spotting on
ICDAR 2015 and ICDAR 2013 S W G S W G

ICDAR 2015
Liu et al. (2018c) 84.68 79.32 63.29 81.09 75.90 60.80
Xing et al. (2019) – – – 80.14 74.45 62.18
Lyu et al. (2018a) 79.3 74.5 64.2 79.3 73.0 62.4
He et al. (2018) 85 80 65 82 77 63
Qin et al. (2019) – – – 83.38 79.94 67.98
ICDAR 2013
Busta et al. (2017) 92 89 81 89 86 77
Liu et al. (2018c) 92.73 90.72 83.51 88.81 87.11 80.81
Li et al. (2017a) 94.2 92.4 88.2 91.1 89.8 84.6
He et al. (2018) 93 92 87 91 89 86
Lyu et al. (2018a) 92.5 92.0 88.2 92.2 91.1 86.5

6.2 Challenges and Future Trends

We look at the present through a rear-view mirror. We march


backward into the future (Liu 1975). We list and discuss chal-
lenges, and analyze what would be the next valuable research
directions in the field scene text detection and recognition.
Languages There are more than 1000 languages in the
world.10 However, most current algorithms and datasets have
primarily focused on text of English. While English has a
rather small alphabet, other languages such as Chinese and
Japanese have a much larger one, with tens of thousands
of symbols. RNN-based recognizers may suffer from such
enlarged symbol sets. Moreover, some languages have much
more complex appearances, and they are therefore more sen-
sitive to conditions such as image quality. Researchers should
Fig. 11 Progress of scene text detection over the past few years (eval- first verify how well current algorithms can generalize to text
uated as F1 scores)
of other languages and further to mixed text. Unified detec-
tion and recognition systems for multiple types of languages
are of important academic value and application prospects. A
feasible solution might be to explore compositional represen-
tations that can capture the common patterns of text instances
of different languages, and train the detection and recogni-
tion models with text examples of different languages, which
are generated by text synthesizing engines.
Robustness of Models Although current text recognizers have
proven to be able to generalize well to different scene text
datasets even only using synthetic data, recent work (Liao
et al. 2019b) shows that robustness against flawed detection
is not a neglectable problem. Actually, such instability in
prediction has also been observed for text detection models.
The reason behind this kind of phenomenon is still unclear.
One conjecture is that the robustness of models is related to
Fig. 12 Progress of scene text recognition over the past few years (eval- the internal operating mechanism of deep neural networks.
uated as word-level accuracy)
10 https://www.ethnologue.com/guides/how-many-languages.

123
180 International Journal of Computer Vision (2021) 129:161–184

Generalization Few detection algorithms except for TextSnake ICDAR MLT 2017, ICDAR MLT 2019, ICDAR ArT 2019,
(Long et al. 2018) have considered the problem of general- and COCO-Text.
ization ability across datasets, i.e. training on one dataset,
and testing on another. Generalization ability is important as
some application scenarios would require the adaptability to
varying environments. For example, instant translation and References
OCR in autonomous vehicles should be able to perform sta-
Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting
bly under different situations: zoomed-in images with large and recognition with embedded attributes. IEEE Transactions on
text instances, far and small words, blurred words, different Pattern Analysis and Machine Intelligence, 36(12), 2552–2566.
languages, and shapes. It remains unverified whether simply Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detec-
tion and hierarchical image segmentation. IEEE Transactions on
pooling all existing datasets together is enough, especially
Pattern Analysis and Machine Intelligence, 33(5), 898–916.
when the target domain is totally unknown. Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., et al. (2019a).
Evaluation Existing evaluation metrics for detection stem What is wrong with scene text recognition model comparisons?
from those for general object detection. Matching based on Dataset and model analysis. In Proceedings of the IEEE interna-
tional conference on computer vision (pp. 4715–4723).
IoU score or pixel-level precision and recall ignore the fact
Baek, Y., Lee, B., Han, D., Yun, S., & Lee, H. (2019b). Character
that missing parts and superfluous backgrounds may hurt the region awareness for text detection. In Proceedings of the IEEE
performance of the subsequent recognition procedure. For conference on computer vision and pattern recognition (CVPR)
each text instance, pixel-level precision and recall are good (pp. 9365–9374).
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation
metrics. However, their scores are assigned to 1.0 once they
by jointly learning to align and translate. In ICLR 2015.
are matched to ground truth, and thus not reflected in the Bai, F., Cheng, Z., Niu, Y., Pu, S., & Zhou, S. (2018). Edit probability
final dataset-level score. An off-the-shelf alternative method for scene text recognition. In CVPR 2018.
is to simply sum up the instance-level scores under DetEval Bartz, C., Yang, H., & Meinel, C. (2017). See: Towards semi-
supervised end-to-end scene text recognition. arXiv preprint
instead of first assigning them to 1.0.
arXiv:1712.05404.
Synthetic Data While training recognizers on synthetic Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). Photoocr:
datasets has become a routine and results are excellent, detec- Reading text in uncontrolled conditions. In Proceedings of the
tors still rely heavily on real datasets. It remains a challenge IEEE international conference on computer vision (pp. 785–792).
Borisyuk, F., Gordo, A., & Sivakumar, V. (2018). Rosetta: Large scale
to synthesize diverse and realistic images to train detectors.
system for text detection and recognition in images. In Proceedings
Potential benefits of synthetic data are not yet fully explored, of the 24th ACM SIGKDD international conference on knowledge
such as generalization ability. Synthesis using 3D engines discovery & data mining (pp. 71–79). ACM.
and models can simulate different conditions such as light- Busta, M., Neumann, L., & Matas, J. (2015). Fastext: Efficient uncon-
strained scene text detector. In Proceedings of the IEEE interna-
ing and occlusion, and thus is worth further development.
tional conference on computer vision (ICCV) (pp. 1206–1214).
Efficiency Another shortcoming of deep-learning-based meth- Busta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-
ods lies in their efficiency. Most of the current systems can to-end trainable scene text localization and recognition framework.
not run in real-time when deployed on computers without In Proceedings of ICCV.
Chen, X., Yang, J., Zhang, J., & Waibel, A. (2004). Automatic detection
GPUs or mobile devices. Apart from model compression and
and recognition of signs from natural scenes. IEEE Transactions
lightweight models that have proven effective in other tasks, on Image Processing, 13(1), 87–99.
it is also valuable to study how to make custom speedup Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., & Zhou, S. (2017a).
mechanism for text-related tasks. Focusing attention: Towards accurate text recognition in natural
images. In 2017 IEEE international conference on computer vision
Bigger and Better Datasets The sizes of most widely adopted
(ICCV) (pp. 5086–5094). IEEE.
datasets are small (∼ 1k images). It will be worthwhile to Cheng, Z., Liu, X., Bai, F., Niu, Y., Pu, S., & Zhou, S. (2017b).
study whether the improvements gained from current algo- Arbitrarily-oriented text recognition. In CVPR2018.
rithms can scale up or they are just accidental results of better Ch’ng, C.K., & Chan, C. S. (2017). Total-text: A comprehensive dataset
for scene text detection and recognition. In 2017 14th IAPR
regularization. Besides, most datasets are only labeled with
international conference on document analysis and recognition
bounding boxes and texts. Detailed annotation of different (ICDAR) (Vol. 1, pp. 935–942). IEEE.
attributes (Yuan et al. 2018) such as word-art and occlu- Chowdhury, M. A., & Deb, K. (2013). Extracting and segmenting
sion may guide researchers with pertinence. Finally, datasets container name from container images. International Journal of
Computer Applications, 74(19), 18–22.
characterized by real-world challenges are also important in
Coates, A., Carpenter, B., Case, C., Satheesh, S., Suresh, B., Wang,
advancing research progress, such as densely located text T., et al. (2011). Text detection and character recognition in scene
on products. Another related problem is that most of the images with unsupervised feature learning. In 2011 international
existing datasets do not have validation sets. It is highly pos- conference on document analysis and recognition (ICDAR) (pp.
440–445). IEEE.
sible that the current reported evaluation results are actually
Dai, Y., Huang, Z., Gao, Y., & Chen, K. (2017). Fused text segmentation
upward biased due to overfitting on the test sets. We sug- networks for multi-oriented scene text detection. arXiv preprint
gest that researchers should focus on large datasets, such as arXiv:1709.03272.

123
International Journal of Computer Vision (2021) 129:161–184 181

Dalal, N., & Triggs, B., (2005). Histograms of oriented gradients for wild. In 2017 IEEE conference on computer vision and pattern
human detection. In IEEE computer society conference on com- recognition (CVPR) (pp. 474–483). IEEE.
puter vision and pattern recognition (CVPR) (Vol. 1, pp. 886–893). He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017b). Mask R-CNN.
IEEE. In 2017 IEEE international conference on computer vision (ICCV)
Deng, D., Liu, H., Li, X., & Cai, D. (2018). Pixellink: Detecting scene (pp. 2980–2988). IEEE.
text via instance segmentation. Proceedings of AAA, I, 2018. He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., & Li, X. (2017c). Single shot
DeSouza, G. N., & Kak, A. C. (2002). Vision for mobile robot nav- text detector with regional attention. In The IEEE international
igation: A survey. IEEE Transactions on Pattern Analysis and conference on computer vision (ICCV).
Machine Intelligence, 24(2), 237–267. He, P., Huang, W., Qiao, Y., Loy, C. C., & Tang, X. (2016). Read-
Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyra- ing scene text in deep convolutional sequences. In Thirtieth AAAI
mids for object detection. IEEE Transactions on Pattern Analysis conference on artificial intelligence.
and Machine Intelligence, 36(8), 1532–1545. He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018).
Dvorin, Y., & Havosha, U. E. (2009). Method and device for instant An end-to-end textspotter with explicit alignment and attention.
translation, June 4. US Patent App. 11/998,931. In Proceedings of the IEEE conference on computer vision and
Epshtein, B., Ofek, E., & Wexler, Y. (2010). Detecting text in natural pattern recognition (CVPR) (pp. 5020–5029).
scenes with stroke width transform. In 2010 IEEE conference on He, W., Zhang, X.-Y., Yin, F., & Liu, C.-L. (2017d). Deep direct
computer vision and pattern recognition (CVPR) (pp. 2963–2970). regression for multi-oriented scene text detection. In The IEEE
IEEE. international conference on computer vision (ICCV).
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, He, Z., Liu, J., Ma, H., & Li, P. (2005). A new automatic extraction
J., & Zisserman, A. (2015). The pascal visual object classes chal- method of container identity codes. IEEE Transactions on Intelli-
lenge: A retrospective. International Journal of Computer Vision, gent Transportation Systems, 6(1), 72–78.
111(1), 98–136. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures Neural Computation, 9(8), 1735–1780.
for object recognition. International Journal of Computer Vision, Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., & Ding, E. (2017).
61(1), 55–79. Wordsup: Exploiting word annotations for character based text
Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). detection. In Proceedings of the IEEE international conference on
DSSD: Deconvolutional single shot detector. arXiv preprint computer vision. 2017.
arXiv:1701.06659. Huang, W., Lin, Z., Yang, J., & Wang, J. (2013). Text localization in
Gao, Y., Chen, Y., Wang, J., & Lu, H. (2017). Reading scene text natural images using stroke feature transform and text covariance
with attention convolutional sequence modeling. arXiv preprint descriptors. In Proceedings of the IEEE international conference
arXiv:1709.04303. on computer vision (pp. 1241–1248).
Girshick, R. (2015). Fast R-CNN. In The IEEE international conference Huang, W., Qiao, Y., & Tang, X. (2014). Robust scene text detection with
on computer vision (ICCV). convolution neural network induced MSER trees. In European
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature conference on computer vision (pp. 497–511). Springer.
hierarchies for accurate object detection and semantic segmenta- Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014a).
tion. In Proceedings of the IEEE conference on computer vision Deep structured output learning for unconstrained text recognition.
and pattern recognition (CVPR) (pp. 580–587). In ICLR2015.
Goldberg, A. V. (1997). An efficient implementation of a scaling Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014b).
minimum-cost flow algorithm. Journal of Algorithms, 22(1), 1– Synthetic data and artificial neural networks for natural scene text
29. recognition. arXiv preprint arXiv:1406.2227.
Gordo, A. (2015). Supervised mid-level features for word image rep- Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016).
resentation. In Proceedings of the IEEE conference on computer Reading text in the wild with convolutional neural networks. Inter-
vision and pattern recognition (CVPR) (pp. 2956–2964). national Journal of Computer Vision, 116(1), 1–20.
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Jaderberg, M., Simonyan, K., Zisserman, A. et al. (2015). Spatial trans-
Connectionist temporal classification: Labelling unsegmented former networks. In Advances in neural information processing
sequence data with recurrent neural networks. In Proceedings of systems (pp. 2017–2025).
the 23rd international conference on machine learning (pp. 369– Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014c). Deep features
376). ACM. for text spotting. In In Proceedings of European conference on
Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., & Fernández, computer vision (ECCV) (pp. 512–528). Springer.
S. (2008). Unconstrained on-line handwriting recognition with Jain, A. K., & Yu, B. (1998). Automatic text location in images and
recurrent neural networks. In Advances in neural information pro- video frames. Pattern Recognition, 31(12), 2055–2076.
cessing systems (pp. 577–584). Jiang, Y., Zhu, X., Wang, X., Yang, S., Li, W., Wang, H., Fu, P., & Luo,
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for Z. (2017). R2CNN: rotational region CNN for orientation robust
text localisation in natural images. In Proceedings of the IEEE scene text detection. arXiv preprint arXiv:1706.09579.
conference on computer vision and pattern recognition (CVPR) Jung, K., Kim, K. I., & Jain, A. K. (2004). Text information extraction in
(pp. 2315–2324). images and video: A survey. Pattern Recognition, 37(5), 977–997.
Ham, Y. K., Kang, M. S., Chung, H. K., Park, R.-H., & Park, G. T. Kang, L., Li, Y., & Doermann, D. (2014). Orientation robust text line
(1995). Recognition of raised characters for automatic classifica- detection in natural images. In Proceedings of the IEEE conference
tion of rubber tires. Optical Engineering, 34(1), 102–110. on computer vision and pattern recognition (CVPR) (pp. 4034–
Han, J., Zhang, D., Cheng, G., Liu, N., & Xu, D. (2018). Advanced 4041).
deep-learning techniques for salient and category-specific object Karatzas, D., & Antonacopoulos, A. (2004). Text extraction from web
detection: A survey. IEEE Signal Processing Magazine, 35(1), 84– images based on a split-and-merge segmentation method using
100. colour perception. In Proceedings of the 17th international con-
He, D., Yang, X., Liang, C., Zhou, Z., Ororbia, A. G., Kifer, D., & ference on pattern recognition, 2004. ICPR 2004 (Vol. 2, pp.
Giles, C. L. (2017a). Multi-scale FCN with cascaded instance 634–637). IEEE.
aware segmentation for arbitrary oriented word spotting in the

123
182 International Journal of Computer Vision (2021) 129:161–184

Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, Liu, X. (1975). Old book of tang. Beijing: Zhonghua Book Company.
A., Iwamura, M., et al. (2015). ICDAR 2015 competition on robust Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018c). FOTS:
reading. In 2015 13th international conference on document anal- Fast oriented text spotting with a unified network. In CVPR2018.
ysis and recognition (ICDAR) (pp. 1156–1160). IEEE. Liu, X., & Samarabandu, J. (2005a). An edge-based text region extrac-
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda, L. G. I., tion algorithm for indoor mobile robot navigation. In 2005 IEEE
Mestre, S. R., et al. (2013). ICDAR 2013 robust reading competi- international conference mechatronics and automation (Vol. 2, pp.
tion. In 2013 12th international conference on document analysis 701–706). IEEE.
and recognition (ICDAR) (pp. 1484–1493). IEEE. Liu, X., & Samarabandu, J. K. (2005b). A simple and fast text local-
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with ization algorithm for indoor mobile robot navigation. In Image
graph convolutional networks. arXiv preprint arXiv:1609.02907. processing: Algorithms and systems IV (Vol. 5672, pp. 139–151).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classi- International Society for Optics and Photonics.
fication with deep convolutional neural networks. In Advances in Liu, Y., & Jin, L. (2017). Deep matching prior network: Toward tighter
neural information processing systems (pp. 1097–1105). multi-oriented text detection.
Lee, C.-Y., & Osindero, S. (2016). Recursive recurrent nets with atten- Liu, Y., Jin, L., Xie, Z., Luo, C., Zhang, S., & Xie, L. (2019). Tightness-
tion modeling for OCR in the wild. In Proceedings of the IEEE aware evaluation protocol for scene text detection. In Proceedings
conference on computer vision and pattern recognition (CVPR) of the IEEE conference on computer vision and pattern recognition
(pp. 2231–2239). (pp. 9612–9620).
Lee, J.-J, Lee, P.-H., Lee, S.-W., Yuille, A., & Koch, C. (2011). Adaboost Liu, Y., Jin, L., Zhang, S., & Zhang, S. (2017). Detecting curve
for text detection in natural scene. In 2011 international conference text in the wild: New dataset and new solution. arXiv preprint
on document analysis and recognition (ICDAR) (pp. 429–434). arXiv:1712.02170.
IEEE. Liu, Z., Li, Y., Ren, F., Yu, H., & Goh, W. (2018d). Squeezedtext: A
Lee, S., & Kim, J. H. (2013). Integrating multiple character proposals for real-time scene text recognition by binary convolutional encoder–
robust scene text extraction. Image and Vision Computing, 31(11), decoder network. In AAAI.
823–840. Liu, Z., Lin, G., Yang, S., Feng, J., Lin, W., & Goh, W. L. (2018e).
Li, H., Wang, P., & Shen, C. (2017a). Towards end-to-end text spot- Learning Markov clustering networks for scene text detection. In
ting with convolutional recurrent neural networks. In The IEEE Proceedings of the IEEE conference on computer vision and pat-
international conference on computer vision (ICCV). tern recognition (CVPR) (pp. 6936–6944).
Li, H., Wang, P., Shen, C., & Zhang, G. (2019). Show, attend and read: A Long, S., Guan, Y., Bian, K., & Yao, C. (2020). A new perspective for
simple and strong baseline for irregular text recognition. In AAAI. flexible feature gathering in scene text recognition via character
Li, R., En, M., Li, J., & Zhang, H. (2017b). Weakly supervised text anchor pooling. In ICASSP 2020—2020 IEEE international con-
attention network for generating text proposals in scene images. ference on acoustics, speech and signal processing (ICASSP) (pp.
In 2017 14th IAPR international conference on document analysis 2458–2462. IEEE.
and recognition (ICDAR) (Vol. 1, pp. 324–330). IEEE. Long, S., Guan, Y., Wang, B., Bian, K., & Yao, C. (2019). Alchemy:
Liao, M., Shi, B., & Bai, X. (2018a). Textboxes++: A single-shot ori- Techniques for rectification based irregular scene text recognition.
ented scene text detector. IEEE Transactions on Image Processing, arXiv preprint arXiv:1908.11834.
27(8), 3676–3690. Long, S., Ruan, J., Zhang, W., He, X., Wu, W., & Yao, C. (2018).
Liao, M., Shi, B., Bai, X., Wang, X., & Liu, W. (2017). Textboxes: A Textsnake: A flexible representation for detecting text of arbitrary
fast text detector with a single deep neural network. In AAAI (pp. shapes. In Proceedings of European conference on computer vision
4161–4167). (ECCV).
Liao, M., Song, B., He, M., Long, S., Yao, C., & Bai, X. (2019a). Syn- Long, S., & Yao, C. (2020). Unrealtext: Synthesizing realistic scene text
thtext3d: Synthesizing scene text images from 3d virtual worlds. images from the unreal world. arXiv preprint arXiv:2003.10608.
arXiv preprint arXiv:1907.06007. Lyu, P., Liao, M., Yao, C., Wu, W., & Bai, X. (2018a). Mask textspotter:
Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., & Bai, X. An end-to-end trainable neural network for spotting text with arbi-
(2019b). Scene text recognition from two-dimensional perspective. trary shapes. In Proceedings of European conference on computer
In AAAI. vision (ECCV).
Liao, M., Zhu, Z., Shi, B., Xia, G.-S., & Bai, X. (2018b). Rotation- Lyu, P., Yao, C., Wu, W., Yan, S., & Bai, X. (2018b). Multi-oriented
sensitive regression for oriented scene text detection. In Pro- scene text detection via corner localization and region segmen-
ceedings of the IEEE conference on computer vision and pattern tation. In 2018 IEEE conference on computer vision and pattern
recognition (CVPR) (pp. 5909–5918). recognition (CVPR).
Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional neural fields for Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., et al.
depth estimation from a single image. In Proceedings of the IEEE (2018). Arbitrary-oriented scene text detection via rotation pro-
conference on computer vision and pattern recognition (CVPR) posals. IEEE Transactions on Multimedia, 20, 3111–3122.
(pp. 5162–5170). Mammeri, A., & Boukerche, A. et al. (2016). MSER-based text detec-
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäi- tion and communication algorithm for autonomous vehicles. In
nen, M. (2018a). Deep learning for generic object detection: A 2016 IEEE symposium on computers and communication (ISCC)
survey. arXiv preprint arXiv:1809.02165. (pp. 1218–1223). IEEE.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Mammeri, A., Khiari, E.-H., & Boukerche, A. (2014). Road-sign text
Berg, A. C. (2016a). SSD: Single shot multibox detector. In In recognition architecture for intelligent transportation systems. In
Proceedings of European conference on computer vision (ECCV) 2014 IEEE 80th vehicular technology conference (VTC Fall) (pp.
(pp. 21–37). Springer. 1–5). IEEE.
Liu, W., Chen, C., & Wong, K. (2018b). Char-net: A character-aware Mishra, A., Alahari, K., & Jawahar, C. (2011). An MRF model for bina-
neural network for distorted scene text recognition. In AAAI con- rization of natural scene text. In ICDAR-international conference
ference on artificial intelligence, New Orleans, Louisiana, USA. on document analysis and recognition. IEEE.
Liu, W., Chen, C., Wong, K.-Y. K., Su, Z., & Han, J. (2016b). Star-net: Mishra, A., Alahari, K., & Jawahar, C. (2012). Scene text recogni-
A spatial attention residue network for scene text recognition. In tion using higher order language priors. In BMVC-British machine
BMVC (Vol. 2, p. 7). vision conference. BMVA.

123
International Journal of Computer Vision (2021) 129:161–184 183

Neumann, L., & Matas, J. (2010). A method for text localization and to scene text recognition. IEEE Transactions on Pattern Analysis
recognition in real-world images. In Asian conference on computer and Machine Intelligence, 39(11), 2298–2304.
vision (pp. 770–783). Springer. Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene
Neumann, L., & Matas, J. (2012). Real-time scene text localization text recognition with automatic rectification. In Proceedings of
and recognition. In 2012 IEEE conference on computer vision and the IEEE conference on computer vision and pattern recognition
pattern recognition (CVPR) (pp. 3538–3545). IEEE. (CVPR) (pp. 4168–4176).
Neumann, L., & Matas, J. (2013). On combining multiple segmentations Shi, B., Yang, M., Wang, X., Lyu, P., Bai, X., & Yao, C. (2018). Aster:
in scene text recognition. In 2013 12th international conference on An attentional scene text recognizer with flexible rectification.
document analysis and recognition (ICDAR) (pp. 523–527). IEEE. IEEE Transactions on Pattern Analysis and Machine Intelligence,
Nomura, S., Yamanaka, K., Katai, O., Kawakami, H., & Shiose, T. 31(11), 855–868.
(2005). A novel adaptive morphological approach for degraded Shi, C., Wang, C., Xiao, B., Zhang, Y., Gao, S., & Zhang, Z. (2013).
character image segmentation. Pattern Recognition, 38(11), 1961– Scene text recognition using part-based tree-structured character
1975. detection. In 2013 IEEE conference on computer vision and pattern
Parkinson, C., Jacobsen, J. J., Ferguson, D. B., & Pombo, S. A. (2016). recognition (CVPR) (pp. 2961–2968). IEEE.
Instant translation system, Nov. 29. US Patent 9,507,772. Shivakumara, P., Bhowmick, S., Su, B., Tan, C. L., & Pal, U. (2011).
Qin, S., Bissacco, A., Raptis, M., Fujii, Y., & Xiao, Y. (2019). Towards A new gradient based character segmentation method for video
unconstrained end-to-end text spotting. In Proceedings of the IEEE text recognition. In 2011 international conference on document
international conference on computer vision (pp. 4704–4714). analysis and recognition (ICDAR). IEEE.
Qiu, W., Zhong, F., Zhang, Y., Qiao, S., Xiao, Z., Kim, T. S., & Wang, Su, B., & Lu, S. (2014). Accurate scene text recognition based on recur-
Y. (2017). Unrealcv: Virtual worlds for computer vision. In Pro- rent neural network. In Asian conference on computer vision (pp.
ceedings of the 25th ACM international conference on multimedia 35–48). Springer.
(pp. 1221–1224). ACM. Sun, Y., Liu, J., Liu, W., Han, J., Ding, E., & Liu, J. (2019). Chinese
Phan, T. Q., Shivakumara, P., Tian, S., & Tan, C. L. (2013). Recognizing street view text: Large-scale Chinese text reading with partially
text with perspective distortion in natural scenes. In Proceedings supervised learning. In Proceedings of the IEEE international con-
of the IEEE international conference on computer vision (ICCV) ference on computer vision (pp. 9086–9095).
(pp. 569–576). Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence
Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. learning with neural networks. In Advances in neural information
arXiv preprint. processing systems (pp. 3104–3112).
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., & Tan, C. L. (2015). Text
look once: Unified, real-time object detection. In Proceedings of flow: A unified text detection system in natural scene images. In
the IEEE conference on computer vision and pattern recognition Proceedings of the IEEE international conference on computer
(CVPR) (pp. 779–788). vision (pp. 4651–4659).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Tian, S., Lu, S., & Li, C. (2017). Wetext: Scene text detection under
Towards real-time object detection with region proposal networks. weak supervision. In Proceedings of ICCV.
In Advances in neural information processing systems (pp. 91–99). Tian, Z. Huang, W., He, T., He, P., & Qiao, Y. (2016). Detecting text
Rodriguez-Serrano, J. A., Gordo, A., & Perronnin, F. (2015). Label in natural image with connectionist text proposal network. In In
embedding: A frugal baseline for text recognition. International Proceedings of European conference on computer vision (ECCV)
Journal of Computer Vision, 113(3), 193–207. (pp. 56–72). Springer.
Rodriguez-Serrano, J. A., Perronnin, F., & Meylan, F. (2013). Label Tian, Z., Shu, M., Lyu, P., Li, R., Zhou, C., Shen, X., & Jia, J. (2019).
embedding for text recognition. In Proceedings of the British Learning shape-aware embedding for scene text detection. In Pro-
machine vision conference. Citeseer. ceedings of the IEEE conference on computer vision and pattern
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional recognition (pp. 4234–4243).
networks for biomedical image segmentation. Berlin: Springer. Tsai, S. S., Chen, H., Chen, D., Schroth, G., Grzeszczuk, R., & Girod,
Roy, P. P., Pal, U., Llados, J., & Delalandre, M. (2009). Multi-oriented B. (2011). Mobile visual search on printed documents using text
and multi-sized touching character segmentation using dynamic and low bit-rate features. In 18th IEEE international conference
programming. In 10th international conference on document anal- on image processing (ICIP) (pp. 2601–2604). IEEE.
ysis and recognition, 2009. IEEE. Tu, Z., Ma, Y., Liu, W., Bai, X., & Yao, C. (2012). Detecting texts
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., of arbitrary orientations in natural images. In 2012 IEEE confer-
et al. (2015). Imagenet large scale visual recognition challenge. ence on computer vision and pattern recognition (pp. 1083–1090).
International Journal of Computer Vision, 115(3), 211–252. IEEE.
Schroth, G., Hilsenbeck, S., Huitl, R., Schweiger, F., & Steinbach, E. Uchida, S. (2014). Text localization and recognition in images and
(2011). Exploiting text-related features for content-based image video. In Handbook of document image processing and recog-
retrieval. In 2011 IEEE international symposium on multimedia nition (pp. 843–883). Springer.
(pp. 77–84). IEEE. Wachenfeld, S., Klein, H.-U., & Jiang, X. (2006). Recognition of
Schulz, R., Talbot, B., Lam, O., Dayoub, F., Corke, P., Upcroft, B., & screen-rendered text. In 18th international conference on pattern
Wyeth, G. (2015). Robot navigation using human cues: A robot recognition, 2006. ICPR 2006 (Vol. 2, pp. 1086–1089). IEEE.
navigation system for symbolic goal-directed exploration. In Pro- Wakahara, T., & Kita, K. (2011). Binarization of color character strings
ceedings of the 2015 IEEE international conference on robotics in scene images using k-means clustering and support vector
and automation (ICRA 2015) (pp. 1100–1105). IEEE. machines. In 2011 international conference on document analysis
Sheshadri, K., & Divvala, S. K. (2012). Exemplar driven character and recognition (ICDAR) (pp. 274–278). IEEE.
recognition in the wild. In BMVC (pp. 1–10). Wang, C., Yin, F., & Liu, C.-L. (2017). Scene text detection with novel
Shi, B., Bai, X., & Belongie, S. (2017a). Detecting oriented text in superpixel based character candidate extraction. In 2017 14th IAPR
natural images by linking segments. In The IEEE conference on international conference on document analysis and recognition
computer vision and pattern recognition (CVPR). (ICDAR) (Vol. 1, pp. 929–934). IEEE.
Shi, B., Bai, X., & Yao, C. (2017b). An end-to-end trainable neural Wang, F., Zhao, L., Li, X., Wang, X., & Tao, D. (2018). Geometry-
network for image-based sequence recognition and its application aware scene text detection with instance transformation network.

123
184 International Journal of Computer Vision (2021) 129:161–184

In Proceedings of the IEEE conference on computer vision and Ye, Q., Gao, W., Wang, W., & Zeng, W. (2003). A robust text detection
pattern recognition (CVPR) (pp. 1381–1389). algorithm in images and video frames. In IEEE ICICS-PCM (pp.
Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text 802–806).
recognition. In 2011 IEEE international conference on computer Yi, C., & Tian, Y. (2011). Text string detection from natural scenes
vision (ICCV), (pp. 1457–1464). IEEE. by structure-based partition and grouping. IEEE Transactions on
Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end Image Processing, 20(9), 2594–2605.
text recognition with convolutional neural networks. In 2012 21st Yin, F., Wu, Y.-C, Zhang, X.-Y., & Liu, C.-L. (2017). Scene text recog-
international conference on pattern recognition (ICPR) (pp. 3304– nition with sliding convolutional character models. arXiv preprint
3308). IEEE. arXiv:1709.01727.
Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., & Shao, S. (2019a). Yin, X.-C., Yin, X., Huang, K., & Hao, H.-W. (2014). Robust text
Shape robust text detection with progressive scale expansion net- detection in natural scene images. IEEE Transactions on Pattern
work. Proceedings of the IEEE conference on computer vision and Analysis and Machine Intelligence, 36(5), 970–983.
pattern recognition (CVPR). Yin, X.-C., Zuo, Z.-Y., Tian, S., & Liu, C.-L. (2016). Text detection,
Wang, X., Jiang, Y., Luo, Z., Liu, C.-L., Choi, H., & Kim, S. (2019b). tracking and recognition in video: A comprehensive survey. IEEE
Arbitrary shape scene text detection with adaptive text region rep- Transactions on Image Processing, 25(6), 2752–2773.
resentation. In Proceedings of the IEEE conference on computer Yu, D., Li, X., Zhang, C., Han, J., Liu, J., & Ding, E. (2020). Towards
vision and pattern recognition (pp. 6449–6458). accurate scene text recognition with semantic reasoning networks.
Weinman, J., Learned-Miller, E., & Hanson, A. (2007). Fast lexicon- arXiv preprint arXiv:2003.12294.
based scene text recognition with sparse belief propagation. In Yuan, T.-L., Zhu, Z., Xu, K., Li, C.-J., & Hu, S.-M. (2018). Chinese
ICDAR (pp. 979–983). IEEE. text in the wild. arXiv preprint arXiv:1803.00085.
Wolf, C., & Jolion, J.-M. (2006). Object count/area graphs for the Zhan, F., & Lu, S. (2019). ESIR: End-to-end scene text recognition via
evaluation of object detection and segmentation algorithms. Inter- iterative image rectification. In Proceedings of the IEEE confer-
national Journal of Document Analysis and Recognition (IJDAR), ence on computer vision and pattern recognition.
8(4), 280–296. Zhan, F., Lu, S., & Xue, C. (2018). Verisimilar image synthesis for
Wu, L., Zhang, C., Liu, J., Han, J., Liu, J., Ding, E., & Bai, X. (2019). accurate detection and recognition of texts in scenes.
Editing text in the wild. In Proceedings of the 27th ACM interna- Zhang, C., Liang, B., Huang, Z., En, M., Han, J., Ding, E., & Ding,
tional conference on multimedia (pp. 1500–1508). X. (2019). Look more than once: An accurate detector for text
Wu, Y., & Natarajan, P. (2017). Self-organized text detection with min- of arbitrary shapes. In Proceedings of the IEEE conference on
imal post-processing via border learning. In Proceedings of the computer vision and pattern recognition (CVPR).
IEEE conference on CVPR (pp. 5000–5009). Zhang, D., & Chang, S.-F. (2003). A Bayesian framework for fusing
Xia, Y., Tian, F., Wu, L., Lin, J., Qin, T., Yu, N., & Liu, T.-Y. (2017). multiple word knowledge models in videotext recognition. In Com-
Deliberation networks: Sequence generation beyond one-pass puter vision and pattern recognition, 2003. IEEE.
decoding. In Advances in neural information processing systems Zhang, S., Liu, Y., Jin, L., & Luo, C. (2018). Feature enhancement
(pp. 1784–1794). network: A refined scene text detector. In Proceedings of AAAI,
Xing, L., Tian, Z., Huang, W., & Scott, M. R. (2019). Convolutional 2018.
character networks. In Proceedings of the IEEE international con- Zhang, S.-X., Zhu, X., Hou, J.-B., Liu, C., Yang, C., Wang, H., &
ference on computer vision (pp. 9126–9136). Yin, X.-C. (2020). Deep relational reasoning graph network for
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., arbitrary shape text detection. arXiv preprint arXiv:2003.07493.
et al. (2015). Show, attend and tell: Neural image caption genera- Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., & Bai, X. (2016).
tion with visual attention. In International conference on machine Multi-oriented text detection with fully convolutional networks.
learning (pp. 2048–2057). In Proceedings of the IEEE conference on computer vision and
Xue, C., Lu, S., & Zhan, F. (2018). Accurate scene text detection through pattern recognition (CVPR).
border semantics awareness and bootstrapping. In In Proceedings Zhiwei, Z., Linlin, L., & Lim, T. C. (2010). Edge based binarization
of European conference on computer vision (ECCV). for video text images. In 2010 20th international conference on
Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., et al. (2019). pattern recognition (ICPR) (pp. 133–136). IEEE.
Symmetry-constrained rectification network for scene text recog- Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., & Liang, J.
nition. In Proceedings of the IEEE international conference on (2017). EAST: An efficient and accurate scene text detector. In
computer vision (pp. 9147–9156). The IEEE conference on computer vision and pattern recognition
Yang, Q., Jin, H., Huang, J., & Lin, W. (2020). Swaptext: Image based (CVPR).
texts transfer in scenes. arXiv preprint arXiv:2003.08152. Zhu, Y., Yao, C., & Bai, X. (2016). Scene text detection and recognition:
Yang, X., He, D., Zhou, Z., Kifer, D., & Giles, C. L. (2017). Learning to Recent advances and future trends. Frontiers of Computer Science,
read irregular text with attention mechanisms. In Proceedings of 10(1), 19–36.
the twenty-sixth international joint conference on artificial intelli- Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object pro-
gence, IJCAI-17 (pp. 3280–3286). posals from edges. In Proceedings of European conference on
Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi- computer vision (ECCV) (pp. 391–405). Springer.
scale representation for scene text recognition. In Proceedings of
the IEEE conference on computer vision and pattern recognition
(CVPR) (pp. 4042–4049). Publisher’s Note Springer Nature remains neutral with regard to juris-
Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., & Cao, Z. (2016). Scene dictional claims in published maps and institutional affiliations.
text detection via holistic, multi-channel prediction. arXiv preprint
arXiv:1606.09002.
Ye, Q., & Doermann, D. (2015). Text detection and recognition in
imagery: A survey. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 37(7), 1480–1500.

123

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy