Long2021 Article SceneTextDetectionAndRecogniti
Long2021 Article SceneTextDetectionAndRecogniti
Long2021 Article SceneTextDetectionAndRecogniti
https://doi.org/10.1007/s11263-020-01369-0
Received: 14 April 2020 / Accepted: 8 August 2020 / Published online: 27 August 2020
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an
important research area in computer vision, scene text detection and recognition has been inevitably influenced by this wave
of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial
advancements in mindset, methodology and performance. This survey is aimed at summarizing and analyzing the major
changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we
devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future
trends. Specifically, we will emphasize the dramatic differences brought by deep learning and remaining grand challenges.
We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also
collected in our Github repository (https://github.com/Jyouhou/SceneTextPapers).
Keywords Scene text · Optical character recognition · Detection · Recognition · Deep learning · Survey
Communicated by Vittorio Ferrari. • Diversity and Variability of Text in Natural Scenes Dis-
tinctive from scripts in documents, text in natural scene
B Shangbang Long
shangbal@cs.cmu.edu exhibit much higher diversity and variability. For exam-
ple, instances of scene text can be in different languages,
Xin He
hexin7257@gmail.com colors, fonts, sizes, orientations, and shapes. Moreover,
the aspect ratios and layouts of scene text may vary signif-
Cong Yao
yaocong2010@gmail.com icantly. All these variations pose challenges for detection
and recognition algorithms designed for text in natural
1 Machine Learning Department, School of Computer Science, scenes.
Carnegie Mellon University, Pittsburgh, USA
• Complexity and Interference of Backgrounds The back-
2 ByteDance Ltd, Beijing, China grounds of natural scenes are virtually unpredictable.
3 MEGVII Inc. (Face++), Beijing, China There might be patterns extremely similar to text (e.g.,
123
162 International Journal of Computer Vision (2021) 129:161–184
123
International Journal of Computer Vision (2021) 129:161–184 163
123
164 International Journal of Computer Vision (2021) 129:161–184
Pipeline
Complexity
The evolution of scene text detection algorithms, there-
(a) (b)
fore, undergoes three main stages: (1) In the first stage,
learning-based methods are equipped with multi-step
pipelines, but these methods are still slow and complicated.
(2) Then, the idea and methods of general object detection are
(c) successfully implanted into this task. (3) In the third stage,
(d)
researchers design special representations based on sub-text
Time components to solve the challenges of long text and irregular
Input
Images
text.
Extract Word Text-Region Word Box Word Box
Box Proposal Extraction Regression Regression
123
International Journal of Computer Vision (2021) 129:161–184 165
123
166 International Journal of Computer Vision (2021) 129:161–184
123
International Journal of Computer Vision (2021) 129:161–184 167
123
168 International Journal of Computer Vision (2021) 129:161–184
Ground Truth
Transcription:
Transcription: C1 C2 Ct Char Type
Ground Truth 1 2
Cropped Segmentation Map
Images
CTC CTC
Resize Loss Rule
Attention Pooling
h
C1 C2 Ct a1 a2 a3 aw Convolutional
w Layers
Convolutional
Layers
Feature extractor (a) CTC-based decoding (b) Seq-to-seq learning (c) Character segmentation
Fig. 7 Frameworks of text recognition models. a Sequence tagging model, and uses CTC for alignment in training and inference. b Sequence to
sequence model, and can use cross-entropy to learn directly. c Segmentation-based methods
Instead of RNN, Gao et al. (2017) adopt the stacked Bai et al. (2018) propose an edit probability (EP) metric to
convolutional layers to effectively capture the contextual handle the misalignment between the ground truth string and
dependencies of the input sequence, which is characterized the attention’s output sequence of the probability distribution.
by lower computational complexity and easier parallel com- Unlike aforementioned attention-based methods, which usu-
putation. ally employ a frame-wise maximal likelihood loss, EP tries
Yin et al. (2017) simultaneously detect and recognize char- to estimate the probability of generating a string from the
acters by sliding the text line image with character models, output sequence of probability distribution conditioned on
which are learned end-to-end on text line images labeled with the input image, while considering the possible occurrences
text transcripts. of missing or superfluous characters.
Liu et al. (2018d) propose an efficient attention-based
encoder–decoder model, in which the encoder part is trained
under binary constraints to reduce computation cost.
3.2.2 Encoder–Decoder Methods
Both CTC and the encoder–decoder framework simplify
the recognition pipeline and make it possible to train scene
The encoder–decoder framework for sequence-to-sequence
text recognizers with only word-level annotations instead of
learning is originally proposed in Sutskever et al. (2014)
character level annotations. Compared to CTC, the decoder
for machine translation. The encoder RNN reads an input
module of the encoder–decoder framework is an implicit
sequence and passes its final latent state to a decoder RNN,
language model, and therefore, it can incorporate more lin-
which generates output in an auto-regressive way. The main
guistic priors. For the same reason, the encoder–decoder
advantage of the encoder–decoder framework is that it gives
framework requires a larger training dataset with a larger
outputs of variable lengths, which satisfies the task setting of
vocabulary. Otherwise, the model may degenerate when
scene text recognition. The encoder–decoder framework is
reading words that are unseen during training. On the con-
usually combined with the attention mechanism (Bahdanau
trary, CTC is less dependent on language models and has a
et al. 2014) which jointly learns to align input sequence and
better character-to-pixel alignment. Therefore it is potentially
output sequence.
better on languages such as Chinese and Japanese that have a
Lee and Osindero (2016) present recursive recurrent neu-
large character set. The main drawback of these two methods
ral networks with attention modeling for lexicon-free scene
is that they assume the text to be straight, and therefore can
text recognition. the model first passes input images through
not adapt to irregular text.
recursive convolutional layers to extract encoded image fea-
tures and then decodes them to output characters by recurrent
neural networks with implicitly learned character-level lan- 3.2.3 Adaptions for Irregular Text Recognition
guage statistics. The attention-based mechanism performs
soft feature selection for better image feature usage. Rectification-modules are a popular solution to irregular
Cheng et al. (2017a) observe the attention drift problem text recognition. Shi et al. (2016, 2018) propose a text
in existing attention-based methods and proposes to impose recognition system which combined a Spatial Transformer
localization supervision for attention score to attenuate it. Network (STN) (Jaderberg et al. 2015) and an attention-based
123
International Journal of Computer Vision (2021) 129:161–184 169
Sequence Recognition Network. The STN-module predicts encoder–decoder framework, the 2D attentional model main-
text bounding polygons with fully connected layers in order tains 2-dimensional encoded features, and attention scores
for Thin-Plate-Spline transformations which rectify the input are computed for all spatial locations. Similar to spatial atten-
irregular text image into a more canonical form, i.e. straight tion, Long et al. (2020) propose to first detect characters.
text. The rectification proves to be a successful strategy and Afterward, features are interpolated and gathered along the
forms the basis of the winning solution (Long et al. 2019) in character center lines to form sequential feature frames.
ICDAR 2019 ArT1 irregular text recognition competition. In addition to the aforementioned techniques, Qin et al.
There have also been several improved versions of rec- (2019) show that simply flattening the feature maps from
tification based recognition. Zhan and Lu (2019) propose 2-dimensional to 1-dimensional and feeding the resulting
to perform rectification multiple times to gradually rectify sequential features to RNN based attentional encoder–
the text. They also replace the text bounding polygons with decoder model is sufficient to produce state-of-the-art recog-
a polynomial function to represent the shape. Yang et al. nition results on irregular text, which is a simple yet efficient
(2019) propose to predict local attributes, such as radius and solution.
orientation values for pixels inside the text center region, in a Apart from tailored model designs, Long et al. (2019) syn-
similar way to TextSnake (Long et al. 2018). The orientation thesizes a curved text dataset, which significantly boosts the
is defined as the orientation of the underlying character boxes, recognition performance on real-world curved text datasets
instead of text bounding polygons. Based on the attributes, with no sacrifices to straight text datasets.
bounding polygons are reconstructed in a way that the per- Although many elegant and neat solutions have been pro-
spective distortion of characters is rectified, while the method posed, they are only evaluated and compared based on a
by Shi et al. and Zhan et al. may only rectify at the text level relatively small dataset, CUTE80, which only consists of 288
and leave the characters distorted. word samples. Besides, the training datasets used in these
Yang et al. (2017) introduce an auxiliary dense character works only contain a negligible proportion of irregular text
detection task to encourage the learning of visual representa- samples. Evaluations on larger datasets and more suitable
tions that are favorable to the text patterns. And they adopt an training datasets may help us understand these methods bet-
alignment loss to regularize the estimated attention at each ter.
time-step. Further, they use a coordinate map as a second
input to enforce spatial-awareness.
Cheng et al. (2017b) argue that encoding a text image as 3.2.4 Other Methods
a 1-D sequence of features as implemented in most methods
is not sufficient. They encode an input image to four feature Jaderberg et al. (2014a, b) perform word recognition by clas-
sequences of four directions: horizontal, reversed horizontal, sifying the image into a pre-defined set of vocabulary, under
vertical, and reversed vertical. A weighting mechanism is the framework of image classification. The model is trained
applied to combine the four feature sequences. by synthetic images, and achieves state-of-the-art perfor-
Liu et al. (2018b) present a hierarchical attention mecha- mance on some benchmarks containing English words only.
nism (HAM) which consists of a recurrent RoI-Warp layer However, the application of this method is quite limited as
and a character-level attention layer. They adopt a local trans- it cannot be applied to recognize unseen sequences such as
formation to model the distortion of individual characters, phone numbers and email addresses.
resulting in improved efficiency, and can handle different To improve performance on difficult cases such as occlu-
types of distortion that are hard to be modeled by a single sion which brings ambiguity to single character recognition,
global transformation. Yu et al. (2020) propose a transformer-based semantic
Liao et al. (2019b) cast the task of recognition into seman- reasoning module that performs translations from coarse,
tic segmentation, and treat each character type as one class. prone-to-error text outputs from the decoder to fine and lin-
The method is insensitive to shapes and is thus effective guistically calibrated outputs, which bears some resemblance
on irregular text, but the lack of end-to-end training and to the deliberation networks for machine translation (Xia
sequence learning makes it prone to single-character errors, et al. 2017) that first translate and then re-write the sentences.
especially when the image quality is low. They are also the Despite the progress we have seen so far, the evaluation of
first to evaluate the robustness of their recognition method recognition methods falls behind the time. As most detection
by padding and transforming test images. methods can detect oriented and irregular text and some even
Another solution to irregular scene text recognition is rectify them, the recognition of such text may seem redun-
2-dimensional attention (Xu et al. 2015), which has been dant. On the other hand, the robustness of recognition when
verified in Li et al. (2019). Different from the sequential cropped with a slightly different bounding box is seldom ver-
ified. Such robustness may be more important in real-world
1 https://rrc.cvc.uab.es/?ch=14. scenarios.
123
170 International Journal of Computer Vision (2021) 129:161–184
123
International Journal of Computer Vision (2021) 129:161–184 171
3.4.1 Synthetic Data limited to the middle parts of large and well-defined regions,
which is an unfavorable location bias.
Most deep learning models are data-thirsty. Their perfor- UnrealText (Long and Yao 2020) is another work using
mance is guaranteed only when enough data are available. game engines to synthesize scene text images. It features
In the field of text detection and recognition, this problem deep interactions with the 3D worlds during synthesis. A
is more urgent since most human-labeled datasets are small, ray-casting based algorithm is proposed to navigate in the
usually containing around merely 1K–2K data instances. For- 3D worlds efficiently and is able to generate diverse cam-
tunately, there have been work (Jaderberg et al. 2014b; Gupta era views automatically. The text region proposal module is
et al. 2016; Zhan et al. 2018; Liao et al. 2019a) that generate based on collision detection and can put text onto the whole
data of relatively high quality, and have been widely used for surfaces, thus getting rid of the location bias. UnrealText
pre-training models for better performance. achieves significant speedup and better detector perfor-
Jaderberg et al. (2014b) propose to generate synthetic data mances.
for text recognition. Their method blends text with randomly Text Editing It is also worthwhile to mention the text edit-
cropped natural images from human-labeled datasets after ing task that is proposed recently (Wu et al. 2019; Yang
rending of font, border/shadow, color, and distortion. The et al. 2020). Both works try to replace the text content while
results show that training merely on these synthetic data can retaining text styles in natural images, such as the spatial
achieve state-of-the-art performance and that synthetic data arrangement of characters, text fonts, and colors. Text edit-
can act as augmentative data sources for all datasets. ing per se is useful in applications such as instant translation
SynthText (Gupta et al. 2016) first propose to embed text in using cellphone cameras. It also has great potential in aug-
natural scene images for the training of text detection, while menting existing scene text images, though we have not seen
most previous work only print text on a cropped region and any relevant experiment results yet.
these synthetic data are only for text recognition. Printing
text on the whole natural images poses new challenges, as 3.4.2 Weakly and Semi-supervision
it needs to maintain semantic coherence. To produce more
realistic data, SynthText makes use of depth prediction (Liu Bootstrapping for Character-Box
et al. 2015) and semantic segmentation (Arbelaez et al. 2011). Character level annotations are more accurate and better.
Semantic segmentation groups pixels together as semantic However, most existing datasets do not provide character-
clusters and each text instance is printed on one semantic level annotating. Since characters are smaller and close to
surface, not overlapping multiple ones. A dense depth map each other, character-level annotation is more costly and
is further used to determine the orientation and distortion inconvenient. There has been some work on semi-supervised
of the text instance. The model trained only on SynthText character detection. The basic idea is to initialize a character–
achieves state-of-the-art on many text detection datasets. It detector and applies rules or threshold to pick the most
is later used in other works (Zhou et al. 2017; Shi et al. 2017a) reliable predicted candidates. These reliable candidates are
as well for initial pre-training. then used as additional supervision sources to refine the
Further, Zhan et al. (2018) equip text synthesis with other character–detector. Both of them aim to augment existing
deep learning techniques to produce more realistic samples. datasets with character level annotations. Their difference is
They introduce selective semantic segmentation so that word illustrated in Fig. 9.
instances would only appear on sensible objects, e.g. a desk WordSup (Hu et al. 2017) first initializes the character
or wall in stead of someone’s face. Text rendering in their detector by training 5K warm-up iterations on synthetic
work is adapted to the image so that they fit into the artistic datasets. For each image, WordSup generates character
styles and do not stand out awkwardly. candidates, which are then filtered with word-boxes. For
SynthText3D (Liao et al. 2019a) uses the famous open- characters in each word box, the following score is computed
source game engine, Unreal Engine 4 (UE4), and UnrealCV to select the most possible character list:
(Qiu et al. 2017) to synthesize scene text images. Text is
rendered with the scene together and thus can achieve dif-
ferent lighting conditions, weather, and natural occlusions. ar ea(Bchar s ) λ2
s =w· + (1 − w) · (1 − ) (1)
However, SynthText3D simply follows the pipeline of Syn- ar ea(Bwor d ) λ1
thText and only makes use of the ground-truth depth and
segmentation maps provided by the game engine. As a result, where Bchar s is the union of the selected character boxes;
SynthText3D relies on manual selection of camera views, Bwor d is the enclosing word bounding box; λ1 and λ2 are the
which limits its scalability. Besides, the proposed text regions first- and second-largest eigenvalues of a covariance matrix
are generated by clipping maximal rectangular bounding C, computed by the coordinates of the centers of the selected
boxes extracted from segmentation maps, and therefore are character boxes; w is a weight scalar. Intuitively, the first term
123
172 International Journal of Computer Vision (2021) 129:161–184
Framework
equals to the length of the ground truth word, the charac-
Synthetic Predictions
Data ter bounding boxes are regarded as correct.
Partial Annotations In order to improve the recognition per-
Model G-0 Model G-1 Model G-i formance of end-to-end word spotting models on curved text,
Filtering
Qin et al. (2019) propose to use off-the-shelf straight scene
Warm-up
Predictions text spotting models to annotate a large number of unlabeled
images. These images are called partially labeled images,
since the off-the-shelf models may omit some words. These
partially annotated straight text prove to boost the perfor-
Filter (a) Filter (b)
mance on irregular text greatly.
Another similar effort is the large dataset proposed by
Filtering Filtering
Sun et al. (2019), where each image is only annotated with
>a one dominant text. They also design an algorithm to utilize
Score these partially labeled data, which they claim are cheaper to
Retrieve characters with ground-truth Coordinates
word-level annotations: Matrix
annotate.
2xN
M Eigenvalues
Filtering
123
Table 1 Public datasets for scene text detection and recognition
Dataset (year) Image Num (train/val/test) Orientation Language Features Det. Recog.
123
173
174 International Journal of Computer Vision (2021) 129:161–184
SVT-P
IIIT5K
Fig. 10 Selected samples from Chars74K, SVT-P, IIIT5K, MSRA-TD 500, ICDAR 2013, ICDAR 2015, ICDAR 2017 MLT, ICDAR 2017 RCTW,
and Total-Text
The Chinese Text in the Wild (CTW) dataset (Yuan et al. matching between the predicted instances and ground truth
2018) contains 32,285 high-resolution street view images, ones comes first.
annotated at the character level, including its underlying
character type, bounding box, and detailed attributes such 4.2.1 Text Detection
as whether it uses word-art. The dataset is the largest one
to date and the only one that contains detailed annotations. There are mainly two different protocols for text detection,
However, it only provides annotations for Chinese text and the IOU based PASCAL Eval and overlap based DetE-
ignores other scripts, e.g. English. val. They differ in the criterion of matching predicted text
LSVT (Sun et al. 2019) is composed of two datasets. One instances and ground truth ones. In the following part, we use
is fully labeled with word bounding boxes and word content. these notations: SGT is the area of the ground truth bound-
The other, while much larger, is only annotated with the word ing box, S P is the area of the predicted bounding box, S I is
content of the dominant text instance. The authors propose the area of the intersection of the predicted and ground truth
to work on such partially labeled data that are much cheaper. bounding box, SU is the area of the union.
IIIT 5K-Word (Mishra et al. 2012) is the largest scene
text recognition dataset, containing both digital and natural • DetEval DetEval imposes constraints on both precision,
scene images. Its variance in font, color, size, and other noises i.e. SSPI and recall, i.e. SSGTI . Only when both are larger than
makes it the most challenging one to date. their respective thresholds, are they matched together.
• PASCAL (Everingham et al. 2015) The basic idea is that,
if the intersection-over-union value, i.e. SSUI , is larger than
4.2 Evaluation Protocols a designated threshold, the predicted and ground truth
box are matched together.
In this part, we briefly summarize the evaluation protocols
for text detection and recognition. Most works follow either one of the two evaluation pro-
As metrics for performance comparison of different algo- tocols, but with small modifications. We only discuss those
rithms, we usually refer to their precision, recall and F1- that are different from the two protocols mentioned above.
score. To compute these performance indicators, the list of
predicted text instances should be matched to the ground • ICDAR-2003/2005 The match score m is calculated in
truth labels in the first place. Precision, denoted as P, is cal- a way similar to IOU. It is defined as the ratio of the
culated as the proportion of predicted text instances that can area of intersection over that of the minimum bounding
be matched to ground truth labels. Recall, denoted as R, is the rectangular bounding box containing both.
proportion of ground truth labels that have correspondents in • ICDAR-2011/2013 One major drawback of the eval-
the predicted list. F1-score is then computed by F1 = 2∗P∗R
P+R , uation protocol of ICDAR2003/2005 is that it only
taking both precision and recall into account. Note that the considers the one-to-one match. It does not consider
123
International Journal of Computer Vision (2021) 129:161–184 175
123
176 International Journal of Computer Vision (2021) 129:161–184
Table 3 Detection on ICDAR MLT 2017 Table 5 Detection and end-to-end on total-text
Method P R F1 Method Detection E2E
123
International Journal of Computer Vision (2021) 129:161–184 177
Table 8 Characteristics of the three vocabulary lists used in ICDAR There have been several works on text detection and recog-
2013/2015 nition for autonomous vehicle (Mammeri et al. 2014, 2016).
Vocab list Description The largest dataset so far, CTW (Yuan et al. 2018), also
places extra emphasis on traffic signs. Another example is
S Per-image list of 100 words including all
words in the image the instant translation, where OCR is combined with a trans-
W All words in the entire test set lation model. This is extremely helpful and time-saving as
G A 90k-word generic vocabulary
people travel or read documents written in foreign languages.
Google’s Translate application6 can perform such instant
S stands for Strongly Contextualised, W for Weakly Contextualised, and
translation. A similar application is instant text-to-speech
G for Generic
software equipped with OCR, which can help those with
visual disability and those who are illiterate.7
5 Application Intelligent Content Analysis OCR also allows the industry
to perform more intelligent analysis, mainly for platforms
The detection and recognition of text—the visual and phys- like video-sharing websites and e-commerce. Text can be
ical carrier of human civilization—allow the connection extracted from images and subtitles as well as real-time com-
between vision and the understanding of its content fur- mentary subtitles (a kind of floating comments added by
ther. Apart from the applications we have mentioned at the users, e.g. those in Bilibili8 and Niconico9 ). On the one hand,
beginning of this paper, there have been numerous specific such extracted text can be used in automatic content tagging
application scenarios across various industries and in our and recommendation systems. They can also be used to per-
daily lives. In this part, we list and analyze the most out- form user sentiment analysis, e.g. which part of the video
standing ones that have, or are to have, significant impact, attracts the users most. On the other hand, website adminis-
improving our productivity and life quality. trators can impose supervision and filtration for inappropriate
Automatic Data Entry Apart from an electronic archive of and illegal content, such as terrorist advocacy.
existing documents, OCR can also improve our productivity
in the form of automatic data entry. Some industries involve
time-consuming data type-in, e.g. express orders written by 6 Conclusion and Discussion
customers in the delivery industry, and hand-written informa-
tion sheets in the financial and insurance industries. Applying 6.1 Status Quo
OCR techniques can accelerate the data entry process as well
as protect customer privacy. Some companies have already Algorithms The past several years have witnessed the sig-
been using these technologies, e.g. SF-Express.3 Another nificant development of algorithms for text detection and
potential application is note taking, such as NEBO,4 a note- recognition, mainly due to the deep learning boom. Deep
taking software on tablets like iPad that performs instant learning models have replaced the manual search and design
transcription as users write down notes. for patterns and features. With the improved capability of
Identity Authentication Automatic identity authentication is models, research attention has been drawn to challenges such
yet another field where OCR can give a full play to. In fields as oriented and curved text detection, and have achieved con-
such as Internet finance and Customs, users/passengers are siderable progress.
required to provide identification (ID) information, such as Applications Apart from efforts towards a general solution
identity card and passport. Automatic recognition and analy- to all sorts of images, these algorithms can be trained and
sis of the provided documents would require OCR that reads adapted to more specific scenarios, e.g. bankcard, ID card,
and extracts the textual content, and can automate and greatly and driver’s license. Some companies have been providing
accelerate such processes. There are companies that have such scenario-specific APIs, including Baidu Inc., Tencent
already started working on identification based on face and Inc., and MEGVII Inc.. Recent development of fast and effi-
ID card, e.g. MEGVII (Face++).5 cient methods (Ren et al. 2015; Zhou et al. 2017) has also
Augmented Computer Vision As text is an essential element allowed the deployment of large-scale systems (Borisyuk
for the understanding of scene, OCR can assist computer et al. 2018). Companies including Google Inc. and Amazon
vision in many ways. In the scenario of autonomous vehi- Inc. are also providing text extraction APIs.
cles, text-embedded panels carry important information, e.g.
geo-location, current traffic condition, navigation, and etc.. 6 https://translate.google.com/.
7 https://en.wikipedia.org/wiki/Screen_reader#cite_note-Braille_
3 Official website: http://www.sf-express.com/cn/sc/. display-2.
4 Official website: https://www.myscript.com/nebo/. 8 https://www.bilibili.com.
5 https://www.faceplusplus.com/face-based-identification/. 9 www.nicovideo.jp/.
123
178
123
Table 9 State-of-the-art recognition performance across a number of datasets
Methods ConvNet, Data IIIT5k SVT IC03 IC13 IC15 SVTP CUTE Total-text
50 1k 0 50 0 50 Full 0 0 0 0 0 0
“50”, “1k”, “Full” are lexicons. “0” means no lexicon. “90k” and “ST” are the Synth90k and the SynthText datasets, respectively. “ST+ ” means including character-level annotations. “Private”
means private training data
International Journal of Computer Vision (2021) 129:161–184
International Journal of Computer Vision (2021) 129:161–184 179
Table 10 Performance of
Method Word spotting End-to-end
end-to-end and word spotting on
ICDAR 2015 and ICDAR 2013 S W G S W G
ICDAR 2015
Liu et al. (2018c) 84.68 79.32 63.29 81.09 75.90 60.80
Xing et al. (2019) – – – 80.14 74.45 62.18
Lyu et al. (2018a) 79.3 74.5 64.2 79.3 73.0 62.4
He et al. (2018) 85 80 65 82 77 63
Qin et al. (2019) – – – 83.38 79.94 67.98
ICDAR 2013
Busta et al. (2017) 92 89 81 89 86 77
Liu et al. (2018c) 92.73 90.72 83.51 88.81 87.11 80.81
Li et al. (2017a) 94.2 92.4 88.2 91.1 89.8 84.6
He et al. (2018) 93 92 87 91 89 86
Lyu et al. (2018a) 92.5 92.0 88.2 92.2 91.1 86.5
123
180 International Journal of Computer Vision (2021) 129:161–184
Generalization Few detection algorithms except for TextSnake ICDAR MLT 2017, ICDAR MLT 2019, ICDAR ArT 2019,
(Long et al. 2018) have considered the problem of general- and COCO-Text.
ization ability across datasets, i.e. training on one dataset,
and testing on another. Generalization ability is important as
some application scenarios would require the adaptability to
varying environments. For example, instant translation and References
OCR in autonomous vehicles should be able to perform sta-
Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting
bly under different situations: zoomed-in images with large and recognition with embedded attributes. IEEE Transactions on
text instances, far and small words, blurred words, different Pattern Analysis and Machine Intelligence, 36(12), 2552–2566.
languages, and shapes. It remains unverified whether simply Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detec-
tion and hierarchical image segmentation. IEEE Transactions on
pooling all existing datasets together is enough, especially
Pattern Analysis and Machine Intelligence, 33(5), 898–916.
when the target domain is totally unknown. Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., et al. (2019a).
Evaluation Existing evaluation metrics for detection stem What is wrong with scene text recognition model comparisons?
from those for general object detection. Matching based on Dataset and model analysis. In Proceedings of the IEEE interna-
tional conference on computer vision (pp. 4715–4723).
IoU score or pixel-level precision and recall ignore the fact
Baek, Y., Lee, B., Han, D., Yun, S., & Lee, H. (2019b). Character
that missing parts and superfluous backgrounds may hurt the region awareness for text detection. In Proceedings of the IEEE
performance of the subsequent recognition procedure. For conference on computer vision and pattern recognition (CVPR)
each text instance, pixel-level precision and recall are good (pp. 9365–9374).
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation
metrics. However, their scores are assigned to 1.0 once they
by jointly learning to align and translate. In ICLR 2015.
are matched to ground truth, and thus not reflected in the Bai, F., Cheng, Z., Niu, Y., Pu, S., & Zhou, S. (2018). Edit probability
final dataset-level score. An off-the-shelf alternative method for scene text recognition. In CVPR 2018.
is to simply sum up the instance-level scores under DetEval Bartz, C., Yang, H., & Meinel, C. (2017). See: Towards semi-
supervised end-to-end scene text recognition. arXiv preprint
instead of first assigning them to 1.0.
arXiv:1712.05404.
Synthetic Data While training recognizers on synthetic Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). Photoocr:
datasets has become a routine and results are excellent, detec- Reading text in uncontrolled conditions. In Proceedings of the
tors still rely heavily on real datasets. It remains a challenge IEEE international conference on computer vision (pp. 785–792).
Borisyuk, F., Gordo, A., & Sivakumar, V. (2018). Rosetta: Large scale
to synthesize diverse and realistic images to train detectors.
system for text detection and recognition in images. In Proceedings
Potential benefits of synthetic data are not yet fully explored, of the 24th ACM SIGKDD international conference on knowledge
such as generalization ability. Synthesis using 3D engines discovery & data mining (pp. 71–79). ACM.
and models can simulate different conditions such as light- Busta, M., Neumann, L., & Matas, J. (2015). Fastext: Efficient uncon-
strained scene text detector. In Proceedings of the IEEE interna-
ing and occlusion, and thus is worth further development.
tional conference on computer vision (ICCV) (pp. 1206–1214).
Efficiency Another shortcoming of deep-learning-based meth- Busta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-
ods lies in their efficiency. Most of the current systems can to-end trainable scene text localization and recognition framework.
not run in real-time when deployed on computers without In Proceedings of ICCV.
Chen, X., Yang, J., Zhang, J., & Waibel, A. (2004). Automatic detection
GPUs or mobile devices. Apart from model compression and
and recognition of signs from natural scenes. IEEE Transactions
lightweight models that have proven effective in other tasks, on Image Processing, 13(1), 87–99.
it is also valuable to study how to make custom speedup Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., & Zhou, S. (2017a).
mechanism for text-related tasks. Focusing attention: Towards accurate text recognition in natural
images. In 2017 IEEE international conference on computer vision
Bigger and Better Datasets The sizes of most widely adopted
(ICCV) (pp. 5086–5094). IEEE.
datasets are small (∼ 1k images). It will be worthwhile to Cheng, Z., Liu, X., Bai, F., Niu, Y., Pu, S., & Zhou, S. (2017b).
study whether the improvements gained from current algo- Arbitrarily-oriented text recognition. In CVPR2018.
rithms can scale up or they are just accidental results of better Ch’ng, C.K., & Chan, C. S. (2017). Total-text: A comprehensive dataset
for scene text detection and recognition. In 2017 14th IAPR
regularization. Besides, most datasets are only labeled with
international conference on document analysis and recognition
bounding boxes and texts. Detailed annotation of different (ICDAR) (Vol. 1, pp. 935–942). IEEE.
attributes (Yuan et al. 2018) such as word-art and occlu- Chowdhury, M. A., & Deb, K. (2013). Extracting and segmenting
sion may guide researchers with pertinence. Finally, datasets container name from container images. International Journal of
Computer Applications, 74(19), 18–22.
characterized by real-world challenges are also important in
Coates, A., Carpenter, B., Case, C., Satheesh, S., Suresh, B., Wang,
advancing research progress, such as densely located text T., et al. (2011). Text detection and character recognition in scene
on products. Another related problem is that most of the images with unsupervised feature learning. In 2011 international
existing datasets do not have validation sets. It is highly pos- conference on document analysis and recognition (ICDAR) (pp.
440–445). IEEE.
sible that the current reported evaluation results are actually
Dai, Y., Huang, Z., Gao, Y., & Chen, K. (2017). Fused text segmentation
upward biased due to overfitting on the test sets. We sug- networks for multi-oriented scene text detection. arXiv preprint
gest that researchers should focus on large datasets, such as arXiv:1709.03272.
123
International Journal of Computer Vision (2021) 129:161–184 181
Dalal, N., & Triggs, B., (2005). Histograms of oriented gradients for wild. In 2017 IEEE conference on computer vision and pattern
human detection. In IEEE computer society conference on com- recognition (CVPR) (pp. 474–483). IEEE.
puter vision and pattern recognition (CVPR) (Vol. 1, pp. 886–893). He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017b). Mask R-CNN.
IEEE. In 2017 IEEE international conference on computer vision (ICCV)
Deng, D., Liu, H., Li, X., & Cai, D. (2018). Pixellink: Detecting scene (pp. 2980–2988). IEEE.
text via instance segmentation. Proceedings of AAA, I, 2018. He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., & Li, X. (2017c). Single shot
DeSouza, G. N., & Kak, A. C. (2002). Vision for mobile robot nav- text detector with regional attention. In The IEEE international
igation: A survey. IEEE Transactions on Pattern Analysis and conference on computer vision (ICCV).
Machine Intelligence, 24(2), 237–267. He, P., Huang, W., Qiao, Y., Loy, C. C., & Tang, X. (2016). Read-
Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyra- ing scene text in deep convolutional sequences. In Thirtieth AAAI
mids for object detection. IEEE Transactions on Pattern Analysis conference on artificial intelligence.
and Machine Intelligence, 36(8), 1532–1545. He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018).
Dvorin, Y., & Havosha, U. E. (2009). Method and device for instant An end-to-end textspotter with explicit alignment and attention.
translation, June 4. US Patent App. 11/998,931. In Proceedings of the IEEE conference on computer vision and
Epshtein, B., Ofek, E., & Wexler, Y. (2010). Detecting text in natural pattern recognition (CVPR) (pp. 5020–5029).
scenes with stroke width transform. In 2010 IEEE conference on He, W., Zhang, X.-Y., Yin, F., & Liu, C.-L. (2017d). Deep direct
computer vision and pattern recognition (CVPR) (pp. 2963–2970). regression for multi-oriented scene text detection. In The IEEE
IEEE. international conference on computer vision (ICCV).
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, He, Z., Liu, J., Ma, H., & Li, P. (2005). A new automatic extraction
J., & Zisserman, A. (2015). The pascal visual object classes chal- method of container identity codes. IEEE Transactions on Intelli-
lenge: A retrospective. International Journal of Computer Vision, gent Transportation Systems, 6(1), 72–78.
111(1), 98–136. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures Neural Computation, 9(8), 1735–1780.
for object recognition. International Journal of Computer Vision, Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., & Ding, E. (2017).
61(1), 55–79. Wordsup: Exploiting word annotations for character based text
Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). detection. In Proceedings of the IEEE international conference on
DSSD: Deconvolutional single shot detector. arXiv preprint computer vision. 2017.
arXiv:1701.06659. Huang, W., Lin, Z., Yang, J., & Wang, J. (2013). Text localization in
Gao, Y., Chen, Y., Wang, J., & Lu, H. (2017). Reading scene text natural images using stroke feature transform and text covariance
with attention convolutional sequence modeling. arXiv preprint descriptors. In Proceedings of the IEEE international conference
arXiv:1709.04303. on computer vision (pp. 1241–1248).
Girshick, R. (2015). Fast R-CNN. In The IEEE international conference Huang, W., Qiao, Y., & Tang, X. (2014). Robust scene text detection with
on computer vision (ICCV). convolution neural network induced MSER trees. In European
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature conference on computer vision (pp. 497–511). Springer.
hierarchies for accurate object detection and semantic segmenta- Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014a).
tion. In Proceedings of the IEEE conference on computer vision Deep structured output learning for unconstrained text recognition.
and pattern recognition (CVPR) (pp. 580–587). In ICLR2015.
Goldberg, A. V. (1997). An efficient implementation of a scaling Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014b).
minimum-cost flow algorithm. Journal of Algorithms, 22(1), 1– Synthetic data and artificial neural networks for natural scene text
29. recognition. arXiv preprint arXiv:1406.2227.
Gordo, A. (2015). Supervised mid-level features for word image rep- Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016).
resentation. In Proceedings of the IEEE conference on computer Reading text in the wild with convolutional neural networks. Inter-
vision and pattern recognition (CVPR) (pp. 2956–2964). national Journal of Computer Vision, 116(1), 1–20.
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Jaderberg, M., Simonyan, K., Zisserman, A. et al. (2015). Spatial trans-
Connectionist temporal classification: Labelling unsegmented former networks. In Advances in neural information processing
sequence data with recurrent neural networks. In Proceedings of systems (pp. 2017–2025).
the 23rd international conference on machine learning (pp. 369– Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014c). Deep features
376). ACM. for text spotting. In In Proceedings of European conference on
Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., & Fernández, computer vision (ECCV) (pp. 512–528). Springer.
S. (2008). Unconstrained on-line handwriting recognition with Jain, A. K., & Yu, B. (1998). Automatic text location in images and
recurrent neural networks. In Advances in neural information pro- video frames. Pattern Recognition, 31(12), 2055–2076.
cessing systems (pp. 577–584). Jiang, Y., Zhu, X., Wang, X., Yang, S., Li, W., Wang, H., Fu, P., & Luo,
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for Z. (2017). R2CNN: rotational region CNN for orientation robust
text localisation in natural images. In Proceedings of the IEEE scene text detection. arXiv preprint arXiv:1706.09579.
conference on computer vision and pattern recognition (CVPR) Jung, K., Kim, K. I., & Jain, A. K. (2004). Text information extraction in
(pp. 2315–2324). images and video: A survey. Pattern Recognition, 37(5), 977–997.
Ham, Y. K., Kang, M. S., Chung, H. K., Park, R.-H., & Park, G. T. Kang, L., Li, Y., & Doermann, D. (2014). Orientation robust text line
(1995). Recognition of raised characters for automatic classifica- detection in natural images. In Proceedings of the IEEE conference
tion of rubber tires. Optical Engineering, 34(1), 102–110. on computer vision and pattern recognition (CVPR) (pp. 4034–
Han, J., Zhang, D., Cheng, G., Liu, N., & Xu, D. (2018). Advanced 4041).
deep-learning techniques for salient and category-specific object Karatzas, D., & Antonacopoulos, A. (2004). Text extraction from web
detection: A survey. IEEE Signal Processing Magazine, 35(1), 84– images based on a split-and-merge segmentation method using
100. colour perception. In Proceedings of the 17th international con-
He, D., Yang, X., Liang, C., Zhou, Z., Ororbia, A. G., Kifer, D., & ference on pattern recognition, 2004. ICPR 2004 (Vol. 2, pp.
Giles, C. L. (2017a). Multi-scale FCN with cascaded instance 634–637). IEEE.
aware segmentation for arbitrary oriented word spotting in the
123
182 International Journal of Computer Vision (2021) 129:161–184
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, Liu, X. (1975). Old book of tang. Beijing: Zhonghua Book Company.
A., Iwamura, M., et al. (2015). ICDAR 2015 competition on robust Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018c). FOTS:
reading. In 2015 13th international conference on document anal- Fast oriented text spotting with a unified network. In CVPR2018.
ysis and recognition (ICDAR) (pp. 1156–1160). IEEE. Liu, X., & Samarabandu, J. (2005a). An edge-based text region extrac-
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda, L. G. I., tion algorithm for indoor mobile robot navigation. In 2005 IEEE
Mestre, S. R., et al. (2013). ICDAR 2013 robust reading competi- international conference mechatronics and automation (Vol. 2, pp.
tion. In 2013 12th international conference on document analysis 701–706). IEEE.
and recognition (ICDAR) (pp. 1484–1493). IEEE. Liu, X., & Samarabandu, J. K. (2005b). A simple and fast text local-
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with ization algorithm for indoor mobile robot navigation. In Image
graph convolutional networks. arXiv preprint arXiv:1609.02907. processing: Algorithms and systems IV (Vol. 5672, pp. 139–151).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classi- International Society for Optics and Photonics.
fication with deep convolutional neural networks. In Advances in Liu, Y., & Jin, L. (2017). Deep matching prior network: Toward tighter
neural information processing systems (pp. 1097–1105). multi-oriented text detection.
Lee, C.-Y., & Osindero, S. (2016). Recursive recurrent nets with atten- Liu, Y., Jin, L., Xie, Z., Luo, C., Zhang, S., & Xie, L. (2019). Tightness-
tion modeling for OCR in the wild. In Proceedings of the IEEE aware evaluation protocol for scene text detection. In Proceedings
conference on computer vision and pattern recognition (CVPR) of the IEEE conference on computer vision and pattern recognition
(pp. 2231–2239). (pp. 9612–9620).
Lee, J.-J, Lee, P.-H., Lee, S.-W., Yuille, A., & Koch, C. (2011). Adaboost Liu, Y., Jin, L., Zhang, S., & Zhang, S. (2017). Detecting curve
for text detection in natural scene. In 2011 international conference text in the wild: New dataset and new solution. arXiv preprint
on document analysis and recognition (ICDAR) (pp. 429–434). arXiv:1712.02170.
IEEE. Liu, Z., Li, Y., Ren, F., Yu, H., & Goh, W. (2018d). Squeezedtext: A
Lee, S., & Kim, J. H. (2013). Integrating multiple character proposals for real-time scene text recognition by binary convolutional encoder–
robust scene text extraction. Image and Vision Computing, 31(11), decoder network. In AAAI.
823–840. Liu, Z., Lin, G., Yang, S., Feng, J., Lin, W., & Goh, W. L. (2018e).
Li, H., Wang, P., & Shen, C. (2017a). Towards end-to-end text spot- Learning Markov clustering networks for scene text detection. In
ting with convolutional recurrent neural networks. In The IEEE Proceedings of the IEEE conference on computer vision and pat-
international conference on computer vision (ICCV). tern recognition (CVPR) (pp. 6936–6944).
Li, H., Wang, P., Shen, C., & Zhang, G. (2019). Show, attend and read: A Long, S., Guan, Y., Bian, K., & Yao, C. (2020). A new perspective for
simple and strong baseline for irregular text recognition. In AAAI. flexible feature gathering in scene text recognition via character
Li, R., En, M., Li, J., & Zhang, H. (2017b). Weakly supervised text anchor pooling. In ICASSP 2020—2020 IEEE international con-
attention network for generating text proposals in scene images. ference on acoustics, speech and signal processing (ICASSP) (pp.
In 2017 14th IAPR international conference on document analysis 2458–2462. IEEE.
and recognition (ICDAR) (Vol. 1, pp. 324–330). IEEE. Long, S., Guan, Y., Wang, B., Bian, K., & Yao, C. (2019). Alchemy:
Liao, M., Shi, B., & Bai, X. (2018a). Textboxes++: A single-shot ori- Techniques for rectification based irregular scene text recognition.
ented scene text detector. IEEE Transactions on Image Processing, arXiv preprint arXiv:1908.11834.
27(8), 3676–3690. Long, S., Ruan, J., Zhang, W., He, X., Wu, W., & Yao, C. (2018).
Liao, M., Shi, B., Bai, X., Wang, X., & Liu, W. (2017). Textboxes: A Textsnake: A flexible representation for detecting text of arbitrary
fast text detector with a single deep neural network. In AAAI (pp. shapes. In Proceedings of European conference on computer vision
4161–4167). (ECCV).
Liao, M., Song, B., He, M., Long, S., Yao, C., & Bai, X. (2019a). Syn- Long, S., & Yao, C. (2020). Unrealtext: Synthesizing realistic scene text
thtext3d: Synthesizing scene text images from 3d virtual worlds. images from the unreal world. arXiv preprint arXiv:2003.10608.
arXiv preprint arXiv:1907.06007. Lyu, P., Liao, M., Yao, C., Wu, W., & Bai, X. (2018a). Mask textspotter:
Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., & Bai, X. An end-to-end trainable neural network for spotting text with arbi-
(2019b). Scene text recognition from two-dimensional perspective. trary shapes. In Proceedings of European conference on computer
In AAAI. vision (ECCV).
Liao, M., Zhu, Z., Shi, B., Xia, G.-S., & Bai, X. (2018b). Rotation- Lyu, P., Yao, C., Wu, W., Yan, S., & Bai, X. (2018b). Multi-oriented
sensitive regression for oriented scene text detection. In Pro- scene text detection via corner localization and region segmen-
ceedings of the IEEE conference on computer vision and pattern tation. In 2018 IEEE conference on computer vision and pattern
recognition (CVPR) (pp. 5909–5918). recognition (CVPR).
Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional neural fields for Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., et al.
depth estimation from a single image. In Proceedings of the IEEE (2018). Arbitrary-oriented scene text detection via rotation pro-
conference on computer vision and pattern recognition (CVPR) posals. IEEE Transactions on Multimedia, 20, 3111–3122.
(pp. 5162–5170). Mammeri, A., & Boukerche, A. et al. (2016). MSER-based text detec-
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäi- tion and communication algorithm for autonomous vehicles. In
nen, M. (2018a). Deep learning for generic object detection: A 2016 IEEE symposium on computers and communication (ISCC)
survey. arXiv preprint arXiv:1809.02165. (pp. 1218–1223). IEEE.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Mammeri, A., Khiari, E.-H., & Boukerche, A. (2014). Road-sign text
Berg, A. C. (2016a). SSD: Single shot multibox detector. In In recognition architecture for intelligent transportation systems. In
Proceedings of European conference on computer vision (ECCV) 2014 IEEE 80th vehicular technology conference (VTC Fall) (pp.
(pp. 21–37). Springer. 1–5). IEEE.
Liu, W., Chen, C., & Wong, K. (2018b). Char-net: A character-aware Mishra, A., Alahari, K., & Jawahar, C. (2011). An MRF model for bina-
neural network for distorted scene text recognition. In AAAI con- rization of natural scene text. In ICDAR-international conference
ference on artificial intelligence, New Orleans, Louisiana, USA. on document analysis and recognition. IEEE.
Liu, W., Chen, C., Wong, K.-Y. K., Su, Z., & Han, J. (2016b). Star-net: Mishra, A., Alahari, K., & Jawahar, C. (2012). Scene text recogni-
A spatial attention residue network for scene text recognition. In tion using higher order language priors. In BMVC-British machine
BMVC (Vol. 2, p. 7). vision conference. BMVA.
123
International Journal of Computer Vision (2021) 129:161–184 183
Neumann, L., & Matas, J. (2010). A method for text localization and to scene text recognition. IEEE Transactions on Pattern Analysis
recognition in real-world images. In Asian conference on computer and Machine Intelligence, 39(11), 2298–2304.
vision (pp. 770–783). Springer. Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene
Neumann, L., & Matas, J. (2012). Real-time scene text localization text recognition with automatic rectification. In Proceedings of
and recognition. In 2012 IEEE conference on computer vision and the IEEE conference on computer vision and pattern recognition
pattern recognition (CVPR) (pp. 3538–3545). IEEE. (CVPR) (pp. 4168–4176).
Neumann, L., & Matas, J. (2013). On combining multiple segmentations Shi, B., Yang, M., Wang, X., Lyu, P., Bai, X., & Yao, C. (2018). Aster:
in scene text recognition. In 2013 12th international conference on An attentional scene text recognizer with flexible rectification.
document analysis and recognition (ICDAR) (pp. 523–527). IEEE. IEEE Transactions on Pattern Analysis and Machine Intelligence,
Nomura, S., Yamanaka, K., Katai, O., Kawakami, H., & Shiose, T. 31(11), 855–868.
(2005). A novel adaptive morphological approach for degraded Shi, C., Wang, C., Xiao, B., Zhang, Y., Gao, S., & Zhang, Z. (2013).
character image segmentation. Pattern Recognition, 38(11), 1961– Scene text recognition using part-based tree-structured character
1975. detection. In 2013 IEEE conference on computer vision and pattern
Parkinson, C., Jacobsen, J. J., Ferguson, D. B., & Pombo, S. A. (2016). recognition (CVPR) (pp. 2961–2968). IEEE.
Instant translation system, Nov. 29. US Patent 9,507,772. Shivakumara, P., Bhowmick, S., Su, B., Tan, C. L., & Pal, U. (2011).
Qin, S., Bissacco, A., Raptis, M., Fujii, Y., & Xiao, Y. (2019). Towards A new gradient based character segmentation method for video
unconstrained end-to-end text spotting. In Proceedings of the IEEE text recognition. In 2011 international conference on document
international conference on computer vision (pp. 4704–4714). analysis and recognition (ICDAR). IEEE.
Qiu, W., Zhong, F., Zhang, Y., Qiao, S., Xiao, Z., Kim, T. S., & Wang, Su, B., & Lu, S. (2014). Accurate scene text recognition based on recur-
Y. (2017). Unrealcv: Virtual worlds for computer vision. In Pro- rent neural network. In Asian conference on computer vision (pp.
ceedings of the 25th ACM international conference on multimedia 35–48). Springer.
(pp. 1221–1224). ACM. Sun, Y., Liu, J., Liu, W., Han, J., Ding, E., & Liu, J. (2019). Chinese
Phan, T. Q., Shivakumara, P., Tian, S., & Tan, C. L. (2013). Recognizing street view text: Large-scale Chinese text reading with partially
text with perspective distortion in natural scenes. In Proceedings supervised learning. In Proceedings of the IEEE international con-
of the IEEE international conference on computer vision (ICCV) ference on computer vision (pp. 9086–9095).
(pp. 569–576). Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence
Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. learning with neural networks. In Advances in neural information
arXiv preprint. processing systems (pp. 3104–3112).
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., & Tan, C. L. (2015). Text
look once: Unified, real-time object detection. In Proceedings of flow: A unified text detection system in natural scene images. In
the IEEE conference on computer vision and pattern recognition Proceedings of the IEEE international conference on computer
(CVPR) (pp. 779–788). vision (pp. 4651–4659).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Tian, S., Lu, S., & Li, C. (2017). Wetext: Scene text detection under
Towards real-time object detection with region proposal networks. weak supervision. In Proceedings of ICCV.
In Advances in neural information processing systems (pp. 91–99). Tian, Z. Huang, W., He, T., He, P., & Qiao, Y. (2016). Detecting text
Rodriguez-Serrano, J. A., Gordo, A., & Perronnin, F. (2015). Label in natural image with connectionist text proposal network. In In
embedding: A frugal baseline for text recognition. International Proceedings of European conference on computer vision (ECCV)
Journal of Computer Vision, 113(3), 193–207. (pp. 56–72). Springer.
Rodriguez-Serrano, J. A., Perronnin, F., & Meylan, F. (2013). Label Tian, Z., Shu, M., Lyu, P., Li, R., Zhou, C., Shen, X., & Jia, J. (2019).
embedding for text recognition. In Proceedings of the British Learning shape-aware embedding for scene text detection. In Pro-
machine vision conference. Citeseer. ceedings of the IEEE conference on computer vision and pattern
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional recognition (pp. 4234–4243).
networks for biomedical image segmentation. Berlin: Springer. Tsai, S. S., Chen, H., Chen, D., Schroth, G., Grzeszczuk, R., & Girod,
Roy, P. P., Pal, U., Llados, J., & Delalandre, M. (2009). Multi-oriented B. (2011). Mobile visual search on printed documents using text
and multi-sized touching character segmentation using dynamic and low bit-rate features. In 18th IEEE international conference
programming. In 10th international conference on document anal- on image processing (ICIP) (pp. 2601–2604). IEEE.
ysis and recognition, 2009. IEEE. Tu, Z., Ma, Y., Liu, W., Bai, X., & Yao, C. (2012). Detecting texts
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., of arbitrary orientations in natural images. In 2012 IEEE confer-
et al. (2015). Imagenet large scale visual recognition challenge. ence on computer vision and pattern recognition (pp. 1083–1090).
International Journal of Computer Vision, 115(3), 211–252. IEEE.
Schroth, G., Hilsenbeck, S., Huitl, R., Schweiger, F., & Steinbach, E. Uchida, S. (2014). Text localization and recognition in images and
(2011). Exploiting text-related features for content-based image video. In Handbook of document image processing and recog-
retrieval. In 2011 IEEE international symposium on multimedia nition (pp. 843–883). Springer.
(pp. 77–84). IEEE. Wachenfeld, S., Klein, H.-U., & Jiang, X. (2006). Recognition of
Schulz, R., Talbot, B., Lam, O., Dayoub, F., Corke, P., Upcroft, B., & screen-rendered text. In 18th international conference on pattern
Wyeth, G. (2015). Robot navigation using human cues: A robot recognition, 2006. ICPR 2006 (Vol. 2, pp. 1086–1089). IEEE.
navigation system for symbolic goal-directed exploration. In Pro- Wakahara, T., & Kita, K. (2011). Binarization of color character strings
ceedings of the 2015 IEEE international conference on robotics in scene images using k-means clustering and support vector
and automation (ICRA 2015) (pp. 1100–1105). IEEE. machines. In 2011 international conference on document analysis
Sheshadri, K., & Divvala, S. K. (2012). Exemplar driven character and recognition (ICDAR) (pp. 274–278). IEEE.
recognition in the wild. In BMVC (pp. 1–10). Wang, C., Yin, F., & Liu, C.-L. (2017). Scene text detection with novel
Shi, B., Bai, X., & Belongie, S. (2017a). Detecting oriented text in superpixel based character candidate extraction. In 2017 14th IAPR
natural images by linking segments. In The IEEE conference on international conference on document analysis and recognition
computer vision and pattern recognition (CVPR). (ICDAR) (Vol. 1, pp. 929–934). IEEE.
Shi, B., Bai, X., & Yao, C. (2017b). An end-to-end trainable neural Wang, F., Zhao, L., Li, X., Wang, X., & Tao, D. (2018). Geometry-
network for image-based sequence recognition and its application aware scene text detection with instance transformation network.
123
184 International Journal of Computer Vision (2021) 129:161–184
In Proceedings of the IEEE conference on computer vision and Ye, Q., Gao, W., Wang, W., & Zeng, W. (2003). A robust text detection
pattern recognition (CVPR) (pp. 1381–1389). algorithm in images and video frames. In IEEE ICICS-PCM (pp.
Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text 802–806).
recognition. In 2011 IEEE international conference on computer Yi, C., & Tian, Y. (2011). Text string detection from natural scenes
vision (ICCV), (pp. 1457–1464). IEEE. by structure-based partition and grouping. IEEE Transactions on
Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end Image Processing, 20(9), 2594–2605.
text recognition with convolutional neural networks. In 2012 21st Yin, F., Wu, Y.-C, Zhang, X.-Y., & Liu, C.-L. (2017). Scene text recog-
international conference on pattern recognition (ICPR) (pp. 3304– nition with sliding convolutional character models. arXiv preprint
3308). IEEE. arXiv:1709.01727.
Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., & Shao, S. (2019a). Yin, X.-C., Yin, X., Huang, K., & Hao, H.-W. (2014). Robust text
Shape robust text detection with progressive scale expansion net- detection in natural scene images. IEEE Transactions on Pattern
work. Proceedings of the IEEE conference on computer vision and Analysis and Machine Intelligence, 36(5), 970–983.
pattern recognition (CVPR). Yin, X.-C., Zuo, Z.-Y., Tian, S., & Liu, C.-L. (2016). Text detection,
Wang, X., Jiang, Y., Luo, Z., Liu, C.-L., Choi, H., & Kim, S. (2019b). tracking and recognition in video: A comprehensive survey. IEEE
Arbitrary shape scene text detection with adaptive text region rep- Transactions on Image Processing, 25(6), 2752–2773.
resentation. In Proceedings of the IEEE conference on computer Yu, D., Li, X., Zhang, C., Han, J., Liu, J., & Ding, E. (2020). Towards
vision and pattern recognition (pp. 6449–6458). accurate scene text recognition with semantic reasoning networks.
Weinman, J., Learned-Miller, E., & Hanson, A. (2007). Fast lexicon- arXiv preprint arXiv:2003.12294.
based scene text recognition with sparse belief propagation. In Yuan, T.-L., Zhu, Z., Xu, K., Li, C.-J., & Hu, S.-M. (2018). Chinese
ICDAR (pp. 979–983). IEEE. text in the wild. arXiv preprint arXiv:1803.00085.
Wolf, C., & Jolion, J.-M. (2006). Object count/area graphs for the Zhan, F., & Lu, S. (2019). ESIR: End-to-end scene text recognition via
evaluation of object detection and segmentation algorithms. Inter- iterative image rectification. In Proceedings of the IEEE confer-
national Journal of Document Analysis and Recognition (IJDAR), ence on computer vision and pattern recognition.
8(4), 280–296. Zhan, F., Lu, S., & Xue, C. (2018). Verisimilar image synthesis for
Wu, L., Zhang, C., Liu, J., Han, J., Liu, J., Ding, E., & Bai, X. (2019). accurate detection and recognition of texts in scenes.
Editing text in the wild. In Proceedings of the 27th ACM interna- Zhang, C., Liang, B., Huang, Z., En, M., Han, J., Ding, E., & Ding,
tional conference on multimedia (pp. 1500–1508). X. (2019). Look more than once: An accurate detector for text
Wu, Y., & Natarajan, P. (2017). Self-organized text detection with min- of arbitrary shapes. In Proceedings of the IEEE conference on
imal post-processing via border learning. In Proceedings of the computer vision and pattern recognition (CVPR).
IEEE conference on CVPR (pp. 5000–5009). Zhang, D., & Chang, S.-F. (2003). A Bayesian framework for fusing
Xia, Y., Tian, F., Wu, L., Lin, J., Qin, T., Yu, N., & Liu, T.-Y. (2017). multiple word knowledge models in videotext recognition. In Com-
Deliberation networks: Sequence generation beyond one-pass puter vision and pattern recognition, 2003. IEEE.
decoding. In Advances in neural information processing systems Zhang, S., Liu, Y., Jin, L., & Luo, C. (2018). Feature enhancement
(pp. 1784–1794). network: A refined scene text detector. In Proceedings of AAAI,
Xing, L., Tian, Z., Huang, W., & Scott, M. R. (2019). Convolutional 2018.
character networks. In Proceedings of the IEEE international con- Zhang, S.-X., Zhu, X., Hou, J.-B., Liu, C., Yang, C., Wang, H., &
ference on computer vision (pp. 9126–9136). Yin, X.-C. (2020). Deep relational reasoning graph network for
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., arbitrary shape text detection. arXiv preprint arXiv:2003.07493.
et al. (2015). Show, attend and tell: Neural image caption genera- Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., & Bai, X. (2016).
tion with visual attention. In International conference on machine Multi-oriented text detection with fully convolutional networks.
learning (pp. 2048–2057). In Proceedings of the IEEE conference on computer vision and
Xue, C., Lu, S., & Zhan, F. (2018). Accurate scene text detection through pattern recognition (CVPR).
border semantics awareness and bootstrapping. In In Proceedings Zhiwei, Z., Linlin, L., & Lim, T. C. (2010). Edge based binarization
of European conference on computer vision (ECCV). for video text images. In 2010 20th international conference on
Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., et al. (2019). pattern recognition (ICPR) (pp. 133–136). IEEE.
Symmetry-constrained rectification network for scene text recog- Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., & Liang, J.
nition. In Proceedings of the IEEE international conference on (2017). EAST: An efficient and accurate scene text detector. In
computer vision (pp. 9147–9156). The IEEE conference on computer vision and pattern recognition
Yang, Q., Jin, H., Huang, J., & Lin, W. (2020). Swaptext: Image based (CVPR).
texts transfer in scenes. arXiv preprint arXiv:2003.08152. Zhu, Y., Yao, C., & Bai, X. (2016). Scene text detection and recognition:
Yang, X., He, D., Zhou, Z., Kifer, D., & Giles, C. L. (2017). Learning to Recent advances and future trends. Frontiers of Computer Science,
read irregular text with attention mechanisms. In Proceedings of 10(1), 19–36.
the twenty-sixth international joint conference on artificial intelli- Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object pro-
gence, IJCAI-17 (pp. 3280–3286). posals from edges. In Proceedings of European conference on
Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi- computer vision (ECCV) (pp. 391–405). Springer.
scale representation for scene text recognition. In Proceedings of
the IEEE conference on computer vision and pattern recognition
(CVPR) (pp. 4042–4049). Publisher’s Note Springer Nature remains neutral with regard to juris-
Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., & Cao, Z. (2016). Scene dictional claims in published maps and institutional affiliations.
text detection via holistic, multi-channel prediction. arXiv preprint
arXiv:1606.09002.
Ye, Q., & Doermann, D. (2015). Text detection and recognition in
imagery: A survey. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 37(7), 1480–1500.
123