(2022-MM) SPTS Single-Point Text Spotting
(2022-MM) SPTS Single-Point Text Spotting
Dezhi Peng2∗ , Xinyu Wang3∗ , Yuliang Liu1∗ , Jiaxin Zhang4∗ , Mingxin Huang2 , Songxuan Lai5 , Shenggao Zhu5 , Jing Li5 ,
Dahua Lin1 , Chunhua Shen6 , Lianwen Jin2
1 2
Chinese University of Hong Kong South China University of Technology 3 University of Adelaide
4 5
ByteDance Inc. Huawei Technologies 6 Zhejiang University
arXiv:2112.07917v1 [cs.CV] 15 Dec 2021
Abstract on from horizontal [19, 35] and multi-oriented text [40, 42]
to arbitrary-shaped text [22, 28] in recent years, accompa-
Almost all scene text spotting (detection and recogni- nied by the annotation format from horizontal rectangles,
tion) methods rely on costly box annotation (e.g., text-line to quadrilaterals, and to polygons. The fact that regular
box, word-level box, and character-level box). For the first bounding boxes are prone to involve noises has been well
time, we demonstrate that training scene text spotting mod- studied in previous works (see Fig. 1), which has proved
els can be achieved with an extremely low-cost annotation that character-level and polygonal annotations can effec-
of a single-point for each instance. We propose an end-to- tively lift the model performance [18, 20, 28]. Furthermore,
end scene text spotting method that tackles scene text spot- many efforts have been made to develop more sophisticated
ting as a sequence prediction task, like language modeling. representations to fit arbitrarily shaped text instances [8, 22,
Given an image as input, we formulate the desired detec- 26, 36, 43] (see Fig. 2). For example, ABCNet [22] con-
tion and recognition results as a sequence of discrete to- verts polygon annotations to Bezier curves for represent-
kens and use an auto-regressive transformer to predict the ing the curved text instances, while Text Dragon [8] utilizes
sequence. We achieve promising results on several horizon- character-level bounding boxes to generate centerlines for
tal, multi-oriented, and arbitrarily shaped scene text bench- enabling the prediction of local geometry attributes. How-
marks. Most significantly, we show that the performance is ever, these novel representations are primarily and carefully
not very sensitive to the positions of the point annotation, designed by experts based on prior knowledge, heavily rely-
meaning that it can be much easier to be annotated and au- ing on highly customized network architecture (e.g., speci-
tomatically generated than the bounding box that requires fied RoI module) and consume more expensive annotations
precise positions. We believe that such a pioneer attempt in- (e.g., character-level annotations), limiting their generaliza-
dicates a significant opportunity for scene text spotting ap- tion ability for practical applications.
plications of a much larger scale than previously possible. To reduce the cost of data labeling, some researchers [1,
2, 11, 34] have explored training the OCR models with
coarse annotations in a weakly-supervised manner. These
methods can mainly be separated into two categories, i.e.,
1. Introduction (1) bootstrapping labels to finer granularity and (2) train-
ing with partial annotations. The former usually derives
In the last decades, it has been witnessed that modern character-level labels from word or line-level annotations;
Optical Character Recognition (OCR) algorithms are able thus, the models could enjoy the well-understood advantage
to read textual content from pictures of complex scenes, of character-level supervision without introducing overhead
which is an incredible development, leading to enormous costs. The latter is committed to achieving competitive
interest from both academia and industry. The limitation of performance with fewer training samples. However, both
existing methods, and particularly their poorer performance methods still rely on the costly bounding-box annotations.
on arbitrarily shaped scene text, have been repeatedly iden- One of the underlying problems that prevent replacing
tified [6, 18, 22]. This can be seen in the trend of worse the bounding box with a simpler annotation format, such
predictions for instances with curved shapes, varied fonts, as a single-point, is that most text spotters rely on RoI-like
distortions, etc. sampling strategies to extract the shared backbone features.
The focus of research in the OCR community has moved For example, Mask TextSpotter requires mask prediction in-
* The first four authors contributed equally to this work. Part of this side an RoI [18]; ABCNet [22] proposes BezeirAlign, while
work was done when Y. Liu and C. Shen were with University of Adelaide. TextDragon [8] introduces RoISlide to unify the detection
1
(a) Rectangle (55s). (b) Quadrilateral (96s). (c) Character (581s). (d) Polygon (172s). (e) Single-Point (11s)
Figure 1. Different annotation styles and their time cost (for all the text instances in the sample image) measured by the OpenMMLab-
LabelBee1 tool. Green areas are positive samples, while red dashed boxes are noises that may be possibly included. The time cost of
single-point annotation is more than 50 times faster than character-level annotation.
2
t1 t2 t3 tn <EOS>
Predicted Sequence
... PLANET
Transformer
CNN
Encoder
...
HOLLYWOOD
<SOS> t1 t2 tn-1 tn
Encoder Decoder
Figure 3. Overall framework of the proposed SPTS. The visual and contextual features are first extracted by a series of CNN and transformer
encoders. Then, the features are auto-regressively decoded into a sequence that contains both localization and recognition information,
which is subsequently translated into point coordinates and text transcriptions. Only a point-level annotation is required for training.
Cat
the arbitrary-shaped Person
text spotting task, i.e., segmentation-
Pizza Sports Ball
RoISlide [8], and RoIMasking [18] are required to bridge
Donut
based [18, 31, 32, 39] and regression-based methods [8, 22, the detection and recognition blocks, where backbone fea-
37]. The former first predicts masks to segment text in-Persontures are cropped Cupand shared
Spoon
between detection and recog-
stances, then the features inside text regions are sampled nition heads. Under such types Cupof design, the recognition
and grouped for further recognition. ForTennis Racket
example, Mask and detection modules are highly coupled. For example, Mouse
3
Transformer
CNN
Encoder
Encoder <SOS> t1 t2
Original Annotations
ASHBURY x1,y1 HAIGHT
The fact that a sequence can carry information with x2,y2 ASHBURY
multiple attributes naturally enables the text spotting task, x3,y3 VINTAGE
Output Seq.
where text instances are simultaneously localized and rec- x4,y4 VINTAGE
HAIGHT
Discretization x y
ognized. To express the target text instances by a sequence,
Coordinates Transcription
Discretized Instances
it is required to convert the continuous descriptions (e.g., VINTAGE
⌊x1/nbins⌋ ,⌊y1/nbins⌋ H,A,I,G,H,T,<PAD>,..
bounding boxes) to a discretized space. To this end, as ⌊x2/nbins⌋ ,⌊y2/nbins⌋
Input Seq.
A,S,H,B,U,R,Y,<PAD>,..
< x
shown in Fig. 4, we follow Pix2Seq [4] to build the target VINTAGE ⌊x3/nbins⌋ ,⌊y3/nbins⌋ V,I,N,T,A,G,E,<PAD>,..
What distinguishes our methods is that we further sim- Taraget Sequence <SOS> R
plify the bounding box to a single-point. Specifically, the Coord. Transcription <PAD>
continuous coordinates of the top-left corner of the first < x y V I N T A G E ∅ ∅ x ... ∅ ∅ x y ... ∅ >
character of the text instance (top-left point for short) are
uniformly discretized into an integer between [1, nbins ], <SOS> Text Instance 1 Randomly Ordered Instances <EOS>
where nbins controls the degree of discretization. For ex-
ample, an image with a long side of 800 pixels requires Figure 4. Pipeline of the sequence construction for text spotting.
only nbins = 800 to achieve zero quantization error. As
so far, a text instance can thereby be represented by a se-
quence of three tokens, i.e., [x, y, t], where t is the tran- son for this issue may be due to annotation noises and dif-
scription text. Notably, the transcriptions are inherently dis- ficult samples. Therefore, a sequence augmentation strat-
crete, i.e., each of the characters represents a category, thus egy is adopted to postpone the emergence of <EOS> token.
can be easily appended to the sequence. However, different Specifically, as shown in Fig. 5, noise instances, which con-
from the generic object detection that has a relatively fixed sists of randomly generated positions and transcriptions, are
vocabulary (each t represents an object category, such as inserted into the input sequence. For the output sequence,
pedestrian), t can be any natural language text of any length each noise text instance corresponds to several ‘N/A’ to-
in our task, resulting in a variable length of the target se- kens, one noise token, and one <EOS> token. During
quence, which may further cause misalignment issue and training, the noise token will help the model to distinguish
can consume more computational resources. the noise text instances, and the sampling of <EOS> is
To eliminate such problems, we pad or truncate the texts thereby delayed. Notably, the loss of ‘N/A’ tokens is ex-
to a fixed length ltr , where the <PAD> token is used to cluded in optimization.
fill the vacancy for shorter text instances. In addition, like Training Objective. Since the SPTS is trained to predict
other language modeling methods, <SOS> and <EOS> tokens, it only requires to maximize the likelihood loss at
tokens are inserted to the head and tail of the sequence, in- training time, which can be written as:
dicating the start and the end of a sequence, respectively. L
Therefore, given an image that contains nti text instances, X
maximize wj log P (s̃i |I, s1:i−1 ), (1)
the constructed sequence will include (2 + llr ) × nti dis-
i=1
crete tokens, where each text instances would be randomly
ordered, following previous works [4]. Supposing there are where I is the input image, s̃ is the target sequence, s is
ncls categories of characters (e.g., 97 for English characters the input sequence, L is the length of the sequence, and wj
and symbols), the vocabulary size of the dictionary used to is the weight of the likelihood of the j-th token, which is
tokenize the sequence can be calculated as nbins + ncls + 3, empirically set to 1.
where the extra three classes are for <PAD>, <SOS>, and
<EOS> tokens. Empirically, we set the ltr and nbins to 25 2.3. Inference
and 1,000, respectively, in our experiments. At the inference stage, SPTS auto-regressively predicts
the tokens until the end of the sequence token <EOS> oc-
2.2. Model Training
curs. The predicted sequence will subsequently be divided
Sequence Augmentation. As shown in Fig. 3, the SPTS into multiple segments, each of which contains 2 + ntr to-
decoder predicts the sequence in an auto-regressive man- kens. Then, the tokens can be easily translated into the point
ner, where the <EOS> token is employed to decide the coordinates and transcriptions, yielding the text spotting re-
termination of generation. However, in practice, the fact sults. In addition, the likelihood of all tokens in the corre-
that <EOS> token is easy to cause the sequence prediction sponding segment is averaged and assigned as a confidence
task to be prematurely terminated, leading to a low recall score to filter the original outputs, which efficiently removes
rate, has been identified in previous works [4]. The rea- redundant and false-positive predictions.
4
ranscription
HAIGHT
ASHBURY
Output Seq.
VINTAGE
H,B,U,R,Y,<PAD>,..
< x y A C C ... ∅ x y @ ! A z $ j % ∅ ∅ ∅ ... A ∅
,T,A,G,E,<PAD>,..
,T,A,G,E,<PAD>,..
aget Sequence <SOS> Real Text Instance Noise Text Instances <PAD>
Figure 5. Illustration of the sequence augmentation. Noise text instances that consist of randomly generated coordinates and transcriptions
are inserted to postpone <EOS> token.
y ... ∅ >
3. Experiments
ances <EOS> 0.48 0.78 - -
We report the experimental results on four scene text
datasets including horizontal dataset ICDAR 2013 [15], 0.71 0.11 - -
multi-oriented dataset ICDAR 2015 [13], and arbitrarily- 0.07 0.86 - -
shaped datasets Total-Text [5] and SCUT-CTW1500 [24].
0.65 0.77 - -
3.1. Datasets
Curved Synthetic Dataset 150k. It is admitted that Normalized Distance Matrix
the performance of text spotters can be improved by
Figure 6. Illustration of the point-based evaluation metric. Colored
pre-training on synthesized samples. Following previous diamonds are predicted points, circles represent ground-truth.
work [22], we use the 150k synthetic images generated by
the SynthText [9] toolbox, which contains around one-third
of curved text and two-third of horizontal instances. matched. Then, the recognized content inside each matched
ICDAR 2013 [15] contains 229 training and 233 testing bounding box is compared with the GT transcription; only
samples, while the images are primarily captured in a con- if the predicted text is the same as the GT will it contribute
trolled environment, where the text contents of interest are to the end-to-end accuracy. However, in the proposed meth-
explicitly focused in horizontal. ods, each text instance is represented by a single-point; thus,
ICDAR 2015 [13] consists of 1,000 training and 500 the evaluation metric based on the IoU is not available to
testing images that were incidentally captured, contain- measure the performance. Meanwhile, comparing the local-
ing multi-oriented text instances presented in complicated ization performance between bounding-box-based methods
backgrounds with strong variations in blur, distortions, etc. and the proposed point-based SPTS might be unfair, e.g.,
Total-Text [5] includes 1,255 training and 300 testing directly treating points inside a bounding box as true pos-
images, where at least one curved sample is presented in itive may overestimate the detection performance. To this
each image and annotated with polygonal bounding box at end, we propose a new evaluation metric to ensure a rela-
the word-level. tively fair comparison to existing approaches, which mainly
SCUT-CTW1500 [24] is another widely used bench- considers the end-to-end accuracy as it reflects both detec-
mark designed for spotting arbitrary shaped scene text, in- tion and recognition performance (failure detections usually
volving 1,000 and 500 images for training and testing, re- lead to incorrect recognition results). Specifically, as shown
spectively. The text instances are labeled by polygons at in Fig. 6, we modified the text instance matching rule by re-
text line-level. placing the IoU metric with a distance metric, i.e., the pre-
dicted point that has the nearest distance to the top-left cor-
3.2. Evaluation Protocol
ner of the GT box would be selected, and the recognition re-
The existing evaluation protocol of text spotting tasks sults will be measured by the same full-matching rules used
consists of two steps. Firstly, the intersection over union in existing benchmarks. Only one predicted point with the
(IoU) scores between ground-truth (GT) and detected boxes highest confidence would be matched to the ground truth;
are calculated; and only if the IoU score is larger than others are then marked as false positives.
a designated threshold (usually set to 0.5), the boxes are To explore whether the proposed evaluation proto-
5
Total-Text (%) CTW1500 (%)
Method
Box Point Box Point
ABCNetv1 [22] 67.2 66.1 53.5 53.2
ABCNetv2 [25] 71.7 70.7 57.6 57.3
(a) Top-left (b) Central (c) Random
Table 1. Comparison of the end-to-end recognition performance
evaluated by the proposed point-based metric and box-based met- Figure 7. Indicated points (red color) using different positions.
ric on Total-Text [5] and SCUT-CTW1500 [24]. All results are
reproduced based on the official codes. E2E Total-Text E2E CTW1500
Position
None Full None Full
Top-left 67.9 74.1 56.3 67.2
col can genuinely represent the model accuracy, Table 1 Central 67.6 74.6 56.2 66.8
compares the end-to-end recognition accuracy of ABC- Random 66.6 73.3 51.7 63.9
Netv1 [22] and ABCNetv2 [25] on Total-Text [5] and
SCUT-CTW1500 [24] under two metrics, i.e., the com- Table 2. Ablation study of the position of the point on Total-
monly used bounding box metric that is based on IoU, and Text [5] and SCUT-CTW1500 [24]. Top-left, central, and random
the proposed point-based metric. The results demonstrate points are illustrated in Fig. 7. E2E: end-to-end evaluation metric.
that the point-based evaluation protocol can well reflect the
performance, where the difference between the values eval- Total-Text CTW1500
Variants Np
uated by box-based and point-based metrics are less than None Full None Full
1%. For example, the ABCNetv1 model achieves 53.5 and SPTS-Bezier 58.5 68.3 42.2 60.8 16
53.2 scores under the two metrics, respectively. There- SPTS-Rect 67.7 69.9 48.7 61.2 4
SPTS-Point 67.9 74.1 56.3 67.2 2
fore, we use the point-based metric to evaluate the proposed
SPTS in the following experiments. Table 3. Comparison with different shapes of bounding box. Np
is the number of parameters required to describe the text instances
3.3. Implemented Details by different representations.
The model is first pretrained on a combination dataset in-
cludes Curved Synthetic Dataset 150k [22], MLT-2017 [30],
ICDAR 2013 [15], ICDAR 2015 [13], and Total-Text [5] studies that use three different strategies to get the indicated
for 150 epochs, which is optimized by the AdamW [27] points (see Fig. 7), i.e., top-left corner, central point (ob-
with an initial learning rate of 5 × 10−4 , while the learning tained by averaging the upper and lower midpoints), and
rate is linearly decayed to 1 × 10−5 . After pretraining, the random point inside the box. It should be noted that, we use
model is then fine-tuned on the training split of each target the corresponding ground-truth here to evaluate the perfor-
dataset for another 200 epochs, with a fixed learning rate of mance, i.e., top-left ground-truth is used for evaluating top-
1 × 10−5 . The entire model is distributively trained on 32 left, central ground-truth for central, and closest distance to
NVIDIA V100 GPUs with a batch size of 64. In addition, the polygon for the random.
we utilize ResNet-50 as the backbone network, while both The results are shown in Table 2, where the results of
the transformer encoder and decoder consist of 6 layers with top-left and central are very close in both datasets. It sug-
eight heads. During training, short size of the input image gests that the performance is not very sensitive to the posi-
is randomly resized to a range from 640 to 896 (intervals of tions of the point annotation. The results of random are in-
32). Random cropping and rotating are employed for data ferior, especially for the long text-line-based dataset SCUT-
augmentation. At the inference stage, we resize short edge CTW1500. We guess this is because the random position in
to 1,000 while keeping longer side shorter than 1824 pix- very long text instances exacerbates the difficulty of model
els, following the previous works [22, 25]. Moreover, as the convergence.
noise tokens are introduced for the sequence augmentation
during training, we only preserve the predicted text instance Comparison between various representations. The
with a confidence score that is larger than 0.9. proposed SPTS can be easily extended to produce bounding
boxes by modifying the point coordinates to bounding box
3.4. Ablation Study
locations during sequence construction. Here, we conduct
Ablation study of the position of the indicated point. In ablations to explore the influence by using different rep-
this paper, we propose to simplify the bounding box to a resentations of the text instances. Specifically, three vari-
single-point. Intuitively, all points in the region enclosed by ants are explored, including the Bezier curve bounding box
the bounding box should be able to represent the target text (SPTS-Bezier), rectangular bounding box (SPTS-Rect), and
instance. To explore the differences, we conduct ablation the indicated point (SPTS-point). Since we only focus on
6
IC13 End-to-End Total-Text End-to-End
Method Method
S W G None Full
Bounding Box-based methods Bounding Box-based methods
Jaderberg et al. [12] 86.4 - - CharNet [20] 66.6 -
Textboxes [19] 91.6 89.7 83.9 ABCNet [22] 64.2 75.7
Deep Text Spotter [3] 89.0 86.0 77.0 PGNet [38] 63.1 -
Li et al. [16] 91.1 89.8 84.6 Mask TextSpotter [28] 65.3 77.4
MaskTextSpotter [28] 92.2 91.1 86.5 Qin et al. [32] 67.8 -
Point-based method Mask TextSpotter v3 [18] 71.2 78.4
SPTS (Ours) 87.6 85.6 82.9 MANGO [31] 72.9 83.6
PAN++ [39] 68.6 78.6
Table 4. End-to-end recognition results on ICDAR 2013. “S”, ABCNet v2 [25] 70.4 78.1
“W”, and “G” represent recognition with “Strong”, “Weak”, and Point-based method
“Generic” lexicon, respectively. SPTS (Ours) 67.9 74.1
7
MARC BRITISH
MIKE DTKA's EXIT
Willow Lounge
Student Accounts MAIL
TEA HOUSE
Do you Know anyone
IN THE
24 WILLOWBROOK LANES WINE TASTING
Brunswick starting university
AUTOMANESPLAINFIELD RD. & ROUTE
WILLOWBROOK, 1L. GASTRONOMY
this year?
Gt. Yarmouth
Ipswich
HANG NHAT
NHA CityHotMRT
(A12) 12) Office
Tower
only
OHi SUSHI BAR
BOTTLED LIQUORS
Harwich
SUMOSEL
WWW.BUNGHOLEELIQUORS.COM Japanese
Restaurant
(A120) bigot
Oh SuShi Bar
Brown CENtBUNGHUCH LIQUORS BUD
Backweigh LIGHT Having Clacton A133
Figure 8. Qualitative results on the scene text benchmarks. Images are selected from SCUT-CTW1500 [24] (first col.), Total-Text [5]
(second col.), ICDAR 2013 [15] (third col.), and ICDAR 2015 [13] (fourth col.). Best view on screen.
8
cludes all the prediction results. A more efficient framework alignment and attention. In Proc. IEEE Conf. Comp. Vis.
would be valuable to be developed in the future. Patt. Recogn., pages 5020–5029, 2018.
[11] Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang,
5. Conclusion Junyu Han, and Errui Ding. WordSup: Exploiting word an-
notations for character based text detection. In Proc. IEEE
We propose SPTS, which to the best of our knowledge, Int. Conf. Comp. Vis., pages 4940–4949, 2017.
is a pioneering method that tackles scene text spotting using [12] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An-
only the extremely low-cost single-point annotation. SPTS drew Zisserman. Reading text in the wild with convolutional
is a concise auto-regressive transformer-based framework, neural networks. Int. J. Comput. Vis., 116(1):1–20, 2016.
which generates the results as sequential tokens. Extensive [13] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos
experiments, as well as ablation studies on various bench- Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwa-
marks, have shown promising results of our method. It is mura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chan-
also worth mentioning that our method may also be gener- drasekhar, Shijian Lu, et al. ICDAR 2015 competition on
alized to generic object detection tasks. In addition, on the robust reading. In Proc. Int. Conf. Doc. Anal. and Recognit.,
pages 1156–1160. IEEE, 2015.
general value of the experimental results, we believe SPTS
[14] Dimosthenis Karatzas, S Robles Mestre, Joan Mas, Farshad
sheds great possibilities towards multi-task learning, which
Nourbakhsh, and P Pratim Roy. ICDAR 2011 robust reading
is definitely worth exploring in the future. competition-challenge 1: reading text in born-digital images
(web and email). In Proc. Int. Conf. Doc. Anal. and Recog-
References nit., pages 1485–1490, 2011.
[1] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, [15] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida,
and Hwalsuk Lee. Character region awareness for text de- Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles
tection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Al-
9365–9374, 2019. mazan, and Lluis Pere De Las Heras. ICDAR 2013 robust
[2] Christian Bartz, Haojin Yang, and Christoph Meinel. SEE: reading competition. In Proc. Int. Conf. Doc. Anal. and
Towards semi-supervised end-to-end scene text recognition. Recognit., pages 1484–1493. IEEE, 2013.
In Proc. AAAI Conf. Artificial Intell., 2018. [16] Hui Li, Peng Wang, and Chunhua Shen. Towards end-to-end
[3] Michal Busta, Lukas Neumann, and Jiri Matas. Deep text spotting with convolutional recurrent neural networks.
textspotter: An end-to-end trainable scene text localization In Proc. IEEE Int. Conf. Comp. Vis., pages 5238–5246, 2017.
and recognition framework. In Proc. IEEE Int. Conf. Comp. [17] Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao,
Vis., pages 2204–2212, 2017. Wenhao Wu, and Xiang Bai. Mask TextSpotter: An end-to-
[4] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Ge- end trainable neural network for spotting text with arbitrary
offrey Hinton. Pix2Seq: A language modeling framework shapes. IEEE Trans. Pattern Anal. Mach. Intell., 2019.
for object detection. arXiv preprint arXiv:2109.10852, 2021. [18] Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and
[5] Chee Kheng Ch’ng and Chee Seng Chan. Total-Text: A com- Xiang Bai. Mask TextSpotter v3: Segmentation proposal
prehensive dataset for scene text detection and recognition. network for robust scene text spotting. In Proc. Eur. Conf.
In Proc. Int. Conf. Doc. Anal. and Recognit., volume 1, pages Comp. Vis., 2020.
935–942. IEEE, 2017. [19] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang,
[6] Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, and Wenyu Liu. TextBoxes: A fast text detector with a single
Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, deep neural network. In Proc. AAAI Conf. Artificial Intell.,
Junyu Han, Errui Ding, et al. ICDAR2019 robust reading 2017.
challenge on arbitrary-shaped text-RRC-ArT. In Proc. Int. [20] Xing Linjie, Tian Zhi, Huang Weilin, and R. Scott Matthew.
Conf. Doc. Anal. and Recognit., pages 1571–1576, 2019. Convolutional character networks. In Proc. IEEE Int. Conf.
[7] Mark Everingham, Luc Van Gool, Christopher KI Williams, Comp. Vis., 2019.
John Winn, and Andrew Zisserman. The pascal visual object [21] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and
classes (VOC) challenge. Int. J. Comput. Vis., 88(2):303– Junjie Yan. FOTS: Fast oriented text spotting with a uni-
338, 2010. fied network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
[8] Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng- pages 5676–5685, 2018.
Lin Liu. TextDragon: An end-to-end framework for arbitrary [22] Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen
shaped text spotting. In Proc. IEEE Int. Conf. Comp. Vis., Jin, and Liangwei Wang. ABCNet: Real-time scene text
pages 9076–9085, 2019. spotting with adaptive bezier-curve network. Proc. IEEE
[9] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Conf. Comp. Vis. Patt. Recogn., 2020.
Synthetic data for text localisation in natural images. In [23] Yuliang Liu and Lianwen Jin. Deep matching prior network:
Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2315– Toward tighter multi-oriented text detection. In Proc. IEEE
2324, 2016. Conf. Comp. Vis. Patt. Recogn., pages 1962–1969, 2017.
[10] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, [24] Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and
and Changming Sun. An end-to-end textspotter with explicit Sheng Zhang. Curved scene text detection via transverse and
9
longitudinal sequence connection. Pattern Recogn., 90:337– spotting. In Proc. AAAI Conf. Artificial Intell., pages 12160–
345, 2019. 12167, 2020.
[25] Yuliang Liu, Chunhua Shen, Lianwen Jin, Tong He, Peng [38] Pengfei Wang, Chengquan Zhang, Fei Qi, Shanshan Liu, Xi-
Chen, Chongyu Liu, and Hao Chen. ABCNet v2: Adaptive aoqiang Zhang, Pengyuan Lyu, Junyu Han, Jingtuo Liu, Er-
bezier-curve network for real-time end-to-end text spotting. rui Ding, and Guangming Shi. PGNet: Real-time arbitrarily-
IEEE Trans. Pattern Anal. Mach. Intell., pages 1–1, 2021. shaped text spotting with point gathering network. arXiv
[26] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, preprint arXiv:2104.05458, 2021.
Wenhao Wu, and Cong Yao. TextSnake: A flexible repre- [39] Wenhai Wang, Enze Xie, Xiang Li, Xuebo Liu, Ding Liang,
sentation for detecting text of arbitrary shapes. In Proc. Eur. Yang Zhibo, Tong Lu, and Chunhua Shen. PAN++: Towards
Conf. Comp. Vis., pages 20–36, 2018. efficient and accurate end-to-end spotting of arbitrarily-
[27] Ilya Loshchilov and Frank Hutter. Decoupled weight de- shaped text. IEEE Trans. Pattern Anal. Mach. Intell., 2021.
cay regularization. Proc. Int. Conf. Learn. Representations, [40] Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu.
2018. Detecting texts of arbitrary orientations in natural images.
[28] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1083–
Xiang Bai. Mask TextSpotter: An end-to-end trainable neu- 1090. IEEE, 2012.
ral network for spotting text with arbitrary shapes. In Proc. [41] Rui Zhang, Yongsheng Zhou, Qianyi Jiang, Qi Song, Nan Li,
Eur. Conf. Comp. Vis., pages 67–83, 2018. Kai Zhou, Lei Wang, Dong Wang, Minghui Liao, Mingkun
[29] Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowd- Yang, et al. ICDAR 2019 robust reading challenge on read-
hury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Uma- ing Chinese text on signboard. In Proc. Int. Conf. Doc. Anal.
pada Pal, Jean-Christophe Burie, Cheng-lin Liu, et al. IC- and Recognit., pages 1577–1581, 2019.
DAR 2019 robust reading challenge on multi-lingual scene [42] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang
text detection and recognition–RRC-MLT-2019. arXiv Zhou, Weiran He, and Jiajun Liang. EAST: An efficient and
preprint arXiv:1907.00945, 2019. accurate scene text detector. In Proc. IEEE Conf. Comp. Vis.
[30] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Patt. Recogn., pages 5551–5560, 2017.
Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, [43] Yiqin Zhu, Jianyong Chen, Lingyu Liang, Zhanghui Kuang,
Christophe Rigaud, Joseph Chazalon, et al. ICDAR 2017 ro- Lianwen Jin, and Wayne Zhang. Fourier contour embed-
bust reading challenge on multi-lingual scene text detection ding for arbitrary-shaped text detection. In Proc. IEEE Conf.
and script identification-RRC-MLT. In Proc. Int. Conf. Doc. Comp. Vis. Patt. Recogn., pages 3123–3131, 2021.
Anal. and Recognit., volume 1, pages 1454–1459. IEEE,
2017.
[31] Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi Niu,
Shiliang Pu, and Fei Wu. MANGO: A mask attention guided
one-stage scene text spotter. In Proc. AAAI Conf. Artificial
Intell., pages 2467–2476, 2021.
[32] Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa
Fujii, and Ying Xiao. Towards unconstrained end-to-end text
spotting. In Proc. IEEE Int. Conf. Comp. Vis., pages 4704–
4714, 2019.
[33] Yipeng Sun, Zihan Ni, Chee-Kheng Chng, Yuliang Liu, Can-
jie Luo, Chun Chet Ng, Junyu Han, Errui Ding, Jingtuo Liu,
Dimosthenis Karatzas, et al. ICDAR 2019 competition on
large-scale street view text with partial labeling-RRC-LSVT.
In ICDAR, pages 1557–1562, 2019.
[34] Shangxuan Tian, Shijian Lu, and Chongshou Li. WeText:
Scene text detection under weak supervision. In Proc. IEEE
Int. Conf. Comp. Vis., pages 1492–1500, 2017.
[35] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao.
Detecting text in natural image with connectionist text pro-
posal network. In Proc. Eur. Conf. Comp. Vis., pages 56–72.
Springer, 2016.
[36] Fangfang Wang, Yifeng Chen, Fei Wu, and Xi Li. Tex-
tray: Contour-based geometric modeling for arbitrary-
shaped scene text detection. In Proc. ACM Int. Conf. Multi-
media, pages 111–119, 2020.
[37] Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai,
Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu
Liu. All you need is boundary: Toward arbitrary-shaped text
10
Appendix C. Single-Point Object Detection
A. SPTS without Point Supervision The proposed SPTS can be extended to the single-point
object detection task. In this experiment, we use PASCAL
As discussed in Sec. 4, we further explore whether the VOC 2007 trainval (5011 images) and PASCAL VOC 2012
SPTS can learn to read text even without location informa- trainval (11540 images) to train our network. The central
tion. Specifically, the point coordinates are abandoned dur- point of the object is selected as the indicated point. It
ing sequence construction in this experiment, forcing the takes about 150 epochs for our network to reach conver-
model to predict the transcriptions directly. gence. The qualitative results on PASCAL VOC 2007 test
Fig. 10 presents qualitative results of the SPTS-noPoint set are shown in Fig. 12.
model on several scene text benchmarks, where the text can
be recognized without positional supervision, suggesting COMPANY
WALKER PERSON
that the model could learn to localize the textual contents BREWING
FIRESTONE
PRECIOUS
IS
implicitly with transcription labels only. Such results may FE EVERY
TRESTONE
encourage further simplification of the current text spotters,
albeit some downstream applications requiring the precise
location of each text instance may be limited. CARMINES
CARMINES
Turkish
Welcome
CARMINES Delight