0% found this document useful (0 votes)
22 views12 pages

(2022-MM) SPTS Single-Point Text Spotting

Uploaded by

kengdiecui1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

(2022-MM) SPTS Single-Point Text Spotting

Uploaded by

kengdiecui1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

SPTS: Single-Point Text Spotting*

Dezhi Peng2∗ , Xinyu Wang3∗ , Yuliang Liu1∗ , Jiaxin Zhang4∗ , Mingxin Huang2 , Songxuan Lai5 , Shenggao Zhu5 , Jing Li5 ,
Dahua Lin1 , Chunhua Shen6 , Lianwen Jin2
1 2
Chinese University of Hong Kong South China University of Technology 3 University of Adelaide
4 5
ByteDance Inc. Huawei Technologies 6 Zhejiang University
arXiv:2112.07917v1 [cs.CV] 15 Dec 2021

Abstract on from horizontal [19, 35] and multi-oriented text [40, 42]
to arbitrary-shaped text [22, 28] in recent years, accompa-
Almost all scene text spotting (detection and recogni- nied by the annotation format from horizontal rectangles,
tion) methods rely on costly box annotation (e.g., text-line to quadrilaterals, and to polygons. The fact that regular
box, word-level box, and character-level box). For the first bounding boxes are prone to involve noises has been well
time, we demonstrate that training scene text spotting mod- studied in previous works (see Fig. 1), which has proved
els can be achieved with an extremely low-cost annotation that character-level and polygonal annotations can effec-
of a single-point for each instance. We propose an end-to- tively lift the model performance [18, 20, 28]. Furthermore,
end scene text spotting method that tackles scene text spot- many efforts have been made to develop more sophisticated
ting as a sequence prediction task, like language modeling. representations to fit arbitrarily shaped text instances [8, 22,
Given an image as input, we formulate the desired detec- 26, 36, 43] (see Fig. 2). For example, ABCNet [22] con-
tion and recognition results as a sequence of discrete to- verts polygon annotations to Bezier curves for represent-
kens and use an auto-regressive transformer to predict the ing the curved text instances, while Text Dragon [8] utilizes
sequence. We achieve promising results on several horizon- character-level bounding boxes to generate centerlines for
tal, multi-oriented, and arbitrarily shaped scene text bench- enabling the prediction of local geometry attributes. How-
marks. Most significantly, we show that the performance is ever, these novel representations are primarily and carefully
not very sensitive to the positions of the point annotation, designed by experts based on prior knowledge, heavily rely-
meaning that it can be much easier to be annotated and au- ing on highly customized network architecture (e.g., speci-
tomatically generated than the bounding box that requires fied RoI module) and consume more expensive annotations
precise positions. We believe that such a pioneer attempt in- (e.g., character-level annotations), limiting their generaliza-
dicates a significant opportunity for scene text spotting ap- tion ability for practical applications.
plications of a much larger scale than previously possible. To reduce the cost of data labeling, some researchers [1,
2, 11, 34] have explored training the OCR models with
coarse annotations in a weakly-supervised manner. These
methods can mainly be separated into two categories, i.e.,
1. Introduction (1) bootstrapping labels to finer granularity and (2) train-
ing with partial annotations. The former usually derives
In the last decades, it has been witnessed that modern character-level labels from word or line-level annotations;
Optical Character Recognition (OCR) algorithms are able thus, the models could enjoy the well-understood advantage
to read textual content from pictures of complex scenes, of character-level supervision without introducing overhead
which is an incredible development, leading to enormous costs. The latter is committed to achieving competitive
interest from both academia and industry. The limitation of performance with fewer training samples. However, both
existing methods, and particularly their poorer performance methods still rely on the costly bounding-box annotations.
on arbitrarily shaped scene text, have been repeatedly iden- One of the underlying problems that prevent replacing
tified [6, 18, 22]. This can be seen in the trend of worse the bounding box with a simpler annotation format, such
predictions for instances with curved shapes, varied fonts, as a single-point, is that most text spotters rely on RoI-like
distortions, etc. sampling strategies to extract the shared backbone features.
The focus of research in the OCR community has moved For example, Mask TextSpotter requires mask prediction in-
* The first four authors contributed equally to this work. Part of this side an RoI [18]; ABCNet [22] proposes BezeirAlign, while
work was done when Y. Liu and C. Shen were with University of Adelaide. TextDragon [8] introduces RoISlide to unify the detection

1
(a) Rectangle (55s). (b) Quadrilateral (96s). (c) Character (581s). (d) Polygon (172s). (e) Single-Point (11s)

Figure 1. Different annotation styles and their time cost (for all the text instances in the sample image) measured by the OpenMMLab-
LabelBee1 tool. Green areas are positive samples, while red dashed boxes are noises that may be possibly included. The time cost of
single-point annotation is more than 50 times faster than character-level annotation.

fiting from such a concise pipeline, the complex post-


processing and sampling strategies designed based on
prior knowledge can thus far be discarded, showing
great potential in the generalization ability.
(a) Text Dragon [8] (b) ABCNet [22] (c) Text Snake [26]
• To evaluate the effectiveness of proposed methods,
Figure 2. Some recent representations of text instances. (a) Text extensive experiments and ablations are conducted
Dragon [8] employs character-level bounding box to generate text on four widely used OCR datasets, i.e., ICDAR
centerline. (b) ABCNet [22] converts polygon annotations to 2013 [15], ICDAR 2015 [13], Total-Text [5], and
Bezier curves. (c) Text Snake [26] describes text instances by a SCUT-CTW1500 [24], involving both horizontal and
series of ordered disks centered at symmetric axes. arbitrary-shaped text; as well as a qualitative exper-
iment on an generic object detection dataset Pascal
VOC [7]. The results show that the proposed SPTS
and recognition heads. In this paper, inspired by the recent can achieve competitive performance compared to the
success of a sequence-based object detector Pix2Seq [4], we state-of-the-art approaches.
show that the text spotter can be trained with a single-point,
also termed as indicated point (see Fig. 1e). Thanks to such
a concise form of annotation, labeling time can be signifi- 1.1. Related Work
cantly saved, e.g., it only takes less than one-fiftieth of the In the past decades, a variety of scene text datasets us-
time to label single-points for the sample image shown in ing different annotation styles have been proposed, focusing
Fig. 1 compared with annotating character-level bounding on various scenarios, including horizontal text [14, 15] de-
box, which is extremely tortuous especially for the small scribed by rectangles (Fig. 1a), multi-oriented text [13, 29]
and vague text instances. Another motivating factor in se- represented by quadrilaterals (Fig. 1b), and arbitrary-shaped
lecting point annotation is that a clean and efficient OCR text [5, 6, 24] labeled by polygons (Fig. 1d). These forms of
pipeline can be developed, discarding the complex post- annotations have facilitated the development of correspond-
processing module and sampling strategies, and the ambi- ing OCR algorithms. For example, earlier works [16] usu-
guity introduced by RoIs (see red dashed regions in Fig. 1) ally adapt the generic object detectors to scene text spot-
can thus far be alleviated. To the best of our knowledge, this ting, where feature maps are shared between detection and
is the first attempt to simplify the bounding box to a single- recognition heads via RoI modules. These approaches fol-
point supervision signal in the OCR community. The main low the sampling mechanism designed for generic object
contributions of this work are summarized as follows: detection and utilize rectangles to express text instances,
thus performing worse on non-horizontal targets. Later,
• For the first time, we show that the text spotters can be
some methods [3, 10, 21] replace the rectangular bound-
supervised by a simple yet effective single-point rep-
ing boxes with quadrilaterals by modifying the regular Re-
resentation. Such a straightforward annotation format
gion Proposal Network (RPN) to generate oriented propos-
can considerably reduce the labeling costs, making it
als, enabling better performance for multi-oriented texts.
possible to access large-scale OCR data in the future.
Recently, with the presentation of the curved scene text
• We propose an auto-regressive transformer-based
datasets [5, 6, 24], the research interest of the OCR commu-
scene text spotter, which forms the text spotting as a
nity has shifted to more challenging arbitrarily shaped texts.
language modeling task. Given an input image, our
Generally, there are two widely adopted solutions to solve
method predicts a discrete tokens sequence that in-
cludes both detection and recognition results. Bene- 1 https://github.com/open-mmlab/labelbee-client

2
t1 t2 t3 tn <EOS>

Predicted Sequence
... PLANET
Transformer
CNN
Encoder
...
HOLLYWOOD
<SOS> t1 t2 tn-1 tn
Encoder Decoder

Figure 3. Overall framework of the proposed SPTS. The visual and contextual features are first extracted by a series of CNN and transformer
encoders. Then, the features are auto-regressively decoded into a sequence that contains both localization and recognition information,
which is subsequently translated into point coordinates and text transcriptions. Only a point-level annotation is required for training.

Cat
the arbitrary-shaped Person
text spotting task, i.e., segmentation-
Pizza Sports Ball
RoISlide [8], and RoIMasking [18] are required to bridge
Donut
based [18, 31, 32, 39] and regression-based methods [8, 22, the detection and recognition blocks, where backbone fea-
37]. The former first predicts masks to segment text in-Persontures are cropped Cupand shared
Spoon
between detection and recog-
stances, then the features inside text regions are sampled nition heads. Under such types Cupof design, the recognition
and grouped for further recognition. ForTennis Racket
example, Mask and detection modules are highly coupled. For example, Mouse

TextSpotterv3 [18] proposes a Segmentation Proposal Net- Person Spoon


the features fed to the recognition head are usually cropped
work (SPN) instead of the regular RPN to decouple neigh- from the ground-truth bounding box at the training stage
boring text instances accurately, thus significantly improv- since detection results are not good enough in the first iter-
ing the performance. In addition, regression-based methods ations; thus, the recognition result is susceptible to interfer-
Cat
usually parameterize the text instances as a sequence of co-
Cat
ence from the detected bounding box during the test phase.
ordinates and subsequently learn to predict them. Car For in- Car
Recently, Pix2Seq [4] pioneered to cast the generic ob-
stance, ABCNet [22] converts polygons into Bezier curves,
ject detection problem as a language modeling task, based
significantly improving performance on curved scene texts.
on an intuitive assumption that if a deep model knows what
Wang et al. [37] first localize the boundary points of text
and where the target is, it can be taught to tell the results by
instances, then the features rectified by Thin-Plate-Spline
the desired sequence. Thanks to the concise pipeline, labels
are fed into the recognition branch, demonstrating promis-
with different attributes such as location coordinates and
ing accuracy on arbitrary-shaped instances. Moreover, Xing
object categories can be integrated into a single sequence,
et al. [20] boost the text spotting performance by utiliz-
enabling an end-to-end trainable framework without task-
ing character-level annotations, where character bounding
specific modules (e.g., Region Proposal Networks and RoI
boxes, as well as type segmentation maps, are predicted
pooling layers), which can thus be naturally adapted to the
simultaneously, enabling impressive performance. Even
text spotting task. Inspired by this, we propose Single-
though different representations are adopted in the above
Point Text Spotting (SPTS). SPTS tackles text detection
methods to describe the text instances, they are all actually
and recognition as a sequence prediction task, solvable by
derived from one of the rectangular, quadrilateral, or polyg-
a much more straightforward pipeline. Now, the input im-
onal bounding boxes. Such annotations must be carefully
ages are translated into a sequence containing location and
labeled by human beings, thus are quite expensive, limiting
recognition results, genuinely enabling the text detection
the scale of training datasets.
and recognition task simultaneously.
In this paper, we propose Single-Point Text Spotting
(SPTS), which is, to the best of our knowledge, the first Specifically, as shown in Fig. 3, each input image is first
scene text spotter that does not rely on bounding box anno- encoded by CNN and transformer encoders to extract visual
tations at all. Specifically, each text instance is represented and contextual features. Then, the captured features are de-
by a single-point (see Fig. 1e) with a meager cost. The fact coded by a transformer decoder, where tokens are predicted
that this point does not need to be accurately marked (e.g. in an auto-regressive manner. Unlike previous algorithms,
in the center of the text) further demonstrates the possibil- we further simplify the bounding box to a corner point lo-
ity of learning in a weakly supervised manner, considerably cated at the upper left of the first character or the center of
lowering down the labeling cost. the text instance, describing in Fig. 7, in the text instance.
Benefiting from such a simple yet effective representation,
2. Methodology the modules carefully designed based on prior knowledge,
such as grouping strategies utilized in segmentation-based
Most of the existing text spotting algorithms treat the methods and feature sampling blocks equipped in box-
problem as two sub-tasks, i.e., text detection and recog- based text spotters, can be eschewed. Therefore, the recog-
nition, albeit the entire network might be end-to-end op- nition accuracy will not be limited by poor detection results,
timized. Customized modules such as BezierAlign [22], significantly improving the model robustness.

3
Transformer
CNN
Encoder

Encoder <SOS> t1 t2

2.1. Sequence Construction Coordinates Transcription

Original Annotations
ASHBURY x1,y1 HAIGHT

The fact that a sequence can carry information with x2,y2 ASHBURY

multiple attributes naturally enables the text spotting task, x3,y3 VINTAGE

Output Seq.
where text instances are simultaneously localized and rec- x4,y4 VINTAGE
HAIGHT
Discretization x y
ognized. To express the target text instances by a sequence,
Coordinates Transcription

Discretized Instances
it is required to convert the continuous descriptions (e.g., VINTAGE
⌊x1/nbins⌋ ,⌊y1/nbins⌋ H,A,I,G,H,T,<PAD>,..
bounding boxes) to a discretized space. To this end, as ⌊x2/nbins⌋ ,⌊y2/nbins⌋

Input Seq.
A,S,H,B,U,R,Y,<PAD>,..
< x
shown in Fig. 4, we follow Pix2Seq [4] to build the target VINTAGE ⌊x3/nbins⌋ ,⌊y3/nbins⌋ V,I,N,T,A,G,E,<PAD>,..

sequence. ⌊x4/nbins⌋ ,⌊y4/nbins⌋ V,I,N,T,A,G,E,<PAD>,..

What distinguishes our methods is that we further sim- Taraget Sequence <SOS> R

plify the bounding box to a single-point. Specifically, the Coord. Transcription <PAD>

continuous coordinates of the top-left corner of the first < x y V I N T A G E ∅ ∅ x ... ∅ ∅ x y ... ∅ >
character of the text instance (top-left point for short) are
uniformly discretized into an integer between [1, nbins ], <SOS> Text Instance 1 Randomly Ordered Instances <EOS>
where nbins controls the degree of discretization. For ex-
ample, an image with a long side of 800 pixels requires Figure 4. Pipeline of the sequence construction for text spotting.
only nbins = 800 to achieve zero quantization error. As
so far, a text instance can thereby be represented by a se-
quence of three tokens, i.e., [x, y, t], where t is the tran- son for this issue may be due to annotation noises and dif-
scription text. Notably, the transcriptions are inherently dis- ficult samples. Therefore, a sequence augmentation strat-
crete, i.e., each of the characters represents a category, thus egy is adopted to postpone the emergence of <EOS> token.
can be easily appended to the sequence. However, different Specifically, as shown in Fig. 5, noise instances, which con-
from the generic object detection that has a relatively fixed sists of randomly generated positions and transcriptions, are
vocabulary (each t represents an object category, such as inserted into the input sequence. For the output sequence,
pedestrian), t can be any natural language text of any length each noise text instance corresponds to several ‘N/A’ to-
in our task, resulting in a variable length of the target se- kens, one noise token, and one <EOS> token. During
quence, which may further cause misalignment issue and training, the noise token will help the model to distinguish
can consume more computational resources. the noise text instances, and the sampling of <EOS> is
To eliminate such problems, we pad or truncate the texts thereby delayed. Notably, the loss of ‘N/A’ tokens is ex-
to a fixed length ltr , where the <PAD> token is used to cluded in optimization.
fill the vacancy for shorter text instances. In addition, like Training Objective. Since the SPTS is trained to predict
other language modeling methods, <SOS> and <EOS> tokens, it only requires to maximize the likelihood loss at
tokens are inserted to the head and tail of the sequence, in- training time, which can be written as:
dicating the start and the end of a sequence, respectively. L
Therefore, given an image that contains nti text instances, X
maximize wj log P (s̃i |I, s1:i−1 ), (1)
the constructed sequence will include (2 + llr ) × nti dis-
i=1
crete tokens, where each text instances would be randomly
ordered, following previous works [4]. Supposing there are where I is the input image, s̃ is the target sequence, s is
ncls categories of characters (e.g., 97 for English characters the input sequence, L is the length of the sequence, and wj
and symbols), the vocabulary size of the dictionary used to is the weight of the likelihood of the j-th token, which is
tokenize the sequence can be calculated as nbins + ncls + 3, empirically set to 1.
where the extra three classes are for <PAD>, <SOS>, and
<EOS> tokens. Empirically, we set the ltr and nbins to 25 2.3. Inference
and 1,000, respectively, in our experiments. At the inference stage, SPTS auto-regressively predicts
the tokens until the end of the sequence token <EOS> oc-
2.2. Model Training
curs. The predicted sequence will subsequently be divided
Sequence Augmentation. As shown in Fig. 3, the SPTS into multiple segments, each of which contains 2 + ntr to-
decoder predicts the sequence in an auto-regressive man- kens. Then, the tokens can be easily translated into the point
ner, where the <EOS> token is employed to decide the coordinates and transcriptions, yielding the text spotting re-
termination of generation. However, in practice, the fact sults. In addition, the likelihood of all tokens in the corre-
that <EOS> token is easy to cause the sequence prediction sponding segment is averaged and assigned as a confidence
task to be prematurely terminated, leading to a low recall score to filter the original outputs, which efficiently removes
rate, has been identified in previous works [4]. The rea- redundant and false-positive predictions.

4
ranscription
HAIGHT

ASHBURY

VINTAGE N/A Token Noise Token <EOS>

Output Seq.
VINTAGE

cretization x y A C C O ... > > ... >


ranscription
A,I,G,H,T,<PAD>,.. ...
Input Seq.

H,B,U,R,Y,<PAD>,..
< x y A C C ... ∅ x y @ ! A z $ j % ∅ ∅ ∅ ... A ∅
,T,A,G,E,<PAD>,..
,T,A,G,E,<PAD>,..

aget Sequence <SOS> Real Text Instance Noise Text Instances <PAD>

Figure 5. Illustration of the sequence augmentation. Noise text instances that consist of randomly generated coordinates and transcriptions
are inserted to postpone <EOS> token.
y ... ∅ >

3. Experiments
ances <EOS> 0.48 0.78 - -
We report the experimental results on four scene text
datasets including horizontal dataset ICDAR 2013 [15], 0.71 0.11 - -
multi-oriented dataset ICDAR 2015 [13], and arbitrarily- 0.07 0.86 - -
shaped datasets Total-Text [5] and SCUT-CTW1500 [24].
0.65 0.77 - -
3.1. Datasets
Curved Synthetic Dataset 150k. It is admitted that Normalized Distance Matrix
the performance of text spotters can be improved by
Figure 6. Illustration of the point-based evaluation metric. Colored
pre-training on synthesized samples. Following previous diamonds are predicted points, circles represent ground-truth.
work [22], we use the 150k synthetic images generated by
the SynthText [9] toolbox, which contains around one-third
of curved text and two-third of horizontal instances. matched. Then, the recognized content inside each matched
ICDAR 2013 [15] contains 229 training and 233 testing bounding box is compared with the GT transcription; only
samples, while the images are primarily captured in a con- if the predicted text is the same as the GT will it contribute
trolled environment, where the text contents of interest are to the end-to-end accuracy. However, in the proposed meth-
explicitly focused in horizontal. ods, each text instance is represented by a single-point; thus,
ICDAR 2015 [13] consists of 1,000 training and 500 the evaluation metric based on the IoU is not available to
testing images that were incidentally captured, contain- measure the performance. Meanwhile, comparing the local-
ing multi-oriented text instances presented in complicated ization performance between bounding-box-based methods
backgrounds with strong variations in blur, distortions, etc. and the proposed point-based SPTS might be unfair, e.g.,
Total-Text [5] includes 1,255 training and 300 testing directly treating points inside a bounding box as true pos-
images, where at least one curved sample is presented in itive may overestimate the detection performance. To this
each image and annotated with polygonal bounding box at end, we propose a new evaluation metric to ensure a rela-
the word-level. tively fair comparison to existing approaches, which mainly
SCUT-CTW1500 [24] is another widely used bench- considers the end-to-end accuracy as it reflects both detec-
mark designed for spotting arbitrary shaped scene text, in- tion and recognition performance (failure detections usually
volving 1,000 and 500 images for training and testing, re- lead to incorrect recognition results). Specifically, as shown
spectively. The text instances are labeled by polygons at in Fig. 6, we modified the text instance matching rule by re-
text line-level. placing the IoU metric with a distance metric, i.e., the pre-
dicted point that has the nearest distance to the top-left cor-
3.2. Evaluation Protocol
ner of the GT box would be selected, and the recognition re-
The existing evaluation protocol of text spotting tasks sults will be measured by the same full-matching rules used
consists of two steps. Firstly, the intersection over union in existing benchmarks. Only one predicted point with the
(IoU) scores between ground-truth (GT) and detected boxes highest confidence would be matched to the ground truth;
are calculated; and only if the IoU score is larger than others are then marked as false positives.
a designated threshold (usually set to 0.5), the boxes are To explore whether the proposed evaluation proto-

5
Total-Text (%) CTW1500 (%)
Method
Box Point Box Point
ABCNetv1 [22] 67.2 66.1 53.5 53.2
ABCNetv2 [25] 71.7 70.7 57.6 57.3
(a) Top-left (b) Central (c) Random
Table 1. Comparison of the end-to-end recognition performance
evaluated by the proposed point-based metric and box-based met- Figure 7. Indicated points (red color) using different positions.
ric on Total-Text [5] and SCUT-CTW1500 [24]. All results are
reproduced based on the official codes. E2E Total-Text E2E CTW1500
Position
None Full None Full
Top-left 67.9 74.1 56.3 67.2
col can genuinely represent the model accuracy, Table 1 Central 67.6 74.6 56.2 66.8
compares the end-to-end recognition accuracy of ABC- Random 66.6 73.3 51.7 63.9
Netv1 [22] and ABCNetv2 [25] on Total-Text [5] and
SCUT-CTW1500 [24] under two metrics, i.e., the com- Table 2. Ablation study of the position of the point on Total-
monly used bounding box metric that is based on IoU, and Text [5] and SCUT-CTW1500 [24]. Top-left, central, and random
the proposed point-based metric. The results demonstrate points are illustrated in Fig. 7. E2E: end-to-end evaluation metric.
that the point-based evaluation protocol can well reflect the
performance, where the difference between the values eval- Total-Text CTW1500
Variants Np
uated by box-based and point-based metrics are less than None Full None Full
1%. For example, the ABCNetv1 model achieves 53.5 and SPTS-Bezier 58.5 68.3 42.2 60.8 16
53.2 scores under the two metrics, respectively. There- SPTS-Rect 67.7 69.9 48.7 61.2 4
SPTS-Point 67.9 74.1 56.3 67.2 2
fore, we use the point-based metric to evaluate the proposed
SPTS in the following experiments. Table 3. Comparison with different shapes of bounding box. Np
is the number of parameters required to describe the text instances
3.3. Implemented Details by different representations.
The model is first pretrained on a combination dataset in-
cludes Curved Synthetic Dataset 150k [22], MLT-2017 [30],
ICDAR 2013 [15], ICDAR 2015 [13], and Total-Text [5] studies that use three different strategies to get the indicated
for 150 epochs, which is optimized by the AdamW [27] points (see Fig. 7), i.e., top-left corner, central point (ob-
with an initial learning rate of 5 × 10−4 , while the learning tained by averaging the upper and lower midpoints), and
rate is linearly decayed to 1 × 10−5 . After pretraining, the random point inside the box. It should be noted that, we use
model is then fine-tuned on the training split of each target the corresponding ground-truth here to evaluate the perfor-
dataset for another 200 epochs, with a fixed learning rate of mance, i.e., top-left ground-truth is used for evaluating top-
1 × 10−5 . The entire model is distributively trained on 32 left, central ground-truth for central, and closest distance to
NVIDIA V100 GPUs with a batch size of 64. In addition, the polygon for the random.
we utilize ResNet-50 as the backbone network, while both The results are shown in Table 2, where the results of
the transformer encoder and decoder consist of 6 layers with top-left and central are very close in both datasets. It sug-
eight heads. During training, short size of the input image gests that the performance is not very sensitive to the posi-
is randomly resized to a range from 640 to 896 (intervals of tions of the point annotation. The results of random are in-
32). Random cropping and rotating are employed for data ferior, especially for the long text-line-based dataset SCUT-
augmentation. At the inference stage, we resize short edge CTW1500. We guess this is because the random position in
to 1,000 while keeping longer side shorter than 1824 pix- very long text instances exacerbates the difficulty of model
els, following the previous works [22, 25]. Moreover, as the convergence.
noise tokens are introduced for the sequence augmentation
during training, we only preserve the predicted text instance Comparison between various representations. The
with a confidence score that is larger than 0.9. proposed SPTS can be easily extended to produce bounding
boxes by modifying the point coordinates to bounding box
3.4. Ablation Study
locations during sequence construction. Here, we conduct
Ablation study of the position of the indicated point. In ablations to explore the influence by using different rep-
this paper, we propose to simplify the bounding box to a resentations of the text instances. Specifically, three vari-
single-point. Intuitively, all points in the region enclosed by ants are explored, including the Bezier curve bounding box
the bounding box should be able to represent the target text (SPTS-Bezier), rectangular bounding box (SPTS-Rect), and
instance. To explore the differences, we conduct ablation the indicated point (SPTS-point). Since we only focus on

6
IC13 End-to-End Total-Text End-to-End
Method Method
S W G None Full
Bounding Box-based methods Bounding Box-based methods
Jaderberg et al. [12] 86.4 - - CharNet [20] 66.6 -
Textboxes [19] 91.6 89.7 83.9 ABCNet [22] 64.2 75.7
Deep Text Spotter [3] 89.0 86.0 77.0 PGNet [38] 63.1 -
Li et al. [16] 91.1 89.8 84.6 Mask TextSpotter [28] 65.3 77.4
MaskTextSpotter [28] 92.2 91.1 86.5 Qin et al. [32] 67.8 -
Point-based method Mask TextSpotter v3 [18] 71.2 78.4
SPTS (Ours) 87.6 85.6 82.9 MANGO [31] 72.9 83.6
PAN++ [39] 68.6 78.6
Table 4. End-to-end recognition results on ICDAR 2013. “S”, ABCNet v2 [25] 70.4 78.1
“W”, and “G” represent recognition with “Strong”, “Weak”, and Point-based method
“Generic” lexicon, respectively. SPTS (Ours) 67.9 74.1

Table 6. End-to-end recognition results and detection results on


Total-Text. “None” represents lexicon-free. “Full” represents that
IC15 End-to-End we use all the words appeared in the test set.
Method
S W G
Bounding Box-based methods
SCUT-CTW1500 End-to-End
FOTS [21] 81.1 75.9 60.8 Method
None Full
Mask TextSpotter [17] 83.0 77.7 73.5
Bounding Box-based methods
CharNet [20] 83.1 79.2 69.1
TextDragon [8] 39.7 72.4
TextDragon [8] 82.5 78.3 65.2
ABCNet [22] 45.2 74.1
Mask TextSpotter v3 [18] 83.3 78.1 74.2
ABCNet* [22] 53.2 76.0
MANGO [31] 81.8 78.9 67.3
MANGO [31] 58.9 78.7
ABCNetV2 [25] 82.7 78.5 73.0
ABCNet v2 [25] 57.5 77.2
PAN++ [39] 82.7 78.2 69.2
Point-based method
Point-based method
SPTS (Ours) 56.3 67.2
SPTS (Ours) 64.6 58.8 54.9
Table 7. End-to-end recognition results on SCUT-CTW1500.
Table 5. End-to-end recognition results on ICDAR 2015. “S”,
“None” represents lexicon-free. “Full” represents that we use all
“W”, and “G” represent recognition with “Strong”, “Weak”, and
the words appeared in the test set. ABCNet* means using the
“Generic” lexicon, respectively.
github checkpoint.

3.5. Experiment Results on Scene Text Benchmarks


end-to-end performance here, to minimize the impact of de-
tection results, each method uses corresponding represen- Horizontal-Text dataset. Table 4 compares the pro-
tations to match the GT box in the evaluation. That is to posed SPTS with state-of-the-art methods on the widely
say, the single-point model (original SPTS) uses the evalua- used ICDAR 2013 [15] benchmark. Our methods achieve
tion metrics introduced in Sec. 3.2, i.e., distance between competitive results to previous methods with all three lexi-
points; the predictions of SPTS-Rect are matched to the cons. It should be noted that the proposed SPTS only uti-
circumscribed rectangle of the polygonal annotations; the lizes a single-point for training, while the other approaches
SPTS-Bezier adopts the original metric that matches poly- are fully trained with more costly bounding boxes.
gon boxes. As shown in Table 3, the SPTS-point achieves Multi-oriented dataset. The quantitative results of the
the best performance on both Total-Text and CTW1500 ICDAR 2015 [13] dataset are shown in Table 5. A perfor-
datasets, outperforming the other two representations by mance gap between the proposed SPTS and state-of-the-art
a large margin. Such experimental results suggest that a methods can still be found, which shows some limitations
low-cost annotation, i.e., the indicated point, is capable of of our method for tiny texts and our evaluation metric.
providing supervision for the text spotting task. The pos- Arbitrarily-shaped dataset. We further compare our
sible reason for the low performance of SPTS-Bezier and method with state-of-the-art approaches on the benchmarks
SPTS-Rect may be because longer sequences (SPTS-Bezier containing arbitrarily shaped texts, including Total-Text [5]
Np = 16 vs. SPTS-Point Np = 2) require more training it- and SCUT-CTW1500 [24]. As shown in Table 6, SPTS
erations to converge; thus, the SPTS-Bezier cannot achieve achieves competitive performance to state-of-the-art meth-
comparable accuracy under the same training schedule. ods by only using an extremely low-cost point annotation.

7
MARC BRITISH
MIKE DTKA's EXIT
Willow Lounge
Student Accounts MAIL
TEA HOUSE
Do you Know anyone
IN THE
24 WILLOWBROOK LANES WINE TASTING
Brunswick starting university
AUTOMANESPLAINFIELD RD. & ROUTE
WILLOWBROOK, 1L. GASTRONOMY
this year?

Gt. Yarmouth
Ipswich
HANG NHAT
NHA CityHotMRT
(A12) 12) Office
Tower
only
OHi SUSHI BAR
BOTTLED LIQUORS
Harwich
SUMOSEL
WWW.BUNGHOLEELIQUORS.COM Japanese
Restaurant
(A120) bigot
Oh SuShi Bar
Brown CENtBUNGHUCH LIQUORS BUD
Backweigh LIGHT Having Clacton A133

ACCOMPANY 2 17C HUNG VUONG NHA TRANG 058 352 57 29

Figure 8. Qualitative results on the scene text benchmarks. Images are selected from SCUT-CTW1500 [24] (first col.), Total-Text [5]
(second col.), ICDAR 2013 [15] (third col.), and ICDAR 2015 [13] (fourth col.). Best view on screen.

Additionally, Table 7 shows that our method outperforms


TextDragon [8] and ABCNet [22] by a large margin on the
challenging dataset SCUT-CTW1500, which further identi-
fies the potentiality of our method.
Cat
In summary, the proposed SPTS can achieve compara- Dog Cat
ble performance to state-of-the-art text spotters on several
widely used benchmarks. Especially on the two curved
datasets, i.e., Total-Text [5] and SCUT-CTW1500 [23], the
proposed SPTS even outperforms some recently proposed Figure 9. Qualitative results on the Pascal VOC 2012 validation set
methods by a large margin. The reason that why our meth- under single-point supervision for generic object detection task.
ods can achieve better accuracy on arbitrary-shaped text
might be explained as: (1) The proposed SPTS discards
text spotting. Some qualitative results on the validation set
the task-specific modules designed based on prior knowl-
are shown in Fig. 9. The results suggest that single-point
edge, where the recognition accuracy is decoupled with the
might be viable to provide extremely low-cost annotation
detection results, i.e., SPTS can achieve acceptable recogni-
for any object. More can be found in the Appendix. Quan-
tion results even the detection position is shifted. However,
titative results on other datasets such as Microsoft COCO
other methods suffer from poor E2E accuracy is mainly
will be reported in an accompanying article.
because their recognition heads heavily rely on the detec-
tion results; once the text instance cannot be perfectly lo-
calized, the recognition head fails to work. (2) Although 4. Discussion
previous models are trained end-to-end, the interactions be- The experiments suggest that the detection and recogni-
tween their detection and recognition branches are limited. tion may have been decoupled. Based on the results, we fur-
Specifically, the features fed to the recognition module are ther show that SPTS can be converged even without the su-
sampled based on the ground-truth position while training pervision of the single-point. This indicates the SPTS may
but from detection results at the inference stage, leading have learned the ability to implicitly find out the locations
to feature misalignment, which is far more severe on the of the text-only based on the transcriptions, which may be
curved instances. However, by tackling the spotting task explainable through the attention maps.
in a sequence modeling manner, the proposed SPTS elimi- One limitation of the proposed framework is that the
nates such issues, thus far showing more robustness on the training procedure requires a large number of computing re-
arbitrary-shaped datasets. sources. For example, 150 epochs for 160k scene text pre-
training and fine-tuning require approximately four days to
3.6. Extension: Single-Point Object Detection
be distributively trained on 32 NVIDIA V100 GPU cards.
To demonstrate the generality of SPTS, we conduct ex- Additionally, the inference speed (FPS) for our method is
periments on the Pascal VOC [7] object detection task, only 0.2 when testing on the ICDAR 2013 dataset with a
where the model was trained with central points and the cor- maximum size of 1600 for the input images. This is mainly
responding categories. All other settings are identical to the due to the serial output of the very long sequence that in-

8
cludes all the prediction results. A more efficient framework alignment and attention. In Proc. IEEE Conf. Comp. Vis.
would be valuable to be developed in the future. Patt. Recogn., pages 5020–5029, 2018.
[11] Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang,
5. Conclusion Junyu Han, and Errui Ding. WordSup: Exploiting word an-
notations for character based text detection. In Proc. IEEE
We propose SPTS, which to the best of our knowledge, Int. Conf. Comp. Vis., pages 4940–4949, 2017.
is a pioneering method that tackles scene text spotting using [12] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An-
only the extremely low-cost single-point annotation. SPTS drew Zisserman. Reading text in the wild with convolutional
is a concise auto-regressive transformer-based framework, neural networks. Int. J. Comput. Vis., 116(1):1–20, 2016.
which generates the results as sequential tokens. Extensive [13] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos
experiments, as well as ablation studies on various bench- Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwa-
marks, have shown promising results of our method. It is mura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chan-
also worth mentioning that our method may also be gener- drasekhar, Shijian Lu, et al. ICDAR 2015 competition on
alized to generic object detection tasks. In addition, on the robust reading. In Proc. Int. Conf. Doc. Anal. and Recognit.,
pages 1156–1160. IEEE, 2015.
general value of the experimental results, we believe SPTS
[14] Dimosthenis Karatzas, S Robles Mestre, Joan Mas, Farshad
sheds great possibilities towards multi-task learning, which
Nourbakhsh, and P Pratim Roy. ICDAR 2011 robust reading
is definitely worth exploring in the future. competition-challenge 1: reading text in born-digital images
(web and email). In Proc. Int. Conf. Doc. Anal. and Recog-
References nit., pages 1485–1490, 2011.
[1] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, [15] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida,
and Hwalsuk Lee. Character region awareness for text de- Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles
tection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Al-
9365–9374, 2019. mazan, and Lluis Pere De Las Heras. ICDAR 2013 robust
[2] Christian Bartz, Haojin Yang, and Christoph Meinel. SEE: reading competition. In Proc. Int. Conf. Doc. Anal. and
Towards semi-supervised end-to-end scene text recognition. Recognit., pages 1484–1493. IEEE, 2013.
In Proc. AAAI Conf. Artificial Intell., 2018. [16] Hui Li, Peng Wang, and Chunhua Shen. Towards end-to-end
[3] Michal Busta, Lukas Neumann, and Jiri Matas. Deep text spotting with convolutional recurrent neural networks.
textspotter: An end-to-end trainable scene text localization In Proc. IEEE Int. Conf. Comp. Vis., pages 5238–5246, 2017.
and recognition framework. In Proc. IEEE Int. Conf. Comp. [17] Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao,
Vis., pages 2204–2212, 2017. Wenhao Wu, and Xiang Bai. Mask TextSpotter: An end-to-
[4] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Ge- end trainable neural network for spotting text with arbitrary
offrey Hinton. Pix2Seq: A language modeling framework shapes. IEEE Trans. Pattern Anal. Mach. Intell., 2019.
for object detection. arXiv preprint arXiv:2109.10852, 2021. [18] Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and
[5] Chee Kheng Ch’ng and Chee Seng Chan. Total-Text: A com- Xiang Bai. Mask TextSpotter v3: Segmentation proposal
prehensive dataset for scene text detection and recognition. network for robust scene text spotting. In Proc. Eur. Conf.
In Proc. Int. Conf. Doc. Anal. and Recognit., volume 1, pages Comp. Vis., 2020.
935–942. IEEE, 2017. [19] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang,
[6] Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, and Wenyu Liu. TextBoxes: A fast text detector with a single
Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, deep neural network. In Proc. AAAI Conf. Artificial Intell.,
Junyu Han, Errui Ding, et al. ICDAR2019 robust reading 2017.
challenge on arbitrary-shaped text-RRC-ArT. In Proc. Int. [20] Xing Linjie, Tian Zhi, Huang Weilin, and R. Scott Matthew.
Conf. Doc. Anal. and Recognit., pages 1571–1576, 2019. Convolutional character networks. In Proc. IEEE Int. Conf.
[7] Mark Everingham, Luc Van Gool, Christopher KI Williams, Comp. Vis., 2019.
John Winn, and Andrew Zisserman. The pascal visual object [21] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and
classes (VOC) challenge. Int. J. Comput. Vis., 88(2):303– Junjie Yan. FOTS: Fast oriented text spotting with a uni-
338, 2010. fied network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
[8] Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng- pages 5676–5685, 2018.
Lin Liu. TextDragon: An end-to-end framework for arbitrary [22] Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen
shaped text spotting. In Proc. IEEE Int. Conf. Comp. Vis., Jin, and Liangwei Wang. ABCNet: Real-time scene text
pages 9076–9085, 2019. spotting with adaptive bezier-curve network. Proc. IEEE
[9] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Conf. Comp. Vis. Patt. Recogn., 2020.
Synthetic data for text localisation in natural images. In [23] Yuliang Liu and Lianwen Jin. Deep matching prior network:
Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2315– Toward tighter multi-oriented text detection. In Proc. IEEE
2324, 2016. Conf. Comp. Vis. Patt. Recogn., pages 1962–1969, 2017.
[10] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, [24] Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and
and Changming Sun. An end-to-end textspotter with explicit Sheng Zhang. Curved scene text detection via transverse and

9
longitudinal sequence connection. Pattern Recogn., 90:337– spotting. In Proc. AAAI Conf. Artificial Intell., pages 12160–
345, 2019. 12167, 2020.
[25] Yuliang Liu, Chunhua Shen, Lianwen Jin, Tong He, Peng [38] Pengfei Wang, Chengquan Zhang, Fei Qi, Shanshan Liu, Xi-
Chen, Chongyu Liu, and Hao Chen. ABCNet v2: Adaptive aoqiang Zhang, Pengyuan Lyu, Junyu Han, Jingtuo Liu, Er-
bezier-curve network for real-time end-to-end text spotting. rui Ding, and Guangming Shi. PGNet: Real-time arbitrarily-
IEEE Trans. Pattern Anal. Mach. Intell., pages 1–1, 2021. shaped text spotting with point gathering network. arXiv
[26] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, preprint arXiv:2104.05458, 2021.
Wenhao Wu, and Cong Yao. TextSnake: A flexible repre- [39] Wenhai Wang, Enze Xie, Xiang Li, Xuebo Liu, Ding Liang,
sentation for detecting text of arbitrary shapes. In Proc. Eur. Yang Zhibo, Tong Lu, and Chunhua Shen. PAN++: Towards
Conf. Comp. Vis., pages 20–36, 2018. efficient and accurate end-to-end spotting of arbitrarily-
[27] Ilya Loshchilov and Frank Hutter. Decoupled weight de- shaped text. IEEE Trans. Pattern Anal. Mach. Intell., 2021.
cay regularization. Proc. Int. Conf. Learn. Representations, [40] Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu.
2018. Detecting texts of arbitrary orientations in natural images.
[28] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1083–
Xiang Bai. Mask TextSpotter: An end-to-end trainable neu- 1090. IEEE, 2012.
ral network for spotting text with arbitrary shapes. In Proc. [41] Rui Zhang, Yongsheng Zhou, Qianyi Jiang, Qi Song, Nan Li,
Eur. Conf. Comp. Vis., pages 67–83, 2018. Kai Zhou, Lei Wang, Dong Wang, Minghui Liao, Mingkun
[29] Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowd- Yang, et al. ICDAR 2019 robust reading challenge on read-
hury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Uma- ing Chinese text on signboard. In Proc. Int. Conf. Doc. Anal.
pada Pal, Jean-Christophe Burie, Cheng-lin Liu, et al. IC- and Recognit., pages 1577–1581, 2019.
DAR 2019 robust reading challenge on multi-lingual scene [42] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang
text detection and recognition–RRC-MLT-2019. arXiv Zhou, Weiran He, and Jiajun Liang. EAST: An efficient and
preprint arXiv:1907.00945, 2019. accurate scene text detector. In Proc. IEEE Conf. Comp. Vis.
[30] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Patt. Recogn., pages 5551–5560, 2017.
Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, [43] Yiqin Zhu, Jianyong Chen, Lingyu Liang, Zhanghui Kuang,
Christophe Rigaud, Joseph Chazalon, et al. ICDAR 2017 ro- Lianwen Jin, and Wayne Zhang. Fourier contour embed-
bust reading challenge on multi-lingual scene text detection ding for arbitrary-shaped text detection. In Proc. IEEE Conf.
and script identification-RRC-MLT. In Proc. Int. Conf. Doc. Comp. Vis. Patt. Recogn., pages 3123–3131, 2021.
Anal. and Recognit., volume 1, pages 1454–1459. IEEE,
2017.
[31] Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi Niu,
Shiliang Pu, and Fei Wu. MANGO: A mask attention guided
one-stage scene text spotter. In Proc. AAAI Conf. Artificial
Intell., pages 2467–2476, 2021.
[32] Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa
Fujii, and Ying Xiao. Towards unconstrained end-to-end text
spotting. In Proc. IEEE Int. Conf. Comp. Vis., pages 4704–
4714, 2019.
[33] Yipeng Sun, Zihan Ni, Chee-Kheng Chng, Yuliang Liu, Can-
jie Luo, Chun Chet Ng, Junyu Han, Errui Ding, Jingtuo Liu,
Dimosthenis Karatzas, et al. ICDAR 2019 competition on
large-scale street view text with partial labeling-RRC-LSVT.
In ICDAR, pages 1557–1562, 2019.
[34] Shangxuan Tian, Shijian Lu, and Chongshou Li. WeText:
Scene text detection under weak supervision. In Proc. IEEE
Int. Conf. Comp. Vis., pages 1492–1500, 2017.
[35] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao.
Detecting text in natural image with connectionist text pro-
posal network. In Proc. Eur. Conf. Comp. Vis., pages 56–72.
Springer, 2016.
[36] Fangfang Wang, Yifeng Chen, Fei Wu, and Xi Li. Tex-
tray: Contour-based geometric modeling for arbitrary-
shaped scene text detection. In Proc. ACM Int. Conf. Multi-
media, pages 111–119, 2020.
[37] Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai,
Yongchao Xu, Mengchao He, Yongpan Wang, and Wenyu
Liu. All you need is boundary: Toward arbitrary-shaped text

10
Appendix C. Single-Point Object Detection
A. SPTS without Point Supervision The proposed SPTS can be extended to the single-point
object detection task. In this experiment, we use PASCAL
As discussed in Sec. 4, we further explore whether the VOC 2007 trainval (5011 images) and PASCAL VOC 2012
SPTS can learn to read text even without location informa- trainval (11540 images) to train our network. The central
tion. Specifically, the point coordinates are abandoned dur- point of the object is selected as the indicated point. It
ing sequence construction in this experiment, forcing the takes about 150 epochs for our network to reach conver-
model to predict the transcriptions directly. gence. The qualitative results on PASCAL VOC 2007 test
Fig. 10 presents qualitative results of the SPTS-noPoint set are shown in Fig. 12.
model on several scene text benchmarks, where the text can
be recognized without positional supervision, suggesting COMPANY
WALKER PERSON
that the model could learn to localize the textual contents BREWING
FIRESTONE
PRECIOUS
IS
implicitly with transcription labels only. Such results may FE EVERY
TRESTONE
encourage further simplification of the current text spotters,
albeit some downstream applications requiring the precise
location of each text instance may be limited. CARMINES
CARMINES
Turkish
Welcome
CARMINES Delight

B. SPTS for Chinese Text in the Wild


The Chinese scene text spotting remains another big TICKET!
BAR
food
challenge due to a much larger vocabulary of Chinese char- KNOW
YOUR
UENUC
CAFE
acters. In this section, we provide the experiment results of First
DUnn
SPTS in spotting Chinese text in the wild.

B.1. Datasets giordano


Raffles
SPING
City
• Chinese Bezier Curve Synthetic Dataset [25] con- ladies

tains 130k synthetic images with Chinese text.


• ReCTS [41] dataset contains 25k signboard images,
which are divided into 20k images for training and 5k Figure 10. Qualitative results of the SPTS-noPoint model on
images for testing. SCUT-CTW1500 (1st row), Total-Text (2nd row), ICDAR 2013
• ArT [6] dataset is a large-scale arbitrarily-shaped (3rd row), and ICDAR 2015 (4th row). (Zoom in for better view)
scene text dataset. The training set contains 5,603 im-
ages, and the testing set contains 4,563 images.
• LSVT [33] dataset provides totally 450k images from 

street view. There are 50k fully annotated images,


among which 30k images are for training and the rest
 

20k images are for testing.


&+,1(6( 3,==$



B.2. Implemented Details


We first pre-train our SPTS for 100 epochs using the Chi- 

WLPH g

6,1&(

nese Bezier Curve Synthetic Dataset and the training sets of 




ReCTS, ArT, and LSVT. Then the model is fine-tuned on




the training split of ReCTS for 300 epochs. The number of


character categories is set to 5,462.

B.3. Qualitative Results


3$/$

 12 



Because the ReCTS dataset does not release the anno-
tations of the testing set, we can not evaluate the perfor-
mance of our method using the proposed point-based met-
ric. Therefore, we illustrate the qualitative results in Fig. Figure 11. Qualitative results on the testing set of ReCTS. (Zoom
11. It can be seen that the Chinese text are accurately lo- in for better view)
cated and recognized, even for some vertical texts.

11
Tvmonitor Person
Person Bottle Dog
Person Person
Person
Diningtable

Person
Person Horse
Cat Cat Person Person
Person

Figure 12. Qualitative results on the test set of PASCAL VOC 2007.

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy