0% found this document useful (0 votes)
5 views

20 Paper

Uploaded by

cheng900300
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

20 Paper

Uploaded by

cheng900300
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

An Investigation into the Effect of Control Tokens on Text Simplification

Zihao LI Matthew Shardlow Saeed-Ul Hassan


Manchester Metropolitan University
21443696@stu.mmu.ac.uk {m.shardlow,s.ul-hassan}@mmu.ac.uk

Abstract
Recent work on text simplification has focused
on the use of control tokens to further the state
of the art. However, it is not easy to further
improve without an in-depth comprehension
of the mechanisms underlying control tokens.
One unexplored factor is the tokenization strat-
egy, which we also explore. In this paper, we
(1) reimplemented ACCESS, (2) explored the Figure 1: Example of input and output
effects of varying control tokens, (3) tested the
influences of different tokenization strategies,
and (4) demonstrated how separate control to- By adjusting the value in different control tokens,
kens affect performance. We show variations researchers can manually adjust the characteristics
of performance in the four control tokens sep-
of the output, such as length, syntactic and lexical
arately. We also uncover how the design of
control tokens could influence the performance difficulties, etc.
and propose some suggestions for designing The control tokens are added to the beginning
control tokens, which also reaches into other of the complex sentences and represent a relation-
controllable text generation tasks. ship between that sentence and the desired input
(such as the desired compression ratio). In addi-
1 Introduction
tion, the numerical value also changes with the
Text simplification (TS) refers to reducing linguis- demands of the outcome. The format of the control
tic complexity at both syntactic and lexical levels token is: <Token_value>, where Token is a novel
without losing the main content (Alva-Manchego extra-vocabulary token with human interpretable
et al., 2020b). It is commonly used to increase meaning, and value is a numerical value indicat-
the readability of documents intended for children ing some relationship between the given input and
(De Belder and Moens, 2010), non-native speak- output as shown in Figure 1 and Appendix A. The
ers (Petersen and Ostendorf, 2007) and people design of the control tokens is based on the need for
with dyslexia. The requirements for simplified adjustment. Multiple control tokens can be applied
outcomes may vary among audiences (Xu et al., simultaneously, and four control tokens are used in
2015), for instance, depending on the characteris- this project.
tics of the dataset. The task can be roughly divided Although the control tokens are manually
into sentence-level simplification (Nishihara et al., crafted, how the control tokens change the out-
2019; Martin et al., 2020a) and paragraph-level sim- come remains unstudied. To explore the mecha-
plification (Sun et al., 2020; Devaraj et al., 2021). nisms of control tokens in simplification, this paper
The two types of tasks may have different focuses, proposes the following: (1) Verify the importance
and this paper only involves sentence-level simpli- of control tokens in Section 4.2. (2) Reimplement
fication. the ACCESS (Martin et al., 2020a) used in the
In order to fit the requirements of different user current state-of-the-art (SOTA) in Section 3.3. (3)
groups, some projects introduced explicit discrete Explore the influence of the variation of control
prompts as control tokens to assist the model in tokens in the format in Section 4.1. And finally (4)
learning from datasets and adjusting the simplifica- investigate the effects of the tokenization method
tions (Martin et al., 2020a; Agrawal et al., 2021). in Section 4.2.
2 Literature Review multilingual unsupervised sentence simplification
(MUSS) (Martin et al., 2020b). They fine-tuned
Natural language generation (NLG) is a sub-task BART (Lewis et al., 2020) on their mined para-
in natural language processing. There have been phrases datasets instead of complex-simple parallel
attempts to build an NLG system based on hand- corpora and found that with the help of ACCESS,
crafted rules and to define the problem and fea- the unsupervised model outperformed the other un-
tures based on knowledge in the last century (Hovy, supervised text simplification models and became
1990; Reiter and Dale, 1997). With the develop- the latest SOTA. As an extension of ACCESS, the
ment of computation power and the introduction of authors improved the design of control tokens and
neural networks, more neural-network-based sta- changed the tokenization strategy. They showed
tistical methods were applied (Wen et al., 2015; that performance differences between the two types
Dušek and Jurčíček, 2016; Lebret et al., 2016; Mei of datasets might be acceptable only if the mined
et al., 2016). One important change happened paraphrase dataset is good enough. Training on
with the publishing of the transformer architecture paraphrase datasets provides more options than
(Vaswani et al., 2017), which inspired the “pre-train training solely on the supervised datasets and there
and fine-tune” paradigm. Later, due to the new is a nearly unlimited amount of unlabelled data.
architecture outperforming existing ones in both They also found that the performance of the combi-
performance and computation consumption, the nation of unsupervised and supervised training is
transformer architecture and its derivatives occu- the best, which is very similar to the pre-train and
pied a dominant position in the NLG domain (Yang fine-tune paradigm. Although multilingual tests
et al., 2019; Floridi and Chiriatti, 2020; Lewis et al., were made in the MUSS, they were delivered sep-
2020). As a sub-task of NLG, text simplification arately and had little interference with the aim of
can also be regarded as monolingual machine trans- this project. Thus, there is little need to focus on
lation (Wubben et al., 2012). With the development their research in French and Spanish.
of sequence-to-sequence machine translation, text
simplification also drew more attention (Guo et al., The metrics also play a vital role in evaluating
2018; Surya et al., 2019; Omelianchuk et al., 2021) the performance of models. Although current met-
In recent years, researchers tried to introduce rics can hardly compete with human evaluations,
explicit parameters to control the simplified out- they can still partially reflect the performance in
put (Nishihara et al., 2019; Martin et al., 2020a; certain indexes. Among the popular metrics, there
Agrawal et al., 2021). Martin et al. (2020a) in- are reference-based metrics like Bilingual evalua-
troduced four hyper-parameters in the AudienCe- tion understudy (BLEU) (Papineni et al., 2002) and
CEntric Sentence Simplification (ACCESS): the Recall-Oriented Understudy for Gisting Evaluation
number of characters, Levenshtein similarity (Lev- (ROUGE) (Lin, 2004) and non-reference-based
enshtein et al., 1966), word rank and dependency metrics like Flesch-Kincaid Grade Level (FKGL)
tree depth, which are used to control the length, (Flesch, 1948). Currently, the most popular metric
similarity, lexical complexity and syntactic com- for text simplification is the system output against
plexity respectively. With the help of the param- references and against the input sentence (SARI)
eters, users can modify the generated simplifica- (Xu et al., 2016). SARI is designed especially for
tion based on their needs. However, these parame- text simplification tasks, which evaluates the out-
ters may be less straightforward for lay users, and puts in aspects of adding, keeping and deleting.
Agrawal et al. ( 2021) replaced the detailed parame- Although it is found to have some deviation from
ters with simplification grades. In addition, a minor human judgement, SARI is still a valuable met-
change in these parameters may significantly affect ric to evaluate simplicity (Alva-Manchego et al.,
the readability and fluency of output. Although the 2021). As for the non-reference-based metrics, the
value set that maximises the benchmark scores can BERT score is a BERT-based metric that evaluates
be given, it may be of little help to the end-users the similarity between input and output by calculat-
with specific requirements. Further exploration of ing the correlation in the embedding space (Zhang
the effect and proper parameter preferences needs et al., 2019). It is found to have a high correla-
to be made to guide and help lay users adjust these tion with human judgement (Scialom et al., 2021).
parameters based on their needs. By combining the metrics, the performance can be
Another novel research on the training datasets is evaluated more comprehensively.
Strategy Raw Input ’<DEPENDENCYTREEDEPTHRATIO_0.6>’
IDs [0, 41552, 41372, 9309, 23451, . . . , 2571, 6454, 1215, 288, 4, . . . ]
Default
tokenization [’<s>’, ’<’, ’DEP’, ’END’, ’ENCY’, . . . , ’_’, ’0’, ’.’, ’6’, ’>’, . . . ]
IDs [0, 50265, . . . ]
Joint
tokenization [’<s>’, ’<DEPENDENCYTREEDEPTHRATIO_0.6>’, . . . ]
IDs [0, 50265, 50266, 15698, . . . ]
Separate
tokenization [’<s>’, ’<DEPENDENCYTREEDEPTHRATIO_’, ’0.6’, ’>’, . . . ]

Table 1: Tokenization under differing strategies for the input starting with: ’<DEPENDENCYTREEDEPTH-
RATIO_0.6>’

3 Experiments and optimisation process but also the training pro-


cess, thus each tokenization strategy requires an
3.1 Quantisation differences
independent fine-tuned model.
As mentioned in the literature review, there
are 4 types of control tokens: <DEPENDEN- 3.3 Reimplementation of ACCESS
CYTREEDEPTH_x> (DTD), <WORDRANK_x>
(WR), <REPLACEONLYLEVENSHTEIN_x> One of the goals of this project is to reimplement
(LV) and <LENGTHRATIO_x> (LR). In the and verify the effect of control tokens in the cur-
preprocessing step, they are calculated and added rent SOTA. However, since the main focus of this
to the beginning of complex sentences in the project is on the control tokens, instead of training
complex dataset. As an augmentation to the on both supervised and unsupervised datasets, it
control tokens, the calculated values are rounded would be more practical to claim the reimplemen-
to the nearest 0.05. However, in the original tation of ACCESS rather than MUSS. In order to
optimisation process, the calculated values by build a unified baseline, this project also applied
the algorithm provided by the Nevergrad (Rapin the BART model (Lewis et al., 2020), which is
and Teytaud, 2018) API have high precision and adopted in the MUSS project. The original project
verbose digits, just like the first line in Table 2. can be divided into the following sections: data
During the reimplementation, we found that only mining, preprocessing, training, evaluation and op-
the first one or two digits are recognised as input timisation.
values and the remaining digits didn’t provide Since the goal is verification, there is no need
any meaningful instruction. On the contrary, it to rewrite the code for all sections. Thus only the
brought unnecessary information to the system and codes related to training and some other periph-
even lowered the performance of the model. Thus eral functions have been altered to achieve similar
we replaced the continuous values with discrete results. The other functions, such as preprocess-
ones like 0.2, 0.25, 0.3, ..., 1.0 and changed to the ing and optimisation, still kept most of the original
corresponding discrete algorithm in Nevergrad code. The original core API used for training is
(Rapin and Teytaud, 2018). fairseq. This project replaced it with another open-
source API — Huggingface. Huggingface provides
3.2 Tokenization Strategies a collection of the most popular pre-trained mod-
One of the aims of this project is to explore the ef- els and datasets, including the BART (Lewis et al.,
fects of tokenization strategies. As shown in Table 2020) and a unified, advanced and user-friendly
1, the default tokenization method in the MUSS API to achieve the most common applications,
project is regarding the control tokens as plain which made it easier for future upgrading and mod-
text. In comparison, we added 2 more tokeniza- ification. The hyper-parameters of models in the
tion strategies: One is to regard the whole control reimplementation, including the learning rate and
token as one token in the tokenizer; the other is to weight decay, are set to be identical to the original
break the control token into a combination of type project so that the influence of irrelevant factors
and value and add them separately to the tokenizer. can be lowered. The last difference between the
These 2 strategies are achieved by manually adding reimplementation and the original project is the
all possible control tokens to the dictionary of the tokeniser. The tokeniser in the reimplementation is
tokenizer. This will affect not only the evaluation the BART-base byte-pair encoding(BPE) tokeniser
instead of the GPT2 BPE tokeniser (Radford et al., token is designed to represent one character of the
2019). Both tokenisers serve the same purpose and sentence. The <DEPENDENCYTREEDEPTH_x>
perform very similarly to each other. The new one represents the syntactic complexity; The <WOR-
consumes fewer computer resources, which pre- DRANK_x> represents the lexical complexity; The
sumably causes only a little effect on the results. <REPLACEONLYLEVENSHTEIN_x> represents
Due to the variation of control tokens, the optimi- the inverse similarity of input and output at the
sation algorithm has also changed. The original al- letter level; The <LENGTHRATIO_x> represents
gorithm is the OneplusOne provided by Nevergrad the length ratio of input and output. The value of
(Rapin and Teytaud, 2018), and the current one is each control token is calculated based on the refer-
the PortfolioDiscreteOnePlusOne, which fits the ence complex-simple pairs in the training dataset,
discrete values better. As for the metrics, the SARI which is Wikilarge in this project (Zhang and La-
score is kept as the primary evaluation method (Xu pata, 2017). After the calculation, these control
et al., 2016), and the BERT score is introduced as tokens will be added to the beginning of complex
a co-reference. sentences, and the model will be trained on this
However, due to the limitation of computation preprocessed dataset. In addition to the combined
resources and mass fine-tuning demands of models control tokens, this project also explored the effects
with different tokenization strategies, this project of a single control token; only the corresponding
also downgraded the training scale and limited control tokens are kept in that dataset.
the epochs in both baseline and reimplementation.
The next step is training. It follows the majority
Here are the changes applied to both the reimple-
of fine-tuning processes for pretrained language
mentation and the baseline as follows:
models. By feeding the preprocessed complex-
• All results are from models trained in BART- simple sentence pairs to the model, the model is
base instead of BART-large. expected to learn how to simplify texts and the
meaning of each control token. As explained in
• All training processes are set to 10 epochs the tokenization strategy, each tokenization method
only. demands a separate model. To compare the perfor-
mance of different tokenization methods, except
• All models are trained on Wikilarge (Zhang the baseline, 15 models are fine-tuned in the exper-
and Lapata, 2017) only. iment: 3 models with full control tokens and 12
models with only one control token. The models
As explained earlier, each tokenization strategies is
with one control token are used to verify the im-
corresponding to one model and there is a total of
portance of combined control tokens and provide
16 models that need to be fine-tuned. This is why
supportive evidence for the assumption.
only BART-base is applied and the training epochs
are limited. As for the reason for choosing 10 as the The following step is evaluation. Thanks to
targeting epoch number, it is because the training Easier Automatic Sentence Simplification Evalua-
loss for models with combined control tokens has tion(EASSE), multiple evaluation metrics can be
reached 0.85 and decreased very slowly between applied at the same time easily (Alva-Manchego
epochs, while the validation loss started increasing. et al., 2019). The SARI score is adopted as the
If continuing training, the over-fitting problem may primary metric to compare with the current SOTA,
occur. The results of the baseline shown in the next while the BERT score is added as a second refer-
section can also partially prove the training process ence. Different from the common applications in
is probably long enough. other projects, the BERT score in this project is the
correlation between the output and references. One
3.4 Training process coefficient array can be used to combine different
General NLP tasks can be divided into three steps: evaluation metrics and give a weighted score. How-
data preprocessing, training and evaluation. The ever, in this project, we also follow the operations
preprocessing step followed the MUSS project in MUSS and maximise the SARI score, so only
(Martin et al., 2020b). In this project, there is one the SARI score is taken into account, and the cor-
more step: optimisation. The authors defined four responding coefficient is set to 1. The models will
types of prompts used as control tokens to ma- be evaluated on the ASSET (Alva-Manchego et al.,
nipulate the features of the outputs. Each control 2020a) test dataset, which contains 359 complex-
Prompts SARI BERT DTD WR LV LR
Baseline 43.83 — 0.249. . . 0.814. . . 0.758. . . 0.858. . .
Default 44.00±0.05 0.754 0.25 0.8 0.75 0.85
Joint tokens 44.02±0.05 0.769 0.25 0.8 0.75 0.85
Separate tokens 44.04±0.05 0.754 0.25 0.8 0.75 0.85
Default 44.36±0.05 0.733 0.6 0.7 0.65 0.85
Joint tokens 44.58±0.05 0.794 0.35 0.85 0.8 0.85
Separate tokens 44.53±0.05 0.784 0.35 0.75 0.8 0.85
Default 43.34±0.06 0.827 0.6 0.85 0.85 0.85
Joint tokens 43.83±0.06 0.829 0.6 0.85 0.85 0.85
Separate tokens 43.99±0.06 0.828 0.6 0.85 0.85 0.85

Table 2: Results on SARI and BERT score under differing tokenization strategies, with comparison to the baseline
(top 4 rows of results), optimised parameter values (middle 3 rows) and values reported on unified parameters (last
3 rows).

simple pairs, and each complex sentence has ten the baseline is generated by rerunning the code in
reference simplifications. MUSS by altering specific settings only. The actual
The last step is optimisation. As mentioned in output lacks these 2 features. As shown in the top
previous sections, the value of control tokens is lim- 4 rows in the table2, the SARI score with 95% con-
ited to a small range. All options fall between 0.2 to fidence in the reimplementation is slightly higher
1.5 except the Levenshtein, whose upper boundary than the baseline. The middle 3 rows show the
is limited to 1 due to the calculation method that best SARI score with optimised options of control
divides the minimum replacement steps to change tokens. Among the 3 methods, the joint tokens had
from the original sentence to the target sentence by the highest SARI score. Interestingly, the BERT
the maximum possible steps of replacement. Only score is not always proportional to the SARI score,
these options are provided during optimisation, and but the BERT score of optimal value is still quite
the optimisation problem is reduced to finding the high. The optimised values of control tokens are
best value combination of control tokens within pretty close in all situations except the DTD. The
the range. Even though only finite combinations bottom 3 rows show the performance difference
can be applied to the model, the optimisation al- under a unified value of control tokens. The unified
gorithm is still supported by the Nevergrad (Rapin value is the average value of all possible values
and Teytaud, 2018) API to compare with the cur- for each control token. Under the unified condi-
rent SOTA. With a budget of limitation to repeat tion, the separated one outperformed the other two,
the optimisation process 64 times, the algorithm and the default tokenization method still performs
can find a relatively optimised result. In order to worst. As for the BERT score, the joint tokeniza-
ensure the reliability of the score under the opti- tion method still outperforms the other two.
mised combination, a bootstrapping on the ASSET
4.2 Effects of single control tokens
(Alva-Manchego et al., 2020a) test dataset will be
executed by resampling the dataset 200 times and In order to verify the effects of each single control
hence generate a 95% confidence interval. token, a more detailed investigation of the SARI
score was done on control tokens respectively and
4 Results the results are shown in Figure 2. Except for the
Figure 2(b), all 3 tokenization methods show a
4.1 Overall performance high consistency in the curves and have a common
Following the setting in reimplementation, the base- minimum at the value of 1. As shown in Table 4, it
line from the original code of the current SOTA is is mainly caused by the low score in both deletion
43.83 on the ASSET (Alva-Manchego et al., 2020a) and adding operations.
test dataset, which is consistent with the reported In addition to the curves, the differences in to-
score in the MUSS in the corresponding scenario, kenization methods have marginal effects on the
which is 43.63±0.71. There is no confidence in- scores while the value of control tokens can change
terval and BERT score in the baseline because the performance significantly. In Figure 2(a) and
Default Searate Joint Default Separate Joint Default Separate Joint Default Separate Joint
42 42 44 42

40
40 42
40

38
40
38
38
36
SARIScore

SARIScore

SARIScore

SARIScore
38
36
36 34

36
34 32
34
34
30
32

32
32
28
30

30 30 26
0.25 0.50 0.75 1.00 1.25 1.50 0.25 0.50 0.75 1.00 1.25 1.50 0.2 0.4 0.6 0.8 1.0 0.25 0.50 0.75 1.00 1.25 1.50

DependencyTreeDepth Ratio WordRank Ratio ReplaceOnlyLevenshtein Ratio Length Ratio

(a) DependencyTreeDepth (b) WordRank Ratio (c) ReplaceOnlyLevenshtein (d) Length Ratio
Ratio Ratio

Figure 2: The effect of varying control tokens with different tokenization strategies on SARI Score.

Prompts SARI BERT DTD Prompts SARI BERT WR


40.82 ±0.05 0.805 0.55 40.61±0.06 0.720 0.75
Default Default
40.54 ±0.05 0.799 0.6 40.80±0.06 0.776 0.8
40.68±0.06 0.804 0.55 41.08±0.06 0.738 0.75
Separate Separate
40.87±0.05 0.801 0.6 40.32±0.05 0.797 0.8
40.71±0.06 0.812 0.55 41.42±0.06 0.733 0.75
Joint Joint
40.43±0.06 0.800 0.6 40.43±0.06 0.782 0.8

Prompts SARI BERT LV Prompts SARI BERT LR


42.52±0.06 0.750 0.65 40.15±0.06 0.758 0.6
Default Default
42.26±0.08 0.785 0.7 39.91±0.05 0.782 0.65
42.55±0.06 0.747 0.65 40.25±0.06 0.760 0.6
Separate Separate
42.86±0.06 0.782 0.7 40.27±0.05 0.781 0.55
42.63±0.06 0.761 0.65 40.46±0.05 0.758 0.6
Joint Joint
42.31±0.07 0.787 0.7 40.64±0.05 0.785 0.65

Table 3: Results on SARI and BERT scores of peak points in different control tokens.

Control Token Value SARI_add SARI_keep SARI_del SARI


0.2 2.71 27.03 69.32 33.02
0.6 5.24 58.50 57.51 40.41
DTD_joint
1.0 3.30 62.64 26.68 30.87
1.5 4.41 62.66 27.82 31.63
0.5 5.10 37.47 68.54 37.04
0.75 6.65 54.91 62.57 41.37
WR_joint
1.0 3.38 62.04 29.90 31.77
1.25 4.19 54.88 58.35 39.14
0.2 7.15 50.83 63.83 40.60
LV_joint 0.7 9.14 60.15 57.60 42.30
1.0 2.25 61.62 32.17 32.01
0.2 1.80 19.27 69.46 30.18
0.65 5.54 56.84 59.36 40.56
LR_joint
1.0 2.43 62.42 15.26 26.70
1.2 5.80 61.46 26.03 31.10

Table 4: SARI score by operation at turning points in Figure 2.


Default Separate Joint Default Separate Joint Default Separate Joint
Default Separate Joint
1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

BERTScore
BERTScore

BERTScore

BERTScore
0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2


0.25 0.50 0.75 1.00 1.25 1.50 0.25 0.50 0.75 1.00 1.25 1.50 0.2 0.4 0.6 0.8 1.0 0.25 0.50 0.75 1.00 1.25 1.50

DependencyTreeDepth Ratio WordRank Ratio ReplaceOnlyLevenshtein Ratio Length Ratio

(a) DependencyTreeDepth (b) WordRank Ratio (c) ReplaceOnlyLevenshtein (d) Length Ratio
Ratio Ratio

Figure 3: The effect of varying control tokens with different tokenization strategies on BERT score.

2(c), the separate tokenization method shows the default values of control tokens at 0.8. A hypothe-
highest peak point, while in Figure 2(b) and Figure sis emerged that continuous optimisation is not an
2(d), the joint tokenization method has the best per- ideal option to maximise the score. As shown in
formance. The corresponding Table 3 also shows the first four rows in Table 2, the score in reimple-
the scores in pairs under a unified value. Although mentation is higher even in similar values. There
the advantage is not as clear as the combined con- are several reasons: the algorithm is not working
trol tokens, the optimised SARI score of either sep- as expected or the optimisation budget is not large
arate or joint tokenization methods is still slightly enough to find better optimisations. The default to-
higher than the default tokenization method. kenization method in the MUSS project that breaks
The Table 4 is designed to help readers better the control tokens into pieces brings more noise
understand the reason for variations in Figure 2. It and probably lowers the performance. Apart from
shows some local minimum or maximum points the verbosity in optimal values, the long tokeniza-
within the domain and the corresponding SARI tion of the control token is another concern of noisy
score by operations. The addition score is much input. Although the results above shows sign of
lower than the keeping and deletion. It is because such problem, it may become more serious with the
there is only limited adding operation in the refer- increasing of control tokens, especially for short
ences and much more expression options to carry sentences. It would be wiser to limit the unneces-
a similar meaning, which leads to a low hit rate of sary noise in the input to a lower level.
the addition operation. At the same time, the keep
and deletion are chosen from the existing input and Figure 2 and Table 4 expose the reason for vari-
thus have a much bigger hit rate and score. ation with the control token and provide a good
As for the BERT score, as shown in Figure 3, illustration of nature in each control token. In sin-
nearly all 3 tokenization strategies show high simi- gle control tokens, the peak points mainly fall be-
larity to each other except Figure 3(b). The figures tween 0.6 and 0.7, and the score decreases with
show that near all models have the highest BERT the value deviating from the peak point. How-
score around 1. Since the BERT score calculates ever, there are still some differences among the
the correlation between the output and references, control tokens. In the DependencyTreeDepth Ratio
when the control token is set to 1, the model pro- and Length Ratio, the reduction is more dramatic
cesses nothing, and the output is very similar to the than the other 2. In both graphs, the SARI_add
input. Under this situation, as shown in Table 4, decreases with the value deviating from the peak
the SARI_keep reaches the top. However, the peak point and increases slowly when the value is bigger
of BERT score in 3(c) slightly deviates to the left, than 1. The SARI_keep and SARI_del fluctuate
which shows that the references and input are not in the form of 2 half-phase shifted sine functions
identical. and the maximum sum is found in between the
peaks. The graph of the WordRank Ratio shows
5 Discussion and Future Directions some diversity in both Figure 2(b) and 3(b) among
the tokenization methods. Although there is no
One phenomenon found during the optimisation explanation for the deviations, the deviations show
section in the original project is that the score of the potential of combining different tokenization
recommended optimisation is even lower than the methods. When focusing on the main section from
0.5 to 1, the graph shows characteristics similar ods. Considering the various requirements of lay
to the graphs in the previous 2 control tokens. As users, a mixed tokenization method based on the
for the ReplaceOnlyLevenshtein Ratio, the slope performance curve may maximise the model’s per-
is milder on the left side and it seems to have less formance at different points better than a fixed one.
effect on the SARI score. Unlike the other 3 con- Although it remains unclear whether there will be
trol tokens, this control token can only indicate the the same effects in the combined control tokens,
intensity of change but not the direction of change. the mixed tokenizations method can be still promis-
Although the combined effects are still under re- ing with the appearance of more different control
search, a more effective control token could be a tokens. However, a more lightweight and efficient
better solution. training method should be introduced to solve the
As for the optimal value, the most significant problem of balancing cost and effect.
variation between single and combined control to- 5.1 Future Work
kens is in DependencyTreeDepth Ratio. The opti-
mal value in combined control tokens in the joint In the future, one of the main tasks is to reim-
and separate tokenization method is 0.35 instead of plement control tokens in different models or
0.6. Although no direct comparison is listed in Ta- learning strategies so that training can be more
ble 2, comparing the middle and bottom three rows lightweight and less time-consumed. Another goal
makes it pretty clear that 0.35 has a better SARI is to build new non-reference-based metrics and
score. The correlation among the control tokens replace SARI, which will significantly contribute
presumably causes this variation. There are also to the development. However, it is not easy to un-
deviations in the other three control tokens. If the derstand the relationship between the performance
four control tokens can be designed to work inde- and control tokens. A further investigation of the
pendently, the graph on a single control token can complex relationship between SARI and combined
be directly used to find the optimal value. However, control tokens is also worth doing. Although the
the graph of combined control tokens is bound to five-dimension graph may be less visualised, it can
have some distortions for now. Based on the de- still provide some guidance on how to apply the
tailed graph, it is also clear that the value of control control tokens. Designing and introducing new
tokens can significantly affect the performance of control tokens is another novel direction. The con-
the models trained in this way and should be treated trol tokens may be further simplified or optimised
carefully. with a deeper inspection of the control tokens and
SARI score. In addition to that, current optimisa-
Another interesting finding between SARI and tion procedure works only on the dataset level and
BERT in this paper is that most BERT score for needs more precise prediction on sentence level.
optimal value is around 0.78 to 0.8. However, as A sentence level prediction model to the optimal
shown in Figure 3(b) and 3(d), there are more than value of control token may be worth considering.
1 points that have such value, so the BERT score Lastly, whether there is a similar phenomenon of
alone cannot be used to evaluate the text simplifica- control tokens in other controllable text generation
tion results. It may be a necessary but not sufficient tasks is also an important question.
condition for a good simplification. Since the SARI
score is not perfect and relies on references, it is 5.2 Concluding Remarks
important to build non-reference-based metrics to In the investigation, we have shown the results
evaluate the model on a different genre of corpora. and importance of control tokens with different
The BERT score may play a role in these new met- values and tokenization methods, which can be
rics. Thus, this guess is worth further verification used to balance user intention and performance.
in future work. We proposed some improvements in quantisation,
In addition to the values, as shown in Table 3, the compared the influences of different tokenization
tokenization methods can also affect the peak score. strategies of control tokens and proposed possible
In the curves, there are different optimised methods further improvement means. Although the pro-
for each certain point. Although the performance posed suggestions may improve text simplification
differences may be caused by the fine-tuned mod- tasks marginally, they may also be generalised to
els on a lower training scale, they may still imply prompts designing on other controllable NLP tasks.
performance variations between tokenization meth-
References Rémi Lebret, David Grangier, and Michael Auli. 2016.
Neural text generation from structured data with ap-
Sweta Agrawal, Weijia Xu, and Marine Carpuat. 2021. plication to the biography domain. arXiv preprint
A non-autoregressive edit-based approach to control- arXiv:1603.07771.
lable text simplification. In Findings of the Associ-
ation for Computational Linguistics: ACL-IJCNLP Vladimir I Levenshtein et al. 1966. Binary codes capa-
2021, pages 3757–3769. ble of correcting deletions, insertions, and reversals.
In Soviet physics doklady, volume 10, pages 707–710.
Fernando Alva-Manchego, Louis Martin, Antoine Bor- Soviet Union.
des, Carolina Scarton, Benoît Sagot, and Lucia Spe-
cia. 2020a. ASSET: A dataset for tuning and evalua- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
tion of sentence simplification models with multiple Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
rewriting transformations. In Proceedings of the 58th Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Annual Meeting of the Association for Computational BART: Denoising sequence-to-sequence pre-training
Linguistics, pages 4668–4679, Online. Association for natural language generation, translation, and com-
for Computational Linguistics. prehension. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
Fernando Alva-Manchego, Louis Martin, Carolina Scar-
pages 7871–7880, Online. Association for Computa-
ton, and Lucia Specia. 2019. EASSE: Easier auto-
tional Linguistics.
matic sentence simplification evaluation. In Proceed-
ings of the 2019 Conference on Empirical Methods Chin-Yew Lin. 2004. ROUGE: A package for auto-
in Natural Language Processing and the 9th Inter- matic evaluation of summaries. In Text Summariza-
national Joint Conference on Natural Language Pro- tion Branches Out, pages 74–81, Barcelona, Spain.
cessing (EMNLP-IJCNLP): System Demonstrations, Association for Computational Linguistics.
pages 49–54, Hong Kong, China. Association for
Computational Linguistics. Louis Martin, Éric de la Clergerie, Benoît Sagot, and
Antoine Bordes. 2020a. Controllable sentence sim-
Fernando Alva-Manchego, Carolina Scarton, and Lucia plification. In Proceedings of the 12th Language
Specia. 2020b. Data-driven sentence simplification: Resources and Evaluation Conference, pages 4689–
Survey and benchmark. Computational Linguistics, 4698, Marseille, France. European Language Re-
46(1):135–187. sources Association.
Fernando Alva-Manchego, Carolina Scarton, and Lucia Louis Martin, Angela Fan, Éric de la Clergerie, Antoine
Specia. 2021. The (un)suitability of automatic evalu- Bordes, and Benoît Sagot. 2020b. Multilingual un-
ation metrics for text simplification. Computational supervised sentence simplification. arXiv preprint
Linguistics, 47(4):861–889. arXiv:2005.00352.
Jan De Belder and Marie-Francine Moens. 2010. Text Hongyuan Mei, Mohit Bansal, and Matthew R Walter.
simplification for children. In Prroceedings of the 2016. Listen, attend, and walk: Neural mapping
SIGIR workshop on accessible search systems, pages of navigational instructions to action sequences. In
19–26. ACM; New York. Thirtieth AAAI Conference on Artificial Intelligence.
Ashwin Devaraj, Byron C Wallace, Iain J Marshall, and Daiki Nishihara, Tomoyuki Kajiwara, and Yuki Arase.
Junyi Jessy Li. 2021. Paragraph-level simplification 2019. Controllable text simplification with lexical
of medical texts. In Proceedings of the conference. constraint loss. In Proceedings of the 57th Annual
Association for Computational Linguistics. North Meeting of the Association for Computational Lin-
American Chapter. Meeting, volume 2021, page 4972. guistics: Student Research Workshop, pages 260–
NIH Public Access. 266, Florence, Italy. Association for Computational
Ondřej Dušek and Filip Jurčíček. 2016. Sequence- Linguistics.
to-sequence generation for spoken dialogue via
Kostiantyn Omelianchuk, Vipul Raheja, and Oleksandr
deep syntax trees and strings. arXiv preprint
Skurzhanskyi. 2021. Text simplification by tagging.
arXiv:1606.05491.
arXiv preprint arXiv:2103.05070.
Rudolph Flesch. 1948. A new readability yardstick.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Journal of applied psychology, 32(3):221.
Jing Zhu. 2002. Bleu: a method for automatic evalu-
Luciano Floridi and Massimo Chiriatti. 2020. Gpt-3: ation of machine translation. In Proceedings of the
Its nature, scope, limits, and consequences. Minds 40th Annual Meeting of the Association for Compu-
and Machines, 30(4):681–694. tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Han Guo, Ramakanth Pasunuru, and Mohit Bansal. Linguistics.
2018. Dynamic multi-level multi-task learning for
sentence simplification. CoRR, abs/1806.07304. Sarah E Petersen and Mari Ostendorf. 2007. Text sim-
plification for language learners: a corpus analysis.
Eduard H Hovy. 1990. Pragmatics and natural language In Workshop on speech and language technology in
generation. Artificial Intelligence, 43(2):153–197. education. Citeseer.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Xlnet: Generalized autoregressive pretraining for lan-
Dario Amodei, Ilya Sutskever, et al. 2019. Language guage understanding. Advances in neural informa-
models are unsupervised multitask learners. OpenAI tion processing systems, 32.
blog, 1(8):9.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
J. Rapin and O. Teytaud. 2018. Nevergrad - A gradient- Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
free optimization platform. https://GitHub.com/ uating text generation with bert. arXiv preprint
FacebookResearch/Nevergrad. arXiv:1904.09675.

Ehud Reiter and Robert Dale. 1997. Building applied Xingxing Zhang and Mirella Lapata. 2017. Sen-
natural language generation systems. Natural Lan- tence simplification with deep reinforcement learning.
guage Engineering, 3(1):57–87. arXiv preprint arXiv:1703.10931.

Thomas Scialom, Louis Martin, Jacopo Staiano,


Éric Villemonte de la Clergerie, and Benoît Sagot.
2021. Rethinking automatic evaluation in sentence
simplification. arXiv preprint arXiv:2104.07560.

Renliang Sun, Zhe Lin, and Xiaojun Wan. 2020. On the


helpfulness of document context to sentence simplifi-
cation. In Proceedings of the 28th International Con-
ference on Computational Linguistics, pages 1411–
1423, Barcelona, Spain (Online). International Com-
mittee on Computational Linguistics.

Sai Surya, Abhijit Mishra, Anirban Laha, Parag Jain,


and Karthik Sankaranarayanan. 2019. Unsupervised
neural text simplification. In Proceedings of the 57th
Annual Meeting of the Association for Computational
Linguistics, pages 2058–2068, Florence, Italy. Asso-
ciation for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
systems, 30.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-


Hao Su, David Vandyke, and Steve Young. 2015.
Semantically conditioned lstm-based natural lan-
guage generation for spoken dialogue systems. arXiv
preprint arXiv:1508.01745.

Sander Wubben, Antal van den Bosch, and Emiel Krah-


mer. 2012. Sentence simplification by monolingual
machine translation. In Proceedings of the 50th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1015–
1024, Jeju Island, Korea. Association for Computa-
tional Linguistics.

Wei Xu, Chris Callison-Burch, and Courtney Napoles.


2015. Problems in current text simplification re-
search: New data can help. Transactions of the Asso-
ciation for Computational Linguistics, 3:283–297.

Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen,


and Chris Callison-Burch. 2016. Optimizing sta-
tistical machine translation for text simplification.
Transactions of the Association for Computational
Linguistics, 4:401–415.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-


bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Appendices
A

Source Reflection nebulae are usually blue because the scattering is more efficient for
blue light than red (this is the same scattering process that gives us blue skies
and red sunsets).
LR_1.2 Reflection nebulae are usually blue because the scattering is more efficient for
blue light than red (this is the same scattering process that gives us blue skies
and red sunsets) and because the light reflects off of them.
LR_1.0 Reflection nebulae are usually blue because the scattering is more efficient for
blue light than red (this is the same scattering process that gives us blue skies
and red sunsets).
LR_0.8 Reflection nebulae are usually blue because the scattering is more efficient for
blue light than red (this is the same scattering process that gives us blue skies).
LR_0.6 Reflection nebulae are usually blue because the scattering is more efficient
for blue light than red.
LR_0.4 Reflection nebulae are usually blue because the scattering is more efficient.
LR_0.2 Reflection nebulae are usually blue in color.

Table 5: Effect of varying Length ratio with the others remain 1.0.

Source Moderate to severe damage extended up the Atlantic coastline and as far inland
as West Virginia.
LV_0.8 Moderate to severe damage happened along the Atlantic coast and as far inland
as West Virginia.
LV_0.6 Moderate to severe damage happened along the Atlantic coast and as far inland
as West Virginia.
LV_0.4 In West Virginia, the storm caused moderate to severe damage along the Atlantic
coast and inland.
LV_0.2 The National Hurricane Center (NHC) said that the storm was a "major hurri-
cane" and not a tropical storm.

Table 6: Effect of varying ReplaceOnlyLevenshtein ratio with the others remain 1.0.

Source He will abjure his allegiance to the king.


WR_0.8 LV_1.0 He will abjure his allegiance to the king.
WR_0.6 LV_1.0 He will abjure his allegiance to the king.
WR_0.8 LV_0.8 He will not give up his allegiance to the king.
WR_0.6 LV_0.8 He will not give up his power to the king.
WR_0.4 LV_0.8 He will not follow the orders of the king.
WR_0.2 LV_0.8 He will abjure his loyalty to the king.
WR_0.6 LV_0.8 He will not follow the king anymore.
LR_0.75

Table 7: Effect of varying WordRank ratio and some other ratios with the others remain 1.0.
Source The four canonical texts are the Gospel of Matthew, Gospel of Mark, Gospel of
Luke and Gospel of John, probably written between AD 65 and 100 (see also
the Gospel according to the Hebrews).
DTD_1.2 The four canonical texts are the Gospel of Matthew, Gospel of Mark and Gospel
of Luke , probably written between AD 65 and AD 100 (see also the Gospel
according to the Hebrews).
DTD_0.8 The four canonical texts are the Gospel of Matthew, Gospel of Mark and Gospel
of Luke. They are probably written between AD 65 and 100 (see also the Gospel
according to the Hebrews).
DTD_0.6 The four canonical texts are the Gospel of Matthew, Gospel of Mark and Gospel
of Luke. The Gospel of John was probably written between AD 65 and 100 (see
also the Gospel according to the Hebrews).
DTD_0.4 The four canonical texts are the Gospel of Matthew, Gospel of Mark and Gospel
of Luke. The Gospel of John was probably written between AD 65 and 100 (see
also the Gospel according to the Hebrews).

Table 8: Effect of varying DependencyTreeDepth ratio with the others remain 1.0.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy