AILC Abstract2
AILC Abstract2
net/publication/371156476
CITATIONS READ
0 1
3 authors, including:
Martina Galletti
8 PUBLICATIONS 6 CITATIONS
SEE PROFILE
All content following this page was uploaded by Martina Galletti on 30 May 2023.
Francesca Padovani [1][2], Martina Galletti* [1][3][5] & Daniele Nardi [3][4][5]
[1]
Sony Computer Science Laboratories-Paris (Sony CSL - Paris), France
[2]
University of Trento, Italy
[3]
Sapienza University of Rome, Italy
[4]
CINI-AIIS, Italy
[5]
Centro di Studi e Ricerche Enrico Fermi, Italy
*martina.galletti@sony.com
Automatic Text Simplification (ATS) is the process of modifying a text to reduce its overall linguistic complexity.
To automate this simplification process, a number of non-trivial operations must be carried out, including the
assessment of the complexity of the source text, the identification of the fundamental words and parts of the text
itself, and the appropriate modification of these elements in the subsequent simplification stages, at the level of
vocabulary, syntax or discourse. The simplification problem has been investigated by several studies proposing
different methodologies to tackle the task on the English language, but other languages, such as Italian, are less
explored. This is due not only to the limited amount of data available but also the poor quality of the accessible
data itself. For the Italian language there are only two small manually curated datasets1 and only one large corpus2,
PaCCSS-IT, created with a data-driven approach. Most ATS systems produce the same output for every target
group, whereas different categories of people, such as those with cognitive and linguistic disabilities, may benefit
from a text simplified according to their vulnerabilities. The output of this abstract is three-fold. We first built a
new enriched corpus of parallel complex/simple sentences for Italian, robust in terms of quality and large in terms
of quantity by merging PaCCSS-IT with the existing manually curated resources3, a small dataset harvested from
the Italian Wikipedia in a semi-automatic way4 and by translating sentences from an English dataset. Secondly,
we fine-tuned a transformer-based encoder-decoder model inspired by the state-of-the-art available for English5.
Finally, we attempted to parameterise grammatical text features to control simplifications with the goal of making
them adaptive for a specific target population. After evaluation, the baseline sentence simplification model
obtained a good result, achieving a SARI value of 51.51 on the test set of the corpus we built and designed. This
result improves the state of the art (+1.51) on Italian language. We have also made an attempt to create the adaptive
model that reached a SARI value of 60.12. This score is the highest obtained for a controllable simplification
system of Italian text.
1
Brunato, D., Dell’Orletta, F., Venturi, G., & Montemagni, S. (2015, June). Design and annotation of the first
Italian corpus for text simplification. In Proceedings of The 9th Linguistic Annotation Workshop (pp. 31-41).
2
Brunato, D., Cimino, A., Dell’Orletta, F., & Venturi, G. (2016, November). Paccss-it: A parallel corpus of
complex-simple sentences for automatic text simplification. In Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing (pp. 351-361).
3
Brunato, D., Dell’Orletta, F., Venturi, G., & Montemagni, S. (2015, June). Design and annotation of the first
Italian corpus for text simplification. In Proceedings of The 9th Linguistic Annotation Workshop (pp. 31-41).
4
Tonelli, S., Aprosio, A. P., & Saltori, F. (2016). SIMPITIKI: a Simplification corpus for Italian. In CLiC-
it/EVALITA (pp. 4333-4338).
5
Sheang, K. C., & Saggion, H. (2021, August). Controllable Sentence Simplification with a Unified Text-to-
Text Transfer Transformer. In Proceedings of the 14th International Conference on Natural Language
Generation (pp. 341-352).