en
Proceedings chapter
Open access
English

Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method

Presented at Neuchâtel, 12-14 June 2023
First online date2023-06-23
Abstract

Automatic Text Simplification (ATS) aims at simplifying texts by reducing their linguistic complexity while retaining their meaning. While being an interesting task from a societal and computational perspective, the lack of monolingual parallel data prevents an agile implementation of ATS models, especially in less resource-rich languages than English. For these reasons, this paper investigates how to create a general-language parallel simplification dataset for French using a method to extract complex-simple sentence pairs from comparable corpora like Wikipedia and its simplified counterpart, Vikidia. By using a two-step automatic filtering process, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid: (1) preservation of the original meaning, and (2) simplicity gain with respect to the source text. Using this approach, we provide a dataset of parallel sentence simplifications (WiViCo) that can be later used for training French sequence-to-sequence general-language ATS models.

eng
Keywords
  • Automatic text simplification
  • Comparable corpora
  • Automatic sentence alignment
  • Wiki-based resources
  • Sentence semantic similarity
  • Low-resourced tasks
Citation (ISO format)
ORMAECHEA GRIJALBA, Lucía, TSOURAKIS, Nikolaos. Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method. In: Proceedings of the 8th Swiss Text Analytics Conference 2023 (SwissText). Neuchâtel. [s.l.] : [s.n.], 2023. p. 10.
Main files (1)
Proceedings chapter (Published version)
Identifiers
  • PID : unige:169798
127views
86downloads

Technical informations

Creation06/23/2023 11:24:54 AM
First validation06/29/2023 9:06:04 AM
Update time06/29/2023 9:06:04 AM
Status update06/29/2023 9:06:04 AM
Last indexation02/01/2024 10:18:01 AM
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack