Proceedings chapter
OA Policy
English

Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method

Presented atNeuchâtel, 12-14 June 2023
First online date2023-06-23
Abstract

Automatic Text Simplification (ATS) aims at simplifying texts by reducing their linguistic complexity while retaining their meaning. While being an interesting task from a societal and computational perspective, the lack of monolingual parallel data prevents an agile implementation of ATS models, especially in less resource-rich languages than English. For these reasons, this paper investigates how to create a general-language parallel simplification dataset for French using a method to extract complex-simple sentence pairs from comparable corpora like Wikipedia and its simplified counterpart, Vikidia. By using a two-step automatic filtering process, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid: (1) preservation of the original meaning, and (2) simplicity gain with respect to the source text. Using this approach, we provide a dataset of parallel sentence simplifications (WiViCo) that can be later used for training French sequence-to-sequence general-language ATS models.

Keywords
  • Automatic text simplification
  • Comparable corpora
  • Automatic sentence alignment
  • Wiki-based resources
  • Sentence semantic similarity
  • Low-resourced tasks
Citation (ISO format)
ORMAECHEA GRIJALBA, Lucía, TSOURAKIS, Nikolaos. Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method. In: Proceedings of the 8th Swiss Text Analytics Conference 2023 (SwissText). Neuchâtel. [s.l.] : [s.n.], 2023. p. 10.
Main files (1)
Proceedings chapter (Published version)
Identifiers
  • PID : unige:169798
240views
203downloads

Technical informations

Creation23/06/2023 13:24:54
First validation29/06/2023 11:06:04
Update time29/06/2023 11:06:04
Status update29/06/2023 11:06:04
Last indexation01/11/2024 06:26:38
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack