Doctoral thesis
OA Policy
English

Towards Simpler Transcripts: Investigating Automatic Simplification of French Spontaneous Speech

Imprimatur date2026-01-28
Defense date2025-12-18
Abstract

In this thesis, we address the automation of speech simplification with a focus on spontaneous French as the input language, which remains a largely unexplored task in current research. While ATS has advanced significantly in the context of English written texts, little attention has been paid to the spoken modality in the same task. Furthermore, the shortage of parallel corpora in languages that are less resource-rich than English has posed an obstacle to its automation in the language under consideration. Therefore, the main objective of this research work is to bridge such gaps by proposing an artifact that automatically simplifies spontaneous French speech.

The thesis is structured around three main goals. First, we propose a characterization of simplification strategies specific to spontaneous French, as the task of spontaneous speech simplification has not been formally defined. To this end, we collect expert- and machine-based simplifications from utterances derived from the CEFC dataset, and then analyze the linguistic operations performed. The findings show a preference for deletions and the tendency to produce register-standardized sentences that solely retain the propositional content of the input.

Secondly, we tackle the challenge of limited task-specific parallel data through two data creation methods. The first method relies on the exploitation of register-differentiated comparable corpora (i.e., Wikipedia and Vikidia) to extract aligned complex-simpler sentence pairs, resulting in the WiViCo set, which contains 46k unspontaneous yet human-based pairs. The second method is based on synthetic data generation using LLMs through an iterative exo-refinement workflow. In this approach, separate LLMs are used for generation and evaluation, enabling external feedback loops and role specialization. Applied to CEFC transcripts, this process produces the CEFC-Synth dataset, which, despite being artificially generated, reflects more closely the spontaneous speech modality.

Finally, building on these resources, we introduce an artifact for French spontaneous speech simplification, trained on a combination of human- (i.e., WiViCo) and LLM-generated data (i.e., CEFC-Synth). We experiment with both cascade and end-to-end architectures, and evaluate their performance against a CEFC-based test set of expert-crafted simplification references, on the basis of automatic metrics. Results demonstrate that our proposed models notably outperform a state-of-the-art text simplification system, i.e., MUSS. Moreover, the inclusion of synthetic speech domain data in the training set proves beneficial, as evidenced by the results obtained in the transcript-to-simplification experiments.

With the contribution of new evaluation and training resources, different methods for task-specific data creation and an operational artifact, this thesis contributes to the generation of simpler transcripts from spontaneous French. This, in turn, has implications not only from an accessibility perspective (by enhancing the clarity of the message for diverse target audiences), but also from a computational standpoint, as providing intermediate simplified representations may improve performance in other downstream NLP tasks.

Keywords
  • Automatic speech simplification
  • Large language models
  • Synthetic data generation
  • Exploitation of comparable corpora
  • Spontaneous speech
  • Low-resourced tasks
Citation (ISO format)
ORMAECHEA GRIJALBA, Lucía. Towards Simpler Transcripts: Investigating Automatic Simplification of French Spontaneous Speech. Thèse, 2026. doi: 10.13097/archive-ouverte/unige:191341
Main files (1)
Thesis
accessLevelPublic
Secondary files (1)
Imprimatur
accessLevelPublic
Identifiers
106views
71downloads

Technical informations

Creation02/07/2026 3:27:08 PM
First validation02/10/2026 2:16:46 PM
Update time03/03/2026 1:06:23 PM
Status update03/03/2026 1:06:23 PM
Last indexation03/03/2026 1:06:31 PM
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack