Scientific article
OA Policy
English

Linguistic statistical universals: comparing computer- and human-generated texts

ContributorsCivico, Marcoorcid
First online date2025-02-25
Abstract

This paper aims at testing the ability of artificial text samples generated by transformers of replicating the writing style of various authors across different languages. We fine-tune GPT-2-based models with corpora from Jane Austen (English), Jules Verne (French) and Giovanni Verga (Italian). Then we analyse the samples in terms of (i) lexical distribution; (ii) long term correlations; and (iii) entropy. As a benchmark, we use text samples generated as Markov chains of different orders trained on the corpora of the same authors. Our results show that transformers represent a great improvement in terms of capturing long range correlations and entropy reduction, although the same cannot be said about lexical distribution.

Keywords
  • Digital Humanities
  • Artificial Text Generation
  • Authorial Style Modeling
  • Lexical Distribution Analysis
  • Entropy in Text Analysis
  • Long-range Textual Correlations
  • Markov Chains for Text Modeling
  • Statistical Stylometry
  • Computational linguistics
  • Transformers
Citation (ISO format)
CIVICO, Marco. Linguistic statistical universals: comparing computer- and human-generated texts. In: International journal of digital humanities, 2025, p. 37. doi: 10.1007/s42803-025-00096-7
Main files (1)
Article (Published version)
Identifiers
Additional URL for this publicationhttps://link.springer.com/10.1007/s42803-025-00096-7
Journal ISSN2524-7832
94views
210downloads

Technical informations

Creation26/02/2025 10:16:24
First validation27/02/2025 06:41:04
Update time27/02/2025 06:41:04
Status update27/02/2025 06:41:04
Last indexation27/02/2025 06:41:05
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack