Linguistic statistical universals: comparing computer- and human-generated texts

Civico, Marco

doi:10.1007/s42803-025-00096-7

Scientific article

English

Linguistic statistical universals: comparing computer- and human-generated texts

ContributorsCivico, Marco

Published inInternational journal of digital humanities, p. 37

First online date2025-02-25

Abstract

This paper aims at testing the ability of artificial text samples generated by transformers of replicating the writing style of various authors across different languages. We fine-tune GPT-2-based models with corpora from Jane Austen (English), Jules Verne (French) and Giovanni Verga (Italian). Then we analyse the samples in terms of (i) lexical distribution; (ii) long term correlations; and (iii) entropy. As a benchmark, we use text samples generated as Markov chains of different orders trained on the corpora of the same authors. Our results show that transformers represent a great improvement in terms of capturing long range correlations and entropy reduction, although the same cannot be said about lexical distribution.

Keywords

Digital Humanities
Artificial Text Generation
Authorial Style Modeling
Lexical Distribution Analysis
Entropy in Text Analysis
Long-range Textual Correlations
Markov Chains for Text Modeling
Statistical Stylometry
Computational linguistics
Transformers

Affiliation entities

Faculté de traduction et d'interprétation / Département de traduction

Research groups

Observatoire économie langues formation (élf)

Citation (ISO format)

CIVICO, Marco. Linguistic statistical universals: comparing computer- and human-generated texts. In: International journal of digital humanities, 2025, p. 37. doi: 10.1007/s42803-025-00096-7

Article (Published version)

CC BY-4.0

Identifiers

PID : unige:183498
DOI : 10.1007/s42803-025-00096-7

Additional URL for this publicationhttps://link.springer.com/10.1007/s42803-025-00096-7

Journal ISSN2524-7832

94views

210downloads

Creation26/02/2025 10:16:24

First validation27/02/2025 06:41:04

Update time27/02/2025 06:41:04

Status update27/02/2025 06:41:04

Last indexation27/02/2025 06:41:05

Archive ouverte UNIGE

Linguistic statistical universals: comparing computer- and human-generated texts

Technical informations