Conference presentation

A Glimpse into Terminology Research with R: Two Experiments Exploring Diastratic Variation in a Large Specialized Corpus

Presented atONELA 2021, Instrumentation and new explorations in applied linguistics, Toulouse, 19-21 octobre 2021
Publication date2021-10-19

The increasing possibilities for the study of specialized discourse have seen terminologists dealing with large volumes of more heterogenous corpus data (Picton, Drouin & Humbert-Droz, 2018). In this context, researchers are faced with the prospect that ready-made tools might fall short, which emphasizes the need for programming languages. R has become one of the most popular choices among linguists as a tool for both data extraction and data evaluation tasks (e.g. Desagulier, 2017; Gries, 2009a; Gries, 2009b). For common operations such as concordances, the benefits of developing customized scripts may not compensate for a steep learning curve (Anthony, 2013). However, when it comes to advanced techniques, writing code opens doors that otherwise remain closed. Given that R is free and open-source, it has attracted a huge community that allows it to keep up with new methods in statistical analysis and machine learning (Wickham, 2019). As an illustration, we propose an experiment in each of these two areas. Both aim to examine diastratic variation, understood as the coexistence of different language uses within groups of experts in the same field (Picton & Dury, 2017). The corpus that we have chosen for these tests, created for the Humanitarian Encyclopedia project , contains over 70 million occurrences and can be subdivided based on various criteria. In this case, we center on the eleven types of humanitarian organizations and their subcorpora, all of very disparate sizes. In each experiment, we focus on a different phenomenon of diastratic variation, providing a step-by-step description of the approach adopted to investigate it. First, we compare the terminologies of the organization types and establish whether certain communities of humanitarian actors favor specific concepts. To remedy the imbalanced sizes of the subcorpora, we turn to correspondence analysis, an exploratory technique to reveal patterns of association in categorical data and display them in two-dimensional plots (Benzécri, 1992; Glynn, 2014). Second, we represent humanitarian terms as more or less distant points in space by capturing their meanings as vectors. Building on the assumption that similarity in meaning correlates with similarity in distribution (Harris, 1954), the word2vec algorithms rely on deep learning technology to infer the meaning of a lexical unit from its contexts in a corpus (Mikolov et al., 2013). Together, the two experiments lead us to discuss key perspectives and limitations for R in terminology studies.

  • Specialized corpora
  • Large corpora
  • R
  • Diastratic variation
Research group
Citation (ISO format)
GONZALEZ GRANADO, Nicolas, PICTON, Aurélie, DROUIN, Patrick. A Glimpse into Terminology Research with R: Two Experiments Exploring Diastratic Variation in a Large Specialized Corpus. In: ONELA 2021. Toulouse. 2021.
Main files (1)
  • PID : unige:160243

Technical informations

Creation04/12/2022 12:56:00 PM
First validation04/12/2022 12:56:00 PM
Update time03/16/2023 6:23:16 AM
Status update03/16/2023 6:23:15 AM
Last indexation08/31/2023 8:15:46 AM
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack