UNIGE document Conference Presentation
previous document  unige:38801  next document
add to browser collection

Part-of-speech tagging for regional languages and dialects : A generic approach based on unsupervised learning

Presented at 8èmes Journées Suisses de la Linguistique. Zurich - 19-21 juin - . 2014
Abstract Natural language processing for regional languages and dialects faces a certain number of challenges. First, the amount of electronically available written texts is small. Second, these data are most often not annotated, and spelling may not be standardized. These challenges can be – at least partially – overcome by using an etymologically closely related language with more resources (we call this language resourced language in the following). However, in most such configurations, parallel corpora are not available: since two closely related languages are generally mutually intelligible, the demand for translation (which constitutes the sourced of parallel corpora) is low. In this communication, we describe various experiments based on our part-of-speech tagging approach (see also Das & Petrov 2011). Concretely, we transfer existing part-of-speech annotations from the resourced language to the non-resourced (regional) language, withouth using bilingual (i.e., parallel) data during this process. This approach relies on two hypotheses. First, we assume that both languages share a lot of cognate pairs, i.e., word pairs that are formally similar and that are translations of each other. Second, we admit that the word order is similar in both languages, and that the set of POS tags is identical. Under these hypotheses, the POS tag of one word can be transferred to its translational equivalent in the other language. The proposed approach consists of two steps. In the first step, we induce a translation lexicon from monolingual corpora using techniques of unsupervised learning. For this, we propose a pipeline of several sub-steps that rely on different similarity measures: Language-independent formal similarity of words according to the BI-SIM measure (Kondrak & Dorr 2004) Language-dependent formal similarity of words, measured with a character-level machine translation system (Tiedemann 2009) Frequential similarity (Koehn & Knight 2002) Contextual similarity (Rapp 1999) In the second step, we assign to each word of the non-resourced language, the POS tag(s) observed with the resourced language word to which it has been linked in the first step. The resulting POS tag assignments can then be used to train a regular tagger. In the presentation, we will describe the different steps of our approach in some detail. We will also present the experiments that we have conducted with Romance, Germanic and Slavic language pairs, and for which we have obtained accuracy figures ranging between 59% and 91%. We will also discuss the reasons for the rather large variance observed in the results.
Full text
Presentation (Author postprint) (219 Kb) - public document Free access
Research group Laboratoire d'Analyse et de Traitement du Langage (LATL)
(ISO format)
SCHERRER, Yves. Part-of-speech tagging for regional languages and dialects : A generic approach based on unsupervised learning. In: 8èmes Journées Suisses de la Linguistique. Zurich. 2014. https://archive-ouverte.unige.ch/unige:38801

429 hits



Deposited on : 2014-07-21

Export document
Format :
Citation style :