Training Statistical Language Models from Grammar-Generated Data: a Comparative Case-Study

Hockey, Beth Ann; Rayner, Emmanuel; Christian, Gwen

Chapitre d'actes

Accès libre

Anglais

Training Statistical Language Models from Grammar-Generated Data: a Comparative Case-Study

Contributeurs/tricesHockey, Beth Ann; Rayner, Emmanuel; Christian, Gwen

Publié dansProceedings of the GoTAL Conference

Présenté à Gothenburg (Sweden)

Date de publication2008

Résumé

Statistical language models (SLMs) for speech recognition have the advantage of robustness, and grammar-based models (GLMs) the advantage that they can be built even when little corpus data is available. A known way to attempt to combine these two methodologies is first to create a GLM, and then use that GLM to generate training data for an SLM. It has however been difficult to evaluate the true utility of the idea, since the corpus data used to create the GLM has not in general been explicitly available. We exploit the Open Source Regulus platform, which supports corpus-based construction of linguistically motivated GLMs, to perform a methodologically sound comparison: the same data is used both to create an SLM directly, and also to create a GLM, which is then used to generate data to train an SLM. An evaluation on a medium-vocabulary task showed that the indirect method of constructing the SLM is in fact only marginally better than the direct one. The method used to create the training data is critical, with PCFG generation heavily outscoring CFG generation.

Structure d'affiliation

Faculté de traduction et d'interprétation / Département de traitement informatique multilingue

Citation (format ISO)

HOCKEY, Beth Ann, RAYNER, Emmanuel, CHRISTIAN, Gwen. Training Statistical Language Models from Grammar-Generated Data: a Comparative Case-Study. In: Proceedings of the GoTAL Conference. Gothenburg (Sweden). [s.l.] : [s.n.], 2008.

Proceedings chapter

Identifiants

PID : unige:3475

570vues

264téléchargements

Création02.10.2009 09:29:05

Première validation02.10.2009 09:29:05

Heure de mise à jour14.03.2023 15:14:57

Changement de statut14.03.2023 15:14:57

Dernière indexation12.02.2024 18:13:41

Archive ouverte UNIGE

Training Statistical Language Models from Grammar-Generated Data: a Comparative Case-Study

Informations techniques