Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases

Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

doi:10.1093/database/bat041

Scientific article

English

Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases

ContributorsGobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

Published inDatabase, vol. 2013, bat041

Publication date2013

Abstract

The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/

Keywords

Algorithms
Data Mining/methods
Databases, Genetic
Knowledge Bases
Molecular Sequence Annotation

Affiliation entities

Faculté de médecine / Section de médecine clinique / Département de radiologie et informatique médicale

Research groups

Interfaces Homme-machine en milieu clinique (610)

Citation (ISO format)

GOBEILL, Julien et al. Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases. In: Database, 2013, vol. 2013, p. bat041. doi: 10.1093/database/bat041

Article (Published version)

Identifiers

PID : unige:33308
DOI : 10.1093/database/bat041
PMID : 23842461

Journal ISSN1758-0463

913views

476downloads

Creation01/11/2013 16:24:00

First validation01/11/2013 16:24:00

Update14/03/2023 20:51:19

Status update14/03/2023 20:51:18

Last indexation30/10/2024 15:52:48

Archive ouverte UNIGE

Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases

Technical informations