Term based comparison metrics for controlled and uncontrolled indexing languages

espanolIntroduccion. Definimos una coleccion de metricas para describir y comparar conjuntos de terminos en lenguajes de indizacion controlados y no-controlados y mostramos como estas metricas pueden usarse para caracterizar un conjunto de lenguajes que cubren fisonomias, ontologias y tesauros. Metodo. Se identificaron las metricas para la caracterizacion y comparacion de conjuntos de terminos y se implementaron los programas para su computo. Estos programas se usaron para identificar las caracteristicas descriptivas de conjuntos de terminos de veintidos diferentes lenguajes de indizacion y medir el solapamiento directo entre los terminos. Analisis. Los datos computados fueron analizados mediante tecnicas manuales y automatizadas, como visualizacion, agrupamiento y analisis factorial. Se buscaron distintos subconjuntos en las metricas que pudieran usarse para distinguir entre lenguajes no-controlados producidos por los sistemas de etiquetado sociales (fisionomias) y los lenguajes controlados producidos por el trabajo profesional. Resultados. Las metricas se mostraron suficientes para diferenciar entre instancias de diferentes lenguajes y para permitir la identificacion de patrones termino-conjunto asociados a lenguajes de indizacion producidos por diferentes tipos de sistema de informacion. En particular, distintos grupos de caracteristicas termino-conjunto parecen distinguir las fisonomias de otros lenguajes. Conclusiones. Las metricas aqui organizadas e incluidas en programas libremente disponibles proporcionan una vision empirica util para empezar a entender las relaciones que se mantienen entre diferentes lenguajes de indizacion, controlados y no-controlados. EnglishIntroduction. We define a collection of metrics for describing and comparing sets of terms in controlled and uncontrolled indexing languages and then show how these metrics can be used to characterize a set of languages spanning folksonomies, ontologies and thesauri. Method. Metrics for term set characterization and comparison were identified and programs for their computation implemented. These programs were then used to identify descriptive features of term sets from twenty-two different indexing languages and to measure the direct overlap between the terms. Analysis. The computed data were analysed using manual and automated techniques including visualization, clustering and factor analysis. Distinct subsets of the metrics were sought that could be used to distinguish between the uncontrolled languages produced by social tagging systems (folksonomies) and the controlled languages produced using professional labour. Results. The metrics proved sufficient to differentiate between instances of different languages and to enable the identification of term-set patterns associated with indexing languages produced by different kinds of information system. In particular, distinct groups of term-set features appear to distinguish folksonomies from the other languages. Conclusions. The metrics organized here and embodied in freely available programs provide an empirical lens useful in beginning to understand the relationships that hold between different, controlled and uncontrolled indexing languages.

[1]  Egon L. Willighagen,et al.  Userscripts for the Life Sciences , 2007, BMC Bioinformatics.

[2]  Benjamin M. Good,et al.  iHOPerator: user-scripting a personalized bioinformatics Web, starting with the iHOP website , 2006, BMC Bioinformatics.

[3]  Patrick Jason Morrison Tagging and Searching: Search Retrieval Effectiveness of Folksonomies on the Web , 2007 .

[4]  Steffen Staab,et al.  On How to Perform a Gold Standard Based Evaluation of Ontology Learning , 2006, SEMWEB.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Sherri de Coronado,et al.  NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information , 2007, J. Biomed. Informatics.

[7]  Hugh C. Davis,et al.  Exploring the Value of Folksonomies for Creating Semantic Metadata , 2007, Int. J. Semantic Web Inf. Syst..

[8]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[9]  Melanie Feinberg,et al.  An Examination of Authority in Social Classification Systems , 2006 .

[10]  Joseph T. Tennis,et al.  Evidence of term-structure differences among folksonomies and controlled indexing languages , 2008, ASIST.

[11]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[12]  Joseph T. Tennis SOCIAL TAGGING AND THE NEXT STEPS FOR INDEXING , 2006 .

[13]  Tony Hammond,et al.  Social Bookmarking Tools (II): A Case Study - Connotea , 2005, D Lib Mag..

[14]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[15]  P. Jason Morrison,et al.  Tagging and searching: Search retrieval effectiveness of folksonomies on the World Wide Web , 2008, Inf. Process. Manag..

[16]  José L. V. Mejino,et al.  CARO - The Common Anatomy Reference Ontology , 2008, Anatomy Ontologies for Bioinformatics.

[17]  Dagobert Soergel,et al.  Indexing languages and thesauri : construction and maintenance , 1974 .

[18]  K. Bretonnel Cohen,et al.  The Compositional Structure of Gene Ontology Terms , 2003, Pacific Symposium on Biocomputing.

[19]  José L. V. Mejino,et al.  A reference ontology for biomedical informatics: the Foundational Model of Anatomy , 2003, J. Biomed. Informatics.

[20]  M. Ashburner,et al.  An ontology for cell types , 2005, Genome Biology.

[21]  Dagobert Soergel The Rise of Ontologies or the Reinvention of Classification , 1999, J. Am. Soc. Inf. Sci..

[22]  Joseph T. Tennis,et al.  Toward a Theory of Structure in Information Organization Frameworks , 2008 .

[23]  Xueying Zhang Concept integration of document databases using different indexing languages , 2006, Inf. Process. Manag..