Indexing a web site with a terminology oriented ontology

This article presents a new approach in order to index a Web site. It uses ontologies and natural language techniques for information retrieval on the Internet. The main goal is to build a structured index of the Web site. This structure is given by a terminology oriented ontology of a domain which is chosen a priori according to the content of the Web site. First, the indexing process uses improved natural language techniques to extract well-formed terms taking into account HTML markers. Second, the use of a thesaurus allows us to associate candidate concepts with each term. It makes it possible to reason at a conceptual level. Next, for each candidate concept, its capacity to represent the page is evaluated by determining its level of representativeness of the page. Then, the structured index itself is built. To each concept of the ontology are attached the pages of the Web site in which they are found. Finally, a number of indicators make it possible to evaluate the indexing process of the Web site by the suggested ontology.

[1]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[2]  John F. Sowa,et al.  Conceptual Structures: Information Processing in Mind and Machine , 1983 .

[3]  Marja-Riitta Koivunen,et al.  Annotea: an open RDF infrastructure for shared Web annotations , 2001, WWW '01.

[4]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[5]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[6]  Chi-Sheng Shih,et al.  Extracting classification knowledge of Internet documents with mining term associations: a semantic approach , 1998, SIGIR '98.

[7]  Sylvie Cazalens,et al.  A Web site indexing process for an Internet information retrieval agent system , 2000, Proceedings of the First International Conference on Web Information Systems Engineering.

[8]  Dik Lun Lee,et al.  WISE: A World Wide Web Resource Database System , 1996, IEEE Trans. Knowl. Data Eng..

[9]  Nicola Guarino,et al.  OntoSeek: content-based access to the Web , 1999, IEEE Intell. Syst..

[10]  Lee Spector,et al.  Ontology-Based Knowledge Discovery on the World-Wide Web , 1996 .

[11]  Philippe Martin,et al.  Embedding Knowledge in Web Documents , 1999, Comput. Networks.

[12]  Asunción Gómez-Pérez Développement récents en matière de conception, de maintenance et d’utilisation des ontologies , 1999 .

[13]  B. Daille Approche mixte pour l'extraction de terminologie : statistique lexicale et filtres linguistiques , 1994 .

[14]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[15]  Timothy W. Finin,et al.  Yahoo! as an ontology: using Yahoo! categories to describe documents , 1999, CIKM '99.

[16]  Leonard J. Seligman,et al.  Rapper: a wrapper generator with linguistic knowledge , 1999, WIDM '99.

[17]  Craig A. Knoblock,et al.  Semi-automatic wrapper generation for Internet information sources , 1997, Proceedings of CoopIS 97: 2nd IFCIS Conference on Cooperative Information Systems.

[18]  C. Jacquin Indexation de pages Web , 2000 .

[19]  James A. Hendler,et al.  Applying Ontology to the Web: A Case Study , 1999, IWANN.

[20]  Nicola Guarino,et al.  Some Organizing Principles For A Unified Top-Level Ontology 1 , 1997 .

[21]  Dieter Fensel,et al.  Ontobroker: or how to enable intelligent access to the WWW , 1998 .

[22]  José Palazzo Moreira de Oliveira,et al.  Concept-based knowledge discovery in texts extracted from the Web , 2000, SKDD.

[23]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[24]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[25]  Myoung-Ho Kim,et al.  Information Retrieval Based on Conceptual Distance in is-a Hierarchies , 1993, J. Documentation.