Discovering Semantic Sibling Groups from Web Documents with XTREEM-SG

The acquisition of explicit semantics is still a research challenge. Approaches for the extraction of semantics focus mostly on learning hierarchical hypernym-hyponym relations. The extraction of co-hyponym and co-meronym sibling semantics is performed to a much lesser extent, though they are not less important in ontology engineering. In this paper we will describe and evaluate the XTREEM-SG (Xhtml TREE Mining – for Sibling Groups) approach on finding sibling semantics from semi-structured Web documents. XTREEM takes advantage of the added value of mark-up, available in web content, for grouping text siblings. We will show that this grouping is semantically meaningful. The XTREEM-SG approach has the advantage that it is domain and language independent; it does not rely on background knowledge, NLP software or training. In this paper we apply the XTREEM-SG approach and evaluate against the reference semantics from two golden standard ontologies. We investigate how variations on input, parameters and reference influence the obtained results on structuring a closed vocabulary on sibling relations. Earlier methods that evaluate sibling relations against a golden standard report a 14.18% F-measure value. Our method improves this number into 21.47%.

[1]  Myra Spiliopoulou,et al.  Discovering Multi Terms and Co-hyponymy from XHTML Documents with XTREEM , 2006, KDXD.

[2]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[3]  Udo Kruschwitz,et al.  Exploiting structure for intelligent Web search , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[4]  Yangyong Zhu,et al.  Similarity Metric for XML Documents , 2003 .

[5]  David Faure,et al.  Knowledge Acquisition of Predicate Argument Structures from Technical Texts Using Machine Learning: The System ASIUM , 1999, EKAW.

[6]  Vipul Kashyap,et al.  Design and Creation of Ontologies for Environmental Information Retrieval1 , 1999 .

[7]  Gerrit Antonides Evaluation and Applications , 1990 .

[8]  Alexander Nareyek,et al.  Local Search for Planning and Scheduling , 2001, Lecture Notes in Computer Science.

[9]  Philipp Cimiano,et al.  Ontology Learning from Text: Methods, Evaluation and Applications , 2005 .

[10]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[11]  Steffen Staab,et al.  Learning by googling , 2004, SKDD.

[12]  Marius Pasca,et al.  Finding Instance Names and Alternative Glosses on the Web: WordNet Reloaded , 2005, CICLing.

[13]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[14]  Wolfgang Lindner,et al.  Current Trends in Database Technology - EDBT 2004 Workshops, EDBT 2004 Workshops PhD, DataX, PIM, P2P&DB, and ClustWeb, Heraklion, Crete, Greece, March 14-18, 2004, Revised Selected Papers , 2004, EDBT Workshops.

[15]  Richi Nayak,et al.  Knowledge Discovery from XML Documents , 2006, Lecture Notes in Computer Science.

[16]  Steffen Staab,et al.  Learning Concept Hierarchies from Text with a Guided Hierarchical Clustering Algorithm , 2005 .

[17]  Choi Il-Hwan,et al.  A Clustering Method Based on Path Similarities of XML Data , 2006 .

[18]  Olatz Ansa,et al.  Enriching very large ontologies using the WWW , 2000, ECAI Workshop on Ontology Learning.

[19]  Udo Kruschwitz A Rapidly Acquired Domain Model Derived form Markup Structure , 2002 .

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Timos K. Sellis,et al.  Clustering XML Documents Using Structural Summaries , 2004, EDBT Workshops.

[22]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[23]  Steffen Staab,et al.  Learning Concept Hierarchies from Text with a Guided Agglomerative Clustering Algorithm , 2005, ICML 2005.

[24]  Ralf Steinmetz,et al.  Ontology enrichment with texts from the WWW , 2002 .

[25]  Kentaro Torisawa,et al.  Acquiring Hyponymy Relations from Web Documents , 2004, NAACL.

[26]  Raphael Volz,et al.  Migrating data-intensive web sites into the Semantic Web , 2002, SAC '02.

[27]  Steffen Staab,et al.  Discovering Conceptual Relations from Text , 2000, ECAI.

[28]  Sergio Greco,et al.  Toward Semantic XML Clustering , 2006, SDM.