Multilingual document mining and navigation using self-organizing maps

One major approach for information finding in the WWW is to navigate through some Web directories and browse them until the goal pages were found. However, such directories are generally constructed manually and may have disadvantages of narrow coverage and inconsistency. Besides, most of existing directories provide only monolingual hierarchies that organized Web pages in terms that a user may not be familiar with. In this work, we will propose an approach that could automatically arrange multilingual Web pages into a multilingual Web directory to break the language barriers in Web navigation. In this approach, a self-organizing map is constructed to train each set of monolingual Web pages and obtain two feature maps, which reveal the relationships among Web pages and thematic keywords, respectively, for such language. We then apply a hierarchy generation process on these maps to obtain the monolingual hierarchy for these Web pages. A hierarchy alignment method is then applied on these monolingual hierarchies to discover the associations between nodes in different hierarchies. Finally, a multilingual Web directory is constructed according to such associations. We applied the proposed approach on a set of Web pages and obtained interesting result that demonstrates the feasibility of our method in multilingual Web navigation.

[1]  Hsin-Chang Yang,et al.  A Web text mining approach based on self-organizing map , 1999, WIDM '99.

[2]  Hyoil Han,et al.  A survey on ontology mapping , 2006, SGMD.

[3]  Chung-Hsing Yeh,et al.  A multilingual text mining approach to web cross-lingual text retrieval , 2004, Knowl. Based Syst..

[4]  Shui-Lung Chuang,et al.  Taxonomy generation for text segments: A practical web-based approach , 2005, TOIS.

[5]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[6]  Yannis Kalfoglou,et al.  Ontology mapping: the state of the art , 2003, The Knowledge Engineering Review.

[7]  Andrew McCallum,et al.  Text Classification by Bootstrapping with Keywords, EM and Shrinkage , 1999 .

[8]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[9]  Amit P. Sheth,et al.  Semantic Interoperability and Integration , 2005, Semantic Interoperability and Integration.

[10]  Thomas Hofmann,et al.  The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data , 1999, IJCAI.

[11]  Martti Juhola,et al.  Corpus-based cross-language information retrieval in retrieval of highly relevant documents: Research Articles , 2007 .

[12]  Mark W. Davis,et al.  A TREC Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval , 1995, TREC.

[13]  Jiawei Han,et al.  Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Databases , 1994, KDD Workshop.

[14]  Kelvin Ng,et al.  Automatic Bounding Volume Hierarchy Generation Using Stochastic Search Methods , 2003 .

[15]  Hitoshi Isahara,et al.  Term-Based Ontology Alignment , 2005 .

[16]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[17]  Christian Wolff,et al.  Topic Map Generation Using Text Mining , 2002, J. Univers. Comput. Sci..

[18]  Luca Benini,et al.  Increasing Energy Efficiency of Embedded Systems by Application-Specific Memory Hierarchy Generation , 2000, IEEE Des. Test Comput..

[19]  Heiner Stuckenschmidt,et al.  Ontology Alignment: An annotated Bibliography , 2005, Semantic Interoperability and Integration.

[20]  Hsin-Chang Yang,et al.  A text mining approach on automatic generation of Web directories and hierarchies , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[21]  Keh-Jiann Chen,et al.  Unknown Word Detection for Chinese by a Corpus-based Learning Method , 1998, ROCLING/IJCLCLP.

[22]  Christopher C. Yang,et al.  A link classification based approach to website topic hierarchy generation , 2007, WWW '07.

[23]  Hsin-Hsi Chen,et al.  A Part-of-Speech-Based Alignment Algorithm , 1994, COLING.

[24]  Andreas Rauber,et al.  The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data , 2002, IEEE Trans. Neural Networks.

[25]  Andreas S. Weigend,et al.  Exploiting Hierarchy in Text Categorization , 1999, Information Retrieval.

[26]  Marti A. Hearst,et al.  Cat-a-Cone: an interactive interface for specifying searches and viewing retrieval results using a large category hierarchy , 1997, SIGIR '97.

[27]  Martti Juhola,et al.  Corpus-based cross-language information retrieval in retrieval of highly relevant documents , 2007, J. Assoc. Inf. Sci. Technol..

[28]  Douglas W. Oard,et al.  A survey of multilingual text retrieval , 1996 .

[29]  Gina-Anne Levow,et al.  Construction of Chinese-English Semantic Hierarchy for Information Retrieval , 2000 .

[30]  Hsin-Chang Yang,et al.  A Multilingual Text Mining Approach Based on Self-Organizing Maps , 2004, Applied Intelligence.

[31]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[32]  Ralf D. Brown,et al.  Example-Based Machine Translation in the Pangloss System , 1996, COLING.

[33]  Lluís Padró,et al.  Mapping Multilingual Hierarchies Using Relaxation Labeling , 1999, EMNLP.

[34]  Ryutaro Ichise,et al.  Rule Induction for Concept Hierarchy Alignment , 2001, Workshop on Ontology Learning.

[35]  Christopher C. Yang,et al.  Web site topic-hierarchy generation based on link structure , 2009 .

[36]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[37]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.