Identifying Latent Semantics in High-Dimensional Web Data

Search engines have become an indispensable tool for obtaining relevant information on the Web. The search engine often generates a large number of results, including several irrelevant items that obscure the comprehension of the generated results. Therefore, the search engines need to be enhanced to discover the latent semantics in high-dimensional web data. This paper purports to explain a novel framework, including its implementation and evaluation. To discover the latent semantics in high-dimensional web data, we proposed a framework named Latent Semantic Manifold (LSM). LSM is a mixture model based on the concepts of topology and probability. The framework can find the latent semantics in web data and represent them in homogeneous groups. The framework will be evaluated by experiments. The LSM framework outperformed compared to other frameworks. In addition, we deployed the framework to develop a tool. The tool was deployed for two years at two places library and one biomedical engineering laboratory of Taiwan. The tool assisted the researchers to do semantic searches of the PubMed database. LSM framework evaluation and deployment suggest that the framework could be used to enhance the functionalities of currently available search engines by discovering latent semantics in high-dimensional web data.

[1]  Thomas Hofmann,et al.  Text categorization by boosting automatically extracted concepts , 2003, SIGIR.

[2]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[3]  Seán O'Riain,et al.  Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches, and Trends , 2012, IEEE Internet Computing.

[4]  Dietram A. Scheufele,et al.  Science, New Media, and the Public , 2013, Science.

[5]  Melike Sah,et al.  Automatic metadata mining from multilingual enterprise content , 2012, J. Web Semant..

[6]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Martin Kuiper,et al.  Jointly creating digital abstracts: dealing with synonymy and polysemy , 2012, BMC Research Notes.

[8]  MARK A. GILLMAN,et al.  The data explosion , 1988, Nature.

[9]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[10]  G. Grimmett,et al.  Disorder in physical systems : a volume in honour of John M. Hammersley on the occasion of his 70th birthday , 1990 .

[11]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[12]  Ramanathan V. Guha,et al.  Semantic search , 2003, WWW '03.

[13]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[14]  Dirk Lewandowski,et al.  Ordinary search engine users carrying out complex search tasks , 2012, J. Inf. Sci..

[15]  Berkant Barla Cambazoglu,et al.  Review of "Search Engines: Information Retrieval in Practice" by Croft, Metzler and Strohman , 2010, Inf. Process. Manag..

[16]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[17]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[18]  Asunción Gómez-Pérez,et al.  Challenges for the multilingual Web of Data , 2012, J. Web Semant..

[19]  Sonia Bergamaschi,et al.  Keyword search over relational databases: a metadata approach , 2011, SIGMOD '11.

[20]  Razvan Bunescu and Raymond J. Mooney Relational Markov Networks for Collective Information Extraction , 2004 .

[21]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[22]  Jeffrey Beall The Weaknesses of Full-Text Searching. , 2008 .

[23]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[24]  Reinhard Diestel,et al.  Graph Theory , 1997 .

[25]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[26]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[27]  Tommi S. Jaakkola,et al.  Weighted Low-Rank Approximations , 2003, ICML.

[28]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[29]  Junkang Feng,et al.  The Notion of "Meaning System" and its use for "Semantic Search" , 2011 .

[30]  I-Jen Chiang,et al.  Discover the semantic topology in high-dimensional data , 2007, Expert Syst. Appl..

[31]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[32]  Bhumika Gupta,et al.  A Comparative Study Of Different Approaches For Improving Search Engine Performance , 2012 .

[33]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[34]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  Jürgen Umbrich,et al.  Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine , 2011, J. Web Semant..

[37]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[38]  École d'été de probabilités de Saint-Flour,et al.  École d'été de probabilités de Saint-Flour XIII - 1983 , 1985 .

[39]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[40]  Shourya Roy,et al.  A hierarchical monothetic document clustering algorithm for summarization and browsing search results , 2004, WWW '04.

[41]  Hilbert J. Kappen,et al.  Sufficient Conditions for Convergence of the Sum–Product Algorithm , 2005, IEEE Transactions on Information Theory.

[42]  Olivier Ferret,et al.  Bag of Senses Versus Bag of Words: Comparing Semantic and Lexical Approaches on Sentence Extraction , 2008, TAC.

[43]  Angelo Dalli Adaptation of the F-measure to Cluster Based Lexicon Quality Evaluation , 2003 .

[44]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[45]  Mukesh A. Zaveri,et al.  Automatic Classification of Unstructured Blog Text , 2013 .

[46]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[47]  William T. Freeman,et al.  Understanding belief propagation and its generalizations , 2003 .

[48]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[49]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[50]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[51]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[52]  Ulf Leser,et al.  GeneView: a comprehensive semantic search engine for PubMed , 2012, Nucleic Acids Res..

[53]  Georg Gottlob,et al.  Semantic Web Search Based on Ontological Conjunctive Queries , 2010, FoIKS.

[54]  R. Carter 11 – IT and society , 1991 .

[55]  Roi Blanco,et al.  Repeatable and reliable semantic search evaluation , 2013, J. Web Semant..

[56]  Andreas Hotho,et al.  Semantic Web Mining: State of the art and future directions , 2006, J. Web Semant..

[57]  D. Aldous Exchangeability and related topics , 1985 .