Short Text Hashing Improved by Integrating Topic Features and Tags

Hashing, as an efficient approach, has been widely used for large-scale similarity search. Unfortunately, many existing hashing methods based on observed keyword features are not effective for short texts due to the sparseness and shortness. Recently, some researchers try to construct semantic relationship using certain granularity topics. However, the topics of certain granularity are insufficient to preserve the optimal semantic similarity for different types of datasets. On the other hand, tag information should be fully exploited to enhance the similarity of related texts. We, therefore, propose a novel unified hashing approach that the optimal topic features can be selected automatically to be integrated with original features for preserving similarity, and tags are fully utilized to improve hash code learning. We carried out extensive experiments on one short text dataset and even one normal text dataset. The results demonstrate that our approach is effective and significantly outperforms baseline methods on several evaluation metrics.

[1]  Mengen Chen,et al.  Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[2]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[3]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[4]  Jun Wang,et al.  Self-taught hashing for fast similarity search , 2010, SIGIR.

[5]  David Suter,et al.  A General Two-Step Approach to Learning-Based Hashing , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Dan Zhang,et al.  Semantic hashing using tags and topic modeling , 2013, SIGIR.

[7]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[8]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[9]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[10]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[11]  Jun Wang,et al.  Laplacian Co-hashing of Terms and Documents , 2010, ECIR.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Deng Cai,et al.  Extensions to Self-Taught Hashing: Kernelisation and Supervision , 2010 .

[14]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.