Automatic identification of topic tags from texts based on expansion-extraction approach

Identifying topics of a textual document is useful for many purposes. We can organize the documents by topics in digital libraries. Then, we could browse and search for the documents with specific topics. By examining the topics of a document, we can quickly understand what the document is about. To augment the traditional manual way of topic tagging tasks, which is labor-intensive, solutions using computers have been developed. This dissertation describes the design and development of a topic identification approach, in this case applied to disaster events. In a sense, this study represents the marriage of research analysis with an engineering effort in that it combines inspiration from Cognitive Informatics with a practical model from Information Retrieval. One of the design constraints, however, is that the Web was used as a universal knowledge source, which was essential in accessing the required information for inferring topics from texts. Retrieving specific information of interest from such a vast information source was achieved by querying a search engine’s application programming interface. Specifically, the information gathered was processed mainly by incorporating the Vector Space Model from the Information Retrieval field. As a proof of concept, we subsequently developed and evaluated a prototype tool, Xpantrac, which is able to run in a batch mode to automatically process text documents. A user interface of Xpantrac also was constructed to support an interactive semi-automatic topic tagging application, which was subsequently assessed via a usability study. Throughout the design, development, and evaluation of these various study components, we detail how the hypotheses and research questions of this dissertation have been supported and answered. We also present that our overarching goal, which was the identification of topics in a human-comparable way without depending on a large training set or a corpus, has been achieved.

[1]  D J Rogers,et al.  A Computer Program for Classifying Plants. , 1960, Science.

[2]  P. Zunde,et al.  Indexing Consistency and Quality. , 1969 .

[3]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[4]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[5]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[6]  Loll N. Rolling Indexing consistency, quality and efficiency , 1981, Inf. Process. Manag..

[7]  Vijay V. Raghavan,et al.  Vector Space Model of Information Retrieval - A Reevaluation , 1984, SIGIR.

[8]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986 .

[9]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Fred D. Davis Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology , 1989, MIS Q..

[12]  Peter W. Foltz Using latent semantic indexing for information filtering , 1990 .

[13]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[14]  Kenneth D. Forbus,et al.  The Roles of Similarity in Transfer: Separating Retrievability From Inferential Soundness , 1993, Cognitive Psychology.

[15]  D. Gentner,et al.  Respects for similarity , 1993 .

[16]  T. E. Lange,et al.  Below the Surface: Analogical Similarity and Retrieval Competition in Reminding , 1994, Cognitive Psychology.

[17]  Mirja Iivonen,et al.  Consistency in the Selection of Search Concepts and Search Terms , 1995, Information Processing & Management.

[18]  James R. Lewis,et al.  IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use , 1995, Int. J. Hum. Comput. Interact..

[19]  D. Gentner,et al.  Structure mapping in analogy and similarity. , 1997 .

[20]  Geoffrey Z. Liu Semantic vector space model : Implementation and evaluation , 1997 .

[21]  Eric Miller,et al.  An Introduction to the Resource Description Framework , 1998, D Lib Mag..

[22]  Trent E. Lange,et al.  Retrieval from episodic memory by inferencing and disambiguation , 1999 .

[23]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[24]  Roy T. Fielding,et al.  Principled design of the modern Web architecture , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[25]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[26]  Vibhu O. Mittal,et al.  Stemming and its effects on TFIDF ranking. , 2000, SIGIR 2000.

[27]  Olatz Ansa,et al.  Enriching very large ontologies using the WWW , 2000, ECAI Workshop on Ontology Learning.

[28]  Rosni Abdullah,et al.  Automatic Topic Identification Using Ontology Hierarchy , 2001, CICLing.

[29]  Chris H. Q. Ding,et al.  Automatic topic identification using webpage clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[30]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[31]  Jonas Sj̈obergh Combining POS-taggers for improved accuracy on Swedish text , 2003 .

[32]  David E. Millard,et al.  Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[33]  Pernilla Danielsson Automatic extraction of meaningful units from corpora: A corpus-driven approach using the word stroke , 2003 .

[34]  Zhongzhi Shi,et al.  Perspectives on cognitive informatics , 2003, The Second IEEE International Conference on Cognitive Informatics, 2003. Proceedings..

[35]  J. Becker,et al.  Topic-based Vector Space Model , 2003 .

[36]  Yingxu Wang On Cognitive Informatics , 2003 .

[37]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[38]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[39]  Frank van Harmelen,et al.  A semantic web primer , 2004 .

[40]  Benno Stein,et al.  Topic Identification: Framework and Application , 2022 .

[41]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[42]  A. Gelman Analysis of variance: Why it is more important than ever? , 2005, math/0504499.

[43]  S. Dumais Latent Semantic Analysis. , 2005 .

[44]  Yingxu Wang,et al.  Cognitive informatics models of the brain , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[45]  Ian H. Witten,et al.  Mining Domain-Specific Thesauri from Wikipedia: A Case Study , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[46]  Marco A. Casanova,et al.  Semantic Web: Concepts, Technologies and Applications , 2007, NASA Monographs in Systems and Software Engineering.

[47]  Yingxu Wang,et al.  Cognitive Informatics Foundations of Nature and Machine Intelligence , 2007, 6th IEEE International Conference on Cognitive Informatics.

[48]  Oren Etzioni,et al.  Strategies for lifelong knowledge extraction from the web , 2007, K-CAP '07.

[49]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[50]  Charles Kemp,et al.  Bayesian models of cognition , 2008 .

[51]  Nianjun Liu,et al.  A latent semantic indexing and WordNet based information retrieval model for digital forensics , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[52]  Péter Schönhofen Annotating Documents by Wikipedia Concepts , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[53]  Xijin Tang,et al.  TFIDF, LSI and multi-word in information retrieval and text categorization , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[54]  Jörg Tiedemann,et al.  Simple is Best: Experiments with Different Document Segmentation Strategies for Passage Retrieval , 2008, COLING 2008.

[55]  Roger B. Bradford,et al.  An empirical study of required dimensionality for large-scale latent semantic indexing applications , 2008, CIKM '08.

[56]  Liu Jiangping,et al.  Research of Information Filtering Based on Vector Space Model , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[57]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[58]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[59]  William E. Moen,et al.  Using Encyclopedic Knowledge for Automatic Topic Identification , 2009, CoNLL.

[60]  Olena Medelyan,et al.  Human-competitive automatic topic indexing , 2009 .

[61]  Timothy Baldwin,et al.  Evaluating N-gram based Evaluation Metrics for Automatic Keyphrase Extraction , 2010, COLING.

[62]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[63]  Christiane Fellbaum,et al.  Ontology and the Lexicon: Formal ontology as interlingua: the SUMO and WordNet linking project and global WordNet , 2010 .

[64]  Wilson Wong,et al.  A Cognitive-Based Approach to Identify Topics in Text Using the Web as a Knowledge Source , 2011 .

[65]  Witold Pedrycz,et al.  Cognitive Informatics in Year 10 and Beyond: summary of the plenary panel , 2011, IEEE 10th International Conference on Cognitive Informatics and Cognitive Computing (ICCI-CC'11).

[66]  Edward A. Fox,et al.  CTRnet DL for disaster information services , 2011, JCDL '11.

[67]  Louis Massey,et al.  A cognitive informatics framework for language understanding , 2011, IEEE 10th International Conference on Cognitive Informatics and Cognitive Computing (ICCI-CC'11).

[68]  Chen Liang,et al.  Improved Terms Weighting Algorithm of Text , 2011, 2011 International Conference on Network Computing and Information Security.

[69]  S. Vasishth,et al.  Analysis of Variance (ANOVA) , 2011 .

[70]  Ferda Nur Alpaslan,et al.  Text summarization using Latent Semantic Analysis , 2011, J. Inf. Sci..

[71]  Jian Yu,et al.  Document Topic Extraction Based on Wikipedia Category , 2011, 2011 Fourth International Joint Conference on Computational Sciences and Optimization.

[72]  Terry Ballard,et al.  Google Custom Search , 2012 .

[73]  Weiguo Fan,et al.  Harnessing global expertise: A comparative study of expertise profiling methods for online communities , 2012, Information Systems Frontiers.

[74]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[75]  R. Likert “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.