LiCord: Language Independent Content Word Finder

Content Words (CWs) are important segments of the text. In text mining, we utilize them for various purposes such as topic identification, document summarization, question answering etc. Usually, the identification of CWs requires various language dependent tools. However, such tools are not available for many languages and developing of them for all languages is costly. On the other hand, because of recent growth of text contents in various languages, language independent text mining carries great potentiality. To mine text automatically, the language tool independent CWs finding is a requirement. In this research, we devise a framework that identifies text segments into CWs in a language independent way. We identify some structural features that relate text segments into CWs. We devise the features over a large text corpus and apply machine learning-based classification that classifies the segments into CWs. The proposed framework only uses large text corpus and some training examples, apart from these, it does not require any language specific tool. We conduct experiments of our framework for three different languages: English, Vietnamese and Indonesian, and found that it works with more than 83 % accuracy.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Hiroshi Kanayama,et al.  Unsupervised lexicon induction for clause-level detection of evaluations , 2012, Nat. Lang. Eng..

[3]  Thomas P. Klammer,et al.  Analyzing English Grammar , 1992 .

[4]  Tetsuya Nasukawa,et al.  Text analysis and knowledge mining system , 2001, IBM Syst. J..

[5]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[6]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[7]  Kristina Toutanova,et al.  Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia , 2012, ACL.

[8]  Thomas Niesler,et al.  Variable-length categoryn-gram language models , 1999, Comput. Speech Lang..

[9]  Eric K. Ringger,et al.  Pulse: Mining Customer Opinions from Free Text , 2005, IDA.

[10]  François Yvon,et al.  Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning , 2014, EMNLP.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Daisuke Kawahara,et al.  TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology , 2008, IJCNLP.

[13]  Joakim Nivre,et al.  Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging , 2013, TACL.

[14]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[15]  Daniel Lewis,et al.  What is web 2.0? , 2006, CROS.

[16]  Elizabeth Grace Winkler,et al.  Understanding Language: A Basic Course in Linguistics , 2007 .

[17]  Ji Wu,et al.  Combining N-Gram and Dependency Word Pair for Multi-document Summarization , 2014, 2014 IEEE 17th International Conference on Computational Science and Engineering.

[18]  Saif Mohammad,et al.  Sentiment Analysis of Short Informal Texts , 2014, J. Artif. Intell. Res..

[19]  Christopher D. Manning,et al.  Cross-lingual Projected Expectation Regularization for Weakly Supervised Learning , 2014, TACL.