论文信息 - Corpus Building for Corporate Knowledge Discovery and Management: A Case Study of Manufacturing

Corpus Building for Corporate Knowledge Discovery and Management: A Case Study of Manufacturing

Building a collection of electronic documents, e.g. corpus, is a cornerstone for the research in information retrieval, text mining and knowledge management. In literature, very few papers have discussed the necessary concerns for building a corpus and explained the building process systematically. In this paper, we explain our work of building an enterprise corpus called manufacturing corpus version 1 (MCV1) for corporate knowledge management purpose. Relevant issues, e.g. input texts, category labels and policies, as well as its parallel coding process and quality measurements are discussed. The real-world automated text classification experiments based on MCV1 show the soundness of its coding process. Finally, suggestions are made on how the proposed approach can be implemented in a more economical manner.

Han Tong Loh | Ying Liu | H. Loh | Y. Liu

[1] Marti A. Hearst. Untangling Text Data Mining , 1999, ACL.

[2] Céline Rouveirol,et al. Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[3] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[4] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.

[5] David R. Karger,et al. Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[6] Padhraic Smyth,et al. From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[7] Chris Buckley,et al. OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[8] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[9] Mark Stevenson,et al. The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[10] Karl T. Ulrich,et al. Product Design and Development , 1995 .

[11] Tom M. Mitchell,et al. Machine Learning and Data Mining , 2012 .

[12] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13] Yiming Yang,et al. RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..