Comparing methods to extract technical content for technological intelligence

We are developing indicators for the emergence of science and technology (S&T) topics. We are targeting various S&T information resources, including metadata (i.e., bibliographic information) and full text. We explore alternative text analysis approaches - principal components analysis (PCA) and topic modeling - to extract technical topic information. We analyze the topical content to pursue potential applications and innovation pathways. In this presentation we compare alternative ways of consolidating messy sets of key terms [e.g., using Natural Language Processing (NLP) on abstracts and titles, together with various keyword sets]. Our process includes combinations of stopword removal, fuzzy term matching, association rules, and tf-idf weighting. We compare PCA results to topic modeling results. Our key test set consists of 4104 Web of Science records on Dye-Sensitized Solar Cells (DSSCs). Results suggest good potential to enhance our technical intelligence payoffs from database searches on topics of interest.

[1]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[2]  Alan L. Porter,et al.  Text Clumping for Technical Intelligence , 2012 .

[3]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Robert J. Watts,et al.  TOAS Intelligence Mining; Analysis of Natural Language Processing and Computational Linguistics , 1997, PKDD.

[5]  Chris Ding Document Retrieval and Clustering: from Principal Component Analysis to Self-aggregation Networks , 2003, AISTATS.

[6]  Alan L. Porter,et al.  Profiling research patterns for a New and Emerging Science and Technology: Dye-Sensitized Solar Cells , 2009, 2009 Atlanta Conference on Science and Innovation Policy.

[7]  Andrew McCallum,et al.  Database of NIH grants using machine-learned categories and graphical clustering , 2011, Nature Methods.

[8]  R.J. Watts,et al.  Mining conference proceedings for corporate technology knowledge management , 2005, A Unifying Discipline for Melting the Boundaries Technology Management:.

[9]  Ying Guo,et al.  The Research Profiling Method Applied to Nano-Enhanced, Thin-Film Solar Cells , 2010 .

[10]  Alan L. Porter,et al.  R&D Cluster Quality Measures and Technology Maturity , 2003 .

[11]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[12]  Alan L. Porter,et al.  “Term clumping” for technical intelligence: A case study on dye-sensitized solar cells , 2014 .

[13]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[14]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[15]  Robert J. Watts,et al.  INNOVATION FORECASTING USING BIBLIOMETRICS , 1998 .

[16]  Elias Pampalk,et al.  EMPIRICAL EVALUATION OF CLUSTERING ALGORITHMS , 2000 .

[17]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[18]  Robert J. Watts,et al.  Automated Text Mining Comparison of Japanese and USA Multi-Robot Research , 2004 .

[19]  Cherie Courseault Trumbach,et al.  Identifying synonymous concepts in preparation for technology mining , 2007, J. Inf. Sci..

[20]  Barry G. T. Lowden,et al.  AN ANALYSIS OF FILE SPACE PROPERTIES USING CLUSTERING , 2002 .

[21]  Robert J. Watts,et al.  Functional Analysis: Deriving Systems Knowledge from Bibliographic Information Resources , 1999, Inf. Knowl. Syst. Manag..

[22]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[23]  Chaomei Chen,et al.  Tech Mining: Exploiting New Technologies for Competitive Advantage , 2005, Inf. Process. Manag..

[24]  Alan L. Porter,et al.  A text mining framework linking technical intelligence from publication databases to strategic technology decisions , 2004 .

[25]  Chirag Shah Automatic Organization of Text Documents in Categories Using Self-Organizing Map (SOM) , 2002 .

[26]  Alan L. Porter,et al.  Empirically Informing a Technology Delivery System Model for an Emerging Technology: Illustrated for Dye‐Sensitized Solar Cells , 2012 .

[27]  Andreas Rauber,et al.  Uncovering Associations Between Documents , 2007 .

[28]  Alan L. Porter,et al.  Automated extraction and visualization of information for technological intelligence and forecasting , 2002 .

[29]  Alan L. Porter,et al.  A process for mining science & technology documents databases, illustrated for the case of "knowledge discovery and data mining" , 1999 .

[30]  W. Bruce Croft,et al.  An Evaluation of Techniques for Clustering Search Results , 2005 .

[31]  Maria da Gloria Botelho Battaglia Tesauro de Química em Lingua Portuguesa. Tesquímica , 1999 .

[32]  Alan L. Porter,et al.  An Inductive Method for “Term Clumping”: A Case Study on Dye-Sensitized Solar Cells , 2012 .

[33]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[36]  Mark A. Girolami,et al.  A probabilistic hierarchical clustering method for organising collections of text documents , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[37]  Li Tang,et al.  Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps , 2010, Scientometrics.

[38]  Alan L. Porter,et al.  Mining foreign language information resources , 1999, PICMET '99: Portland International Conference on Management of Engineering and Technology. Proceedings Vol-1: Book of Summaries (IEEE Cat. No.99CH36310).

[39]  Robert J. Watts,et al.  Requirements-based knowledge discovery for technology management , 2001, PICMET '01. Portland International Conference on Management of Engineering and Technology. Proceedings Vol.1: Book of Summaries (IEEE Cat. No.01CH37199).

[40]  M. Grätzel,et al.  A low-cost, high-efficiency solar cell based on dye-sensitized colloidal TiO2 films , 1991, Nature.