Mining Higher-Order Association Rules from Distributed Named Entity Databases

The burgeoning amount of textual data in distributed sources combined with the obstacles involved in creating and maintaining central repositories motivates the need for effective distributed information extraction and mining techniques. Recently, as the need to mine patterns across distributed databases has grown, Distributed Association Rule Mining (D-ARM) algorithms have been developed. These algorithms, however, assume that the databases are either horizontally or vertically distributed. In the special case of databases populated from information extracted from textual data, existing D-ARM algorithms cannot discover rules based on higher-order associations between items in distributed textual documents that are neither vertically nor horizontally distributed, but rather a hybrid of the two. In this article we present D-HOTM, a framework for Distributed Higher Order Text Mining. Unlike existing algorithms, D-HOTM requires neither full knowledge of the global schema nor that the distribution of data be horizontal or vertical. D-HOTM discovers rules based on higher-order associations between distributed database records containing the extracted entities. In this paper, two approaches to the definition and discovery of higher order itemsets are presented. The implementation of D-HOTM is based on the TMI [20] and tested on a cluster at the National Center for Supercomputing Applications (NCSA). Results on a real-world dataset from the Richmond, VA police department demonstrate the performance and relevance of D-HOTM in law enforcement and homeland defense.

[1]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .

[2]  Luc De Raedt,et al.  Mining Association Rules in Multiple Relations , 1997, ILP.

[3]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[4]  Srinivasan Parthasarathy,et al.  Parallel and distributed methods for incremental frequent itemset mining , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[5]  Carla E. Brodley,et al.  KDD-Cup 2000 organizers' report: peeling the onion , 2000, SKDD.

[6]  Walid G. Aref Mining Association Rules in Large Databases , 2004 .

[7]  Gu Si-yang,et al.  Privacy preserving association rule mining in vertically partitioned data , 2006 .

[8]  Takeaki Uno,et al.  An Output Linear Time Algorithm for Enumerating Chordless Cycles , 2003 .

[9]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[10]  Michael R. Genesereth,et al.  Infomaster: an information integration system , 1997, SIGMOD '97.

[11]  Christos Faloutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[12]  Murat Kantarcioglu,et al.  Mining Cyclically Repeated Patterns , 2001, DaWaK.

[13]  Jiawei Han,et al.  A fast distributed algorithm for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[14]  William M. Pottenger,et al.  Distributed higher order association rule mining using information extracted from textual data , 2005, SKDD.

[15]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[16]  Walid G. Aref,et al.  Incremental, online, and merge mining of partial periodic patterns in time-series databases , 2004, IEEE Transactions on Knowledge and Data Engineering.

[17]  Yücel Saygin,et al.  Privacy preserving association rule mining , 2002, Proceedings Twelfth International Workshop on Research Issues in Data Engineering: Engineering E-Commerce/E-Business Systems RIDE-2EC 2002.

[18]  L. Stein,et al.  OWL Web Ontology Language - Reference , 2004 .

[19]  Philip S. Yu,et al.  Mining Surprising Periodic Patterns , 2004, Data Mining and Knowledge Discovery.

[20]  Yannis Manolopoulos,et al.  Similarity Search in Time Series Databases , 2005, Encyclopedia of Database Technologies and Applications.

[21]  William M. Pottenger,et al.  A Software Infrastructure for Research in Textual Data Mining , 2004, Int. J. Artif. Intell. Tools.

[22]  Philip S. Yu,et al.  Mining Asynchronous Periodic Patterns in Time Series Data , 2003, IEEE Trans. Knowl. Data Eng..

[23]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[24]  Jaideep Vaidya Vertically Partitioned Data , 2009, Encyclopedia of Database Systems.

[25]  William M. Pottenger,et al.  A software infrastructure for research in textual data mining , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[26]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[27]  Chris Clifton,et al.  Privacy-preserving data integration and sharing , 2004, DMKD '04.

[28]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[29]  Jiawei Han,et al.  Efficient mining of partial periodic patterns in time series database , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[30]  Mafruz Zaman Ashrafi Distributed Association Rule Mining , 2009, Encyclopedia of Data Warehousing and Mining.

[31]  David Taniar,et al.  ODAM: An optimized distributed association rule mining algorithm , 2004, IEEE Distributed Systems Online.

[32]  Christian Borgelt,et al.  Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[33]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[34]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[35]  Jaideep Srivastava,et al.  Indirect Association: Mining Higher Order Dependencies in Data , 2000, PKDD.

[36]  Ann Q. Gates,et al.  TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING , 2005 .

[37]  Philip S. Yu,et al.  Mining asynchronous periodic patterns in time series data , 2000, KDD '00.

[38]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[39]  William M. Pottenger,et al.  A semi-supervised active learning algorithm for information extraction from textual data , 2005, J. Assoc. Inf. Sci. Technol..

[40]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[41]  Jaideep Vaidya,et al.  Privacy preserving association rule mining in vertically partitioned data , 2002, KDD.

[42]  Ran Wolff,et al.  Communication-efficient distributed mining of association rules , 2001, SIGMOD '01.

[43]  X.S. Wang,et al.  Discovering Frequent Event Patterns with Multiple Granularities in Time Sequences , 1998, IEEE Trans. Knowl. Data Eng..