Category-Based Infidelity Bounded Queries over Unstructured Data Streams

We present the Caicos system that supports continuous infidelity bounded queries over a data stream, where each data item (of the stream) belongs to multiple categories. Caicos is made up of four primitives: Keywords, Queries, Data items, and Categories. A Category is a virtual entity consisting of all those data items that belong to it. The membership of a data item to a category is decided by evaluating a Boolean predicate (associated with each category) over the data item. Each data item and query in turn are associated with multiple keywords. Given a keyword query, unlike conventional unstructured data querying techniques that return the top-(K) documents, Caicos returns the top-(K) categories with infidelity less than the user specified infidelity bound. Caicos is designed to continuously track the evolving information present in a highly dynamic data stream. It, hence, computes the relevance of a category to the continuous keyword query using the data items occurring in the stream in the recent past (i.e., within the current "window"). To efficiently provide up-to-date answers to the continuous queries, Caicos needs to maintain the required metadata accurately. This requires addressing two subproblems: 1) Identifying the "right" metadata that needs to be updated for providing accurate results and 2) updating the metadata in an efficient manner. We show that the problem of identifying the right metadata can be further broken down into two subparts. We model the first subpart as an inequality constrained minimization problem and propose an innovative iterative algorithm for the same. The second subpart requires us to build an efficient dynamic programming-based algorithm, which helps us to find the right metadata that needs to be updated. Updating the metadata on multiple processors is a scheduling problem whose complexity is exponential in the length of the input. An approximate multiprocessor scheduling algorithm is, hence, proposed. Experimental evaluation of Caicos using real-world dynamic data shows that Caicos is able to provide fidelity close to 100 percent using 45 percent less resources than the techniques proposed in the literature. This ability of Caicos to work accurately and efficiently even in scenarios with high data arrival rates makes it suitable for data intensive application domains.

[1]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[2]  Krithi Ramamritham,et al.  Real Time Discovery of Dense Clusters in Highly Dynamic Graphs: Identifying Real World Events in Highly Dynamic Environments , 2012, Proc. VLDB Endow..

[3]  Krithi Ramamritham,et al.  Efficient Execution of Continuous Incoherency Bounded Queries over Multi-Source Streaming Data , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[4]  Gerhard Weikum,et al.  EnBlogue: emergent topic detection in web 2.0 streams , 2011, SIGMOD '11.

[5]  Fuji Ren,et al.  Text Clustering Based on the User Search Intention , 2011 .

[6]  Marti A. Hearst Clustering versus faceted categories for information exploration , 2006, Commun. ACM.

[7]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[8]  Krithi Ramamritham,et al.  Keyword Search over Dynamic Categorized Information , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[9]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[10]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[11]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[12]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[13]  RamamrithamKrithi,et al.  Real time discovery of dense clusters in highly dynamic graphs , 2012, VLDB 2012.

[14]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[15]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[16]  Eugene L. Lawler,et al.  Sequencing and scheduling: algorithms and complexity , 1989 .

[17]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[18]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[19]  P. Wilmott The Mathematics of Financial Derivatives , 1995 .

[20]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[21]  Margaret H. Wright,et al.  Interior methods for constrained optimization , 1992, Acta Numerica.

[22]  Jeffrey Xu Yu,et al.  Scalable keyword search on large data streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.