ASTERIX: scalable warehouse-style web data integration

A growing wealth of digital information is being generated on a daily basis in social networks, blogs, online communities, etc. Organizations and researchers in a wide variety of domains recognize that there is tremendous value and insight to be gained by warehousing this emerging data and making it available for querying, analysis, and other purposes. This new breed of "Big Data" applications poses challenging requirements against data management platforms in terms of scalability, flexibility, manageability, and analysis capabilities. At UC Irvine, we are building a next-generation database system, called ASTERIX, in response to these trends. We present ongoing work that approaches the following questions: How does data get into the system? What primitives should we provide to better cope with dirty/noisy data? How can we support efficient data analysis on spatial data? Using real examples, we show the capabilities of ASTERIX for ingesting data via feeds, supporting set-similarity predicates for fuzzy matching, and answering spatial aggregation queries.

[1]  Divesh Srivastava,et al.  Bistro data feed management system , 2011, SIGMOD '11.

[2]  Nick Koudas,et al.  Identifying, attributing and describing spatial bursts , 2010, Proc. VLDB Endow..

[3]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[4]  Sharad Mehrotra,et al.  Progressive approximate aggregate queries with a multi-resolution tree structure , 2001, SIGMOD '01.

[5]  Christopher E. Dabrowski,et al.  Object database management systems , 1990 .

[6]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[7]  Alin Deutsch,et al.  ASTERIX: towards a scalable, semistructured data platform for evolving-world models , 2011, Distributed and Parallel Databases.

[8]  Chen Li,et al.  Efficient processing of set-similarity joins on large clusters , 2011 .

[9]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[10]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[11]  Laura M. Haas,et al.  Information integration in the enterprise , 2008, CACM.

[12]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[13]  Panos Kalnis,et al.  Efficient OLAP Operations in Spatial Data Warehouses , 2001, SSTD.

[14]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[15]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[16]  Chen Li,et al.  Answering approximate string queries on large data sets using external memory , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[17]  Elke A. Rundensteiner,et al.  Complex event pattern detection over streams with interval-based temporal semantics , 2011, DEBS '11.

[18]  Panos Kalnis,et al.  Indexing spatio-temporal data warehouses , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  Elke A. Rundensteiner,et al.  Active Complex Event Processing infrastructure: Monitoring and reacting to event streams , 2011, 2011 IEEE 27th International Conference on Data Engineering Workshops.

[20]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[21]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[22]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.