Web Data Management

The Internet and World Wide Web have revolutionized access to information. Users now store information across multiple platforms from personal computers, to smartphones, to websites such as Youtube and Picasa. As a consequence, data management concepts, methods, and techniques are increasingly focused on distribution concerns. Now that information largely resides in the network, so do the tools that process this information. This book explains the foundations of XML, the Web standard for data management, with a focus on data distribution. It covers the many facets of distributed data management on the Web, such as description logics, that are already emerging in today's data integration applications and herald tomorrow's semantic Web. It also introduces the machinery used to manipulate the unprecedented amount of data collected on the Web. Several 'Putting into Practice' chapters describe detailed practical applications of the technologies and techniques. Striking a balance between the conceptual and the practical, the book will serve as an introduction to the new, global, information systems for Web professionals as well as for master's level courses.

[1]  Dan Suciu,et al.  Schema mediation for large-scale semantic data sharing , 2005, The VLDB Journal.

[2]  Dan Suciu,et al.  Typechecking for XML transformers , 2000, J. Comput. Syst. Sci..

[3]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  Ioana Manolescu,et al.  Towards micro-benchmarking XQuery , 2008, ExpDB.

[5]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[6]  Diego Calvanese,et al.  Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family , 2007, Journal of Automated Reasoning.

[7]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[8]  Emin Gün Sirer,et al.  Beehive: O(1) Lookup Performance for Power-Law Query Distributions in Peer-to-Peer Overlays , 2004, NSDI.

[9]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[10]  Hidehiko Tanaka,et al.  An Overview of The System Software of A Parallel Relational Database Machine GRACE , 1986, VLDB.

[11]  Hongjun Lu,et al.  XParent: an efficient RDBMS-Based XML database system , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  Diego Calvanese,et al.  The Description Logic Handbook: Theory, Implementation, and Applications , 2003, Description Logic Handbook.

[13]  Frank van Harmelen,et al.  A semantic web primer , 2004 .

[14]  Dan Suciu The XML typechecking problem , 2002, SGMD.

[15]  Alon Y. Halevy,et al.  Recursive Query Plans for Data Integration , 2000, J. Log. Program..

[16]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[17]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[18]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[19]  Patrick E. O'Neil,et al.  ORDPATHs: insert-friendly XML node labels , 2004, SIGMOD '04.

[20]  Henry MacKay Walker,et al.  A revised model curriculum for a liberal arts degree in computer science , 1996, CACM.

[21]  François Goasdoué,et al.  Querying Distributed Data through Distributed Ontologies: A Simple but Scalable Approach , 2003, IIWeb.

[22]  Torsten Grust,et al.  MonetDB/XQuery: a fast XQuery processor powered by a relational engine , 2006, SIGMOD Conference.

[23]  Frank Wolter,et al.  Handbook of Modal Logic , 2007, Studies in logic and practical reasoning.

[24]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[25]  Benjamin C. Pierce,et al.  Regular expression types for XML , 2000, TOPL.

[26]  Jignesh M. Patel,et al.  Structural join order selection for XML query optimization , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[27]  Ioana Manolescu,et al.  Putting into Practice: Large-Scale Data Management with Hadoop , 2011 .

[28]  Alistair Moffat,et al.  Improved word-aligned binary compression for text indexing , 2006, IEEE Transactions on Knowledge and Data Engineering.

[29]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[30]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[31]  Robert L. Grossman,et al.  Mining Web pages for data records , 2004, IEEE Intelligent Systems.

[32]  Torsten Grust,et al.  Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps , 2003, VLDB.

[33]  Toshiyuki Amagasa,et al.  XRel: a path-based approach to storage and retrieval of XML documents using relational databases , 2001, ACM Trans. Internet Techn..

[34]  Justin Zobel,et al.  Efficient single-pass index construction for text databases , 2003, J. Assoc. Inf. Sci. Technol..

[35]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[36]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[37]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.

[38]  François Goasdoué,et al.  DL-LITER in the Light of Propositional Logic for Decentralized Data Management , 2009, IJCAI.

[39]  Catriel Beeri,et al.  The power of languages for the manipulation of complex values , 1995, The VLDB Journal.

[40]  Tok Wang Ling,et al.  DDE: from dewey to a fully dynamic XML labeling scheme , 2009, SIGMOD Conference.

[41]  Johannes Gehrke,et al.  Querying peer-to-peer networks using P-trees , 2004, WebDB '04.

[42]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[43]  Maged M. Michael,et al.  Scale-up x Scale-out: A Case Study using Nutch/Lucene , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[44]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[45]  Peter M. Chen,et al.  Integrating reliable memory in databases , 1998, The VLDB Journal.

[46]  Stephen Alstrup,et al.  Compact Labeling Scheme for Ancestor Queries , 2006, SIAM J. Comput..

[47]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[48]  Vishu Krishnamurthy,et al.  Performance Challenges in Object-Relational DBMSs , 1999, IEEE Data Eng. Bull..

[49]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[50]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.

[51]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[52]  Norbert Zeh,et al.  External-memory algorithms and data structures , 2010 .

[53]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[54]  Paul F. Dietz Maintaining order in a linked list , 1982, STOC '82.

[55]  Pierre Senellart,et al.  Automatic wrapper induction from hidden-web sources with domain knowledge , 2008, WIDM '08.

[56]  Thomas Schwentick,et al.  Inference of concise regular expressions and DTDs , 2010, TODS.

[57]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[58]  Wei Hong,et al.  Model-based approximate querying in sensor networks , 2005, The VLDB Journal.

[59]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[60]  Mohamed Ziauddin,et al.  Query processing and optimization in Oracle Rdb , 1996, The VLDB Journal.

[61]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[62]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[63]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[64]  Ron K. Cytron,et al.  Efficient memory-reference checks for real-time java , 2003 .

[65]  Yasushi Saito,et al.  Optimistic replication , 2005, CSUR.

[66]  Jayant Madhavan,et al.  Structured Data Meets the Web: A Few Observations , 2006, IEEE Data Eng. Bull..

[67]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[68]  Mong-Li Lee,et al.  A Prime Number Labeling Scheme for Dynamic Ordered XML Trees , 2004, ICDE.

[69]  Madalina Croitoru,et al.  Translations between RDF(S) and Conceptual Graphs , 2010, ICCS.

[70]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[71]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[72]  Per-Åke Larson,et al.  Dynamic hash tables , 1988, CACM.

[73]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[74]  Giuseppe Castagna,et al.  CDuce: an XML-centric general-purpose language , 2003, ACM SIGPLAN Notices.

[75]  R. G. G. Cattell,et al.  The Object Database Standard: ODMG-93 , 1993 .

[76]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[77]  Beng Chin Ooi,et al.  VBI-Tree: A Peer-to-Peer Framework for Supporting Multi-Dimensional Indexing Schemes , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[78]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[79]  Beng Chin Ooi,et al.  BATON: A Balanced Tree Structure for Peer-to-Peer Networks , 2005, VLDB.

[80]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[81]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[82]  Jeffrey D. Ullman,et al.  Principles Of Database And Knowledge-Base Systems , 1979 .

[83]  Daniela Florescu,et al.  Storing and Querying XML Data using an RDMBS , 1999, IEEE Data Eng. Bull..

[84]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[85]  Dan Suciu,et al.  Programming Constructs for Unstructured Data , 1995, DBPL.

[86]  Tova Milo,et al.  An Algebra for Pomsets , 1999, Inf. Comput..

[87]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[88]  Witold Litwin,et al.  RP*: A Family of Order Preserving Scalable Distributed Data Structures , 1994, VLDB.

[89]  David Jordan,et al.  The Object Database Standard: ODMG 2.0 , 1997 .

[90]  Diego Calvanese,et al.  QuOnto: Querying Ontologies , 2005, AAAI.

[91]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[92]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Third Edition , 2011 .

[93]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[94]  Andrea Calì,et al.  A general datalog-based framework for tractable query answering over ontologies , 2009, SEBD.

[95]  Haim Kaplan,et al.  Compact Labeling Scheme for XML Ancestor Queries , 2005, Theory of Computing Systems.

[96]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[97]  Witold Litwin,et al.  LH*—a scalable, distributed data structure , 1996, TODS.

[98]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[99]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[100]  Alon Y. Halevy,et al.  MiniCon: A scalable algorithm for answering queries using views , 2000, The VLDB Journal.

[101]  Yarden Katz,et al.  Pellet: A practical OWL-DL reasoner , 2007, J. Web Semant..

[102]  Torsten Grust,et al.  Accelerating XPath evaluation in any RDBMS , 2004, TODS.

[103]  J. Chris Anderson,et al.  CouchDB: The Definitive Guide , 2010 .

[104]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[105]  Amin Vahdat,et al.  Design and evaluation of a conit-based continuous consistency model for replicated services , 2002, TOCS.

[106]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[107]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[108]  Todd D. Millstein,et al.  Query containment for data integration systems , 2003, J. Comput. Syst. Sci..

[109]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[110]  Charles F. Goldfarb,et al.  SGML handbook , 1990 .

[111]  Michael Kay XSLT 2.0 and XPath 2.0 Programmer's Reference (Programmer to Programmer) , 2008 .

[112]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[113]  Amin Vahdat,et al.  Design and evaluation of a continuous consistency model for replicated services , 2000, OSDI.

[114]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[115]  S. Dongen Graph clustering by flow simulation , 2000 .

[116]  Peter Widmayer,et al.  Distributing a search tree among a growing number of processors , 1994, SIGMOD '94.

[117]  Hamid Pirahesh,et al.  Efficiently publishing relational data as XML documents , 2001, The VLDB Journal.

[118]  Georg Gottlob,et al.  Datalog±: a unified approach to ontologies and integrity constraints , 2009, ICDT '09.

[119]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[120]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[121]  Jos de Bruijn,et al.  Logical Reconstruction of Normative RDF , 2005, OWLED.

[122]  Sherif Sakr,et al.  XQuery on SQL Hosts , 2004, VLDB.

[123]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[124]  Witold Litwin,et al.  Linear Hashing: A new Algorithm for Files and Tables Addressing , 1980, ICOD.

[125]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[126]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[127]  Gary William Flake,et al.  Self-organization of the web and identification of communities , 2002 .

[128]  Moshe Y. Vardi,et al.  The Implication Problem for Functional and Inclusion Dependencies is Undecidable , 1985, SIAM J. Comput..

[129]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[130]  Tok Wang Ling,et al.  From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching , 2005, VLDB.

[131]  Jim Melton,et al.  Querying XML,: XQuery, XPath, and SQL/XML in context (The Morgan Kaufmann Series in Data Management Systems) (The Morgan Kaufmann Series in Data Management Systems) , 2006 .

[132]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[133]  Indranil Gupta,et al.  On scalable and efficient distributed failure detectors , 2001, PODC '01.

[134]  Frank Neven,et al.  Inferring XML Schema Definitions from XML Data , 2007, VLDB.

[135]  Beng Chin Ooi,et al.  Speeding up search in peer-to-peer networks with a multi-way tree structure , 2006, SIGMOD Conference.

[136]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[137]  François Goasdoué,et al.  Distributed Reasoning in a Peer-to-Peer Setting , 2004, ECAI.

[138]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[139]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[140]  Guido Moerkotte,et al.  Querying documents in object databases , 1997, International Journal on Digital Libraries.

[141]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[142]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[143]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[144]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[145]  Dean Allemang,et al.  Semantic Web for the Working Ontologist - Effective Modeling in RDFS and OWL, Second Edition , 2011 .

[146]  David R. Karger,et al.  Chord: a scalable peer-to-peer lookup protocol for internet applications , 2003, TNET.

[147]  Benjamin C. Pierce,et al.  XDuce: A statically typed XML processing language , 2003, TOIT.

[148]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[149]  Hugh E. Williams,et al.  Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.

[150]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[151]  Gennady Antoshenkov,et al.  Dictionary-based order-preserving string compression , 1997, The VLDB Journal.

[152]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[153]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[154]  Nicolás Marín,et al.  Data Mining: Concepts and Techniques - Book Review , 2002, SIGMOD Rec..

[155]  Maarten Marx,et al.  Navigational XPath: calculus and algebra , 2007, SGMD.

[156]  Kenneth P. Birman,et al.  Reliable Distributed Systems: Technologies, Web Services, and Applications , 2005 .

[157]  Serge Abiteboul,et al.  Non First Normal Form Relations: An Algebra Allowing Data Restructuring , 1986, J. Comput. Syst. Sci..

[158]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[159]  Hubert Comon,et al.  Tree automata techniques and applications , 1997 .

[160]  Dan Suciu,et al.  Schema mediation in peer data management systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[161]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[162]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[163]  Robert Devine,et al.  Design and Implementation of DDH: A Distributed Dynamic Hashing Algorithm , 1993, FODO.

[164]  Michael Benedikt,et al.  From XQuery to relational logics , 2009, TODS.

[165]  Bernardo Cuenca Grau,et al.  OWL 2 Web Ontology Language: Profiles , 2009 .

[166]  Michael Benedikt,et al.  XPath leashed , 2009, CSUR.