A user-driven annotation framework for scientific data

Annotations play an increasingly crucial role in scientific exploration and discovery, as the amount of data and the level of collaboration among scientists increases. There are many systems today focusing on annotation management, querying, and propagation. Although all such systems are implemented to take user input (i.e., the annotations themselves), very few systems are user-driven, taking into account user preferences on how annotations should be propagated and applied over data. In this thesis, we propose to treat annotations as first-class citizens for scientific data by introducing a user-driven, view-based annotation framework. Under this framework, we try to resolve two critical questions: Firstly, how do we support annotations that are scalable both from a system point of view and also from a user point of view? Secondly, how do we support annotation queries both from an annotator point of view and a user point of view, in an efficient and accurate way? To address these challenges, we propose the VIew-base annotation Propagation (ViP) framework to empower users to express their preferences over the time semantics of annotations and over the network semantics of annotations, and define three query types for annotations. To efficiently support such novel functionality, ViP utilizes database views and introduces new annotation caching techniques. The use of views also brings a more compact representation of annotations, making our system easier to scale. Through an extensive experimental study on a real system (with both synthetic and real data), we show that the ViP framework can seamlessly introduce user-driven annotation propagation semantics while at the same time significantly improving the performance (in terms of query execution time) over the current state of the art.

[1]  Walid G. Aref,et al.  A database server for next-generation scientific data management , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[2]  Alan Jay Smith Design of CPU Cache Memories , 1987 .

[3]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[4]  Alexandros Labrinidis,et al.  Adaptive Class-Based Scheduling of Continuous Queries , 2012, 2012 IEEE 28th International Conference on Data Engineering Workshops.

[5]  Alexandros Labrinidis,et al.  Global Transcriptional Response to Spermine, a Component of the Intramacrophage Environment, Reveals Regulation of Francisella Gene Expression through Insertion Sequence Elements , 2009, Journal of bacteriology.

[6]  Jakob Nielsen,et al.  Designing web usability , 1999 .

[7]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[8]  Eric N. Hanson,et al.  A performance analysis of view materialization strategies , 1987, SIGMOD '87.

[9]  Wing-Kai Hon,et al.  The SBC-tree: an index for run-length compressed sequences , 2008, EDBT '08.

[10]  Alexandros Labrinidis,et al.  Preference-Aware Query and Update Scheduling in Web-databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[11]  Hector Garcia-Molina,et al.  Database Support for Efficiently Maintaining Derived Data , 1996, EDBT.

[12]  Wenfei Fan,et al.  Annotation propagation revisited for key preserving views , 2006, CIKM '06.

[13]  Yoshihide Igarashi,et al.  Roughly sorting: sequential and parallel approach , 1989 .

[14]  Thierry Hamon,et al.  A Scalable and Distributed NLP Architecture for Web Document Annotation , 2006, FinTAL.

[15]  Elizabeth O'Neil,et al.  Database--Principles, Programming, and Performance , 1994 .

[16]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[17]  Alexandros Labrinidis,et al.  ViP: A User-Centric View-Based Annotation Framework for Scientific Data , 2008, SSDBM.

[18]  Arnaud Legout,et al.  The complete picture of the Twitter social graph , 2012, CoNEXT Student '12.

[19]  Fusheng Wang,et al.  XBiT: An XML-Based Bitemporal Data Model , 2004, ER.

[20]  Swapna Somasundaran,et al.  Discourse Level Opinion Relations: An Annotation Study , 2008, SIGDIAL Workshop.

[21]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[22]  Lada A. Adamic Zipf, Power-laws, and Pareto-a ranking tutorial , 2000 .

[23]  Panos K. Chrysanthis,et al.  Enforcing Policy and Data Consistency of Cloud Transactions , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[24]  Qiong Luo,et al.  Caching and Materialization for Web Databases , 2009, Found. Trends Databases.

[25]  Walid G. Aref,et al.  Supporting real-world activities in database management systems , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[26]  Alexandros Labrinidis,et al.  Towards Continuous Workflow Enactment Systems , 2008, CollaborateCom.

[27]  Sushil Jajodia,et al.  Flexible support for multiple access control policies , 2001, TODS.

[28]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[29]  Alexandros Labrinidis,et al.  Adaptive WebView Materialization , 2001, WebDB.

[30]  Sanjeev Khanna,et al.  Edinburgh Research Explorer On the Propagation of Deletions and Annotations through Views , 2013 .

[31]  Divesh Srivastava,et al.  Intensional associations between data and metadata , 2007, SIGMOD '07.

[32]  Wang Chiew Tan,et al.  Research Problems in Data Provenance , 2004, IEEE Data Eng. Bull..

[33]  Jianzhong Li,et al.  Efficient Subgraph Matching on Billion Node Graphs , 2012, Proc. VLDB Endow..

[34]  Carolyn E. Begg,et al.  Database Systems: A Practical Approach to Design, Implementation and Management , 1998 .

[35]  Georg Lausen,et al.  Propagation Models for Trust and Distrust in Social Networks , 2005, Inf. Syst. Frontiers.

[36]  Daniel Mossé,et al.  UNIT: User-centric Transaction Management in Web-Database Systems , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[37]  Alexandros Labrinidis,et al.  Multi-criteria routing in wireless sensor-based pervasive environments , 2005, Int. J. Pervasive Comput. Commun..

[38]  Hicham G. Elmongui,et al.  Lazy Maintenance of Materialized Views , 2007, VLDB.

[39]  Limsoon Wong,et al.  Principles of Programming with Complex Objects and Collection Types , 1995, Theor. Comput. Sci..

[40]  Kirk Pruhs,et al.  Admission control mechanisms for continuous queries in the cloud , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[41]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[42]  Alexandros Labrinidis,et al.  AstroShelf: understanding the universe through scalable navigation of a galaxy of annotations , 2012, SIGMOD Conference.

[43]  Rafael Alonso,et al.  Data caching issues in an information retrieval system , 1990, TODS.

[44]  Mohamed A. Sharaf,et al.  Optimizing i/o-intensive transactions in highly interactive applications , 2009, SIGMOD Conference.

[45]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[46]  Kirk Pruhs,et al.  Adaptive Scheduling of Web Transactions , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[47]  Alexandros Labrinidis,et al.  Demonstrating an evacuation algorithm with mobile devices using an e-scavenger hunt game , 2009, MobiDE.

[48]  Elke A. Rundensteiner,et al.  Using object-oriented principles to optimize update propagation to materialized views , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[49]  Carolyn Begg Thomas Connolly,et al.  Database Systems: A Practical Approach To Design, , 2004 .

[50]  Fausto Giunchiglia,et al.  Towards semantic social networks , 2015, 2015 Latin American Computing Conference (CLEI).

[51]  Alexandros Labrinidis,et al.  CONFLuEnCE: CONtinuous workFLow ExeCution Engine , 2011, SIGMOD '11.

[52]  Jie Xu,et al.  Towards a Content-Provider-Friendly Web Page Crawler , 2007, WebDB.

[53]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[54]  Panos K. Chrysanthis,et al.  Personalizing information gathering for mobile database clients , 2002, SAC '02.

[55]  Georgia Koutrika,et al.  Personalization of queries in database systems , 2004, Proceedings. 20th International Conference on Data Engineering.

[56]  Derick Wood,et al.  Roughly sorting: a generalization of sorting , 1991 .

[57]  Terhi Töyli,et al.  bdbms - A Database Management System for Biological Data , 2008 .

[58]  Jakob Nielsen,et al.  Prioritizing Web Usability , 2006 .

[59]  Robert Stevens,et al.  Annotating, Linking and Browsing Provenance Logs for {e-Science} , 2003 .

[60]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[61]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[62]  Wang Chiew Tan,et al.  DBNotes: a post-it system for relational databases based on provenance , 2005, SIGMOD '05.

[63]  George Kingsley Zipf,et al.  The Psychobiology of Language , 2022 .

[64]  Hejun Wu,et al.  Quality aware query scheduling in wireless sensor networks , 2009, DMSN '09.

[65]  Jie Xu,et al.  Quality is in the eye of the beholder: towards user-centric web-databases , 2007, SIGMOD '07.

[66]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[67]  Gottfried Vossen,et al.  Meta-SQL: Towards Practical Meta-Querying , 2004, EDBT.

[68]  Alexandros Labrinidis,et al.  Scheduling Update and Query Transactions under Quality Contracts in Web-Databases ∗ , 2006 .

[69]  Elke A. Rundensteiner,et al.  View materialization techniques for complex hierarchical objects , 1997, CIKM '97.

[70]  Alexandros Labrinidis,et al.  Guiding Personal Choices in a Quality Contracts Driven Query Economy , 2009 .

[71]  Jennifer Widom,et al.  Practical lineage tracing in data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[72]  Laurian M. Chirica,et al.  The entity-relationship model: toward a unified view of data , 1975, SIGF.

[73]  Alexandros Labrinidis,et al.  User-Centric Annotation Management for Biological Data , 2008, IPAW.

[74]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[75]  Inderpal Singh Mumick,et al.  The Stanford Data Warehousing Project , 1995 .

[76]  Val Tannen,et al.  Update Exchange with Mappings and Provenance , 2007, VLDB.

[77]  Jennifer Widom,et al.  View maintenance in a warehousing environment , 1995, SIGMOD '95.

[78]  Alexandros Labrinidis,et al.  Multi-criteria routing in pervasive environment with sensors , 2005, ICPS '05. Proceedings. International Conference on Pervasive Services, 2005..

[79]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[80]  Danah Boyd,et al.  Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter , 2010, 2010 43rd Hawaii International Conference on System Sciences.

[81]  Sandy Irani,et al.  Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.

[82]  Yixin Chen,et al.  A comparison of a graph database and a relational database: a data provenance perspective , 2010, ACM SE '10.

[83]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[84]  Wang Chiew Tan Containment of Relational Queries with Annotation Propagation , 2003, DBPL.

[85]  Val Tannen,et al.  Annotated XML: queries and provenance , 2008, PODS.

[86]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[87]  Alexandros Labrinidis,et al.  WebView materialization , 2000, SIGMOD '00.

[88]  Masaru Kitsuregawa,et al.  Analyzing patterns of information cascades based on users' influence and posting behaviors , 2012, TempWeb '12.

[89]  B. Clifford Neuman,et al.  Security , Payment , and Privacy for Network Commerce , 1995 .

[90]  Luc Moreau,et al.  The Open Provenance Model , 2007 .

[91]  James Cheney,et al.  Provenance management in curated databases , 2006, SIGMOD Conference.

[92]  Mohamed A. Sharaf,et al.  Class-based continuous query scheduling for data streams , 2009, DMSN '09.

[93]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[94]  Jure Leskovec,et al.  Correcting for missing data in information cascades , 2011, WSDM '11.

[95]  Wang Chiew Tan,et al.  Debugging schema mappings with routes , 2006, VLDB.

[96]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[97]  Alexandros Labrinidis,et al.  CONFLuEnCE: Implementation and application design , 2011, 7th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom).

[98]  Floris Geerts,et al.  MONDRIAN: Annotating and Querying Databases through Colors and Blocks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[99]  Jon M. Kleinberg,et al.  Challenges in mining social network data: processes, privacy, and paradoxes , 2007, KDD '07.

[100]  John Riedl,et al.  An algorithmic framework for performing collaborative filtering , 1999, SIGIR '99.

[101]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[102]  Philip A. Bernstein,et al.  Compiling mappings to bridge applications and databases , 2007, SIGMOD '07.

[103]  Adrian Perrig,et al.  Security and Privacy in Sensor Networks , 2003, Computer.

[104]  Walid G. Aref,et al.  HandsOn DB: Managing Data Dependencies Involving Human Actions , 2014, IEEE Transactions on Knowledge and Data Engineering.

[105]  Jie Xu,et al.  Quality Contracts for Real-Time Enterprises , 2006, BIRTE.