Summarizing Massive Information for Querying Web Sources and Data Streams

Author(s): Mousavi, Hamid | Advisor(s): Zaniolo, Carlo | Abstract: Largely as a result of advances brought by the Web and related technologies, we are now experiencing a tremendous growth in the volume of data streaming between, and stored at, many nodes of the Internet. This "Big Data" revolution is underscoring the importance of summarization in general, and in particular in two new application areas that are rich of practical significance and interesting research challenges. Indeed, while summarization techniques, including sampling, histograms, and quantiles, remain critical in analyzing large data sets and optimizing queries in traditional databases, new techniques are needed to address the following two problems. The first is that, in addition to summarization techniques for stored data, we now need online/continuous summaries for the streaming data, e.g., real-time online histograms. When dealing with massive data streams and fast-changing distributions, summaries should be quickly updated with the newly arrived data, in order to reflect the most recent portion (window) of the data stream. The second problem is that the Web is storing large corpora of structured, semi-structured, and unstructured (free-text) documents, and these documents are subject to the ambiguities of natural language and the challenges they pose to machine processing. This situation has so far limited severely the ability of smart applications to use the information contained in Web pages, as needed to realize the Semantic Web vision. It is however clear that many of these limitations can be overcome and advanced searches and analysis applications can be supported, if the knowledge of each Web page can be summarized into a standard machine-friendly structure. In this dissertation, we attack these two difficult problems by proposing fast summarization techniques for (i) scalar information of data streams and (ii) textual information in Web pages. For scalar data, we present light and fast synopses, namely histograms, combined with various sampling approaches in order to implement more practical summarization techniques over massive data sets and data streams. To our knowledge, this technique provides the most accurate online histograms for data streams with sliding windows. For textual documents, we introduce several techniques and systems for extracting structured summaries from unstructured text and use these structured summaries to complete the existing ones as well as to improve their consistency.

[1]  Roland H. C. Yap,et al.  Fast and effective histogram construction , 2009, CIKM.

[2]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[3]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[4]  Surajit Chaudhuri,et al.  Mining Document Collections to Facilitate Accurate Approximate Entity Matching , 2009, Proc. VLDB Endow..

[5]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[6]  Neville Ryant,et al.  A large-scale classification of English verbs , 2008, Lang. Resour. Evaluation.

[7]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[8]  Rafail Ostrovsky,et al.  Smooth Histograms for Sliding Windows , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[9]  Felix Naumann,et al.  Extracting structured information from Wikipedia articles to populate infoboxes , 2010, CIKM '10.

[10]  Qi Zhang,et al.  An efficient algorithm for approximate biased quantile computation in data streams , 2007, CIKM '07.

[11]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[12]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[13]  Carlo Zaniolo,et al.  Optimal load shedding with aggregates and mining queries , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[14]  Haixun Wang,et al.  Towards a Probabilistic Taxonomy of Many Concepts , 2011 .

[15]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[16]  Euripides G. M. Petrakis,et al.  Unsupervised Ontology Acquisition from Plain Texts: The OntoGain System , 2010, NLDB.

[17]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[18]  Jian Xu,et al.  Space-efficient Relative Error Order Sketch over Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[19]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[20]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[21]  Noga Alon,et al.  Estimating arbitrary subset sums with few probes , 2005, PODS '05.

[22]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[23]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[24]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[25]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[26]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[27]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[28]  Didier Bourigault,et al.  Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases , 1992, COLING.

[29]  Maciej Janik,et al.  Training-less ontology-based text categorization , 2008 .

[30]  Kyriakos Mouratidis,et al.  Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[31]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[32]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[33]  Yannis E. Ioannidis,et al.  Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing , 1996, VLDB.

[34]  Carlo Zaniolo,et al.  SWiPE: searching wikipedia by example , 2012, WWW.

[35]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[36]  Divesh Srivastava,et al.  Finding hierarchical heavy hitters in streaming data , 2008, TKDD.

[37]  Hoifung Poon,et al.  Unsupervised Semantic Parsing , 2009, EMNLP.

[38]  Rafail Ostrovsky,et al.  Succinct Sampling on Streams , 2007, ArXiv.

[39]  Dan Klein,et al.  Simple Coreference Resolution with Rich Syntactic and Semantic Features , 2009, EMNLP.

[40]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[41]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[42]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[43]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[44]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[45]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[46]  D. Stott Parker,et al.  Topic dynamics: an alternative model of bursts in streams of topics , 2010, KDD.

[47]  Jayant Madhavan,et al.  Web-scale extraction of structured data , 2009, SGMD.

[48]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[49]  Ah-Hwee Tan,et al.  CRCTOL: A semantic-based domain ontology learning system , 2010, J. Assoc. Inf. Sci. Technol..

[50]  Sumit Ganguly,et al.  CR-precis: A Deterministic Summary Structure for Update Data Streams , 2006, ESCAPE.

[51]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[52]  Mitsuru Ishizuka,et al.  Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia , 2006 .

[53]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[54]  Mohammed Bennamoun,et al.  Ontology learning from text: A look back and into the future , 2012, CSUR.

[55]  Carlo Zaniolo,et al.  Discovering attribute and entity synonyms for knowledge integration and semantic web search , 2013, SS@ '13.

[56]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[57]  Paul T. Groth,et al.  Provenance-based validation of e-science experiments , 2005, J. Web Semant..

[58]  Carlo Zaniolo,et al.  IBminer: A Text Mining Tool for Constructing and Populating InfoBox Databases and Knowledge Bases , 2013, Proc. VLDB Endow..

[59]  Won Suk Lee,et al.  Finding recent frequent itemsets adaptively over online data streams , 2003, KDD '03.

[60]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[61]  Pedro M. Domingos,et al.  Joint Unsupervised Coreference Resolution with Markov Logic , 2008, EMNLP.

[62]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[63]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[64]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[65]  Charu C. Aggarwal,et al.  On biased reservoir sampling in the presence of stream evolution , 2006, VLDB.

[66]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[67]  Zornitsa Kozareva,et al.  A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web , 2010, EMNLP.

[68]  Steffen Staab,et al.  Semi-Automatic Engineering of Ontologies from Text , 2000, ICSE 2000.

[69]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[70]  Siu Cheung Hui,et al.  Automatic fuzzy ontology generation for semantic Web , 2006, IEEE Transactions on Knowledge and Data Engineering.

[71]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[72]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[73]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[74]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[75]  Pedro M. Domingos,et al.  Unsupervised Ontology Induction from Text , 2010, ACL.

[76]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[77]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[78]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[79]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[80]  David Maier,et al.  No pane, no gain: efficient evaluation of sliding-window aggregates over data streams , 2005, SGMD.

[81]  Fabian M. Suchanek,et al.  Inside YAGO2s: a transparent information extraction architecture , 2013, WWW '13 Companion.

[82]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[83]  Aditya G. Parameswaran,et al.  Towards the web of concepts , 2010, Proc. VLDB Endow..

[84]  Divyakant Agrawal,et al.  Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[85]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[86]  Sarah Theiss,et al.  Patterns Of Global Terrorism , 2016 .

[87]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[88]  Ian H. Witten,et al.  Mining Meaning from Wikipedia , 2008, Int. J. Hum. Comput. Stud..

[89]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[90]  Patrick Drouin,et al.  Term extraction using non-technical corpora as a point of leverage , 2003 .

[91]  Rajeev Motwani,et al.  Load Shedding Techniques for Data Stream Systems , 2003 .

[92]  Surajit Chaudhuri,et al.  A framework for robust discovery of entity synonyms , 2012, KDD.

[93]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[94]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[95]  Hongjun Lu,et al.  Continuously maintaining quantile summaries of the most recent N elements over a data stream , 2004, Proceedings. 20th International Conference on Data Engineering.

[96]  Rafail Ostrovsky,et al.  Effective Computations on Sliding Windows , 2010, SIAM J. Comput..

[97]  Qi Zhang,et al.  A Fast Algorithm for Approximate Quantiles in High Speed Data Streams , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[98]  James Alden Barber Naval shiphandler's guide , 2005 .

[99]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[100]  Patrick Pantel,et al.  DIRT @SBT@discovery of inference rules from text , 2001, KDD '01.

[101]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[102]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[103]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[104]  Gerhard Weikum,et al.  IQ: The Case for Iterative Querying for Knowledge , 2011, CIDR.

[105]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[106]  Rajeev Motwani,et al.  On Sampling and Relational Operators , 1999, IEEE Data Eng. Bull..

[107]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[108]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[109]  Tao Cheng,et al.  Fuzzy matching of Web queries to structured data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[110]  Atro Voutilainen,et al.  NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[111]  Rafail Ostrovsky,et al.  Optimal sampling from sliding windows , 2009, J. Comput. Syst. Sci..

[112]  Paola Velardi,et al.  Learning Word-Class Lattices for Definition and Hypernym Extraction , 2010, ACL.

[113]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[114]  Daniel Jurafsky,et al.  Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy , 2010, LREC.

[115]  Pedro F. Miret,et al.  Wikipedia , 2008, Monatsschrift für Deutsches Recht.

[116]  Hamid Mousavi,et al.  A New Framework for Textual Information Mining over Parse Trees , 2011, 2011 IEEE Fifth International Conference on Semantic Computing.

[117]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[118]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[119]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[120]  Divesh Srivastava,et al.  Effective computation of biased quantiles over data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[121]  Jan Chomicki,et al.  Skyline with presorting , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[122]  Patrick Pantel,et al.  A Statistical Corpus-Based Term Extractor , 2001, Canadian Conference on AI.

[123]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[124]  Carlo Zaniolo,et al.  Fast computation of approximate biased histograms on sliding windows over data streams , 2013, SSDBM.

[125]  Mining a Large-Scale Term-Concept Network from Wikipedia , 2006 .

[126]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[127]  Yau-Hwang Kuo,et al.  Automated ontology construction for unstructured text documents , 2007, Data & Knowledge Engineering.

[128]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[129]  Claire Cardie,et al.  Topic Identification for Fine-Grained Opinion Analysis , 2008, COLING.

[130]  Erik T. Mueller,et al.  Open Mind Common Sense: Knowledge Acquisition from the General Public , 2002, OTM.

[131]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[132]  Surajit Chaudhuri,et al.  Exploiting web search to generate synonyms for entities , 2009, WWW '09.

[133]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[134]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.

[135]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[136]  Tao Cheng,et al.  Entity Synonyms for Structured Web Search , 2012, IEEE Transactions on Knowledge and Data Engineering.

[137]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[138]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[139]  Divyakant Agrawal,et al.  RHist: adaptive summarization over continuous data streams , 2002, CIKM '02.

[140]  Daniel S. Weld,et al.  Learning 5000 Relational Extractors , 2010, ACL.

[141]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[142]  Christian Bizer,et al.  Faceted Wikipedia Search , 2010, BIS.

[143]  Daniel S. Weld,et al.  Information extraction from Wikipedia: moving down the long tail , 2008, KDD.

[144]  Heeyoung Lee,et al.  A Multi-Pass Sieve for Coreference Resolution , 2010, EMNLP.

[145]  Carlo Zaniolo,et al.  OntoHarvester : An Unsupervised Ontology Generator from Free Text , 2013 .

[146]  Vassilis Christophides,et al.  On Provenance of Queries on Semantic Web Data , 2011, IEEE Internet Computing.

[147]  Divesh Srivastava,et al.  Space- and time-efficient deterministic algorithms for biased quantiles over data streams , 2006, PODS.

[148]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[149]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[150]  Seung-won Hwang,et al.  Web scale taxonomy cleansing , 2011, Proc. VLDB Endow..

[151]  S. C. Hui,et al.  Automatic Generation of Ontology for Scholarly Semantic Web , 2004, SEMWEB.

[152]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[153]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[154]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[155]  Gerhard Weikum,et al.  YAGO2: exploring and querying world knowledge in time, space, context, and many languages , 2011, WWW.

[156]  Micha Elsner,et al.  EM Works for Pronoun Anaphora Resolution , 2009, EACL.

[157]  C. Zaniolo,et al.  Fast and Space-Efficient Computation of Equi-Depth Histograms for Data Streams , 2010 .

[158]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[159]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[160]  Carlo Zaniolo,et al.  Fast and accurate computation of equi-depth histograms over data streams , 2011, EDBT/ICDT '11.

[161]  R. Alur,et al.  Adding nesting structure to words , 2006, JACM.

[162]  Michael B. Greenwald,et al.  Practical Algorithms for Self Scaling Histograms or Better than Average Data Collection , 1996, Perform. Evaluation.

[163]  Tom M. Mitchell,et al.  Which Noun Phrases Denote Which Concepts? , 2011, ACL.

[164]  Stefano Faralli,et al.  A Graph-Based Algorithm for Inducing Lexical Taxonomies from Scratch , 2011, IJCAI.

[165]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[166]  Jerry R. Hobbs Resolving pronoun references , 1986 .

[167]  José Palazzo Moreira de Oliveira,et al.  Concept-based knowledge discovery in texts extracted from the Web , 2000, SKDD.

[168]  Stanley B. Zdonik,et al.  Staying FIT: Efficient Load Shedding Techniques for Distributed Stream Processing , 2007, VLDB.

[169]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.