Document analysis by means of data mining techniques

The huge amount of textual data produced everyday by scientists, journalists and Web users, allows investigating many different aspects of information stored in the published documents. Data mining and information retrieval techniques are exploited to manage and extract information from huge amount of unstructured textual data. Text mining also known as text data mining is the processing of extracting high quality information (focusing relevance, novelty and interestingness) from text by identifying patterns etc. Text mining typically involves the process of structuring input text by means of parsing and other linguistic features or sometimes by removing extra data and then finding patterns from structured data. Patterns are then evaluated at last and interpretation of output is performed to accomplish the desired task. Recently, text mining has got attention in several fields such as in security (involves analysis of Internet news), for commercial (for search and indexing purposes) and in academic departments (such as answering query). Beyond searching the documents consisting the words given in a user query, text mining may provide direct answer to user by semantic web for content based (content meaning and its context). It can also act as intelligence analyst and can also be used in some email spam filters for filtering out unwanted material. Text mining usually includes tasks such as clustering, categorization, sentiment analysis, entity recognition, entity relation modeling and document summarization. In particular, summarization approaches are suitable for identifying relevant sentences that describe the main concepts presented in a document dataset. Furthermore, the knowledge existed in the most informative sentences can be employed to improve the understanding of user and/or community interests. Different approaches have been proposed to extract summaries from unstructured text documents. Some of them are based on the statistical analysis of linguistic features by means of supervised machine learning or data mining methods, such as Hidden Markov models, neural networks and Naive Bayes methods. An appealing research field is the extraction of summaries tailored to the major user interests. In this context, the problem of extracting useful information according to domain knowledge related to the user interests is a challenging task. The main topics have been to study and design of novel data representations and data mining algorithms useful for managing and extracting knowledge from unstructured documents. This thesis describes an effort to investigate the application of data mining approaches, firmly established in the subject of transactional data (e.g., frequent itemset mining), to textual documents. Frequent itemset mining is a widely exploratory technique to discover hidden correlations that frequently occur in the source data. Although its application to transactional data is well-established, the usage of frequent itemsets in textual document summarization has never been investigated so far. A work is carried on exploiting frequent itemsets for the purpose of multi-document summarization so a novel multi-document summarizer, namely ItemSum (Itemset-based Summarizer) is presented, that is based on an itemset-based model, i.e., a framework comprise of frequent itemsets, taken out from the document collection. Highly representative and not redundant sentences are selected for generating summary by considering both sentence coverage, with respect to a sentence relevance score, based on tf-idf statistics, and a concise and highly informative itemset-based model. To evaluate the ItemSum performance a suite of experiments on a collection of news articles has been performed. Obtained results show that ItemSum significantly outperforms mostly used previous summarizers in terms of precision, recall, and F-measure. We also validated our approach against a large number of approaches on the DUC'04 document collection. Performance comparisons, in terms of precision, recall, and F-measure, have been performed by means of the ROUGE toolkit. In most cases, ItemSum significantly outperforms the considered competitors. Furthermore, the impact of both the main algorithm parameters and the adopted model coverage strategy on the summarization performance are investigated as well. In some cases, the soundness and readability of the generated summaries are unsatisfactory, because the summaries do not cover in an effective way all the semantically relevant data facets. A step beyond towards the generation of more accurate summaries has been made by semantics-based summarizers. Such approaches combine the use of general-purpose summarization strategies with ad-hoc linguistic analysis. The key idea is to also consider the semantics behind the document content to overcome the limitations of general-purpose strategies in differentiating between sentences based on their actual meaning and context. Most of the previously proposed approaches perform the semantics-based analysis as a preprocessing step that precedes the main summarization process. Therefore, the generated summaries could not entirely reflect the actual meaning and context of the key document sentences. In contrast, we aim at tightly integrating the ontology-based document analysis into the summarization process in order to take the semantic meaning of the document content into account during the sentence evaluation and selection processes. With this in mind, we propose a new multi-document summarizer, namely Yago-based Summarizer, that integrates an established ontology-based entity recognition and disambiguation step. Named Entity Recognition from Yago ontology is being used for the task of text summarization. The Named Entity Recognition (NER) task is concerned with marking occurrences of a specific object being mentioned. These mentions are then classified into a set of predefined categories. Standard categories include "person", "location", "geo-political organization", "facility", "organization", and "time". The use of NER in text summarization improved the summarization process by increasing the rank of informative sentences. To demonstrate the effectiveness of the proposed approach, we compared its performance on the DUC'04 benchmark document collections with that of a large number of state-of-the-art summarizers. Furthermore, we also performed a qualitative evaluation of the soundness and readability of the generated summaries and a comparison with the results that were produced by the most effective summarizers. A parallel effort has been devoted to integrating semantics-based models and the knowledge acquired from social networks into a document summarization model named as SociONewSum. The effort addresses the sentence-based generic multi-document summarization problem, which can be formulated as follows: given a collection of news articles ranging over the same topic, the goal is to extract a concise yet informative summary, which consists of most salient document sentences. An established ontological model has been used to improve summarization performance by integrating a textual entity recognition and disambiguation step. Furthermore, the analysis of the user-generated content coming from Twitter has been exploited to discover current social trends and improve the appealing of the generated summaries. An experimental evaluation of the SociONewSum performance was conducted on real English-written news article collections and Twitter posts. The achieved results demonstrate the effectiveness of the proposed summarizer, in terms of different ROUGE scores, compared to state-of-the-art open source summarizers as well as to a baseline version of the SociONewSum summarizer that does not perform any UGC analysis. Furthermore, the readability of the generated summaries has also been analyzed

[1]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[2]  Rasim M. Alguliyev,et al.  Multiple documents summarization based on evolutionary optimization algorithm , 2013, Expert Syst. Appl..

[3]  Christopher Town,et al.  Ontological inference for image and video analysis , 2006, Machine Vision and Applications.

[4]  Raphaël Troncy,et al.  POLITECNICO DI TORINO Repository ISTITUZIONALE NERD : A Framework for Evaluating Named Entity Recognition Tools in the Web of Data / , 2022 .

[5]  Nikolaj Tatti,et al.  Using background knowledge to rank itemsets , 2010, Data Mining and Knowledge Discovery.

[6]  A.A. Mohamed,et al.  Improving Query-Based Summarization Using Document Graphs , 2006, 2006 IEEE International Symposium on Signal Processing and Information Technology.

[7]  Reda Alhajj,et al.  Text summarization techniques: SVM versus neural networks , 2009, iiWAS.

[8]  Shafiq R. Joty,et al.  A SVM-Based Ensemble Approach to Multi-Document Summarization , 2009, Canadian Conference on AI.

[9]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[10]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[11]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[12]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[13]  Fernando Pereira,et al.  Generating summary keywords for emails using topics , 2008, IUI '08.

[14]  Carole A. Goble,et al.  Ontology-based Knowledge Representation for Bioinformatics , 2000, Briefings Bioinform..

[15]  Robert Wetzker,et al.  An Ontology-Based Approach to Text Summarization , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[16]  Joel Nothman,et al.  Evaluating Entity Linking with Wikipedia , 2013, Artif. Intell..

[17]  Luca Cagliero,et al.  Multi-document summarization exploiting frequent itemsets , 2012, SAC '12.

[18]  Mark Last,et al.  Graph-Based Keyword Extraction for Single-Document Summarization , 2008, COLING 2008.

[19]  Dianne P. O'Leary,et al.  CLASSY 2011 at TAC: Guided and Multi-lingual Summaries and Evaluation Metrics , 2011, TAC.

[20]  S. Sathiya Keerthi,et al.  Large scale semi-supervised linear SVMs , 2006, SIGIR.

[21]  Luís Fernando Fortes Garcia,et al.  Using Ontological Modeling in a Context-Aware Summarization System to Adapt Text for Mobile Devices , 2006, Active Conceptual Modeling of Learning.

[22]  A. Kogilavani,et al.  Ontology Enhanced Clustering Based Summarization of Medical Documents , 2009 .

[23]  Nikolaj Tatti,et al.  Probably the best itemsets , 2010, KDD.

[24]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[25]  Hiroya Takamura,et al.  Text summarization model based on the budgeted median problem , 2009, CIKM.

[26]  William W. Cohen Learning Trees and Rules with Set-Valued Features , 1996, AAAI/IAAI, Vol. 1.

[27]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[28]  Michael W. Berry,et al.  Email Surveillance Using Non-negative Matrix Factorization , 2005, Comput. Math. Organ. Theory.

[29]  James Allan,et al.  An interactive algorithm for asking and incorporating feature feedback into support vector machines , 2007, SIGIR.

[30]  Jack G. Conrad,et al.  Query-based opinion summarization for legal blog entries , 2009, ICAIL.

[31]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[32]  Ronen Feldman,et al.  Text Mining and Information Extraction , 2010, Data Mining and Knowledge Discovery Handbook.

[33]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[34]  Tao Li,et al.  Document update summarization using incremental hierarchical clustering , 2010, CIKM.

[35]  Ferda Nur Alpaslan,et al.  Text Summarization of Turkish Texts using Latent Semantic Analysis , 2010, COLING.

[36]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[37]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[38]  Vasileios Hatzivassiloglou,et al.  A Formal Model for Information Selection in Multi-Sentence Text Extraction , 2004, COLING.

[39]  Thair Nu Phyu Survey of Classification Techniques in Data Mining , 2009 .

[40]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[41]  Yang Liu,et al.  Why is “SXSW” trending? Exploring Multiple Text Sources for Twitter Topic Summarization , 2011 .

[42]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[43]  Peter Bednar,et al.  Supporting Semantic Annotation of Text Documents with Text Mining Techniques , 2006 .

[44]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[45]  Rui Li,et al.  Exploring social tagging graph for web object classification , 2009, KDD.

[46]  M. B. Chandak,et al.  Graph-Based Algorithms for Text Summarization , 2010, 2010 3rd International Conference on Emerging Trends in Engineering and Technology.

[47]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[48]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[49]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[50]  Chunping Li,et al.  WikiSummarizer - A Wikipedia-based Summarization System , 2010, TAC.

[51]  Dunja Mladenic,et al.  Capturing Document Semantics for Ontology Generation and Document Summarization , 2009, Semantic Knowledge Management.

[52]  Szymon Jaroszewicz,et al.  Interestingness of frequent itemsets using Bayesian networks as background knowledge , 2004, KDD.

[53]  Hideaki Takeda,et al.  Ontology Extraction by Collaborative Tagging with Social Networking , 2008 .

[54]  Youli Qu,et al.  Summarization using Wikipedia , 2010, TAC.

[55]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[56]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[57]  Aldo Gangemi,et al.  A Comparison of Knowledge Extraction Tools for the Semantic Web , 2013, ESWC.

[58]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[59]  Chao-Lin Liu,et al.  Ontology-based Text Summarization for Business News Articles , 2003, CATA.

[60]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[61]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[62]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[63]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[64]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[65]  John Atkinson,et al.  Rhetorics-based multi-document summarization , 2013, Expert Syst. Appl..

[66]  Sholom M. Weiss,et al.  Optimized rule induction , 1993, IEEE Expert.

[67]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[68]  Jugal K. Kalita,et al.  Experiments in Microblog Summarization , 2010, 2010 IEEE Second International Conference on Social Computing.

[69]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[70]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[71]  Yihong Gong,et al.  Integrating Document Clustering and Multidocument Summarization , 2011, TKDD.

[72]  Rasim M. Alguliyev,et al.  CDDS: Constraint-driven document summarization models , 2013, Expert Syst. Appl..

[73]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[74]  Stuart C. Shapiro Review of Knowledge representation: logical, philosophical, and computational foundations by John F. Sowa. Brooks/Cole 2000. , 2001 .

[75]  Vivi Nastase,et al.  Topic-Driven Multi-Document Summarization with Encyclopedic Knowledge and Spreading Activation , 2008, EMNLP.

[76]  Dongmei Ai,et al.  Automatic text summarization based on latent semantic indexing , 2010, Artificial Life and Robotics.

[77]  John F. Sowa,et al.  Knowledge representation: logical, philosophical, and computational foundations , 2000 .

[78]  Lars Kai Hansen,et al.  Mining the posterior cingulate: Segregation between memory and pain components , 2005, NeuroImage.

[79]  Oren Etzioni,et al.  Open Language Learning for Information Extraction (Author's Manuscript) , 2012 .

[80]  Elena Baralis,et al.  Summarizing biological literature with BioSumm , 2010, CIKM '10.

[81]  Lalit M. Patnaik,et al.  Genetic algorithms: a survey , 1994, Computer.

[82]  Dejene Ejigu,et al.  Topic-based Amharic text summarization with probabilistic latent semantic analysis , 2012, MEDES.

[83]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[84]  Juan-Zi Li,et al.  Social context summarization , 2011, SIGIR.

[85]  M. Saravanan,et al.  Identification of Rhetorical Roles for Segmentation and Summarization of a Legal Judgment , 2010, Artificial Intelligence and Law.

[86]  Josef Steinberger,et al.  JRC's Participation at TAC 2011: Guided and MultiLingual Summarization Tasks , 2011, TAC.

[87]  Tunga Güngör,et al.  Using Genetic Algorithms with Lexical Chains for Automatic Text Summarization , 2018, ICAART.

[88]  Aldo Gangemi,et al.  Knowledge Extraction Based on Discourse Representation Theory and Linguistic Frames , 2012, EKAW.

[89]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[90]  Gerhard Weikum,et al.  AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables , 2011, Proc. VLDB Endow..

[91]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[92]  Hannes Heikinheimo,et al.  Decomposable Families of Itemsets , 2008, ECML/PKDD.

[93]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[94]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[95]  Wenjie Li,et al.  Automatic Twitter Topic Summarization With Speech Acts , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[96]  José Palazzo Moreira de Oliveira,et al.  Concept-based knowledge discovery in texts extracted from the Web , 2000, SKDD.

[97]  Tao Li,et al.  Ontology-enriched multi-document summarization in disaster management , 2010, SIGIR.

[98]  Hiroya Takamura,et al.  Text Summarization Model Based on Maximum Coverage Problem and its Variant , 2009, EACL.

[99]  Jugal K. Kalita,et al.  Summarizing Microblogs Automatically , 2010, NAACL.

[100]  Giuseppe Carenini,et al.  Summarizing email conversations with clue words , 2007, WWW '07.

[101]  Jilles Vreeken,et al.  Tell me what i need to know: succinctly summarizing data with itemsets , 2011, KDD.

[102]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[103]  Sun Park,et al.  Query-Based Multi-Document Summarization Using Non-Negative Semantic Feature and NMF Clustering , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[104]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[105]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[106]  Chun Chen,et al.  Tag-oriented document summarization , 2009, WWW '09.

[107]  Elena Baralis,et al.  Minimum number of genes for microarray feature selection , 2008, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[108]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[109]  Rasim M. Alguliyev,et al.  GenDocSum + MCLR: Generic document summarization based on maximum coverage and less redundancy , 2012, Expert Syst. Appl..

[110]  Ping Chen,et al.  A Query-Based Medical Information Summarization System Using Ontology Knowledge , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[111]  Li-Yeh Chuang,et al.  A Hybrid BPSO-CGA Approach for Gene Selection and Classification of Microarray Data , 2012, J. Comput. Biol..

[112]  Raymond Y. K. Lau,et al.  Toward a Fuzzy Domain Ontology Extraction Method for Adaptive e-Learning , 2009, IEEE Transactions on Knowledge and Data Engineering.

[113]  Ted K. Ralphs,et al.  The Symphony Callable Library for Mixed Integer Programming , 2005 .