Database and Expert Systems Applications

Many machine generated emails carry important information which must be acted upon at scheduled time by the recipient. Thus, it becomes a natural goal to automatically extract such actionable information from these emails and communicate to the users. These emails are generated for many different domains, providing different types of services. However, such emails carry personal information, therefore, it becomes difficult to get access to large corpus of labeled data for supervised information extraction methods. In this paper, we propose a novel method to automatically identify part of the email containing actionable information, called core region of the email, with the aid of a domain dictionary. Domain dictionary is generated based on the public information of the domain. The core regions are stored as template trees a template tree is a sub-tree embedded in the email’s HTML DOM tree. Our experiments over real data show, structure of the core region of the email, containing all the information of our interest, is very simple and it is 85%–98% smaller compared to the original email. Further, our experiments also show that the template trees are highly repetitive across diverse set of emails from a given service provider.

[1]  Dotan Di Castro,et al.  Structural Clustering of Machine-Generated Mail , 2016, CIKM.

[2]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[3]  Alexander J. Smola,et al.  Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails , 2015, KDD.

[4]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[5]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[6]  Xiaoyong Du,et al.  Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[7]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[8]  Mohammed J. Zaki,et al.  A distributed approach for graph mining in massive networks , 2016, Data Mining and Knowledge Discovery.

[9]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  Marc-Allen Cartright,et al.  Hierarchical Label Propagation and Discovery for Machine Generated Email , 2016, WSDM.

[11]  Kalyanmoy Deb,et al.  Muiltiobjective Optimization Using Nondominated Sorting in Genetic Algorithms , 1994, Evolutionary Computation.

[12]  Marc-Allen Cartright,et al.  Template Induction over Unstructured Email Corpora , 2017, WWW.

[13]  Wolfgang Nejdl,et al.  From keywords to semantic queries - Incremental query construction on the semantic web , 2009, J. Web Semant..

[14]  Ambuj K. Singh,et al.  Closure-Tree: An Index Structure for Graph Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15]  Alfred O. Hero,et al.  A binary linear programming formulation of the graph edit distance , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Kristina Lerman,et al.  Structure of Heterogeneous Networks , 2009, 2009 International Conference on Computational Science and Engineering.

[17]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[18]  Anthony K. H. Tung,et al.  Comparing Stars: On Approximating Graph Edit Distance , 2009, Proc. VLDB Endow..

[19]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[20]  Ran Wolff,et al.  Enforcing k-anonymity in Web Mail Auditing , 2016, WSDM '16.

[21]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[22]  Yoelle Maarek,et al.  How Many Folders Do You Really Need?: Classifying Email into a Handful of Categories , 2014, CIKM.

[23]  Jeffrey Xu Yu,et al.  Keyword Search in Databases , 2010, Keyword Search in Databases.

[24]  Haofen Wang,et al.  Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[25]  Tim Furche,et al.  DIADEM: Thousands of Websites to a Single Database , 2014, Proc. VLDB Endow..

[26]  Bingsheng He,et al.  MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors , 2015, IEEE Transactions on Parallel and Distributed Systems.

[27]  Nir Ailon,et al.  Threading machine generated email , 2013, WSDM '13.

[28]  Andrei Z. Broder,et al.  Email Category Prediction , 2017, WWW.

[29]  Yizhou Sun,et al.  Mining heterogeneous information networks: a structural analysis approach , 2013, SKDD.

[30]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[31]  Lei Zou,et al.  Semantic SPARQL Similarity Search Over RDF Knowledge Graphs , 2016, Proc. VLDB Endow..