Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails

Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g.purchase, event) into templates. Extracting structured data from B2C emails allows users to track important information on various devices. However, it also poses several challenges, due to the requirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information. In this paper we first introduce a system which can extract structured information automatically without requiring human review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Neither general learning methods, such as binary classifiers, nor more specific structure learning methods, suchas Conditional Random Field (CRF), can solve this problem well. To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels predicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to remove the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs.

[1]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[4]  Greg Ridgeway,et al.  Generalized Boosted Models: A guide to the gbm package , 2006 .

[5]  Terrill L. Frantz,et al.  Communication Networks from the Enron Email Corpus “It's Always About the People. Enron is no Different” , 2005, Comput. Math. Organ. Theory.

[6]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[7]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Shlomo Hershkop,et al.  Automated social hierarchy detection through email network analysis , 2007, WebKDD/SNA-KDD '07.

[10]  Alexander M. Rush,et al.  On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing , 2010, EMNLP.

[11]  References , 1971 .

[12]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[13]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[14]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.

[15]  R. Horst,et al.  DC Programming: Overview , 1999 .

[16]  Bo Zhang,et al.  Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction , 2008, J. Mach. Learn. Res..

[17]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[18]  Yiming Yang,et al.  Mining social networks for personalized email prioritization , 2009, KDD.

[19]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[20]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[21]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[22]  Ming-Syan Chen,et al.  ProMail: Using Progressive Email Social Network for Spam Detection , 2007, PAKDD.

[23]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Stan Matwin,et al.  Email classification with co-training , 2011, CASCON.

[26]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[27]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[28]  Nir Ailon,et al.  Threading machine generated email , 2013, WSDM '13.

[29]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[30]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[31]  Alexander M. Rush,et al.  A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing , 2012, J. Artif. Intell. Res..

[32]  Marius Pasca,et al.  Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[33]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[34]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[35]  Christopher Meek,et al.  Challenges of the Email Domain for Text Classification , 2000, ICML.

[36]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[37]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[38]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[39]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[40]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[41]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[42]  Dennis McLeod,et al.  A Comparative Study for Email Classification , 2007 .

[43]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[44]  Hua Li,et al.  Adding Semantics to Email Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[45]  Keith B. Hall,et al.  Training dependency parsers by jointly optimizing multiple objectives , 2011, EMNLP.

[46]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[47]  Ted Pedersen,et al.  Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts , 2005, IICAI.

[48]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[49]  Rong Jin,et al.  Learning with Multiple Labels , 2002, NIPS.