论文信息 - Hidden in Plain Sight: Classifying Emails Using Embedded Image Contents

Hidden in Plain Sight: Classifying Emails Using Embedded Image Contents

A vast majority of the emails received by people today are machine-generated by businesses communicating with consumers. While some emails originate as a result of a transaction (e.g., hotel or restaurant reservation confirmations, online purchase receipts, shipping notifications, etc.), a large fraction are commercial emails promoting an offer (a special sale, free shipping, available for a limited time, etc.). The sheer number of these promotional emails makes it difficult for users to read all these emails and decide which ones are actually interesting and actionable. In this paper, we tackle the problem of extracting information from commercial emails promoting an offer to the user. This information enables an email platform to build several new experiences that can unlock the value in these emails without the user having to navigate and read all of them. For instance, we can highlight offers that are expiring soon, or display a notification when there»s an unexpired offer from a merchant if your phone recognizes that you are at that merchant»s store. A key challenge in extracting information from such commercial emails is that they are often image-rich and contain very little text. Training a machine learning (ML) model on a rendered image-rich email and applying it to each incoming email can be prohibitively expensive. In this paper, we describe a cost-effective approach for extracting signals from both the text and image content of commercial emails in the context of Gmail, an email platform that serves over a billion users around the world. The key insight is to leverage the template structure of emails, and use off-the-shelf OCR techniques to obtain the text from images to augment the existing text features offline. Compared to a text-only approach, we show that we are able to identify 9.12% more email templates corresponding to ~5% more emails being identified as offers. Interestingly, our analysis shows that this 5% improvement in coverage is across the board, irrespective of whether the emails were sent by large merchants or small local merchants, allowing us to deliver an improved experience for everyone.

Qi Zhao | Sandeep Tata | Marc Najork | James Bradley Wendt | Navneet Potti

[1] Kai Wang,et al. End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[2] Yoelle Maarek,et al. You Will Get Mail!Predicting the Arrival of Future Email , 2015, WWW.

[3] Hartmut Neven,et al. PhotoOCR: Reading Text in Uncontrolled Conditions , 2013, 2013 IEEE International Conference on Computer Vision.

[4] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5] Yoelle Maarek. Web Mail is not Dead!: It's Just Not Human Anymore , 2017, WWW.

[6] Robert Richards,et al. Document Object Model (DOM) , 2006 .

[7] Jure Leskovec,et al. The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables , 2017, KDD.

[8] Ran Wolff,et al. Enforcing k-anonymity in Web Mail Auditing , 2016, WSDM '16.

[9] Yoelle Maarek,et al. How Many Folders Do You Really Need?: Classifying Email into a Handful of Categories , 2014, CIKM.

[10] Nemanja Djuric,et al. E-commerce in Your Inbox: Product Recommendations at Scale , 2015, KDD.

[11] Andrew Slater,et al. The Learning Behind Gmail Priority Inbox , 2010 .

[12] Marc Najork,et al. Learning from User Interactions in Personal Search via Attribute Parameterization , 2017, WSDM.

[13] Nir Ailon,et al. Threading machine generated email , 2013, WSDM '13.

[14] Marc Najork,et al. Web Crawling , 2010, Found. Trends Inf. Retr..

[15] Peter Young,et al. Smart Reply: Automated Response Suggestion for Email , 2016, KDD.

[16] R. Smith,et al. An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[17] Andrei Z. Broder,et al. Email Category Prediction , 2017, WWW.

[18] Enrico Blanzieri,et al. A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[19] D. Sculley,et al. Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[20] Maria T. Pazienza,et al. Information Extraction , 1997 .

[21] Marc-Allen Cartright,et al. Template Induction over Unstructured Email Corpora , 2017, WWW.

[22] Marc-Allen Cartright,et al. Hierarchical Label Propagation and Discovery for Machine Generated Email , 2016, WSDM.

[23] Nicholas Kushmerick,et al. Wrapper Induction for Information Extraction , 1997, IJCAI.

[24] Dmitriy Genzel,et al. Label transition and selection pruning and automatic decoding parameter optimization for time-synchronous Viterbi decoding , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[25] Dotan Di Castro,et al. Structural Clustering of Machine-Generated Mail , 2016, CIKM.

[26] Marc Najork. Using Machine Learning to Improve the Email Experience , 2016, CIKM.

[27] Alexander J. Smola,et al. Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails , 2015, KDD.

[28] Sandeep Tata,et al. Quick Access: Building a Smart Experience for Google Drive , 2017, KDD.

[29] Khaled Shaalan,et al. A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[30] Dotan Di Castro,et al. You've got Mail, and Here is What you Could do With It!: Analyzing and Predicting Actions on Email Messages , 2016, WSDM.

[31] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[32] Yoelle Maarek,et al. How Many Folders Do You Really Need? , 2016, ArXiv.