What you use, not what you do: Automatic classification and similarity detection of recipes

Abstract Social media data is notoriously noisy and unclean. Recipe collections and their manual categorization built by users are no exception. However, a consistent and transparent categorization is vital to users who search for a specific entry. Similarly, curators are faced with the same challenge given a large collection of existing recipes: They first need to understand the data to be able to build a clean system of categories. This paper presents an empirical study using machine learning classifiers (logistic regression and decision trees) for the automatic classification of recipes on the German cooking website Chefkoch.de. The central question we aim at answering is: Which information is necessary to perform well at this task? In particular, we compare features extracted from the free text instructions of the recipe to those taken from the list of ingredients. On a sample of 5000 recipes with 87 classes, our feature analysis shows that a combination of nouns from the textual description of the recipe with ingredient features performs best in the logistic regression model (48% F1). Nouns alone achieve 45% F1 and ingredients alone 46% F1. However, other word classes do not complement the information from nouns. Decision trees constantly underperform the logistic regression, however, lead to an interpretable model. On a bigger training set of 50,000 instances, the best configuration shows an improvement to 57% highlighting the importance of a sizeable data set. In addition, we report on the use of these feature vectors for similarity search and ranking of recipes and evaluate on the task of (near) duplicate detection. We show that our method can reduce the manual curation with precision@3 = 0.52.

[1]  Iris Hendrickx,et al.  Very quaffable and great fun: Applying NLP to wine reviews , 2016, ACL.

[2]  Yejin Choi,et al.  Mise en Place: Unsupervised Interpretation of Instructional Recipes , 2015, EMNLP.

[3]  Hala Skaf-Molli,et al.  WIKITAAABLE: A semantic wiki as a blackboard for a textual case-base reasoning system , 2009, SemWiki.

[4]  Hiroshi Murase,et al.  Finding replaceable materials in cooking recipe texts considering characteristic cooking actions , 2009, CEA '09.

[5]  Emmanuel Nauer,et al.  Extracting Generic Cooking Adaptation Knowledge for the TAAABLE Case-Based Reasoning System , 2012 .

[6]  Dietrich Klakow,et al.  Relation Extraction for the Food Domain without Labeled Training Data - Is Distant Supervision the Best Solution? , 2014, PolTAL.

[7]  Haoran Xie,et al.  A Hybrid Semantic Item Model for Recipe Search by Example , 2010, 2010 IEEE International Symposium on Multimedia.

[8]  Lada A. Adamic,et al.  Recipe recommendation using ingredient networks , 2011, WebSci '12.

[9]  Dietrich Klakow,et al.  Data-driven knowledge extraction for the food domain , 2012, KONVENS.

[10]  Liping Wang,et al.  A Personalized Recipe Database System with User- Centered Adaptation and Tutoring Support , 2007 .

[11]  Kazutoshi Sumiya,et al.  Construction of a cooking ontology from cooking recipes and patents , 2014, UbiComp Adjunct.

[12]  Cheng-Te Li,et al.  Automatic recipe cuisine classification by ingredients , 2014, UbiComp Adjunct.

[13]  Yoko Yamakata,et al.  Flow Graph Corpus from Recipe Texts , 2014, LREC.

[14]  Shinsuke Mori,et al.  A framework for recipe text interpretation , 2014, UbiComp Adjunct.

[15]  Ricardo Ribeiro,et al.  Cooking an Ontology , 2006, AIMSA.

[16]  Dietrich Klakow,et al.  Web-Based Relation Extraction for the Food Domain , 2012, NLDB.

[17]  D. Cox The Regression Analysis of Binary Sequences , 1958 .

[18]  Yoko Yamakata,et al.  Feature Extraction and Summarization of Recipes Using Flow Graph , 2013, SocInfo.

[19]  Luis Herranz,et al.  Being a Supercook: Joint Food Attributes and Multimodal Content Modeling for Recipe Retrieval and Exploration , 2017, IEEE Transactions on Multimedia.

[20]  J. Ramakrishna Naik,et al.  Cuisine Classification and Recipe Generation , 2015 .

[21]  Tomonobu Ozaki,et al.  Extraction of Characteristic Sets of Ingredients and Cooking Actions on Cuisine Type , 2017, 2017 31st International Conference on Advanced Information Networking and Applications Workshops (WAINA).

[22]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[23]  Rishikesh Sanjay Ghewari,et al.  Predicting Cuisine from Ingredients , 2015 .

[24]  Yu Yang,et al.  Substructure similarity measurement in chinese recipes , 2008, WWW.

[25]  Yoko Yamakata,et al.  A Machine Learning Approach to Recipe Text Processing , 2012 .

[26]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[27]  Hwan-Gue Cho,et al.  Constructing Cookery Network based on Ingredient Entropy Measure , 2015 .

[28]  Erik Jonsson,et al.  Semantic word classification and temporaldependency detection on cooking recipes , 2015 .

[29]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[30]  Elke Donalies Himmel und Erde - wie wir Gerichte benennen und warum wir das tun , 2017 .

[31]  Belen Diaz Agudo,et al.  ACook : Recipe adaptation using ontologies , case-based reasoning systems and knowledge discovery , 2012 .

[32]  Young-joo Chung Finding food entity relationships using user-generated data in recipe service , 2012, CIKM '12.