Developing a Legal Form Classification and Extraction Approach for Company Entity Matching Benchmark of Rule-Based and Machine Learning Approaches

This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.

[1]  J. Gulla,et al.  Neural Networks for Entity Matching: A Survey , 2020, ACM Trans. Knowl. Discov. Data.

[2]  J. Gómez,et al.  Data Source Selection Support in the Big Data Integration Process – Towards a Taxonomy , 2021, Lecture Notes in Information Systems and Organisation.

[3]  Jorge Marx Gómez,et al.  Enhancement of Record Linkage by Using Attributes containing Natural Language Text , 2021, AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering.

[4]  A. Doan,et al.  Magellan: toward building ecosystems of entity matching solutions , 2020, Commun. ACM.

[5]  Felix Naumann,et al.  Data Preparation for Duplicate Detection , 2020, ACM J. Data Inf. Qual..

[6]  Erik Cambria,et al.  Extracting Time Expressions and Named Entities with Constituent-Based Tagging Schemes , 2020, Cognitive Computation.

[7]  W. Tan,et al.  Deep entity matching with pre-trained language models , 2020, Proc. VLDB Endow..

[8]  Jorge Marx Gómez,et al.  A Qualitative Literature Review on Linkage Techniques for Data Integration , 2020, HICSS.

[9]  P. Christen Data Linkage: The Big Picture , 2019, 1.2.

[10]  Pasi Fränti,et al.  Framework for syntactic string similarity measures , 2019, Expert Syst. Appl..

[11]  Katsiaryna Mirylenka,et al.  Fast Record Linkage for Company Entities , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[12]  AnHai Doan,et al.  Entity Matching Meets Data Science: A Progress Report from the Magellan Project , 2019, SIGMOD Conference.

[13]  Georg Groh,et al.  Sequence Labeling: A Practical Approach , 2018, ArXiv.

[14]  AnHai Doan,et al.  CloudMatcher: A Hands-Off Cloud/Crowd Service for Entity Matching , 2018, Proc. VLDB Endow..

[15]  Shafiq R. Joty,et al.  Distributed Representations of Tuples for Entity Resolution , 2018, Proc. VLDB Endow..

[16]  Die Digitale Wertschöpfungskette: Künstliche Intelligenz im Einkauf und Supply Chain Management , 2018 .

[17]  Nathan Goldschlag,et al.  Squeezing More Out of Your Data: Business Record Linkage with Python , 2018 .

[18]  Sandeep Purao,et al.  Data-Driven Meets Theory-Driven Research in the Era of Big Data: Opportunities and Challenges for Information Systems Research , 2018, J. Assoc. Inf. Syst..

[19]  Michael Stonebraker,et al.  Data Integration: The Current Status and the Way Forward , 2018, IEEE Data Eng. Bull..

[20]  Christopher-J. Schild,et al.  Linking Deutsche Bundesbank Company Data using Machine-Learning-Based Classification: Extended Abstract , 2016, DSMM@SIGMOD.

[21]  Roger H. L. Chiang,et al.  Big Data Research in Information Systems: Toward an Inclusive Research Agenda , 2016, J. Assoc. Inf. Syst..

[22]  Xiaolong Wang,et al.  Drug Name Recognition: Approaches and Resources , 2015, Inf..

[23]  Varun Grover,et al.  NEW STATE OF PLAY IN INFORMATION SYSTEMS RESEARCH : THE PUSH TO THE EDGES 1 , 2015 .

[24]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[25]  Sean M. Randall,et al.  The effect of data cleaning on record linkage quality , 2013, BMC Medical Informatics and Decision Making.

[26]  Andreas Thor,et al.  Tailoring entity resolution for matching product offers , 2012, EDBT '12.

[27]  José Carlos González,et al.  Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization , 2011, FLAIRS.

[28]  Stasha Ann Bown Larsen,et al.  Record Linkage , 2018, Encyclopedia of Database Systems.

[29]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[30]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..