FLIE: Form Labeling for Information Extraction

Information extraction (IE) from forms remains an unsolved problem, with some exceptions, like bills. Forms are complex and the templates are often unstable, due to the injection of advertising, extra conditions, or document merging. Our scenario deals with insurance forms used by brokers in Switzerland. Here, each combination of insurer, insurance type and language results in a new document layout, leading to a few hundred document types. To help brokers extract data from policies, we developed a new labeling method, called FLIE (form labeling for information extraction). FLIE first assigns a document to a cluster, grouping by language, insurer, and insurance type. It then labels the layout. To produce training data, the user annotates a sample document by hand, adding attribute names, i.e. provides a mapping. FLIE applies machine learning to propagate the mapping and extracts information. Our results are based on 24 Swiss policies in German: UVG (mandatory accident insurance), KTG (sick pay insurance), and UVGZ (optional accident insurance). Our solution has an accuracy of around 84–89%. It is currently being extended to other policy types and languages.

[1]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[2]  J. Bentley A survey of techniques for fixed radius near neighbor searching. , 1975 .

[3]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[4]  Piotr Tereszkiewicz,et al.  Digitalisation of Insurance Contract Law: Preliminary Thoughts with Special Regard to Insurer’s Duty to Advise , 2019 .

[5]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[6]  Simon Rogers,et al.  A First Course in Machine Learning , 2011, Chapman and Hall / CRC machine learning and pattern recognition series.

[7]  S. Waqar Jaffry,et al.  Information extraction from scientific articles: a survey , 2018, Scientometrics.

[8]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[9]  Syed Saqib Bukhari,et al.  Table Localization and Field Value Extraction in Piping and Instrumentation Diagram Images , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[10]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[11]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[12]  T. Hanne,et al.  Text Mining Innovation for Business , 2020 .

[13]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[14]  Daniel Martin Katz,et al.  LexNLP: Natural Language Processing and Information Extraction For Legal and Regulatory Texts , 2018, Research Handbook on Big Data Law.