Table extraction using conditional random fields

The ability to find tables and extract information from them is a necessary component of data mining, question answering, and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multi-dimensional information. Tables do this by employing layout patterns to efficiently indicate fields and records in two-dimensional form.Their rich combination of formatting and content present difficulties for traditional language modeling techniques, however. This paper presents the use of conditional random fields (CRFs) for table extraction, and compares them with hidden Markov models (HMMs). Unlike HMMs, CRFs support the use of many rich and overlapping layout and language features, and as a result, they perform significantly better. We show experimental results on plain-text government statistical reports in which tables are located with 92% F1, and their constituent lines are classified into 12 table-related categories with 94% accuracy. We also discuss future work on undirected graphical models for segmenting columns, finding cells, and classifying them as data cells or label cells.

[1]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[2]  Jorge Nocedal,et al.  Representations of quasi-Newton matrices and their use in limited memory methods , 1994, Math. Program..

[3]  W. Bruce Croft,et al.  TINTIN: a system for retrieval in text tables , 1997, DL '97.

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[6]  Hwee Tou Ng,et al.  Learning to Recognize Tables in Free Text , 1999, ACL.

[7]  Matthew Hurst,et al.  Layout and Language: Integrating Spatial and Linguistic Knowledge for Layout Understanding Tasks , 2000, COLING.

[8]  Matthew Francis Hurst,et al.  The interpretation of tables in texts , 2000 .

[9]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Martin J. Wainwright,et al.  Exact MAP Estimates by (Hyper)tree Agreement , 2002, NIPS.

[12]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[13]  Wei Li,et al.  QuASM: a system for question answering using semi-structured data , 2002, JCDL '02.

[14]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.