Information extraction from scanned documents by stochastic page layout analysis
暂无分享,去创建一个
We propose a stochastic context-free grammar for extracting information from scanned document images. The grammar is designed to disambiguate layout analysis and utilize both layout and text features. We applied this grammar to the problem of extracting bibliographic information from scanned academic papers and found that it can accurately extract information.
[1] George Nagy,et al. HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .
[2] Yalin Wang,et al. Table structure understanding and its performance evaluation , 2004, Pattern Recognit..
[3] Atsuhiro Takasu,et al. Mining knowledge from text using information extraction , 2005, SKDD.