Many applications of Formal Concept Analysis (FCA) start with a set of structured data such as objects and their properties. In practice, most of the data which is readily available are in the form of unstructured or semistructured text. A typical application of FCA assumes the extraction of objects and their properties by some other methods or techniques. For example, in the 2003 Los Alamos National Lab (LANL) project on Advanced Knowledge Integration In Assessing Terrorist Threats, a data extraction tool was used to mine the text for the structured data. In this paper, we provide a detailed description of our approach to extraction ofpersonal names forpossible subsequent use inFCA. Our basic approach is to integrate statistics on names and other words into an adaptation of a Hidden Markov Model (HMM). We use lists of names and their relative frequencies compiled from U.S. Census data. We also use a list of non-name words along with their frequencies in a training set from our collection of documents. These lists are compiled into one master list to be used as a part of the design.
[1]
Richard M. Schwartz,et al.
Named Entity Extraction from Noisy Input: Speech and OCR
,
2000,
ANLP.
[2]
Jonathan J. Hull,et al.
Document Analysis Systems II - Second Workshop on Document Analysis Systems, DAS 1996, Malvern, PA, USA, October 14-16, 1996, Selected papers
,
1998,
Series in Machine Perception and Artificial Intelligence.
[3]
Kazem Taghva,et al.
The Effects of OCR Error on the Extraction of Private Information
,
2006,
Document Analysis Systems.
[4]
Bernhard Ganter,et al.
Formal Concept Analysis
,
2013
.
[5]
Kazem Taghva,et al.
Automatic redaction of private information using relational information extraction
,
2006,
Electronic Imaging.