Browsing semi-structured texts on the web using formal concept analysis

Browsing unstructured Web texts using Formal Concept Analysis (FCA) confronts two problems. Firstly, online Web data is sometimes unstructured and any FCA system must include additional mechanisms to discover the structure of input sources. Secondly, many online collections are large and dynamic, so a Web robot must be used to automatically extract data when it is required. These issues are addressed in this chapter, which reports a case study involving the construction of a Web-based FCA system used for browsing classified advertisements for real estate properties1. Real estate advertisements were chosen because they represent a typical semi-structured information source accessible on the Web. Furthermore, data is relevant only for a short period of time. Moreover, the analysis of real estate data is a classic example used in introductory courses on FCA. However, unlike the classic FCA real estate example, whose input is a structured relational database, we mine Web-based texts for their implicit structure. The issues encountered when mining these texts, and their subsequent presentation to the FCA system, are examined in this chapter. Our method uses a handcrafted parser for extracting structured information from the real estate advertisements, which are then browsed via a Web-based front-end employing rudimentary FCA system features. The user is able to quickly determine the trade-offs between different attributes of real estate properties and to alter the constraints of the search in order to locate good candidate properties. Interaction with the system is characterized as a mixed initiative process in which the user guides the computer in the satisfaction of constraints. These constraints are not specified apriori, but rather drawn from the data exploration process. Further, the chapter shows how the Conceptual Email Manager, a prototype FCA text information retrieval tool, can be adapted to the problem.

[1]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[2]  George R. Krupka SRA: Description of the SRA System as Used for MUC-6 , 1995, MUC.

[3]  Peter W. Eklund,et al.  Structured Ontology and Information Retrieval for Email Search and Discovery , 2002, ISMIS.

[4]  Gerd Stumme,et al.  CEM - A Conceptual Email Manager , 2000, ICCS.

[5]  Frank Vogt,et al.  Data Analysis Based on a Conceptual File , 1991 .

[6]  J. Hendler Gleaning the Web , 1999 .

[7]  Frank Vogt,et al.  TOSCANA - a Graphical Tool for Analyzing and Exploring Data , 1994, GD.

[8]  Ralf Der,et al.  Efficient State-Space Representation by Neural Maps for Reinforcement Learning , 1999 .

[9]  Peter W. Eklund,et al.  Analyzing an Email Collection Using Formal Concept Analysis , 1999, PKDD.

[10]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[11]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[12]  Rudolf Wille,et al.  Conceptual Landscapes of Knowledge: A Pragmatic Paradigm for Knowledge Processing , 1999 .

[13]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[14]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[15]  Douglas E. Appelt,et al.  FASTUS: A System for Extracting Information from Text , 1993, HLT.

[16]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[17]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[18]  Douglas E. Appelt,et al.  Introduction to Information Extraction Technology , 1999, IJCAI 1999.

[19]  A. Waibel,et al.  A Literature Survey on Information Extraction and Text Summarization , 1997 .

[20]  Richard Cole,et al.  Structured ontology and information retrieval for email search and discovery , 2006 .

[21]  N. Kushmerik Gleaning the Web , 1999, IEEE Intell. Syst..

[22]  Peter W. Eklund,et al.  Browsing Semi-structured Web Texts Using Formal Concept Analysis , 2001, ICCS.

[23]  Claudio Carpineto,et al.  A lattice conceptual clustering system and its application to browsing retrieval , 2004, Machine Learning.

[24]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[25]  Ralph Grishman,et al.  Unsupervised Discovery of Scenario-Level Patterns for Information Extraction , 2000, ANLP.

[26]  Douglas E. Appelt,et al.  SRI International: description of the FASTUS system used for MUC-4 , 1992, MUC.

[27]  Giles,et al.  Searching the world wide Web , 1998, Science.

[28]  Alan W. Biermann,et al.  Learning and generalization in the creation of information extraction systems , 1998 .