A text mining approach for automatic construction of hypertexts

The research on automatic hypertext construction emerges rapidly in the last decade because there exists a urgent need to translate the gigantic amount of legacy documents into web pages. Unlike traditional 'flat' texts, a hypertext contains a number of navigational hyperlinks that point to some related hypertexts or locations of the same hypertext. Traditionally, these hyperlinks were constructed by the creators of the web pages with or without the help of some authoring tools. However, the gigantic amount of documents produced each day prevent from such manual construction. Thus an automatic hypertext construction method is necessary for content providers to efficiently produce adequate information that can be used by web surfers. Although most of the web pages contain a number of non-textual data such as images, sounds, and video clips, text data still contribute the major part of information about the pages. Therefore, it is not surprising that most of automatic hypertext construction methods inherit from traditional information retrieval research. In this work, we will propose a new automatic hypertext construction method based on a text mining approach. Our method applies the self-organizing map algorithm to cluster some at text documents in a training corpus and generate two maps. We then use these maps to identify the sources and destinations of some important hyperlinks within these training documents. The constructed hyperlinks are then inserted into the training documents to translate them into hypertext form. Such translated documents will form the new corpus. Incoming documents can also be translated into hypertext form and added to the corpus through the same approach. Our method had been tested on a set of at text documents collected from a newswire site. Although we only use Chinese text documents, our approach can be applied to any documents that can be transformed to a set of index terms.

[1]  Stephen J. Green Building hypertext links in newspaper articles using semantic similarity , 1997 .

[2]  James Allan Building Hypertext Using Information Retrieval , 1997, Inf. Process. Manag..

[3]  Alan F. Smeaton,et al.  Automatic link generation , 1999, CSUR.

[4]  Alexander Mehler Aspects of text semantics in hypertext , 1999, HYPERTEXT '99.

[5]  James Allan,et al.  Automatic hypertext link typing , 1996 .

[6]  Fabio Crestani,et al.  A methodology for the automatic construction of a hypertext for information retrieval , 1993, SAC '93.

[7]  Fazli Can,et al.  Incremental clustering for dynamic information processing , 1993, TOIS.

[8]  Satoshi Sato,et al.  A Method of Automatic Hypertext Construction from an Encyclopedic Dictionary of a Specific Field , 1992, ANLP.

[9]  Gerard Salton,et al.  On the Automatic Generation of Content Links in Hypertext , 1989 .

[10]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[11]  Fabio Crestani,et al.  On the Use of Information Retrieval Techniques for the Automatic Construction of Hypertext , 1997, Inf. Process. Manag..

[12]  Theodore Dalamagas,et al.  NHS: A Tool for the Automatic Construction of News Hypertext , 1998, BCS-IRSG Annual Colloquium on IR Research.

[13]  Fabio Crestani,et al.  Design and Implementation of a Tool for the Automatic Construction of Hypertexts for Information Retrieval , 1996, Inf. Process. Manag..

[14]  James Allan,et al.  Automatic structuring and retrieval of large text files , 1994, CACM.

[15]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[16]  Dongwook Shin,et al.  Hypertext construction using statistical and semantic similarity , 1997, DL '97.

[17]  Andreas Rauber,et al.  Using Self-Organizing Maps to Organize Document Archives and to Charakterize Subject Matter: How to Make a Map Tell the News of the World , 1999, DEXA.

[18]  Jean Tague-Sutcliffe,et al.  From text to hypertext by indexing , 1995, TOIS.

[19]  Fabio Crestani,et al.  Automatic construction of hypertexts for self-referencing: the Hyper-TextBook project , 2003, Inf. Syst..

[20]  Mark D. Dunlop,et al.  Automatic Construction of News Hypertext , 1997, HIM.

[21]  Riccardo Rizzo,et al.  Developing Hypertexts through a Self-Organizing Map , 1998, WebNet.

[22]  Douglas Tudhope,et al.  Navigation via Similarity: Automatic Linking Based on Semantic Closeness , 1997, Inf. Process. Manag..

[23]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[24]  Stephen J. Green Lexical semantics and automatic hypertext construction , 1999, CSUR.

[25]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[26]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[27]  Hsin-Chang Yang,et al.  A Web text mining approach based on self-organizing map , 1999, WIDM '99.

[28]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[29]  James Allan,et al.  Automatic Structuring of Text Files , 1992, Electron. Publ..

[30]  Matti Hämäläinen,et al.  knowledge-based HTML document generation for utomating web publishing☆ , 1996 .

[31]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.