Semantic indexing for a complete subject discipline

As part of the Illinois Digital Library Initiative (DLI) project we developed “scalable semantics” technologies. These statistical techniques enabled us to index large collections for deeper search than word matching. Through the auspices of the DARPA Information Management program, we are developing an integrated analysis environment, the Interspace Prototype, that uses “semantic indexing” as the foundation for supporting concept navigation. These semantic indexes record the contextual correlation of noun phrases, and are computed generically, independent of subject domain. Using this technology, we were able to compute semantic indexes for a subject discipline. In particular, in the summer of 1998, we computed concept spaces for 9.3M MEDLINE bibliographic records from the National Library of Medicine (NLM) which extensively covered the biomedical literature for the period from 1966 to 1997. In this experiment, we first partitioned the collection into smaller collections (repositories) by subject, extracted noun phrases from titles and abstracts, then performed semantic indexing on these subcollections by creating a concept space for each repository. The computation required 2 days on a 128-node SGI/CRAY Origin 2000 at the National Center for Supercomputer Applications (NCSA). This experiment demonstrated the feasibility of scalable semantics techniques for large collections. With the rapid increase in computing power, we believe this indexing technology will shortly be feasible on personal computers.

[1]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[2]  Hsinchun Chen,et al.  An algorithmic approach to concept exploration in a large knowledge network (automatic thesaurus consultation): symbolic branch-and-bound search vs. connectionist Hopfield net activation , 1995 .

[3]  Eric Brill,et al.  A corpus-based approach to language learning , 1993 .

[4]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[5]  Gerard Salton,et al.  A theory of indexing , 1975, Regional conference series in applied mathematics.

[6]  H. Chen,et al.  An Algorithmic Approach to Concept Exploration in a Large Knowledge Network (Automatic Thesaurus Consultation): Symbolic Branch-and-Bound Search vs. Connectionist Hopfield Net Activation , 1995, J. Am. Soc. Inf. Sci..

[7]  Bruce R. Schatz,et al.  Performance and implications of semantic indexing in a distributed environment , 1999, CIKM '99.

[8]  Gerard Salton,et al.  Dynamic information and library processing , 1975 .

[9]  James R. Campbell,et al.  n Phase II Evaluation of Clinical Coding Schemes : Completeness , Taxonomy , Mapping , Definitions , and Clarity , 2022 .

[10]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory, Third Edition , 1989, Springer Series in Information Sciences.

[11]  P Carpenter,et al.  Phase II evaluation of clinical coding schemes: completeness, taxonomy, mapping, definitions, and clarity. CPRI Work Group on Codes and Structures. , 1997, Journal of the American Medical Informatics Association : JAMIA.

[12]  Alexa T. McCray,et al.  Research Paper: Evaluating the Coverage of Controlled Health Data Terminologies: Report on the Results of the NLM/AHCPR Large Scale Vocabulary Test , 1997, J. Am. Medical Informatics Assoc..

[13]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[14]  B R Schatz,et al.  Information Retrieval in Digital Libraries: Bringing Search to the Net , 1997, Science.

[15]  Hsinchun Chen,et al.  A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Hsinchun Chen,et al.  Comparing noun phrasing techniques for use with medical digital library tools , 2000 .

[17]  Hsinchun Chen,et al.  A concept space approach to addressing the vocabulary problem in scientific information retrieval: an experiment on the worm community system , 1997 .

[18]  B L Humphreys,et al.  The NLM/AHCPR large-scale vocabulary test. , 1996, National network.

[19]  William H. Mischo,et al.  FEDERATED SEARCH OF SCIENTIFIC LITERATURE A RETROSPECTIVE ON THE ILLINOIS DIGITAL LIBRARY PROJECT , 2000 .

[20]  Bruce R. Schatz,et al.  Automatic subject indexing using an associative neural network , 1998, DL '98.

[21]  Hsinchun Chen,et al.  Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques , 1998, J. Am. Soc. Inf. Sci..

[22]  Christian Jacquemin,et al.  Retrieving terms and their variants in a lexicalized unification-based framework , 1994, SIGIR '94.

[23]  William H. Mischo,et al.  Federated Search of Scientific Literature , 1999, Computer.