Special Issue on Searching and Mining Literature Digital Libraries

Information extraction and text mining applications are just beginning to tap the immense amounts of valuable textual information available online. In order to extract information from millions, and in some cases, billions of documents, different solutions to scalability emerged. We review key approaches for scaling up information extraction, including using general-purpose search engines as well as indexing techniques specialized for information extraction applications. Scalable information extraction is an active area of research, and we highlight some of the opportunities and challenges in this area that are relevant to the database community.

[1]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[2]  Edward A. Fox,et al.  Connecting topics in document collections with stepping stones and pathways , 2005, CIKM '05.

[3]  Mykola Galushka,et al.  SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection , 2005, IEEE Transactions on Information Technology in Biomedicine.

[4]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[5]  Dragomir R. Radev,et al.  Question-answering by predictive annotation , 2000, SIGIR '00.

[6]  Guenther Eichhorn,et al.  The NASA Astrophysics Data System : Sociology , Bibliometrics , and Impact , 2022 .

[7]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[8]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.

[9]  Gerhard Weikum,et al.  The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents , 2005, VLDB.

[10]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[13]  Eugene Garfield,et al.  Citation Frequency as a Measure of Research Activity and Performance , 1962 .

[14]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[15]  A Borodin,et al.  Xii-1 Xii. Query Splitting in Relevance Feedback Systems , .

[16]  Les Carr,et al.  Open Access to Research Increases Citation Impact , 2005 .

[17]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[18]  Claudio Carpineto,et al.  An information-theoretic approach to automatic query expansion , 2001, TOIS.

[19]  Miguel A. Andrade-Navarro,et al.  Ranking the whole MEDLINE database according to a large training set using text indexing , 2005, BMC Bioinformatics.

[20]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[21]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[22]  Henk F. Moed,et al.  Citation Analysis in Research Evaluation , 1899 .

[23]  Gultekin Özsoyoglu,et al.  Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach , 2003, DEXA.

[24]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[25]  Yindalon Aphinyanagphongs,et al.  Research Paper: Text Categorization Models for High-Quality Article Retrieval in Internal Medicine , 2004, J. Am. Medical Informatics Assoc..

[26]  William W. Cohen The WHIRL Approach to Integration: An Overview , 1998 .

[27]  E GARFIELD,et al.  Citation indexes for science; a new dimension in documentation through association of ideas. , 2006, Science.

[28]  Andrew M. Odlyzko,et al.  The rapid evolution of scholarly communication , 2002, Learn. Publ..

[29]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[30]  M. Eysenck,et al.  The correlation between RAE ratings and citation counts in psychology Technical Report , 2002 .

[31]  S. Harnad,et al.  Comparing the Impact of Open Access (OA) vs. Non-OA Articles in the Same Journals , 2004 .

[32]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[33]  Henk F. Moed,et al.  Statistical relationships between downloads and citations at the level of individual documents within a single journal , 2005, J. Assoc. Inf. Sci. Technol..

[34]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[35]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[36]  Gerald Salton,et al.  Automatic text processing , 1988 .

[37]  Oren Etzioni,et al.  A search engine for natural language applications , 2005, WWW '05.

[38]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[39]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[40]  Doug Downey,et al.  KnowItNow: Fast, Scalable Information Extraction from the Web , 2005, HLT.

[41]  Luis Gravano,et al.  Querying text databases for efficient information extraction , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[42]  Les Carr,et al.  Mandated online RAE CVs Linked to University Eprint Archives , 2003 .

[43]  Johan Bollen,et al.  Toward alternative metrics of journal impact: A comparison of download and citation data , 2005, Inf. Process. Manag..

[44]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[45]  Chris Buckley Why current IR engines fail , 2004, SIGIR '04.

[46]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[47]  Eduard Hovy,et al.  The Terascale Challenge , 2022 .

[48]  Stephen S. Murray,et al.  The effect of use and access on citations , 2005, Inf. Process. Manag..

[49]  Steve Lawren Online or invisible ? , 2001 .

[50]  Thomas V Perneger,et al.  Competing interests: None declared. Ethical approval: Ethics committee of Côte d’Ivoire’s Ministry of Public Health and the Institutional Review Board of the US Centers for Disease Control and Prevention , 2004 .

[51]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[52]  Stevan Harnad,et al.  Earlier Web Usage Statistics as Predictors of Later Citation Impact , 2005, J. Assoc. Inf. Sci. Technol..

[53]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[54]  M. Newman Coauthorship networks and patterns of scientific collaboration , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Les Carr,et al.  The Access/Impact Problem and the Green and Gold Roads to Open Access: An Update , 2008 .

[56]  A. Diamond,et al.  What is a Citation Worth ? , 2001 .

[57]  Fernando Adrian Das Neves,et al.  Stepping Stones and Pathways:Improving Retrieval by Chains of Relationships between Documents , 2004 .

[58]  Neil R. Smalheiser,et al.  Information discovery from complementary literatures: Categorizing viruses as potential weapons , 2001, J. Assoc. Inf. Sci. Technol..

[59]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[60]  W. John Wilbur,et al.  Automatic MeSH term assignment and quality assessment , 2001, AMIA.

[61]  Ellen M. Voorhees,et al.  The TREC robust retrieval track , 2005, SIGF.

[62]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[63]  Philip Resnik,et al.  The Linguist's Search Engine: An Overview , 2005, ACL.

[64]  Alma Swan,et al.  Open access self-archiving: An author study , 2005 .

[65]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[66]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[67]  Eduard Hovy,et al.  Towards terascale knowledge acquisition , 2004, COLING 2004.

[68]  Steffen Staab,et al.  Text Clustering Based on Background Knowledge , 2003 .

[69]  Matt Mettler,et al.  TREC Routing Experiments with the TRW/Paracel Fast Data Finder , 1993, Inf. Process. Manag..

[70]  Kenneth C. Litkowski,et al.  Question Answering Using XML-Tagged Documents , 2002, TREC.

[71]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.