Clustering of scientific fields by integrating text mining and bibliometrics

Increasing dissemination of scientific and technological publications via the Internet, and their availability in large-scale bibliographic databases, has led to tremendous opportunities to improve classification and bibliometric cartography of science and technology. This metascience benefits from the continuous rise of computing power and the development of new algorithms. Paramount challenges still remain, however. This dissertation verifies the hypothesis that accuracy of clustering and classification of scientific fields is enhanced by incorporation of algorithms and techniques from text mining and bibliometrics. Both textual and bibliometric approaches have advantages and intricacies, and both provide different views on the same interlinked corpus of scientific publications or patents. In addition to textual information in such documents, citations between them also constitute huge networks that yield additional information. We incorporate both points of view and show how to improve on existing text-based and bibliometric methods for the mapping of science. The dissertation is organized into three parts. Firstly, we discuss the use of text mining techniques for information retrieval and for mapping of knowledge embedded in text. We introduce and demonstrate our text mining framework and the use of agglomerative hierarchical clustering. We also investigate the relationship between the number of Latent Semantic Indexing factors, the number of clusters, and clustering performance. Furthermore, we describe a combined semi-automatic strategy to determine the optimal number of clusters in a document set. Secondly, we focus on analysis of large networks that emerge from many individual acts of authors citing other scientific works, or collaborating in the same research endeavor. These networks of science and technology can be analyzed with techniques from bibliometrics and graph theory in order to rank important and relevant entities, for clustering or partitioning, and for extraction of communities. Thirdly, we substantiate the complementarity of text mining and bibliometric methods and we propose schemes for the sound integration of both worlds. The performance of unsupervised clustering and classification significantly improves by deeply merging textual content of scientific publications

[1]  C. Loan Generalizing the Singular Value Decomposition , 1976 .

[2]  Magnus Sahlgren,et al.  From Words to Understanding , 2001 .

[3]  W. A. Turner,et al.  Evaluating input/output relationships in a regional research network using co-word analysis , 2005, Scientometrics.

[4]  Dag W. Aksnes,et al.  The effect of highly cited papers on national citation indicators , 2004, Scientometrics.

[5]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[6]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .

[7]  Eugene Garfield,et al.  Citation indexing - its theory and application in science, technology, and humanities , 1979 .

[8]  Bart De Moor,et al.  Combining full text and bibliometric information in mapping scientific disciplines , 2005, Inf. Process. Manag..

[9]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[10]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[11]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[12]  Shenghuo Zhu,et al.  Efficient multi-way text categorization via generalized discriminant analysis , 2003, CIKM '03.

[13]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[14]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[15]  William E. Snizek,et al.  Textual and nontextual characteristics of scientific papers: Neglected science indicators , 2005, Scientometrics.

[16]  H. Zuckerman Nobel laureates in science: patterns of productivity, collaboration, and authorship. , 1967, American sociological review.

[17]  Vladimir Batagelj,et al.  Pajek - Analysis and Visualization of Large Networks , 2001, Graph Drawing Software.

[18]  Nicholas C. Mullins,et al.  THE STRUCTURAL ANALYSIS OF A SCIENTIFIC PAPER , 1988 .

[19]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[20]  E. S. Pearson,et al.  On questions raised by the combination of tests based on discontinuous distributions. , 1950, Biometrika.

[21]  Alan L. Porter,et al.  Patent Profiling for Competitive Advantage , 2004 .

[22]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[23]  Koenraad Debackere,et al.  Do science-technology interactions pay off when developing technology? , 2004, Scientometrics.

[24]  Nivio Ziviani,et al.  Link-based similarity measures for the classification of Web documents , 2006 .

[25]  Beth Dawson,et al.  Basic & Clinical Biostatistics , 1990 .

[26]  Peter D. Karp,et al.  EcoCyc: Encyclopedia of Escherichia coli genes and metabolism , 1998, Nucleic Acids Res..

[27]  Amanda Spink,et al.  A comparison of foreign authorship distribution in JASIST and the Journal of Documentation , 2002, J. Assoc. Inf. Sci. Technol..

[28]  Jon M. Kleinberg,et al.  The small-world phenomenon: an algorithmic perspective , 2000, STOC '00.

[29]  Wolfgang Glänzel,et al.  On the possibility and reliability of predictions based on stochastic citation processes , 2006, Scientometrics.

[30]  Wolfgang Glänzel,et al.  Science in Scandinavia: A Bibliometric Approach , 2004, Scientometrics.

[31]  Hongyuan Zha,et al.  Web document clustering using hyperlink structures , 2001 .

[32]  Haesun Park,et al.  Structure Preserving Dimension Reduction for Clustered Text Data Based on the Generalized Singular Value Decomposition , 2003, SIAM J. Matrix Anal. Appl..

[33]  G. Stolovitzky Gene selection in microarray data: the elephant, the blind men and our algorithms. , 2003, Current opinion in structural biology.

[34]  J. R. Cole,et al.  Scientific output and recognition: a study in the operation of the reward system in science. , 1967, American sociological review.

[35]  Tamara G. Kolda,et al.  Higher-order Web link analysis using multilinear algebra , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[36]  Alfred J. Lotka,et al.  The frequency distribution of scientific productivity , 1926 .

[37]  Seung-won Hwang,et al.  Clustering high dimensional massive scientific datasets , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[38]  G. Zipf,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. , 1949 .

[39]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[40]  András Schubert,et al.  The Web of Scientometrics , 2004, Scientometrics.

[41]  Bart Selman,et al.  Tracking evolving communities in large linked networks , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Loet Leydesdorff,et al.  Has Price's dream come true: Is scientometrics a hard science? , 1994, Scientometrics.

[43]  Tamara G. Kolda,et al.  Temporal Analysis of Social Networks using Three-way DEDICOM , 2006 .

[44]  Radosvet Todorov,et al.  Mapping Australian geophysics: A co-heading analysis , 1990, Scientometrics.

[45]  Z. Neda,et al.  Measuring preferential attachment in evolving networks , 2001, cond-mat/0104131.

[46]  W. Powell,et al.  Interorganizational Collaboration and the Locus of Innovation: Networks of Learning in Biotechnology. , 1996 .

[47]  Olaf Wolkenhauer,et al.  Analysis of DNA microarray data. , 2004, Current topics in medicinal chemistry.

[48]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[49]  V. Rodriguez,et al.  Material transfer agreements: open science vs. proprietary claims , 2005, Nature Biotechnology.

[50]  Jacques Michel,et al.  Patent citation analysis.A closer look at the basic input data from patent search reports , 2001, Scientometrics.

[51]  Isabel Gómez,et al.  Advantages and limitations in the use of impact factor measures for the assessment of research performance , 2002, Scientometrics.

[52]  Howard D. White,et al.  Pathfinder networks and author cocitation analysis: A remapping of paradigmatic information scientists , 2003, J. Assoc. Inf. Sci. Technol..

[53]  Sharon L. Milgram,et al.  The Small World Problem , 1967 .

[54]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[55]  L. Jin,et al.  Limitations of the evolutionary parsimony method of phylogenetic analysis. , 1990, Molecular biology and evolution.

[56]  Miles Efron,et al.  Eigenvalue-based model selection during latent semantic indexing , 2005, J. Assoc. Inf. Sci. Technol..

[57]  B. Dousset Innovation and network structural dynamics: Study of the alliance network of a major sector of the biotechnology industry , 2005 .

[58]  Wolfgang Glänzel,et al.  Towards a Bibliometrics-Aided Data Retrieval for Scientometric Purposes , 2006 .

[59]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[60]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[61]  P. Higgs RNA secondary structure: physical and computational aspects , 2000, Quarterly Reviews of Biophysics.

[62]  Thed N. van Leeuwen,et al.  Towards appropriate indicators of journal impact , 1999, Scientometrics.

[63]  Edward A. Fox,et al.  Intelligent fusion of structural and citation-based evidence for text classification , 2005, SIGIR '05.

[64]  A. Barabasi,et al.  Evolution of the social network of scientific collaborations , 2001, cond-mat/0104162.

[65]  Marie-Francine Moens,et al.  Automatic Indexing and Abstracting of Document Texts , 2000, Computational Linguistics.

[66]  E. Garfield The history and meaning of the journal impact factor. , 2006, JAMA.

[67]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[68]  Bart De Moor,et al.  Generalizations of the Singular Value and QR-Decompositions , 1992, SIAM J. Matrix Anal. Appl..

[69]  Wolfgang Glänzel,et al.  On the h-index - A mathematical approach to a new measure of publication activity and citation impact , 2006, Scientometrics.

[70]  L. Hubert,et al.  Comparing partitions , 1985 .

[71]  Bart De Moor,et al.  Towards mapping library and information science , 2006, Inf. Process. Manag..

[72]  K. C. Garg,et al.  Scientometrics of the international journal Scientometrics , 2004, Scientometrics.

[73]  Dragomir R. Radev,et al.  LexRank: Graph-based Centrality as Salience in Text Summarization , 2004 .

[74]  Song Wang,et al.  A generalized likelihood ratio test to identify differentially expressed genes from microarray data , 2004, Bioinform..

[75]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[76]  Masaru Kitsuregawa,et al.  Creating a Web community chart for navigating related communities , 2001, Hypertext.

[77]  Wolfgang Glänzel,et al.  On the Opportunities and Limitations of the H-index , 2006 .

[78]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[79]  Chris H. Q. Ding,et al.  Link Analysis: Hubs and Authorities on the World Wide Web , 2004, SIAM Rev..

[80]  S. Dongen A cluster algorithm for graphs , 2000 .

[81]  L. Hedges,et al.  Statistical Methods for Meta-Analysis , 1987 .

[82]  R. Albert,et al.  The large-scale organization of metabolic networks , 2000, Nature.

[83]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[84]  Michel Zitt,et al.  Patents and Publications , 2004 .

[85]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[86]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[87]  S. Strogatz Exploring complex networks , 2001, Nature.

[88]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[89]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[90]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[91]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[92]  Eszter Hargittai,et al.  Beyond logs and surveys: In-depth measures of people's web use skills , 2002, J. Assoc. Inf. Sci. Technol..

[93]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[94]  Wolfgang Glänzel,et al.  Combining full-text analysis and bibliometric indicators , 2004 .

[95]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[96]  Gary Marchionini Co-evolution of user and organizational interfaces: A longitudinal case study of WWW dissemination of national statistics , 2002, J. Assoc. Inf. Sci. Technol..

[97]  Robert L. Goldstone,et al.  The simultaneous evolution of author and paper networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[98]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[99]  Michel Zitt,et al.  Development of a method for detection and trend analysis of research fronts built by lexical or cocitation analysis , 1994, Scientometrics.

[100]  H. Zha,et al.  A tree of generalizations of the ordinary singular value decomposition , 1991 .

[101]  Olle Persson All author citations versus first author citations , 2004, Scientometrics.

[102]  W. Scott Spangler,et al.  Clustering hypertext with applications to web searching , 2000, HYPERTEXT '00.

[103]  Ronald N. Kostoff,et al.  The use and misuse of citation analysis in research evaluation , 1998, Scientometrics.

[104]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[105]  G. B. A. Barab'asi Competition and multiscaling in evolving networks , 2000, cond-mat/0011029.

[106]  Berthier A. Ribeiro-Neto,et al.  Combining link-based and content-based methods for web document classification , 2003, CIKM '03.

[107]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[108]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[109]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[110]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[111]  W. Allen Wallis,et al.  Compounding Probabilities from Independent Significance Tests , 1942 .

[112]  Gobinda G. Chowdhury,et al.  Bibliometric cartography of information retrieval research by using co-word analysis , 2001, Inf. Process. Manag..

[113]  Koenraad Debackere,et al.  The Role of Academic Technology Transfer Organizations in Improving Industry Science Links , 2005 .

[114]  J. D. Thompson,et al.  Multiple alignment of complete sequences (MACS) in the post-genomic era. , 2001, Gene.

[115]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[116]  Peter J. Park,et al.  Comparing expression profiles of genes with similar promoter regions , 2002, Bioinform..

[117]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[118]  Mark Gerstein,et al.  Analyzing cellular biochemistry in terms of molecular networks. , 2003, Annual review of biochemistry.

[119]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[120]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[121]  Hao Xiong,et al.  Network-based regulatory pathways analysis , 2004, Bioinform..

[122]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[123]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[124]  A. Raan THE NEURAL NET OF NEURAL NETWORK RESEARCH AN EXERCISE IN BIBLIOMETRIC MAPPING , 2022 .

[125]  Grigorii Pivovarov,et al.  EqRank: a self-consistent equivalence relation on graph vertexes , 2003, SKDD.

[126]  Charles Van Loan,et al.  A General Matrix Eigenvalue Algorithm , 1975 .

[127]  Bart De Moor,et al.  Integration of textual content and link information for accurate clustering of science fields , 2006 .

[128]  Barbara Stefaniak Periodical literature of information science as reflected in Referativnyj Zhurnal, section 59, informatika , 2005, Scientometrics.

[129]  Bluma C. Peritz On the Objectives of Citation Analysis: Problems of Theory and Method , 1992 .

[130]  A. Barabasi,et al.  Quantifying social group evolution , 2007, Nature.

[131]  D. Aksnes CHARACTERISTICS OF HIGHLY CITED PAPERS , 2003 .

[132]  Gary G. Yen,et al.  Time line visualization of research fronts , 2003, J. Assoc. Inf. Sci. Technol..

[133]  Kari Torkkola,et al.  Discriminative features for text document classification , 2003, Formal Pattern Analysis & Applications.

[134]  Koenraad Debackere,et al.  ‘Triad’ or ‘tetrad’? On global changes in a dynamic world , 2008, Scientometrics.

[135]  M. Zuker On finding all suboptimal foldings of an RNA molecule. , 1989, Science.

[136]  J. Devereux,et al.  A comprehensive set of sequence analysis programs for the VAX , 1984, Nucleic Acids Res..

[137]  Huan Liu,et al.  CubeSVD: a novel approach to personalized Web search , 2005, WWW '05.

[138]  Leszek Rychlewski,et al.  Improving the quality of twilight‐zone alignments , 2000, Protein science : a publication of the Protein Society.

[139]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[140]  Wolfgang Glänzel,et al.  A comparative analysis of publication activity and citation impact based on the core literature in bioinformatics , 2009, Scientometrics.

[141]  Anton J. Enright,et al.  BioLayout-an automatic graph layout algorithm for similarity visualization , 2001, Bioinform..

[142]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[143]  Gary G Yen,et al.  Crossmaps: Visualization of overlapping relationships in collections of journal papers , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[144]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[145]  J V Maizel,et al.  SequenceEditingAligner: a multiple sequence editor and aligner. , 1990, Genetic analysis, techniques and applications.

[146]  Chaomei Chen,et al.  Patents, citations & innovations: A window on the knowledge economy , 2003, J. Assoc. Inf. Sci. Technol..

[147]  I. Tinoco,et al.  How RNA folds. , 1999, Journal of molecular biology.

[148]  Bart De Moor,et al.  A hybrid mapping of information science , 2008, Scientometrics.

[149]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[150]  M. Bulyk Computational prediction of transcription-factor binding site locations , 2003, Genome Biology.

[151]  András Schubert,et al.  Cognitive Changes in Scientometrics during the 1980s, as Reflected by the Reference Patterns of its Core Journal , 1993 .

[152]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[153]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[154]  Filippo Menczer,et al.  Evolution of document networks , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[155]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[156]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[157]  Ronald N. Kostoff,et al.  The Hidden Structure of Neuropsychology: Text Mining of the Journal Cortex: 1991-2001 , 2005, Cortex.

[158]  Paola Sebastiani,et al.  Statistical Challenges in Functional Genomics , 2003 .

[159]  David Posada,et al.  MODELTEST: testing the model of DNA substitution , 1998, Bioinform..

[160]  J. ANTHONYF. Reference-based publication networks with episodic memories , 2005 .

[161]  Ronald N. Kostoff,et al.  Factor matrix text filtering and clustering: Research Articles , 2005 .

[162]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[163]  Wolfgang Glänzel,et al.  National characteristics in international scientific co-authorship relations , 2004, Scientometrics.

[164]  Rickard Cöster,et al.  Using Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization , 2004, COLING.

[165]  Richard M. Everson,et al.  When Are Links Useful? Experiments in Text Classification , 2003, ECIR.

[166]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[167]  Jian Zhang,et al.  The Protein Information Resource: an integrated public resource of functional annotation of proteins , 2002, Nucleic Acids Res..

[168]  Wolfgang Glänzel,et al.  Inflationary bibliometric values: The role of scientific collaboration and the need for relative indicators in evaluative studies , 2004, Scientometrics.

[169]  Jean Pierre Courtial,et al.  A coword analysis of scientometrics , 1994, Scientometrics.

[170]  G. Golub,et al.  The restricted singular value decomposition: properties and applications , 1991 .

[171]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[172]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[173]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[174]  Loet Leydesdorff,et al.  Network Structure, Self-Organization and the Growth of International Collaboration in Science.Research Policy, 34(10), 2005, 1608-1618. , 2005, 0911.4299.

[175]  Wolfgang Glänzel,et al.  Chemistry research in Eastern Central Europe (1992-1997): Facts and figures on publication output and citation impact , 2000 .

[176]  Jean Pierre Courtial,et al.  Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemsitry , 1991, Scientometrics.

[177]  April Kontostathis,et al.  Essential Dimensions of Latent Semantic Indexing (LSI) , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[178]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[179]  M. Newman,et al.  The structure of scientific collaboration networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[180]  Gobinda G. Chowdhury,et al.  Mapping the intellectual structure of information retrieval studies: an author co-citation analysis, 1987-1997 , 1999, J. Inf. Sci..

[181]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[182]  Everard Christiaan Marie Noyons,et al.  Bibliometric mapping as a science policy and research management tool , 1999 .

[183]  D. Turner,et al.  Improved free-energy parameters for predictions of RNA duplex stability. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[184]  Loet Leydesdorff,et al.  The university-industry knowledge relationship: Analyzing patents and the science base of technologies , 2004, J. Assoc. Inf. Sci. Technol..

[185]  D. Watts,et al.  Small Worlds: The Dynamics of Networks between Order and Randomness , 2001 .

[186]  Anthony F. J. van Raan,et al.  Mapping co-word structures: A comparison of multidimensional scaling and leximappe , 1989, Scientometrics.

[187]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[188]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[189]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[190]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[191]  R. Merton The Matthew Effect in Science , 1968, Science.

[192]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[193]  Wolfgang Glänzel,et al.  Domain Study 'Nanotechnology: Analysis of an Emerging Domain of Scientific and Technological Endeavour' , 2003 .

[194]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[195]  R. Staden Searching for patterns in protein and nucleic acid sequences. , 1990, Methods in enzymology.

[196]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[197]  Philip Ball,et al.  Index aims for fair ranking of scientists , 2005, Nature.

[198]  B. C. Griffith,et al.  The Structure of Scientific Literatures I: Identifying and Graphing Specialties , 1974 .

[199]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[200]  J. Sabina,et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. , 1999, Journal of molecular biology.

[201]  D. Price Little Science, Big Science , 1965 .

[202]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[203]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[204]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[205]  Magnus Sahlgren Towards a Flexible Model of Word Meaning , 2002, AAAI 2002.

[206]  E. Garfield Citation Indexing for Studying Science , 1970, Nature.

[207]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[208]  Bart De Moor,et al.  Co-clustering approaches to integrate lexical and bibliographical information , 2005 .

[209]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[210]  Marie-Francine Moens,et al.  Abstracting of legal cases: the potential of clustering based on the selection of representative objects , 1999 .

[211]  Chaomei Chen,et al.  Searching for intellectual turning points: Progressive knowledge domain visualization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[212]  R Apweiler,et al.  Clustering and analysis of protein families. , 2001, Current opinion in structural biology.

[213]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[214]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[215]  Masaru Kitsuregawa,et al.  Evaluating contents-link coupled web page clustering for web search results , 2002, CIKM '02.

[216]  Gene H. Golub,et al.  Matrix computations , 1983 .

[217]  M. Callon,et al.  From translations to problematic networks: An introduction to co-word analysis , 1983 .

[218]  Quentin L. Burrell,et al.  Hirsch's h-index: A stochastic model , 2007, J. Informetrics.

[219]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[220]  Wolfgang Glänzel,et al.  A new methodological approach to bibliographic coupling and its application to the national, regional and institutional level , 2005, Scientometrics.

[221]  Ellen Bonnevie-Nebelong,et al.  A multifaceted portrait of a library and information science journal: the case of the Journal of Information Science , 2003, J. Inf. Sci..

[222]  Irina Marshakova-Shaikevich,et al.  Bibliometric maps of field of science , 2005, Inf. Process. Manag..

[223]  P. Brown,et al.  A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. , 1996, Genome research.

[224]  Bart De Moor,et al.  Advanced personalization and document retrieval techniques in support of efficient knowledge management , 2004 .

[225]  Michel Zitt,et al.  Delineating complex scientific fields by an hybrid lexical-citation method: An application to nanosciences , 2006, Inf. Process. Manag..

[226]  Steven Henikoff,et al.  PATMAT: a searching and extraction program for sequence, pattern and block queries and databases , 1992, Comput. Appl. Biosci..

[227]  E GARFIELD,et al.  Citation indexes for science; a new dimension in documentation through association of ideas. , 2006, Science.

[228]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[229]  S. N. Dorogovtsev,et al.  Evolution of networks , 2001, cond-mat/0106144.

[230]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[231]  M. Newman Analysis of weighted networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[232]  Martin Meyer,et al.  Towards hybrid Triple Helix indicators: A study of university-related patents and a survey of academic inventors , 2003, Scientometrics.

[233]  Michael W. Berry,et al.  A Case Study of Latent Semantic Indexing , 1995 .

[234]  Marie-Angèle De Looze,et al.  Corpus relevance through co-word analysis: An application to plant proteints , 1997, Scientometrics.

[235]  Bart De Moor,et al.  Application of HITS algorithms to detect terms and sentences with high saliency scores , 2003 .

[236]  P Bucher,et al.  Compilation and analysis of eukaryotic POL II promoter sequences. , 1986, Nucleic acids research.

[237]  Anthony F. J. van Raan,et al.  The neural net of neural network research , 2005, Scientometrics.

[238]  Chaomei Chen,et al.  Review of Patents, citations & innovations: a window on the knowledge economy by Adam B. Jaffe & Manuel Trajtenberg. Cambridge, MA: The MIT Press, 2002 , 2003 .

[239]  Norman Kaplan,et al.  The Sociology of Science: Theoretical and Empirical Investigations , 1974 .

[240]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[241]  Michel Zitt,et al.  A simple method for dynamic scientometrics using lexical analysis , 1991, Scientometrics.

[242]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[243]  Derek de Solla Price,et al.  A general theory of bibliometric and other cumulative advantage processes , 1976, J. Am. Soc. Inf. Sci..

[244]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[245]  Michael May,et al.  Data Mining and Text Mining for Science & Technology Research , 2004 .

[246]  R. Griffey,et al.  Computational methods for RNA structure determination. , 2001, Current opinion in structural biology.

[247]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[248]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[249]  Arie Rip,et al.  Co-word maps of biotechnology: An example of cognitive scientometrics , 1984, Scientometrics.

[250]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[251]  Jan O. Korbel,et al.  Combining frequency and positional information to predict transcription factor binding sites , 2001, Bioinform..

[252]  Sujit Bhattacharya,et al.  Mapping a research area at the micro level using co-word analysis , 1998, Scientometrics.

[253]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[254]  Berthier A. Ribeiro-Neto,et al.  Local versus global link information in the Web , 2003, TOIS.

[255]  Lingchong You,et al.  Toward computational systems biology , 2007, Cell Biochemistry and Biophysics.

[256]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[257]  Charlene L. Al-Qallaf,et al.  Citation patterns in the Kuwaiti journal Medical Principles and Practice: The first 12 years, 1989-2000 , 2003, Scientometrics.

[258]  Michael I. Jordan,et al.  Link Analysis, Eigenvectors and Stability , 2001, IJCAI.

[259]  W. Doolittle,et al.  Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability. , 2003, Molecular biology and evolution.

[260]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[261]  Kathleen Marchal,et al.  M@cbeth: a Microarray Classification Benchmarking Tool , 2005 .

[262]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[263]  Ronald N. Kostoff,et al.  Text mining using database tomography and bibliometrics: A review , 2001 .

[264]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[265]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[266]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[267]  Chris H. Q. Ding,et al.  Adaptive dimension reduction for clustering high dimensional data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[268]  Hildrun Kretschmer,et al.  Author productivity and geodesic distance in bibliographic co-authorship networks, and visibility on the Web , 2004, Scientometrics.

[269]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[270]  W. Glänzel M. Meyer B. Schlemmer M. du Plessis B. Thi Veugelers Dekenstraat 2 B-3000 Leuven Domain Study ” Biotechnology ”-An Analysis based on Publications and Patents , .

[271]  J. Butler,et al.  AutoDimer: a screening tool for primer-dimer and hairpin structures. , 2004, BioTechniques.

[272]  Bart Selman,et al.  Natural communities in large linked networks , 2003, KDD '03.

[273]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[274]  Alfonso Valencia,et al.  Early bioinformatics: the birth of a discipline - a personal view , 2003, Bioinform..

[275]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[276]  Tamara G. Kolda,et al.  Multilinear Algebra for Analyzing Data with Multiple Linkages , 2006, Graph Algorithms in the Language of Linear Algebra.

[277]  Miguel A. Andrade-Navarro,et al.  Evolving research trends in bioinformatics , 2006, Briefings Bioinform..

[278]  Jonathan Furner,et al.  Scholarly communication and bibliometrics , 2005, Annu. Rev. Inf. Sci. Technol..

[279]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[280]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[281]  Massih-Reza Amini,et al.  Active , Semi-Supervised Learning for Textual Information Access , .

[282]  M. Kanehisa,et al.  Reconstruction of amino acid biosynthesis pathways from the complete genome sequence. , 1998, Genome research.

[283]  M. Zuker Calculating nucleic acid secondary structure. , 2000, Current opinion in structural biology.

[284]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[285]  G. Böhm,et al.  New approaches in molecular structure prediction. , 1996, Biophysical chemistry.

[286]  T. Smith,et al.  Optimal sequence alignments. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[287]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[288]  Jean Pierre Courtial,et al.  Qualitative models, quantitative tools and network analysis , 1989, Scientometrics.

[289]  Jean Garnier,et al.  FORESST: fold recognition from secondary structure predictions of proteins , 1999, Bioinform..

[290]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[291]  Alexander Schliep,et al.  Selecting signature oligonucleotides to identify organisms using DNA arrays , 2002, Bioinform..

[292]  Sujit Bhattacharya,et al.  Mapping inventive activity and technological change through patent analysis: A case study of India and China , 2004, Scientometrics.

[293]  E Rivas,et al.  A dynamic programming algorithm for RNA structure prediction including pseudoknots. , 1998, Journal of molecular biology.

[294]  Mark Abrahamson,et al.  The Scientific Community. , 1966 .

[295]  Ed C. M. Noyons,et al.  Bibliometric mapping of science in a policy context , 2004, Scientometrics.

[296]  H O LANCASTER The combination of probabilities arising from data in discrete distributions. , 1949, Biometrika.

[297]  M. Newman Coauthorship networks and patterns of scientific collaboration , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[298]  Cheng-Yan Kao,et al.  An evolutionary approach for gene expression patterns , 2004, IEEE Transactions on Information Technology in Biomedicine.

[299]  Anthony F. J. van Raan,et al.  Bibliometric cartography of scientific and technological developments of an R & D field , 1994, Scientometrics.

[300]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[301]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[302]  Jean Garnier,et al.  The protein structure code: what is its present status? , 1991, Comput. Appl. Biosci..

[303]  Wolfgang Glänzel,et al.  A relational charting approach to the world of basic research in twelve science fields at the end of the second millennium , 2004, Scientometrics.

[304]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[305]  Bart De Moor,et al.  On the Structure of Generalized Singular Value and QR Decompositions , 1994 .

[306]  Mineichi Kudo,et al.  Non-parametric classifier-independent feature selection , 2006, Pattern Recognit..

[307]  Henk F. Moed,et al.  Mapping of science by combined co-citation and word analysis, I. Structural aspects , 1991, J. Am. Soc. Inf. Sci..

[308]  O. Gotoh,et al.  Multiple sequence alignment: algorithms and applications. , 1999, Advances in biophysics.

[309]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[310]  Mike Thelwall,et al.  The connection between the research of a university and counts of links to its web pages: An investigation based upon a classification of the relationships of pages to the research of the host university , 2003, J. Assoc. Inf. Sci. Technol..

[311]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[312]  Henry G. Small,et al.  On the shoulders of Robert Merton: Towards a normative theory of citation , 2004, Scientometrics.

[313]  Bülent Yener,et al.  Modeling and Multiway Analysis of Chatroom Tensors , 2005, ISI.

[314]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[315]  Chaomei Chen,et al.  Visualizing evolving networks: minimum spanning trees versus pathfinder networks , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[316]  Paul Van Dooren,et al.  A MEASURE OF SIMILARITY BETWEEN GRAPH VERTICES . WITH APPLICATIONS TO SYNONYM EXTRACTION AND WEB SEARCHING , 2002 .

[317]  Dennis N. Ocholla,et al.  An informetric investigation of the relatedness of opportunistic infections to HIV/AIDS , 2005, Inf. Process. Manag..

[318]  Steven A. Morris,et al.  Manifestation of emerging specialties in journal literature: A growth model of papers, references, exemplars, bibliographic coupling, cocitation, and clustering coefficient distribution , 2005, J. Assoc. Inf. Sci. Technol..

[319]  S. Redner How popular is your paper? An empirical study of the citation distribution , 1998, cond-mat/9804163.

[320]  Bart De Moor,et al.  Do material transfer agreements affect the choice of research agendas? The case of biotechnology in Belgium , 2007, Scientometrics.

[321]  Frédéric Delsuc,et al.  Molecular systematics of armadillos (Xenarthra, Dasypodidae): contribution of maximum likelihood and Bayesian analyses of mitochondrial and nuclear genes. , 2003, Molecular phylogenetics and evolution.

[322]  Walter Daelemans,et al.  Memory-Based Language Processing (Studies in Natural Language Processing) , 2005 .

[323]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[324]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[325]  Leo Egghe,et al.  An informetric model for the Hirsch-index , 2006, Scientometrics.

[326]  Loet Leydesdorff Why words and co‐words cannot map the development of the sciences , 1997 .

[327]  Wolfgang Glänzel,et al.  Two decades of "Scientometrics". An interdisciplinary field represented by its leading journal , 2004, Scientometrics.

[328]  Michael W. Berry,et al.  Computational information retrieval , 2001 .

[329]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[330]  Chenn-Jung Huang,et al.  Application of Probabilistic Neural Networks to the Class Prediction of Leukemia and Embryonal Tumor of Central Nervous System , 2004, Neural Processing Letters.

[331]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[332]  Rainer Fuchs,et al.  Topology of gene expression networks as revealed by data mining and modeling , 2003, Bioinform..

[333]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[334]  Ulrik Brandes,et al.  Network Analysis: Methodological Foundations (Lecture Notes in Computer Science) , 2005 .