Heuristic method for searches on large data-sets organised using network models

Searches on large data-sets have become an important issue in recent years. An alternative, which has achieved good results, is the use of methods relying on data mining techniques, such as cluster-based retrieval. This paper proposes a heuristic search that is based on an organisational model that reflects similarity relationships among data elements. The search is guided by using quality estimators of model nodes, which are obtained by the progressive evaluation of the given target function for the elements associated with each node. The results of the experiments confirm the effectiveness of the proposed algorithm. High-quality solutions are obtained evaluating a relatively small percentage of elements in the data-sets.

[1]  Arthur Dalby,et al.  Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited , 1992, J. Chem. Inf. Comput. Sci..

[2]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[3]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[4]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[5]  Nirwan Ansari,et al.  Computational Intelligence for Optimization , 1996, Springer US.

[6]  A. H. Lipkus A proof of the triangle inequality for the Tanimoto distance , 1999 .

[7]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[8]  J. Kazius,et al.  Derivation and validation of toxicophores for mutagenicity prediction. , 2005, Journal of medicinal chemistry.

[9]  Özgür Ulusoy,et al.  Incremental cluster-based retrieval using compressed cluster-skipping inverted files , 2008, TOIS.

[10]  Farahnaz Sadoughi,et al.  Ranked k-medoids: A fast and accurate rank-based partitioning algorithm for clustering large datasets , 2013, Knowl. Based Syst..

[11]  Clare Churcher Beginning Database Design , 2012, Apress.

[12]  Darren R. Flower,et al.  On the Properties of Bit String-Based Measures of Chemical Similarity , 1998, J. Chem. Inf. Comput. Sci..

[13]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[14]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[15]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[16]  Kaspar Riesen,et al.  Graph Embedding in Vector Spaces by Means of Prototype Selection , 2007, GbRPR.

[17]  Jürgen Bajorath,et al.  Profile Scaling Increases the Similarity Search Performance of Molecular Fingerprints Containing Numerical Descriptors and Structural Keys , 2003, J. Chem. Inf. Comput. Sci..

[18]  Kaspar Riesen,et al.  Classification and Clustering of Vector Space Embedded Graphs. Series in Machine Perception and Artificial Intelligence. , 2010 .

[19]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Gloria Bordogna,et al.  A quality driven Hierarchical Data Divisive Soft Clustering for information retrieval , 2012, Knowl. Based Syst..

[21]  Tingjun Hou,et al.  Aqueous Solubility Prediction Based on Weighted Atom Type Counts and Solvent Accessible Surface Areas , 2009, J. Chem. Inf. Model..

[22]  Bernd Fritzke,et al.  A Growing Neural Gas Network Learns Topologies , 1994, NIPS.

[23]  David Weininger,et al.  SMILES, 3. DEPICT. Graphical depiction of chemical structures , 1990, J. Chem. Inf. Comput. Sci..

[24]  Klaus Gundertofte,et al.  A Fragment‐weighted Key‐based Similarity Measure for Use in Structural Clustering and Virtual Screening , 2006 .

[25]  John S. Delaney,et al.  ESOL: Estimating Aqueous Solubility Directly from Molecular Structure , 2004, J. Chem. Inf. Model..