Studies on Computational Learning via Discretization

This thesis presents cutting-edge studies on computational learning. The key issue throughout the thesis is amalgamation of two processes; discretization of continuous objects and learning from such objects provided by data. Machine learning, or data mining and knowledge discovery, has been rapidly developed in recent years and is now becoming a huge topic in not only research communities but also businesses and industries. Discretization is essential for learning from continuous objects such as real-valued data, since every datum obtained by observation in the real world must be discretized and converted from analog (continuous) to digital (discrete) form to store in databases and manipulate on computers. However, most machine learning methods do not pay attention to the process: they use digital data in actual applications whereas assume analog data (usually real vectors) in theories. To bridge the gap, we cut into computational aspects of learning from theory to practice through three parts in this thesis. Part I addresses theoretical analysis, which forms a disciplined foundation of the thesis. In particular, we analyze learning of igures, nonempty compact sets in Euclidean space, based on the Gold-style learningmodel aiming at a computational basis for binary classi ication of continuous data. We use fractals as a representation system, and reveal a learnability hierarchy under various learning criteria in the track of traditional analysis of learnability in the Gold-style learning model. We show a mathematical connection between machine learning and fractal geometry by measuring the complexity of learning using the Hausdorff dimension and the VC dimension. Moreover, we analyze computability aspects of learning of igures using the framework of Type-2 Theory of Effectivity (TTE). Part II is a way from theory to practice. We start from designing a new measure in a computational manner, called coding divergence, which measures the difference between two sets of data, and go further by solving the typical machine learning tasks: classi ication and clustering. Speci ically, we give two novel clustering algorithms, COOL (COding Oriented cLustering) and BOOL (Binary cOding Oriented cLustering). Experiments show that BOOL is faster than the K-means algorithm, and about two to three orders of magnitude faster than two state-ofthe-art algorithms that can detect non-convex clusters of arbitrary shapes. Part III treats more complex problems: semi-supervised and preference learning, by bene iting from Formal Concept Analysis (FCA). First we construct a SELF (SEmi-supervised Learning via FCA) algorithm, which performs classi ication and label ranking of mixed-type data containing both discrete and continuous variables. Finally, we investigate a biological application; we challenge to ind ligand candidates of receptors from databases by formalizing the problem as multi-label classi ication, and develop an algorithm LIFT (Ligand FInding via Formal ConcepT Analysis) for the task. We experimentally show their competitive performance.

[1]  David S. Goodsell,et al.  A semiempirical free energy force field with charge‐based desolvation , 2007, J. Comput. Chem..

[2]  Amedeo Napoli,et al.  Mining gene expression data with pattern structures in formal concept analysis , 2011, Inf. Sci..

[3]  Eyke Hüllermeier,et al.  Predicting Partial Orders: Ranking with Abstention , 2010, ECML/PKDD.

[4]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[5]  Fei Wang,et al.  Label Propagation through Linear Neighborhoods , 2008, IEEE Trans. Knowl. Data Eng..

[6]  Stefan C. Kremer,et al.  Clustering unlabeled data with SOMs improves classification of labeled real-world data , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[7]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[8]  Anne Laurent,et al.  Mining multidimensional and multilevel sequential patterns , 2010, TKDD.

[9]  Sergei O. Kuznetsov,et al.  Toxicology Analysis by Means of the JSM-method , 2003, Bioinform..

[10]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[11]  R. A. Fisher,et al.  Statistical methods and scientific inference. , 1957 .

[12]  Vasco Brattka,et al.  Computability on subsets of metric spaces , 2003, Theor. Comput. Sci..

[13]  Klaus Weihrauch,et al.  Elementary Computable Topology , 2009, J. Univers. Comput. Sci..

[14]  Siegfried M. Rump,et al.  Accurate Sum and Dot Product , 2005, SIAM J. Sci. Comput..

[15]  Klaus Weihrauch,et al.  Computability on Subsets of Euclidean Space I: Closed and Compact Subsets , 1999, Theor. Comput. Sci..

[16]  Eliana Minicozzi,et al.  Some Natural Properties of Strong-Identification in Inductive Inference , 1976, Theor. Comput. Sci..

[17]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[18]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[19]  Jaakko Hollmén,et al.  Quantization of Continuous Input Variables for Binary Classification , 2000, IDEAL.

[20]  C. Sparrow The Fractal Geometry of Nature , 1984 .

[21]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[22]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[23]  Xiaogang Wang,et al.  Clues: an R Package for Nonparametric Clustering Based on Local Shrinking , 2022 .

[24]  Bernhard Ganter,et al.  Formalizing Hypotheses with Concepts , 2000, ICCS.

[25]  Thomas Zeugmann,et al.  Learning recursive functions: A survey , 2008, Theor. Comput. Sci..

[26]  Klaus Weihrauch,et al.  The Computable Multi-Functions on Multi-represented Sets are Closed under Programming , 2008, J. Univers. Comput. Sci..

[27]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[28]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[29]  Michael F. Barnsley,et al.  Fractals everywhere , 1988 .

[30]  Jitender S. Deogun,et al.  Using Closed Itemsets for Discovering Representative Association Rules , 2000, ISMIS.

[31]  E. Mark Gold,et al.  Limiting recursion , 1965, Journal of Symbolic Logic.

[32]  Setsuo Arikawa,et al.  A comparison of identification criteria for inductive inference of recursive real-valued functions , 1998, Theor. Comput. Sci..

[33]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[34]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[35]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[36]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[37]  Setsuo Arikawa,et al.  Inferability of Recursive Real-Valued Functions , 1997, ALT.

[38]  Joanna L. Sharman,et al.  IUPHAR-DB: new receptors and tools for easy searching and visualization of pharmacological data , 2010, Nucleic Acids Res..

[39]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[40]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[41]  Klaus P. Jantke Monotonic and non-monotonic inductive inference , 2009, New Generation Computing.

[42]  Loizos Michael Missing Information Impediments to Learnability , 2011, COLT.

[43]  Takeshi Shinohara,et al.  The correct definition of finite elasticity: corrigendum to identification of unions , 1991, COLT '91.

[44]  J. Dieudonne Foundations of Modern Analysis , 1969 .

[45]  Dana Angluin,et al.  Inductive Inference of Formal Languages from Positive Data , 1980, Inf. Control..

[46]  Manuel Blum,et al.  Toward a Mathematical Theory of Inductive Inference , 1975, Inf. Control..

[47]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[49]  Kai Ming Ting,et al.  Multi-dimensional Mass Estimation and Mass-based Clustering , 2010, 2010 IEEE International Conference on Data Mining.

[50]  Yamamoto Akihiro,et al.  The Coding Divergence for Measuring the Complexity of Separating Two Sets , 2010 .

[51]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[52]  Eyke Hüllermeier,et al.  Label ranking by learning pairwise preferences , 2008, Artif. Intell..

[53]  Norbert Th. Müller,et al.  The iRRAM: Exact Arithmetic in C++ , 2000, CCA.

[54]  Shai Ben-David,et al.  Learning with Restricted Focus of Attention , 1998, J. Comput. Syst. Sci..

[55]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[56]  A. Turing On Computable Numbers, with an Application to the Entscheidungsproblem. , 1937 .

[57]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[58]  E. S. Pearson,et al.  ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I , 1928 .

[59]  Carl H. Smith,et al.  On the Role of Procrastination in Machine Learning , 1993, Inf. Comput..

[60]  Rosario Gennaro,et al.  On learning from noisy and incomplete examples , 1995, COLT '95.

[61]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume IV: Fascicle 2: Generating All Tuples and Permutations , 2005 .

[62]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[63]  Sanjay Jain Hypothesis spaces for learning , 2011, Inf. Comput..

[64]  Thomas Zeugmann,et al.  Learning indexed families of recursive languages from positive data: A survey , 2008, Theor. Comput. Sci..

[65]  Loizos Michael Partial observability and learnability , 2010, Artif. Intell..

[66]  Shin'ichi Oishi Why Research on Numerical Computation with Result Verification , 2008 .

[67]  TaeHyun Hwang,et al.  A Heterogeneous Label Propagation Algorithm for Disease Gene Discovery , 2010, SDM.

[68]  Thomas Zeugmann,et al.  Characterizations of Monotonic and Dual Monotonic Language Learning , 1995, Inf. Comput..

[69]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[70]  Kenneth Falconer,et al.  Fractal Geometry: Mathematical Foundations and Applications , 1990 .

[71]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[72]  Carl H. Smith,et al.  On the Inductive Inference of Recursive Real-Valued Functions , 1999, Theor. Comput. Sci..

[73]  Anthony K. H. Tung,et al.  Spatial clustering methods in data mining : A survey , 2001 .

[74]  Andreas Hotho,et al.  TRIAS--An Algorithm for Mining Iceberg Tri-Lattices , 2006, Sixth International Conference on Data Mining (ICDM'06).

[75]  Colin de la Higuera,et al.  Inference of omega-Languages from Prefixes , 2001, ALT.

[76]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[77]  João Gama,et al.  Discretization from data streams: applications to histograms and data mining , 2006, SAC.

[78]  Thomas Gärtner,et al.  Label Ranking Algorithms: A Survey , 2010, Preference Learning.

[79]  Ehud Shapiro,et al.  Inductive Inference of Theories from Facts , 1991, Computational Logic - Essays in Honor of Alan Robinson.

[80]  Kazuhisa Makino,et al.  New Algorithms for Enumerating All Maximal Cliques , 2004, SWAT.

[81]  Akihiro Yamamoto,et al.  Semi-supervised Learning for Mixed-Type Data via Formal Concept Analysis , 2011, ICCS.

[82]  杉山 麿人,et al.  Learning figures with the Hausdorff metric by fractals (特集 「機械学習の諸科学への応用」および一般) , 2009 .

[83]  Sanjay Jain,et al.  Uncountable automatic classes and learning , 2009, Theor. Comput. Sci..

[84]  Matthew de Brecht Topological and Algebraic Aspects of Algorithmic Learning Theory , 2010 .

[85]  Thomas Zeugmann,et al.  Characterization of language learning front informant under various monotonicity constraints , 1994, J. Exp. Theor. Artif. Intell..

[86]  Kouichi Hirata,et al.  Refutability and Reliability for Inductive Inference of Recursive Real-Valued Functions , 2005 .

[87]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[88]  Akihiro Yamamoto,et al.  Semi-supervised learning on closed set lattices , 2013, Intell. Data Anal..

[89]  Efim B. Kinber Monotonicity versus Efficiency for Learning Languages from Texts , 1994, AII/ALT.

[90]  Rolf Wiehagen,et al.  Learning Recursive Functions Refutably , 2001, ALT.

[91]  Rolf Wiehagen A Thesis in Inductive Inference , 1990, Nonmonotonic and Inductive Logic.

[92]  Tapio Elomaa,et al.  Necessary and Sufficient Pre-processing in Numerical Range Discretization , 2003, Knowledge and Information Systems.

[93]  Akihiro Yamamoto,et al.  Discovering Ligands for TRP Ion Channels Using Formal Concept Analysis , 2011, ILP.

[94]  Hideki Tsuiki,et al.  Real number computation through Gray code embedding , 2002, Theor. Comput. Sci..

[95]  Christopher R. Corbeil,et al.  Towards the development of universal, fast and highly accurate docking/scoring methods: a long way to go , 2008, British journal of pharmacology.

[96]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[97]  Amedeo Napoli,et al.  Revisiting Numerical Pattern Mining with Formal Concept Analysis , 2011, IJCAI.

[98]  M. Jacobson,et al.  Molecular mechanics methods for predicting protein-ligand binding. , 2006, Physical chemistry chemical physics : PCCP.

[99]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[100]  L. Hubert,et al.  Comparing partitions , 1985 .

[101]  Akihiro Yamamoto,et al.  Topological properties of concept spaces (full version) , 2010, Inf. Comput..

[102]  Christophe Rigotti,et al.  From digital genetics to knowledge discovery: Perspectives in genetic network understanding , 2010, Intell. Data Anal..

[103]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[104]  Jorma Rissanen,et al.  An MDL Framework for Data Clustering , 2005 .

[105]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[106]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[107]  Yong Deng,et al.  A new Hausdorff distance for image matching , 2005, Pattern Recognit. Lett..

[108]  Leslie G. Valiant,et al.  A general lower bound on the number of examples needed for learning , 1988, COLT '88.

[109]  J. R. Büchi On a Decision Method in Restricted Second Order Arithmetic , 1990 .

[110]  Geng Li,et al.  ABACUS: Mining Arbitrary Shaped Clusters from Large Datasets based on Backbone Identification , 2011, SDM.

[111]  Philip S. Yu,et al.  On Classification of High-Cardinality Data Streams , 2010, SDM.

[112]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[113]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[114]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[115]  Frank Stephan,et al.  Refuting Learning Revisited , 2001, Theor. Comput. Sci..

[116]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[117]  Stéphane Bressan,et al.  Introduction to Database Systems , 2005 .

[118]  Gerald Beer,et al.  Topologies on Closed and Closed Convex Sets , 1993 .

[119]  Diplom-Informatiker Matthias Schroder,et al.  Admissible representations for continuous computations , 2002 .

[120]  Philip M. Long,et al.  PAC Learning Axis-Aligned Rectangles with Respect to Product Distributions from Multiple-Instance Examples , 1996, COLT.

[121]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[122]  Bernhard Ganter,et al.  Hypotheses and Version Spaces , 2003, ICCS.

[123]  Brian A. Davey,et al.  An Introduction to Lattices and Order , 1989 .

[124]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[125]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[126]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[127]  Henry Tirri,et al.  A Bayesian Approach to Discretization , 1997 .

[128]  Erich Schikuta,et al.  The BANG-Clustering System: Grid-Based Data Analysis , 1997, IDA.

[129]  Hong Cheng,et al.  Sparsity induced similarity measure for label propagation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[130]  Akihiro Yamamoto,et al.  A Fast and Flexible Clustering Algorithm Using Binary Discretization , 2011, 2011 IEEE 11th International Conference on Data Mining.

[131]  Masako Sato,et al.  Refutable Language Learning with a Neighbor System , 2001, ALT.

[132]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[133]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[134]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[135]  Shin-ichi Minato Overview of ERATO Minato Project: The Art of Discrete Structure Manipulation between Science and Engineering , 2011, New Generation Computing.

[136]  Li Wei,et al.  Compression-based data mining of sequential data , 2007, Data Mining and Knowledge Discovery.

[137]  Pedro M. Domingos,et al.  Learning Markov logic network structure via hypergraph lifting , 2009, ICML '09.

[138]  Chabane Djeraba,et al.  Mathematical Tools for Data Mining: Set Theory, Partial Orders, Combinatorics , 2008, Advanced Information and Knowledge Processing.

[139]  Nir Friedman,et al.  Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting , 1998, ICML.

[140]  Dana Angluin,et al.  Inference of Reversible Languages , 1982, JACM.

[141]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[142]  Akihiro Yamamoto,et al.  The Minimum Code Length for Clustering Using the Gray Code , 2011, ECML/PKDD.

[143]  Sergei O. Kuznetsov,et al.  Learning Closed Sets of Labeled Graphs for Chemical Applications , 2005, ILP.

[144]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[145]  Setsuo Arikawa,et al.  Criteria for inductive inference with mind changes and anomalies of recursive real-valued functions , 2003 .

[146]  Mohammad Al Hasan,et al.  Under consideration for publication in Knowledge and Information Systems SPARCL: An Effective and Efficient Algorithm for Mining Arbitrary Shape-based Clusters 1 , 2022 .

[147]  John B. O. Mitchell,et al.  A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking , 2010, Bioinform..

[148]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[149]  Klaus Weihrauch,et al.  Computable Analysis: An Introduction , 2014, Texts in Theoretical Computer Science. An EATCS Series.

[150]  Rudolf Wille,et al.  Restructuring Lattice Theory: An Approach Based on Hierarchies of Concepts , 2009, ICFCA.

[151]  M J Sternberg,et al.  Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[152]  Yun Zhang,et al.  A New Search Results Clustering Algorithm Based on Formal Concept Analysis , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[153]  G. Klebe,et al.  Knowledge-based scoring function to predict protein-ligand interactions. , 2000, Journal of molecular biology.

[154]  P. Kollman,et al.  A Second Generation Force Field for the Simulation of Proteins, Nucleic Acids, and Organic Molecules J. Am. Chem. Soc. 1995, 117, 5179−5197 , 1996 .

[155]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[156]  Sally A. Goldmany,et al.  Learning from Examples with Unspeciied Attribute Values , 1998 .

[157]  Hiroki Arimura,et al.  LCM ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining , 2005 .

[158]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[159]  Ayhan Demiriz,et al.  Semi-Supervised Clustering Using Genetic Algorithms , 1999 .

[160]  Ehud Shapiro,et al.  Algorithmic Program Debugging , 1983 .

[161]  D. C. Baird,et al.  Experimentation: An Introduction to Measurement Theory and Experiment Design , 1965 .

[162]  Keki B. Irani,et al.  Multi-interval discretization of continuos attributes as pre-processing for classi cation learning , 1993, IJCAI 1993.

[163]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[164]  Sergei O. Kuznetsov,et al.  Machine Learning and Formal Concept Analysis , 2004, ICFCA.

[165]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[166]  Kouichi Hirata,et al.  Prediction of Recursive Real-Valued Functions from Finite Examples , 2005, JSAI Workshops.

[167]  Dan Roth,et al.  Learning to Reason with a Restricted View , 1995, COLT '95.

[168]  Rokia Missaoui,et al.  Formal Concept Analysis for Knowledge Discovery and Data Mining: The New Challenges , 2004, ICFCA.

[169]  Setsuo Arikawa,et al.  Towards a Mathematical Theory of Machine Discovery from Facts , 1995, Theor. Comput. Sci..

[170]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[171]  Boris A. Trakhtenbrot,et al.  Finite automata : behavior and synthesis , 1973 .

[172]  Klaus Weihrauch,et al.  Turing machines on represented sets, a model of computation for Analysis , 2011, Log. Methods Comput. Sci..

[173]  Matthias Schröder,et al.  Extended admissibility , 2002, Theor. Comput. Sci..

[174]  Petri Myllymäki,et al.  An Empirical Comparison of NML Clustering Algorithms , 2008, ITSL.

[175]  João Gama,et al.  Learning Decision Rules from Data Streams , 2011, IJCAI.