Pretopology and Topic Modeling for Complex Systems Analysis: Application on Document Classification and Complex Network Analysis. (Prétopologie et modélisation de sujets pour l'analyse de systèmes complexes: application à la classification de documents et à l'analyse de réseaux complexes)

The work of this thesis presents the development of algorithms for document classification on the one hand, or complex network analysis on the other hand, based on pretopology, a theory that models the concept of proximity. The first work develops a framework for document clustering by combining Topic Modeling and Pretopology. Our contribution proposes using topic distributions extracted from topic modeling treatment as input for classification methods. In this approach, we investigated two aspects: determine an appropriate distance between documents by studying the relevance of Probabilistic-Based and Vector-Based Measurements and effect groupings according to several criteria using a pseudo-distance defined from pretopology. The second work introduces a general framework for modeling Complex Networks by developing a reformulation of stochastic pretopology and proposes Pretopology Cascade Model as a general model for information diffusion. In addition, we proposed an agent-based model, Textual-ABM, to analyze complex dynamic networks associated with textual information using author-topic model and introduced Textual-Homo-IC, an independent cascade model of the resemblance, in which homophily is measured based on textual content obtained by utilizing Topic Modeling.

[1]  H. Emptoz,et al.  Structure relation between classes for supervised learning using pretopology , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  E. Kleinberg An overtraining-resistant stochastic modeling method for pattern recognition , 1996 .

[3]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[4]  Katsumi Nitta,et al.  Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification , 2013, IEA/AIE.

[5]  Marc Bui,et al.  Prétopologie stochastique et réseaux complexes , 2012, Stud. Inform. Univ..

[6]  Jussi Myllymaki Effective Web data extraction with standard XML technologies , 2002, Comput. Networks.

[7]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[8]  Stéphane Bonnevay,et al.  A pretopological approach for structural analysis , 2002, Inf. Sci..

[9]  Stéphane Bonnevay,et al.  A pretopological approach for structuring data in non-metric spaces , 1999, Electron. Notes Discret. Math..

[10]  Nir Friedman,et al.  Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning , 2009 .

[11]  Stéphane Bonnevay,et al.  A Stochastic and Pretopological Modeling Aerial Pollution of an Urban Area , 2009, Stud. Inform. Univ..

[12]  Hung T. Nguyen,et al.  An Introduction to Random Sets , 2006 .

[13]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[14]  Vincent Levorato,et al.  Detection of communities in directed networks based on strongly p-connected components , 2011, 2011 International Conference on Computational Aspects of Social Networks (CASoN).

[15]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[16]  Monique Dalud-Vincent Modèle prétopologique pour une méthodologie d'analyse de réseaux : concepts et algorithmes , 1994 .

[17]  Karolin Kappler,et al.  Gender homophily in online dyadic and triadic relationships , 2016, EPJ Data Science.

[18]  Nadia Kabachi,et al.  Basics of pretopology , 2011 .

[19]  Wray L. Buntine Estimating Likelihoods for Topic Models , 2009, ACML.

[20]  Thi Kim Thoa Ho,et al.  Homophily Independent Cascade Diffusion Model Based on Textual Information , 2018, ICCCI.

[21]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[22]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[23]  Marc Bui,et al.  Pollution Modeling and Simulation with Multi-Agent and Pretopology , 2009, Complex.

[24]  Davide Buscaldi,et al.  A pretopological framework for the automatic construction of lexical-semantic structures from texts , 2011, CIKM '11.

[25]  Frank Lebourgeois,et al.  Pretopological approach for supervised learning , 1996, ICPR.

[26]  Thanh Van Le Classification prétopologique des données : application à l'analyse des trajectoires patients , 2007 .

[27]  Mohammed Bouayad Prétopologie et reconnaissances des formes , 1998 .

[28]  Marcel Egea Prétopologie floues , 2009, Stud. Inform. Univ..

[29]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[30]  L. Hubert,et al.  Comparing partitions , 1985 .

[31]  Vincent Levorato,et al.  Contributions à la Modélisation des Réseaux Complexes : Prétopologie et Applications. (Contributions to the Modeling of Complex Networks: Pretopology and Applications) , 2008 .

[32]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[33]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[34]  P. Lazarsfeld,et al.  Friendship as Social process: a substantive and methodological analysis , 1964 .

[35]  Hanna Wallach,et al.  Structured Topic Models for Language , 2008 .

[36]  Thanh Van Le,et al.  A clustering method associated pretopological concepts and k-means algorithm , 2007 .

[37]  David B. Dunson,et al.  Probabilistic topic models , 2012, Commun. ACM.

[38]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[39]  Jacob Goldenberg,et al.  Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth , 2001 .

[40]  Kathleen M. Carley,et al.  Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers , 2004 .

[41]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[42]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[43]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[44]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[45]  Stéphane Bonnevay Extraction de caractéristiques de texture par codages des extrema de gris et traitement prétopologique des images , 1997 .

[46]  R. Rajendiran,et al.  Topological Spaces , 2019, A Physicist's Introduction to Algebraic Structures.

[47]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[48]  Neil J. Gunther,et al.  A General Theory of Computational Scalability Based on Rational Functions , 2008, ArXiv.

[49]  Vincent Levorato,et al.  Data Structures and Algorithms for Pretopology: the Java-based Software Library PretopoLib , 2011, IICS.

[50]  Michel Lamure,et al.  Pretopological Transformations of Binary Images: A Parallel Implementation , 1995, Parallel and Distributed Computing and Systems.

[51]  Quang Vu Bui,et al.  Stochastic Pretopology as a Tool for Topological Analysis of Complex Systems , 2018, ACIIDS.

[52]  Anna M. Gil-Lafuente,et al.  Towards an Advanced Modelling of Complex Economic Phenomena: Pretopological and Topological Uncertainty Research Tools - Volume 276 , 2011 .

[53]  Jean-Paul Auray Structures pauvres , 2009, Stud. Inform. Univ..

[54]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[55]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[56]  Quang Vu Bui,et al.  Distributed implementation of the latent Dirichlet allocation on Spark , 2016, SoICT.

[57]  Kathleen M. Carley,et al.  The Etiology of Social Change , 2009, Top. Cogn. Sci..

[58]  A. Ferligoj,et al.  Direct multicriteria clustering algorithms , 1992 .

[59]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[60]  Muhammad Rafi,et al.  An improved semantic similarity measure for document clustering based on topic maps , 2013, ArXiv.

[61]  Chunjie Luo,et al.  BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking , 2013, WBDB.

[62]  I. J. Taneja New Developments in Generalized Information Measures , 1995 .

[63]  Quang Vu Bui,et al.  A multi-criteria document clustering method based on topic modeling and pseudoclosure function , 2015, Informatica.

[64]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[65]  Quang Vu Bui,et al.  Multilayer classification of web pages using random forest and semi-supervised latent dirichlet allocation , 2015, 2015 15th International Conference on Innovations for Community Services (I4CS).

[66]  Fred B. Schneider,et al.  A Theory of Graphs , 1993 .

[67]  Alain Bui,et al.  A context-free smart grid model using pretopologic structure , 2015, 2015 International Conference on Smart Cities and Green ICT Systems (SMARTGREENS).

[68]  P. Stadler,et al.  Basic Properties of Closure Spaces , 2008 .

[69]  Michel Lamure,et al.  Pretopology as an extension of graph theory : the case of strong connectivity , 2001 .

[70]  Ruocheng Guo,et al.  Diffusion in Social Networks , 2015, SpringerBriefs in Computer Science.

[71]  Jean-Charles Pinoli,et al.  General Adaptive Neighborhood-Based Pretopological Image Filtering , 2011, Journal of Mathematical Imaging and Vision.

[72]  Soufian Ben amor,et al.  Percolation, prétopologie et multialéatoires , contributions à la modélisation des systèmes complexes : exemple du contrôle aérien , 2008 .

[73]  Davide Buscaldi,et al.  QASSIT: A Pretopological Framework for the Automatic Construction of Lexical Taxonomies from Raw Texts , 2015, *SEMEVAL.

[74]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[75]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[76]  André Santanchè,et al.  Topical homophily in online social systems , 2017, ArXiv.

[77]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[78]  David E. Culler,et al.  Dataflow architectures , 1986 .

[79]  Martin Grandjean,et al.  A social network analysis of Twitter: Mapping the digital humanities community , 2016 .

[80]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[81]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[82]  Cynthia Basileu Modélisation structurelle des réseaux sociaux : application à un système d’aide à la décision en cas de crise sanitaire , 2011 .

[83]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[84]  Vincent Levorato Modeling Groups In Social Networks , 2011, ECMS.

[85]  I. Molchanov Theory of Random Sets , 2005 .

[86]  Myungsook Klassen,et al.  Web Document Classification by Keywords Using Random Forests , 2010, NDT.

[87]  Ivan Lavallée,et al.  Generalized Percolation Processes Using Pretopology Theory , 2007, 2007 IEEE International Conference on Research, Innovation and Vision for the Future.

[88]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[89]  Hubert Emptoz,et al.  A Pretopological Approach for Pattern Classification with Reject Options , 1998, SSPR/SPR.

[90]  Vincent Levorato,et al.  PretopoLib : la librairie JAVA de la prétopologie , 2010, EGC.

[91]  László Gulyás,et al.  Agent-Based Dynamic Network Models: Validation on Empirical Data , 2013, ESSA.

[92]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[93]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[94]  Marcel Brissaud Retour sur les origines de la prétopologie , 2009, Stud. Inform. Univ..

[95]  Vincent Levorato,et al.  Discrete Signal Machines via Pretopology , 2010, NCMA.

[96]  A Multicriterion Pretopological Approach for Image Segmentation , 2005 .

[97]  Abdelkrim Meziane,et al.  Satellite image segmentation by mathematical pretopology and automatic classification , 1997, Remote Sensing.

[98]  Mounzer Boubou,et al.  Contribution aux méthodes de classification non supervisée via des approches prétopologiques et d'agrégation d'opinions. (Contribution to the data clustering methods via pretopological approaches and of opinions aggregation) , 2007 .

[99]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[100]  Rossano Schifanella,et al.  Friendship prediction and homophily in social media , 2012, TWEB.

[101]  Quang Vu Bui,et al.  Dynamic Social Network Analysis Using Author-Topic Model , 2018, I4CS.

[102]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[103]  Shie-Jue Lee,et al.  A Similarity Measure for Text Classification and Clustering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[104]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[105]  Nicolas Nicoloyannis Structures prétopologiques et classification automatique : le logiciel DEMON , 1988 .

[106]  Michel Lamure,et al.  A System of Image Analysis Based on a Pretopological Approach , 1986, IAS.

[107]  Stéphane Bonnevay Pretopological operators for gray-level image analysis , 2009, Stud. Inform. Univ..

[108]  Marcel Brissaud,et al.  Eléments de prétopologie généralisée , 2009, Stud. Inform. Univ..

[109]  M. Joshi,et al.  Effectiveness of Different Similarity Measures for Text Classification and Clustering , 2016 .

[110]  T. Morimoto Markov Processes and the H -Theorem , 1963 .

[111]  Driss Mammass,et al.  A Pretopological Approach for Image Segmentation and Edge Detection , 2004, Journal of Mathematical Imaging and Vision.

[112]  Evaggelia Pitoura,et al.  Diffusion Maximization in Evolving Social Networks , 2015, COSN.

[113]  J. Looman,et al.  Adaptation of Sorensen's K (1948) for Estimating Unit Affinities in Prairie Vegetation , 1960 .

[114]  Vincent Levorato,et al.  Modeling the Complex Dynamics of Distributed Communities of the Web with Pretopology , 2010, IICS.

[115]  W. Scott Spangler,et al.  Feature Weighting in k-Means Clustering , 2003, Machine Learning.

[116]  Ivan Lavallée,et al.  Clustering Based on Kolmogorov Information , 2010, KES.

[117]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[118]  Elena Deza,et al.  Dictionary of distances , 2006 .

[119]  Mark S. Granovetter Threshold Models of Collective Behavior , 1978, American Journal of Sociology.

[120]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[121]  Ajay S. Patil,et al.  Automated Classification of Web Sites using Naive Bayesian Algorithm , 2012 .

[122]  Marc Bui,et al.  Généralisation des processus de percolation discrets , 2009, Stud. Inform. Univ..

[123]  Jie Tang,et al.  Influence Maximization in Dynamic Social Networks , 2013, 2013 IEEE 13th International Conference on Data Mining.

[124]  Yunming Ye,et al.  An Improved Random Forest Classifier for Text Categorization , 2012, J. Comput..

[125]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[126]  M. Newman Complex Systems: A Survey , 2011, 1112.1440.

[127]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[128]  Jean-Paul Auray,et al.  A pre-topological analysis of the input-output model , 1979 .

[129]  Michel Lamure Espaces abstraits et reconnaissance des formes : application au traitement des images digitales , 1987 .

[130]  J. Kleinberg,et al.  Networks, Crowds, and Markets , 2010 .

[131]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[132]  Ronald Rousseau,et al.  Social network analysis: a powerful strategy, also for the information sciences , 2002, J. Inf. Sci..

[133]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[134]  Marc Bui,et al.  Gesture Trajectories Modeling Using Quasipseudometrics and Pre-topology for Its Evaluation , 2014, IPMU.

[135]  Yue Lu,et al.  Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA , 2011, Information Retrieval.

[136]  Michel Lamure,et al.  Closed sets and closures in pretopology , 2009 .

[137]  Frank Lebourgeois,et al.  A pretopology-based supervised pattern classifier , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[138]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[139]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[140]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[141]  Andrew H. Sung,et al.  A Similarity Measure for Clustering and its Applications , 2008 .

[142]  Jari Saramäki,et al.  Temporal Networks , 2011, Encyclopedia of Social Network Analysis and Mining.

[143]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[144]  Gilbert L. Peterson,et al.  Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps , 2009, FLAIRS.

[145]  Vincent Levorato,et al.  Classification prétopologique basée sur la complexité de Kolmogorov , 2009, Stud. Inform. Univ..

[146]  Éva Tardos,et al.  Influential Nodes in a Diffusion Model for Social Networks , 2005, ICALP.

[147]  Claire Leschi,et al.  Crest Lines Detection in Grey Level Images: Studies of Different Approaches and Proposition of a New One , 1993, CAIP.

[148]  Masahiro Kimura,et al.  Tractable Models for Information Diffusion in Social Networks , 2006, PKDD.

[149]  Stéphane Bonnevay,et al.  Data Analysis Based on Minimal Closed Subsets , 2000 .

[150]  Muaz A. Niazi,et al.  Agent-based computing from multi-agent systems to agent-based models: a visual survey , 2011, Scientometrics.

[151]  Marc Bui,et al.  Document Classification with LSA and Pretopology , 2010, Stud. Inform. Univ..

[152]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[153]  Tran Vu Pham,et al.  An Efficient Pretopological Approach for Document Clustering , 2013, 2013 5th International Conference on Intelligent Networking and Collaborative Systems.

[154]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[155]  M. Fréchet,et al.  Les espaces abstraits , 1929 .

[156]  Quang Vu Bui,et al.  Combining Latent Dirichlet Allocation and K-Means for Documents Clustering: Effect of Probabilistic Based Distance Measures , 2017, ACIIDS.

[157]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[158]  Yiannis Kompatsiaris,et al.  News Articles Classification Using Random Forests and Weighted Multimodal Features , 2014, IRFC.

[159]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[160]  Cécile Favre,et al.  Information diffusion in online social networks: a survey , 2013, SGMD.

[161]  Bin Wu,et al.  Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark , 2014, BigMine.