Intelligent System for Semantically Similar Sentences Identification and Generation Based on Machine Learning Methods

The task of generating semantically similar sentences can be reduced to the task of generating text and verifying that the generated text is semantically similar to the sample. This article describes all the main technical aspects of solving this problem, describes proposed solutions for the development of algorithmic, functional and software components of the application of identification and generation of semantically similar sentences. During the analysis of existing algorithms, the basic principles of operation of such algorithms were considered. Analogues were analyzed, namely the methods of semantic comparison of sentences, their advantages and disadvantages were determined. The methods that solve the problem are many, but they have some limitations, such as unreliability after slight changes to the text or paraphrase. This article describes the software implementation of the task. Different ways of semantic comparison and text generation are analyzed. Also, the system was tested for new data, that is, data that was not used to train the model.

[1]  Yevhen Burov,et al.  Information resources processing using linguistic analysis of textual content , 2017, 2017 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS).

[2]  Victoria Vysotska,et al.  Linguistic analysis of textual commercial content for information resources processing , 2016, 2016 13th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET).

[3]  Lytvyn Vasyl,et al.  Application of sentence parsing for determining keywords in Ukrainian texts , 2017, 2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT).

[4]  Lyubomyr Chyrun,et al.  Content Analysis Method for Cut Formation of Human Psychological State , 2018, 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP).

[5]  Lyubomyr Chyrun,et al.  Uniform Method of Operative Content Management in Web Systems , 2018, COLINS.

[6]  Svitlana Sachenko,et al.  Pre-conditions of ontological approaches application for knowledge management in accounting , 2009, 2009 IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications.

[7]  Lyubomyr Chyrun,et al.  Content Monitoring Method for Cut Formation of Person Psychological State in Social Scoring , 2018, 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT).

[8]  Vasyl Lytvyn,et al.  Method for Determining Linguometric Coefficient Dynamics of Ukrainian Text Content Authorship , 2018, CSIT.

[9]  Enhong Chen,et al.  Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective , 2015, IJCAI.

[10]  Vasyl Lytvyn,et al.  ANALYSIS OF STATISTICAL METHODS FOR STABLE COMBINATIONS DETERMINATION OF KEYWORDS IDENTIFICATION , 2018 .

[11]  Liliya Chyrun,et al.  Online Tourism System Development for Searching and Planning Trips with User’s Requirements , 2019 .

[12]  Yevhen Burov,et al.  The Consolidated Information Web-Resource about Pharmacy Networks in City , 2018, IDDM.

[13]  Yevhen Burov,et al.  Defining Author's Style for Plagiarism Detection in Academic Environment , 2018, 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP).

[14]  Iryna Khomytska,et al.  Development of methods, models, and means for the author attribution of a text , 2018 .

[15]  Ronan Collobert,et al.  Word Embeddings through Hellinger PCA , 2013, EACL.

[16]  Lyubomyr Chyrun,et al.  Method of Integration and Content Management of the Information Resources Network , 2017 .

[17]  Vasyl Andrunyk,et al.  ELECTRONIC CONTENT COMMERCE SYSTEM DEVELOPMENT , 2016 .

[18]  Vasyl Lytvyn,et al.  Method of Textual Information Authorship Analysis Based on Stylometry , 2018, 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT).

[19]  Lyubomyr Chyrun,et al.  Intellectual system design for content formation , 2017, 2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT).

[20]  Oleg Bisikalo,et al.  Modeling the Phenomenological Concepts for Figurative Processing of Natural-Language Constructions , 2019, COLINS.

[21]  Iryna Khomytska,et al.  Authorship and Style Attribution by Statistical Methods of Style Differentiation on the Phonological Level , 2018, CSIT.

[22]  Lyubomyr Chyrun,et al.  The Intellectual System Development of Distant Competencies Analyzing for IT Recruitment , 2019, Advances in Intelligent Systems and Computing IV.

[23]  Yevhen Burov,et al.  Web Resource Changes Monitoring System Development , 2019, MoMLeT.

[24]  Vasyl Lytvyn,et al.  The method of formation of the status of personality understanding based on the content analysis , 2016 .

[25]  Lyubomyr Chyrun,et al.  Web Content Monitoring System Development , 2019, COLINS.

[26]  L. Chyrun,et al.  Information technology of processing information resources in electronic content commerce systems , 2016, 2016 XIth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT).

[27]  Olena Levchenko,et al.  Method of Automated Identification of Metaphoric Meaning in Adjective + Noun Word Combinations (Based on the Ukrainian Language) , 2019, MoMLeT.

[28]  Lyubomyr Chyrun,et al.  Identifying Textual Content Based on Thematic Analysis of Similar Texts in Big Data , 2019, 2019 IEEE 14th International Conference on Computer Sciences and Information Technologies (CSIT).

[29]  Vasyl Lytvyn,et al.  Methods and Means of Web Content Personalization for Commercial Information Products Distribution , 2019, ISDMCI.

[30]  Vasyl Andrunyk,et al.  Medical News Aggregation and Ranking of Taking into Account the User Needs , 2019, IDDM.

[31]  Yevhen Burov,et al.  Heterogeneous Data with Agreed Content Aggregation System Development , 2019, MoMLeT.

[32]  Lyubomyr Chyrun,et al.  Peculiarities of content forming and analysis in internet newspaper covering music news , 2017, 2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT).

[33]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[34]  Victoria Vysotska,et al.  Web Resources Management Method Based on Intelligent Technologies , 2018 .

[35]  Vasyl Lytvyn,et al.  Technology for the Psychological Portraits Formation of Social Networks Users for the IT Specialists Recruitment Based on Big Five, NLP and Big Data Analysis , 2019, COAPSN.

[36]  Victoria Vysotska,et al.  Process analysis in electronic content commerce system , 2015, 2015 Xth International Scientific and Technical Conference "Computer Sciences and Information Technologies" (CSIT).

[37]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[38]  Yevhen Burov,et al.  The Linguometric Approach for Co-authoring Author's Style Definition , 2018, 2018 IEEE 4th International Symposium on Wireless Systems within the International Conferences on Intelligent Data Acquisition and Advanced Computing Systems (IDAACS-SWS).

[39]  Uliana Shandruk Quantitative Characteristics of Key Words in Texts of Scientific Genre (on the Material of the Ukrainian Scientific Journal) , 2019, COLINS.

[40]  Volodymyr Pasichnyk,et al.  Individual Sign Translator Component of Tourist Information System , 2019 .

[41]  Gal Chechik,et al.  Euclidean Embedding of Co-occurrence Data , 2004, J. Mach. Learn. Res..

[42]  Ihor Kulchytskyi,et al.  Statistical Analysis of the Short Stories by Roman Ivanychuk , 2019, COLINS.

[43]  Yevhen Burov,et al.  The Contextual Search Method Based on Domain Thesaurus , 2017 .

[44]  Yevhen Burov,et al.  Automated Monitoring of Changes in Web Resources , 2019, ISDMCI.

[45]  Yevhen Burov,et al.  Development of Information System for Textual Content Categorizing Based on Ontology , 2019, COLINS.

[46]  Derek Greene,et al.  EVE: explainable vector based embedding technique using Wikipedia , 2017, Journal of Intelligent Information Systems.

[47]  Vasyl Lytvyn,et al.  Design of the architecture of an intelligent system for distributing commercial content in the internet space based on SEO-technologies, neural networks, and Machine Learning , 2019, Eastern-European Journal of Enterprise Technologies.

[48]  Vasyl Lytvyn,et al.  Development of the Linguometric Method for Automatic Identification of the Author of Text Content Based on Statistical Analysis of Language Diversity Coefficients , 2018 .

[49]  Anatoliy Sachenko,et al.  Model of Touristic Information Resources Integration According to User Needs , 2018, 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT).

[50]  Victoria Vysotska,et al.  The Method of Web-Resources Management Under Conditions of Uncertainty Based on Fuzzy Logic , 2018, 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT).

[51]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[52]  Svitlana Sachenko,et al.  Analysis of the developed quantitative method for automatic attribution of scientific and technical text content written in Ukrainian , 2018, Eastern-European Journal of Enterprise Technologies.

[53]  Lyubomyr Chyrun,et al.  Method of Similar Textual Content Selection Based on Thematic Information Retrieval , 2019, 2019 IEEE 14th International Conference on Computer Sciences and Information Technologies (CSIT).

[54]  Lyubomyr Chyrun,et al.  The Mobile Application Development Based on Online Music Library for Socializing in the World of Bard Songs and Scouts' Bonfires , 2019, CSIT.

[55]  Lyubomyr Chyrun,et al.  Development of System for Managers Relationship Management with Customers , 2019, ISDMCI.

[56]  Victoria Vysotska,et al.  Authorship Identification of the Scientific Text in Ukrainian with Using the Lingvometry Methods , 2018, 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT).

[57]  Vasyl Lytvyn,et al.  Textual Content Categorizing Technology Development Based on Ontology , 2019, MoMLeT.

[58]  L. Chyrun,et al.  Analysis features of information resources processing , 2015, 2015 Xth International Scientific and Technical Conference "Computer Sciences and Information Technologies" (CSIT).