论文信息 - Information Processing and Retrieval from CSV File by Natural Language

Information Processing and Retrieval from CSV File by Natural Language

Comma Separated Value (CSV) files are widely used as a fundamental data format. Due to its simple structure and ease of creation, many of the data files that are published in open source and used by organizations are usually stored in CSV files. However, searching for or retrieving expected data from CSV files is quite limited by the traditional keyword-matching technique which can't specify the conditions for searching or processing any data on the search. This paper presents a new model that will allow users to easily retrieve information from CSV files by natural language, a language that users are familiar with and use in everyday life. Users can specify conditions for data retrieval and processing to create the information they need. This will help non-technician users easily retrieve information without the need to learn any additional computer languages or programs. The research data includes natural language messages collected from various sources, both online and offline, to cover on both formal and semi-formal language level. By using natural language processing and techniques such as semantic patterns, ontology, and interactive conversation system, this model can analyze the completeness and meaning of natural language statements as well as allows users to edit the incomplete or faulty statements, and improve the model by adding new words, sentence syntaxes and semantic patterns for more accurate results. Evaluation of the model is performed by 98 testers. By inputting 1,137 natural language statements to the model, the results showed that the models were effective in retrieving and processing data accurately with very high values of precision, recall, and F-score which were all higher than 0.9. There are only 18 statements or 3.2% of all statements that produce errors in the outputs which were caused by the typo in 3 cases: missing of some alphabets which change the word's meaning, using of the ambiguous words, and wrong position of words in the natural language statement.

Chalermpol Tapsai

[1] Yunyao Li,et al. Natural Language Data Management and Interfaces: Recent Development and Open Challenges , 2017, SIGMOD Conference.

[2] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3] Mukund Sundararajan,et al. Analyza: Exploring Data with Conversation , 2017, IUI.

[4] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5] Till Döhmen,et al. Multi-Hypothesis CSV Parsing , 2017, SSDBM.

[6] Diptikalyan Saha,et al. An Ontology based Dialog Interface to Database , 2018, SIGMOD Conference.

[7] Zuhair Bandar,et al. Conversation-Based Natural Language Interface to Relational Databases , 2007, 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops.

[8] Harry R. Tennant,et al. Usable natural language interfaces through menu-based natural language understanding , 1983, CHI '83.

[9] Jeffrey Yasskin. Use Cases and Requirements for Web Packages , 2019 .

[10] Marcelo Arenas,et al. A framework for annotating CSV-like data , 2016, Proc. VLDB Endow..

[11] John D. Burger,et al. Problems in Natural-Language Interface to DBMS With Examples From EUFID , 1983, ANLP.

[12] G. Kellogg,et al. Model for Tabular Data and Metadata on the Web , 2015 .