Optimization and Security in Information Retrieval, Extraction, Processing, and Presentation on a Cloud Platform

This paper presents the processing steps needed in order to have a fully functional vertical search engine. Four actions are identified (i.e., retrieval, extraction, presentation, and delivery) and are required to crawl websites, get the product information from the retrieved webpages, process that data, and offer the end-user the possibility of looking for various products. The whole application flow is focused on low resource usage, and especially on the delivery action, which consists of a web application that uses cloud resources and is optimized for cost efficiency. Novel methods for representing the crawl and extraction template, for product index optimizations, and for deploying and storing data in the cloud database are identified and explained. In addition, key aspects are discussed regarding ethics and security in the proposed solution. A practical use-case scenario is also presented, where products are extracted from seven online board and card game retailers. Finally, the potential of the proposed solution is discussed in terms of researching new methods for improving various aspects of the proposed solution in order to increase cost efficiency and scalability.

[1]  Ling Lin,et al.  SESQ: A Model-Driven Method for Building Object Level Vertical Search Engines , 2008, ER.

[2]  K. Chang,et al.  Accessing the Deep Web : A Survey , 2005 .

[3]  V. Kavitha,et al.  A survey on security issues in service delivery models of cloud computing , 2011, J. Netw. Comput. Appl..

[4]  C. Lee Giles,et al.  The Ethicality of Web Crawlers , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[5]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[6]  P. Santhi Thilagam,et al.  Securing web applications from injection and logic vulnerabilities: Approaches and challenges , 2016, Inf. Softw. Technol..

[7]  Djoerd Hiemstra,et al.  MIREX: MapReduce Information Retrieval Experiments , 2010, ArXiv.

[8]  Marcin Nawrocki,et al.  A Survey on Honeypot Software and Data Analysis , 2016, ArXiv.

[9]  Jan Sedivý,et al.  Deep Neural Networks for Web Page Information Extraction , 2016, AIAI.

[10]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[11]  Lisha Singh,et al.  A dive into Web Scraper world , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[12]  Christian Scheideler,et al.  A Self-Stabilizing Hashed Patricia Trie , 2018, SSS.

[13]  Harald Lampesberger,et al.  Technologies for Web and cloud service interaction: a survey , 2014, Service Oriented Computing and Applications.

[14]  N. V. Kamanwar,et al.  Web data extraction techniques: A review , 2016, 2016 World Conference on Futuristic Trends in Research and Innovation for Social Welfare (Startup Conclave).

[15]  Adrian Alexandrescu,et al.  A distributed framework for information retrieval, processing and presentation of data , 2018, 2018 22nd International Conference on System Theory, Control and Computing (ICSTCC).

[16]  Lule Ahmedi,et al.  E-Shop - A Vertical Search Engine for Domain of Online Shopping , 2017, WEBIST.