Unlocking Social Media and User Generated Content as a Data Source for Knowledge Management

The pervasiveness of Social Media and user-generated content has triggered an exponential increase in global data volumes. However, due to collection and extraction challenges, data in many feeds, embedded comments, reviews and testimonials are inaccessible as a generic data source. This paper incorporates Knowledge Management framework as a paradigm for knowledge management and data value extraction. This framework embodies solutions to unlock the potential of UGC as a rich, real-time data source for analytical applications. The contributions described in this paper are threefold. Firstly, a method for automatically navigating pagination systems to expose UGC for collection is presented. This is evaluated using browser emulation integrated with dynamic data collection. Secondly, a new method for collecting social data without any a priori knowledge of the sites is introduced. Finally, a new testbed is developed to reflect the current state of internet sites and shared publicly to encourage future research. The discussion benchmarks the new algorithm alongside existing data extraction techniques and provides evidence of the increased amount of UGC data made accessible by the new algorithm.

[1]  Sachio Hirokawa,et al.  Testbed for information extraction from deep web , 2004, WWW Alt. '04.

[2]  Murray E. Jennex,et al.  A Revised Knowledge Pyramid , 2013, Int. J. Knowl. Manag..

[3]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[4]  Zhixin Liu,et al.  Affective design using machine learning: a survey and its prospect of conjoining big data , 2018, Int. J. Comput. Integr. Manuf..

[5]  Davide Buscaldi,et al.  Sentiment Analysis on Microblogs for Natural Disasters Management: a Study on the 2014 Genoa Floodings , 2015, WWW.

[6]  Dhaval Patel,et al.  AcT: Accuracy-aware crawling techniques for cloud-crawler , 2015, World Wide Web.

[7]  Kridanto Surendro,et al.  Integrated Social Media Knowledge Capture in Medical Domain of Indonesia , 2018 .

[8]  Atefeh Sharif,et al.  PKM Tools for Developing Personal Knowledge Management Skills among University Students , 2018 .

[9]  Kit Yan Chan,et al.  Twitter mining for ontology-based domain discovery incorporating machine learning , 2018, J. Knowl. Manag..

[10]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[11]  Prabin Kumar Panigrahi,et al.  Knowledge Discovery From Vernacular Expressions: An Application of Social Media and Sentiment Mining , 2018, Int. J. Knowl. Manag..

[12]  Stephanie M. Reich,et al.  Online and Offline Social Networks: Use of Social Networking Sites by Emerging Adults , 2008 .

[13]  Pradeep Sahoo,et al.  An efficient web search engine for noisy free information retrieval , 2018, Int. Arab J. Inf. Technol..

[14]  Bing Liu,et al.  A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction , 2010, SDM.

[15]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[16]  Murray E. Jennex,et al.  Using Knowledge Management to Assist in Identifying Human Sex Trafficking , 2016, 2016 49th Hawaii International Conference on System Sciences (HICSS).

[17]  Hui Xiong,et al.  Locating targets through mention in Twitter , 2015, World Wide Web.

[18]  Robert Power,et al.  A sensitive Twitter earthquake detector , 2013, WWW.

[19]  Cesi Cruz,et al.  Social Networks and the Targeting of Vote Buying , 2018, Comparative Political Studies.

[20]  Marc Kellogg,et al.  Automating functional tests using Selenium , 2006, AGILE 2006 (AGILE'06).

[21]  Gilad Mishne,et al.  Leave a Reply: An Analysis of Weblog Comments , 2006 .

[22]  Emilio Ferrara,et al.  Automatic Wrapper Adaptation by Tree Edit Distance Matching , 2011, ArXiv.

[23]  Murray E. Jennex,et al.  Big Data, the Internet of Things, and the Revised Knowledge Pyramid , 2017, DATB.

[24]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[25]  Arie van Deursen,et al.  Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes , 2012, TWEB.

[26]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[27]  Abu Salih,et al.  Trustworthiness in Social Big Data Incorporating Semantic Analysis, Machine Learning and Distributed Data Processing , 2018 .

[28]  Kit Yan Chan,et al.  State-of-the-Art Ontology Annotation for Personalised Teaching and Learning and Prospects for Smart Learning Recommender Based on Multiple Intelligence and Fuzzy Ontology , 2018, International Journal of Fuzzy Systems.

[29]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[30]  Murray E. Jennex,et al.  What is Knowledge Management , 2012 .

[31]  Chia-Hui Chang,et al.  A novel alignment algorithm for effective web data extraction from singleton-item pages , 2018, Applied Intelligence.

[32]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[33]  Sartra Wongthanavasu,et al.  Bottom-up region extractor for semi-structured web pages , 2014, 2014 International Computer Science and Engineering Conference (ICSEC).

[34]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .

[35]  Pornpit Wongthongtham,et al.  Ontology-based approach for identifying the credibility domain in social Big Data , 2018, J. Organ. Comput. Electron. Commer..

[36]  Reynold Cheng,et al.  STEM: a suffix tree-based method for web data records extraction , 2018, Knowledge and Information Systems.

[37]  Eric Horvitz,et al.  Predicting Depression via Social Media , 2013, ICWSM.

[38]  Kevin Lee,et al.  Towards Social Media as a Data Source for Opportunistic Sensor Networking , 2014, AusDM.

[39]  Jure Leskovec,et al.  Online Actions with Offline Impact: How Online Social Networks Influence Online and Offline User Behavior , 2016, WSDM.

[40]  H. Cheong,et al.  Consumers’ Reliance on Product Information and Recommendations Found in UGC , 2008 .

[41]  Douglas Crockford,et al.  The application/json Media Type for JavaScript Object Notation (JSON) , 2006, RFC.

[42]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[43]  Rafael Corchuelo,et al.  Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction , 2014, IEEE Transactions on Knowledge and Data Engineering.

[44]  Sartra Wongthanavasu,et al.  Information extraction for deep web using repetitive subject pattern , 2013, World Wide Web.

[45]  Hongbo Xu,et al.  Blog Post and Comment Extraction Using Information Quantity of Web Format , 2008, AIRS.

[46]  Shouhong Wang,et al.  Social-Media-Based Knowledge Sharing: A Qualitative Analysis of Multiple Cases , 2018, International Journal of Knowledge Management.

[47]  Wei-Lun Chang,et al.  Estimating trust value: A social network perspective , 2014, Information Systems Frontiers.

[48]  Murali Raman,et al.  Knowledge management systems in support of disasters management: A two decade review , 2013, Technological Forecasting and Social Change.

[49]  Timo Wandhöfer,et al.  Determining citizens’ opinions about stories in the news media: analysing Google, Facebook and Twitter , 2012 .

[50]  France Bélanger,et al.  Social Commerce Benefits for Small Businesses: An Organizational Level Study , 2016, J. Organ. End User Comput..

[51]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[52]  Nikos Kasioumis,et al.  BlogForever Crawler: Techniques and Algorithms to Harvest Modern Weblogs , 2014, WIMS '14.

[53]  Kit Yan Chan,et al.  CredSaT: Credibility ranking of users in big social data incorporating semantic analysis and temporal factor , 2018, J. Inf. Sci..

[54]  A. Suresh,et al.  Identification and classification of best spreader in the domain of interest over the social networks , 2018, Cluster Computing.