Unlocking Analytical Value from Social Media and User Generated Content

The pervasiveness of Social Media and user-generated content has triggered an exponential increase in global data volumes. However, due to collection and extraction challenges, data in many feeds, embedded comments, reviews and testimonials are inaccessible as a generic data source. This paper incorporates Knowledge Management framework as a paradigm for knowledge management and data value extraction. This framework embodies solutions to unlock the potential of UGC as a rich, real-time data source for analytical applications. The contributions described in this paper are threefold. Firstly, a method for automatically navigating pagination systems to expose UGC for collection is presented. This is evaluated using browser emulation integrated with dynamic data collection. Secondly, a new method for collecting social data without any a priori knowledge of the sites is introduced. Finally, a new testbed is developed to reflect the current state of internet sites and shared publicly to encourage future research. The discussion benchmarks the new algorithm alongside existing data extraction techniques and provides evidence of the increased amount of UGC data made accessible by the new algorithm.

[1]  Murray E. Jennex,et al.  A Revised Knowledge Pyramid , 2013, Int. J. Knowl. Manag..

[2]  Kit Yan Chan,et al.  Twitter mining for ontology-based domain discovery incorporating machine learning , 2018, J. Knowl. Manag..

[3]  Marc Kellogg,et al.  Automating functional tests using Selenium , 2006, AGILE 2006 (AGILE'06).

[4]  Kit Yan Chan,et al.  Social Credibility Incorporating Semantic Analysis and Machine Learning: A Survey of the State-of-the-Art and Future Research Directions , 2019, AINA Workshops.

[5]  Eric Horvitz,et al.  Predicting Depression via Social Media , 2013, ICWSM.

[6]  Kevin Lee,et al.  Towards Social Media as a Data Source for Opportunistic Sensor Networking , 2014, AusDM.

[7]  D. Goldman Predicting Depression. , 2019, The American journal of psychiatry.

[8]  Alexandros Labrinidis,et al.  Focused crawling for the hidden web , 2016, World Wide Web.

[9]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[10]  Sartra Wongthanavasu,et al.  Information extraction for deep web using repetitive subject pattern , 2013, World Wide Web.

[11]  Emilio Ferrara,et al.  Automatic Wrapper Adaptation by Tree Edit Distance Matching , 2011, ArXiv.

[12]  Hongbo Xu,et al.  Blog Post and Comment Extraction Using Information Quantity of Web Format , 2008, AIRS.

[13]  Davide Buscaldi,et al.  Sentiment Analysis on Microblogs for Natural Disasters Management: a Study on the 2014 Genoa Floodings , 2015, WWW.

[14]  Stephanie M. Reich,et al.  Online and Offline Social Networks: Use of Social Networking Sites by Emerging Adults , 2008 .

[15]  Dhaval Patel,et al.  AcT: Accuracy-aware crawling techniques for cloud-crawler , 2015, World Wide Web.

[16]  Arie van Deursen,et al.  Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes , 2012, TWEB.

[17]  Gilad Mishne,et al.  Leave a Reply: An Analysis of Weblog Comments , 2006 .

[18]  Timo Wandhöfer,et al.  Determining citizens’ opinions about stories in the news media: analysing Google, Facebook and Twitter , 2012 .

[19]  Sachio Hirokawa,et al.  Testbed for information extraction from deep web , 2004, WWW Alt. '04.

[20]  Pornpit Wongthongtham,et al.  Ontology and trust based data warehouse in new generation of business intelligence: State-of-the-art, challenges, and opportunities , 2015, 2015 IEEE 13th International Conference on Industrial Informatics (INDIN).

[21]  Abu Salih,et al.  Trustworthiness in Social Big Data Incorporating Semantic Analysis, Machine Learning and Distributed Data Processing , 2018 .

[22]  H. Cheong,et al.  Consumers’ Reliance on Product Information and Recommendations Found in UGC , 2008 .

[23]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[24]  Zhixin Liu,et al.  Affective design using machine learning: a survey and its prospect of conjoining big data , 2018, Int. J. Comput. Integr. Manuf..

[25]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[26]  D. Maynard,et al.  Challenges in developing opinion mining tools for social media , 2012 .

[27]  Sartra Wongthanavasu,et al.  Bottom-up region extractor for semi-structured web pages , 2014, 2014 International Computer Science and Engineering Conference (ICSEC).

[28]  A. Köhler,et al.  Social-Media-Marketing Social-Media-Marketing , 2017 .

[29]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[30]  Hui Xiong,et al.  Locating targets through mention in Twitter , 2015, World Wide Web.

[31]  Robert Power,et al.  A sensitive Twitter earthquake detector , 2013, WWW.

[32]  Kok Wai Wong,et al.  Unlocking Social Media and User Generated Content as a Data Source for Knowledge Management , 2019, Int. J. Knowl. Manag..

[33]  Tian Xia Extracting Structured Data from Ajax Site , 2009, 2009 First International Workshop on Database Technology and Applications.

[34]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[35]  Tobias Günther,et al.  Sentiment Analysis of Microblogs , 2013 .

[36]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[37]  Bilal. Abu Salih,et al.  An Approach For Time-Aware Domain-Based Analysis Of Users’ Trustworthness In Big Social Data , 2015 .

[38]  Nikos Kasioumis,et al.  BlogForever Crawler: Techniques and Algorithms to Harvest Modern Weblogs , 2014, WIMS '14.

[39]  Kit Yan Chan,et al.  CredSaT: Credibility ranking of users in big social data incorporating semantic analysis and temporal factor , 2018, J. Inf. Sci..

[40]  Murray E. Jennex,et al.  Big Data, the Internet of Things, and the Revised Knowledge Pyramid , 2017, DATB.

[41]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[42]  Kit Yan Chan,et al.  State-of-the-Art Ontology Annotation for Personalised Teaching and Learning and Prospects for Smart Learning Recommender Based on Multiple Intelligence and Fuzzy Ontology , 2018, International Journal of Fuzzy Systems.

[43]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[44]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[45]  Rafael Corchuelo,et al.  Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction , 2014, IEEE Transactions on Knowledge and Data Engineering.

[46]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .

[47]  Pornpit Wongthongtham,et al.  Ontology-based approach for identifying the credibility domain in social Big Data , 2018, J. Organ. Comput. Electron. Commer..

[48]  Chris Newman,et al.  Date and Time on the Internet: Timestamps , 2002, RFC.

[49]  Bing Liu,et al.  A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction , 2010, SDM.

[50]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.