Web and Big Data: 4th International Joint Conference, APWeb-WAIM 2020, Tianjin, China, September 18-20, 2020, Proceedings, Part I

Many applications need to perform classification on large sparse datasets. Classifying the cold-start users who have very few feedbacks is still a challenging task. Previous work has applied active learning to classification with partially observed data. However, for large and sparse data, the number of feedbacks to be queried is huge and many of them are invalid. In this paper, we develop an active classification framework that can address these challenges by leveraging online Matrix Factorization models. We first identify a step-wise data acquisition heuristic which is useful for active classification. We then use the estimations of online Probabilistic Matrix Factorization to compute this heuristic function. In order to reduce the number of invalid queries, we further estimate the probability that a query can be answered by the cold-start user with online Poisson Factorization. During active learning, a query is selected based on the current knowledge learned in these two online factorization models. We demonstrate with real-world movie rating datasets that our framework is highly effective. It not only gains better improvement in classification, but also reduces the number of invalid queries.

[1]  Knud Möller,et al.  USEWOD2011: 1st international workshop on usage analysis and the web of data , 2011, WWW.

[2]  Hongwen Kang,et al.  Large-scale bot detection for search engines , 2010, WWW '10.

[3]  Markus Krötzsch,et al.  Practical Linked Data Access via SPARQL: The Case of Wikidata , 2018, LDOW@WWW.

[4]  Tim Weninger,et al.  Open-World Knowledge Graph Completion , 2017, AAAI.

[5]  Wim Martens,et al.  An Analytical Study of Large SPARQL Query Logs , 2017, Proc. VLDB Endow..

[6]  Xiaolong Wang,et al.  Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation , 2015, IJCAI.

[7]  Felix Naumann,et al.  Detecting SPARQL Query Templates for Data Prefetching , 2013, ESWC.

[8]  Stijn Vansummeren,et al.  What are real SPARQL queries like? , 2011, SWIM '11.

[9]  Aravindan Raghuveer,et al.  Characterizing Machine Agent Behavior through SPARQL Query Mining , 2012 .

[10]  Jürgen Ziegler,et al.  Pattern-Based Analysis of SPARQL Queries from the LSQ Dataset , 2017, International Semantic Web Conference.

[11]  Xin Wang,et al.  On the statistical analysis of practical SPARQL queries , 2016, WebDB.

[12]  María S. Pérez-Hernández,et al.  Machine Learning-based Query Augmentation for SPARQL Endpoints , 2018, WEBIST.

[13]  Sajjad Zarifzadeh,et al.  Spam query detection using stream clustering , 2017, World Wide Web.

[14]  Siegfried Handschuh,et al.  Learning from Linked Open Data Usage: Patterns & Metrics , 2010 .

[15]  Wim Martens,et al.  Navigating the Maze of Wikidata Query Logs , 2019, WWW.

[16]  Vasudeva Varma,et al.  ELDEN: Improved Entity Linking Using Densified Knowledge Graphs , 2018, NAACL-HLT.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[19]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[20]  Pablo de la Fuente,et al.  An Empirical Study of Real-World SPARQL Queries , 2011, ArXiv.

[21]  Patrick Weber,et al.  OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[22]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[23]  Muhammad Saleem,et al.  LSQ: The Linked SPARQL Queries Dataset , 2015, SEMWEB.

[24]  Jens Lehmann,et al.  LinkedGeoData: A core for a web of spatial open data , 2012, Semantic Web.

[25]  Minyi Guo,et al.  Multi-Task Feature Learning for Knowledge Graph Enhanced Recommendation , 2019, WWW.