On improving open dataset categorization

Under influence of open government and data transparency initiatives, a variety of institutions have published a significant number of datasets. In most cases, open data portals are used as tools for publishing purposes. Further, open data portals have the ability to make data more accessible by categorizing datasets based on different criteria like publishers, institutions, formats, and descriptions. For these purposes, portals take advantage of metadata accompanying datasets. However, a part of metadata is often missing which makes it harder for users to find appropriate datasets and obtain the desired information. As the number of available datasets grows, this problem becomes easy to notice. This paper is focused on first step towards decreasing this problem by implementing components capable of suggesting the best match for the category where an uncategorized dataset should belong to. Our approach relies on dataset descriptions provided by users within dataset tags and uses formal concept analysis to reveal shared conceptualization originating from tags' usage. Since tags represent free text metadata entered by users, in this paper we will present a method of optimizing their usage through means of semantic similarity measures based on natural language processing mechanisms. Finally, we will demonstrate the advantage of our proposal by comparing concept lattices generated using formal concept analysis before and after optimization process and use generated structure as a knowledge base to categorize uncategorized open datasets.

[1]  Robert Jäschke,et al.  Formal concept analysis and tag recommendations in collaborative tagging systems , 2011, DISKI.

[2]  Sylvain Kubler,et al.  Comparison of metadata quality in open data portals using the Analytic Hierarchy Process , 2017, Gov. Inf. Q..

[3]  H. E. Chandler,et al.  Technical writer's handbook , 1982, IEEE Transactions on Professional Communication.

[4]  Valentina Janev,et al.  Lifting Open Data Portals to the Data Web , 2014, Linked Open Data.

[5]  Vassilios Peristeras,et al.  Enabling Interoperability of Government Data Catalogues , 2010, EGOV.

[6]  T. Yorozu,et al.  Electron Spectroscopy Studies on Magneto-Optical Media and Plastic Substrate Interface , 1987, IEEE Translation Journal on Magnetics in Japan.

[7]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[8]  Jürgen Umbrich,et al.  Automated Quality Assessment of Metadata across Open Data Portals , 2016, JDIQ.

[9]  B. Noble,et al.  On certain integrals of Lipschitz-Hankel type involving products of bessel functions , 1955, Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences.

[10]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[11]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[12]  Rudolf Wille,et al.  Introduction to formal concept analysis , 1996 .

[13]  V Korde,et al.  TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY , 2012 .

[14]  Leonid Stoimenov,et al.  Comparative Analysis of Metadata Models on e-Government Open Data Platforms , 2018, IEEE Transactions on Emerging Topics in Computing.

[15]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[16]  Rudolf Wille,et al.  Restructuring Lattice Theory: An Approach Based on Hierarchies of Concepts , 2009, ICFCA.