A Multi-topic Meta-classification Scheme for Analyzing Lobbying Disclosure Data

For the functioning of American democracy, the Lobbying Disclosure Act (LDA), for the very first time, provides data to empirically research interest groups behaviors and their influence on congressional policymaking. One of the main research challenges is to automatically find the topic(s), by short & sparse text classification, in a large corpus of unorganized, semi-structured, and poorly connected lobbying filings to reveal the underlying purpose(s) of these lobbying activities. Common techniques for alleviating data sparseness are to enrich the context of data by external information. This paper, however, proposed an inter-disciplinary yet practical solution to this problem using a Multi-Topic Meta-Classification (MTMC) scheme built upon a set of semantic attributes (i.e., General Issue, Specific Issue, and Bill Info.), integrated with a domain-specific Policy Agenda (PA) coding/labeling procedure. First, multi-label base-classifiers that have been transformed into multi-class classification problems were learned from the abovementioned three semantic sources, respectively, second, to render reliability classification, one meta-classifier per attribute was trained based on meta-instances dataset labeled in a cross-validation fashion, third, the final prediction is made via fusing the reliable outputs of such ensembles of classifiers. Experiments demonstrated satisfactory classification performance with various evaluation measures on such a real-world textual dataset that poses many challenges including problems with noisy data and semantic ambiguity.

[1]  Glenn Fung,et al.  On the Dangers of Cross-Validation. An Experimental Evaluation , 2008, SDM.

[2]  Tomaso Poggio,et al.  Everything old is new again: a fresh look at historical approaches in machine learning , 2002 .

[3]  Alexander K. Seewald,et al.  Towards a theoretical framework for ensemble classification , 2003, IJCAI 2003.

[4]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[5]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[8]  Péter Schönhofen,et al.  Identifying Document Topics Using the Wikipedia Category Network , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[9]  Chengcui Zhang,et al.  Consolidating client names in the lobbying disclosure database using efficient clustering techniques , 2014, ACM Southeast Regional Conference.

[10]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[11]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[12]  A. M. Kaptein Meta-Classifier Approaches to Reliable Text Classification , 2005 .

[13]  Christopher Meek,et al.  Improving Similarity Measures for Short Segments of Text , 2007, AAAI.

[14]  Sarah Zelikovitz,et al.  Transductive Learning For Short-Text Classification Problems Using Latent Semantic Indexing , 2005, Int. J. Pattern Recognit. Artif. Intell..

[15]  Xiaolin Du,et al.  Short Text Classification: A Survey , 2014, J. Multim..

[16]  Grigorios Tsoumakas,et al.  On the Stratification of Multi-label Data , 2011, ECML/PKDD.

[17]  Johannes Fürnkranz,et al.  An Evaluation of Grading Classifiers , 2001, IDA.

[18]  Mengen Chen,et al.  Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[19]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[20]  Bernhard Schölkopf,et al.  A Primer on Kernel Methods , 2004 .