Classification of Short Legal Lithuanian Texts

Statistical analysis of parliamentary roll call votes is an important topic in political science because it reveals ideological positions of members of parliament (MP) and factions. However, it depends on the issues debated and voted upon. Therefore, analysis of carefully selected sets of roll call votes provides a deeper knowledge about MPs. However, in order to classify roll call votes according to their topic automatic text classifiers have to be employed, as these votes are counted in thousands. It can be formulated as a problem of classification of short legal texts in Lithuanian (classification is performed using only headings of roll call vote). We present results of an ongoing research on thematic classification of roll call votes of the Lithuanian Parliament. The problem differs significantly from the classification of long texts, because feature spaces are small and sparse, due to the short and formulaic texts. In this paper we investigate performance of 3 feature representation techniques (bag-of-words, n-gram and tf-idf ) in combination with Support Vector Machines (with different kernels) and Multinomial Logistic Regression. The best results were achieved using tf-idf with SVM with linear and polynomial kernels.

[1]  D. S. Guru,et al.  Representation and Classification of Text Documents: A Brief Review , 2010 .

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[4]  Tomas Krilavicius,et al.  "Mining Social Science Data: a Study of Voting of the Members of the Seimas of Lithuania by Using Multidimensional Scaling and Homegeneity Analysis" , 2011 .

[5]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[6]  Vytautas Mickevicius,et al.  Analysing voting behavior of the Lithuanian parliament using cluster analysis and multidimensional scaling : technical aspects , 2014 .

[7]  Michael A. Bailey Comparable Preference Estimates across Time and Institutions for the Court, Congress, and Presidency , 2007 .

[8]  Tomas Krilavicius,et al.  Automatic Thematic Classification of the Titles of the Seimas Votes , 2015, NODALIDA.

[9]  Jurgita Kapociute-Dzikiene,et al.  Predicting Party Group from the Lithuanian Parliamentary Speeches , 2014, Inf. Technol. Control..

[10]  Walter Daelemans,et al.  Improving Topic Classification for Highly Inflective Languages , 2012, International Conference on Computational Linguistics.

[11]  Steven S. Smith,et al.  The Dimensionality of Congressional Voting Reconsidered , 2016 .

[12]  Gérard Roland,et al.  Dimensions of politics in the European Parliament , 2006 .

[13]  Yvonne Herz,et al.  Spatial Models Of Parliamentary Voting , 2016 .

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.