论文信息 - Experimental study of time series-based dataset selection for effective text classification

Experimental study of time series-based dataset selection for effective text classification

Conventional automatic document classification methods are currently faced with challenges in terms of learning time and computing power, owing to the ever-increasing amount of data on the web. In this paper, we propose an efficient classification method that uses time series-based dataset selection. In the proposed method, the dataset is split based on time series data and the best set of testing documents selected. The results of classification performance tests conducted using a Naïve Bayes classifier indicate that using a small amount of data divided in terms of time series is more efficient than using the entire dataset for learning.

Do-Heon Jeong | Yeonghun Chae | Taehong Kim

[1] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2] Li Wei,et al. Fast time series classification using numerosity reduction , 2006, ICML.

[3] D. Wijaya,et al. Understanding semantic change of words over centuries , 2011, DETECT '11.

[4] Pierre Geurts,et al. Pattern Extraction for Time Series Classification , 2001, PKDD.

[5] Hanmin Jung,et al. Analyzing Email Patterns with Timelines on Researcher Data , 2014, JIST.

[6] Narayanan Kulathuramaiyer,et al. An Empirical Study of Feature Selection for Text Categorization based on Term Weightage , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[7] Guy W. Mineau,et al. Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[8] Fabrizio Sebastiani,et al. Supervised term weighting for automated text categorization , 2003, SAC '03.

[9] Saket S. R. Mengle,et al. Ambiguity measure feature-selection algorithm , 2009, J. Assoc. Inf. Sci. Technol..