Marine literature categorization based on minimizing the labelled data

In marine literature categorization, supervised machine learning method will take a lot of time for labelling the samples by hand. So we utilize Co-training method to decrease the quantities of labelled samples needed for training the classifier. In this paper, we only select features from the text details and add attribute labels to them. It can greatly boost the efficiency of text processing. For building up two views, we split features into two parts, each of which can form an independent view. One view is made up of the feature set of abstract, and the other is made up of the feature sets of title, keywords, creator and department. In experiments, the F1 value and error rate of the categorization system could reach about 0.863 and 14.26%.They are close to the performance of supervised classifier (0.902 and 9.13%), which is trained by more than 1500 labelled samples, however, the labelled samples used by Co-training categorization method to train the original classifier are only one positive sample and one negative sample. In addition we consider joining the idea of the active-learning in Co-training method.

[1]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[2]  Peter J. Haas,et al.  MAXENT: consistent cardinality estimation in action , 2006, SIGMOD Conference.

[3]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6]  Craig A. Knoblock,et al.  Selective Sampling with Redundant Views , 2000, AAAI/IAAI.

[7]  Maria-Florina Balcan,et al.  Co-Training and Expansion: Towards Bridging Theory and Practice , 2004, NIPS.

[8]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[9]  Craig A. Knoblock,et al.  Active + Semi-supervised Learning = Robust Multi-View Learning , 2002, ICML.

[10]  Zhi-Hua Zhou,et al.  Analyzing Co-training Style Algorithms , 2007, ECML.

[11]  Shao Yan A Study on Models of Knowledge Service System in Marine Science for Libraries of Ocean Universities in Network Environment , 2007 .

[12]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[13]  Pang-Ning Tan,et al.  Semi-supervised learning with data calibration for long-term time series forecasting , 2008, KDD.

[14]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.