A subject identification method based on term frequency technique

The analyzing and extracting important information from a text document is crucial and has produced interest in the area of text mining and information retrieval. This process is used in order to notice particularly in the text. Furthermore, on view of the readers that people tend to read almost everything in text documents to find some specific information. However, reading a text document consumes time to complete and additional time to extract information. Thus, classifying text to a subject can guide a person to find relevant information. In this paper, a subject identification method which is based on term frequency to categorize groups of text into a particular subject is proposed. Since term frequency tends to ignore the semantics of a document, the term extraction algorithm is introduced for improving the result of the extracted relevant terms from the text. The evaluation of the extracted terms has shown that the proposed method is exceeded other extraction techniques.

[1]  Herbert Gish,et al.  Approaches to topic identification on the switchboard corpus , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  V Korde,et al.  TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY , 2012 .

[3]  Sivaji Bandyopadhyay,et al.  Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora , 2013, BUCC@ACL.

[4]  M. A. Zaveri,et al.  Automatic Text Classification of sports blog data , 2012, 2012 Computing, Communications and Applications Conference.

[5]  Yogesh Kumar Meena,et al.  Survey on Graph and Cluster Based approaches in Multi-document Text Summarization , 2014, International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014).

[6]  Wei Jiang,et al.  Secure k-nearest neighbor query over encrypted data in outsourced environments , 2013, 2014 IEEE 30th International Conference on Data Engineering.

[7]  Pierre Zweigenbaum,et al.  Using WordNet and Semantic Similarity for Bilingual Terminology Mining from Comparable Corpora , 2013, BUCC@ACL.

[8]  Susan McRoy,et al.  Indexing Text Documents Based on Topic Identification , 2004, SPIRE.

[9]  Menno van Zaanen,et al.  Automatic Mood Classification Using TF*IDF Based on Lyrics , 2010, ISMIR.

[10]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[11]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[12]  William E. Moen,et al.  Using Encyclopedic Knowledge for Automatic Topic Identification , 2009, CoNLL.

[13]  Bali Ranaivo-Malançon,et al.  An Automatic Topic Identification Algorithm , 2011 .

[14]  Peter Sch Identifying document topics using the Wikipedia category network , 2006 .

[15]  Jyoti Pareek,et al.  Automatic Topic(s) Identification from Learning Material: An Ontological Approach , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[16]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[17]  S. SawantGanesh,et al.  A Review on Topic Modeling in Information Retrieval , 2014 .

[18]  Ku Ruhana Ku-Mahamud,et al.  Semantic network representation of female related issues from the Holy Quran , 2012 .

[19]  Péter Schönhofen,et al.  Identifying Document Topics Using the Wikipedia Category Network , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[20]  Tina R. Patil,et al.  Performance Analysis of Naive Bayes and J 48 Classification Algorithm for Data Classification , 2013 .