News Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites

The amount of information that is created, used or stored is growing exponentially and types of data sources are diverse. Most of it is available as an unstructured text. Moreover, considerable part of it is available on-line, usually accessible as Internet resources. It is too expensive or even impossible for humans to analyze all the resources for a required information. Classical Information Technology techniques are not sufficient to process such amounts of information and render it in a form convenient for further analysis. Information Retrieval (IR) and Natural Language Processing (NLP) provide a number of instruments for information analysis and retrieval. In this paper we present a combined application of NLP and IR for Lithuanian media analysis. We demonstrate that a combination of IR and NLP tools with appropriate changes can be successfully applied to Lithuanian media texts.

[1]  Daiva Vitkute-Adzgauskiene,et al.  Developing the Human Language Technology Infrastructure in Lithuania , 2010, Baltic HLT.

[2]  Asta Bevainyt DOCUMENT CLASSIFICATION USING WEIGHTED ONTOLOGY , 2010 .

[3]  Viktoras Paliulionis,et al.  Lietuviškų adresų geokodavimo problemos ir jų sprendimo būdai , 2009 .

[4]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[5]  Joydeep Ghosh,et al.  Similarity-Based Text Clustering: A Comparative Study , 2006, Grouping Multidimensional Data.

[6]  Vishal Gupta,et al.  A survey of Named Entity Recognition in English and other Indian Languages , 2010 .

[7]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[8]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[9]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[10]  Sheng-Yuan Yang,et al.  OntoCrawler: A focused crawler with ontology-supported website models for information agents , 2010, Expert Syst. Appl..

[11]  Upasana Pandey,et al.  A Survey on Text Classification Techniques for E-mail Filtering , 2010, 2010 Second International Conference on Machine Learning and Computing.

[12]  Jerry R. Hobbs,et al.  Natural Language Access to Structured Text , 1982, COLING.

[13]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[14]  D. S. Guru,et al.  Representation and Classification of Text Documents: A Brief Review , 2010 .

[15]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[16]  Predrag Janicic,et al.  A Variant of N-Gram Based Language Classification , 2007, AI*IA.

[17]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[18]  Dietmar F. Rösner,et al.  From Natural Language Documents to Sharable Product Knowledge: A Knowledge Engineering Approach , 1998 .

[19]  Gailius Raskinis,et al.  Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition , 2004, Informatica.

[20]  Marwa Magdy,et al.  Integrated Machine Learning Techniques for Arabic Named Entity Recognition , 2010 .

[21]  Son Bao Pham,et al.  Named Entity Recognition for Vietnamese , 2010, ACIIDS.