Extracting Crime Information from Online Newspaper Articles

Information extraction is the task of extracting relevant information from unstructured data. This paper aims to 'mine' (or extract) crime information from online newspaper articles and make this information available to the public. Baring few, many countries that possess this information do not make them available to their citizens. So, this paper focuses on automatic extraction of public yet 'hidden' information available in newspaper articles and make it available to the general public. In order to demonstrate the feasibility of such an approach, this paper focuses on one type of crime, the theft crime. This work demonstrates how theft-related information can be extracted from newspaper articles from three different countries. The system employs Named Entity Recognition (NER) algorithms to identify locations in sentences. However, not all the locations reported in the article are crime locations. So, it employs Conditional Random Field (CRF), a machine learning approach to classify whether a sentence in an article is a crime location sentence or not. This work compares the performance of four different NERs in the context of identifying locations and their subsequent impact in classifying a sentence as a 'crime location' sentence. It investigates whether a CRF-based classifier model that is trained to identify crime locations from a set of articles can be used to identify articles from another newspaper in the same country (New Zealand). Also, it compares the accuracy of identifying crime location sentences using the developed model in newspapers from two other countries (Australia and India).

[1]  Stephen Cranefield,et al.  Context identification of sentences in related work sections using a conditional random field: towards intelligent digital libraries , 2010, JCDL '10.

[2]  B. Chandra,et al.  Adaptive Query Interface for Mining Crime Data , 2007, DNIS.

[3]  Gang Wang,et al.  Crime data mining: a general framework and some examples , 2004, Computer.

[4]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[5]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[6]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[7]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[8]  Queensland,et al.  Australian and New Zealand Standard Offence Classification , 2006 .

[9]  Walter A. Kosters,et al.  Data Mining Approaches to Criminal Career Analysis , 2006, Sixth International Conference on Data Mining (ICDM'06).

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[12]  Kevin Knight Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics , 2005 .

[13]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[14]  Marcos De Oliveira,et al.  Collective intelligence in law enforcement - The WikiCrimes system , 2010, Inf. Sci..

[15]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[16]  Gondy Leroy,et al.  Natural language processing and e-Government: crime information extraction from heterogeneous data sources , 2008, DG.O.

[17]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..