Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIII

In today’s world we are confronted with increasing amounts of information every day coming from a large variety of sources. People and corporations are producing data on a large scale, and since the rise of the internet, e-mail and social media the amount of produced data has grown exponentially. From a law enforcement perspective we have to deal with these huge amounts of data when a criminal investigation is launched against an individual or company. Relevant questions need to be answered like who committed the crime, who were involved, what happened and on what time, who were communicating and about what? Not only the amount of available data to investigate has increased enormously, but also the complexity of this data has increased. When these communication patterns need to be combined with for instance a seized financial administration or corporate document shares a complex investigation problem arises. Recently, criminal investigators face a huge challenge when evidence of a crime needs to be found in the Big Data environment where they have to deal with large and complex datasets especially in financial and fraud investigations. To tackle this problem, a financial and fraud investigation unit of a European country has developed a new tool named LES that uses Natural Language Processing (NLP) techniques to help criminal investigators handle large amounts of textual information in a more efficient and faster way. In this paper, we present this tool and we focus on the evaluation its performance in terms of the requirements of forensic investigation: speed, smarter and easier for investigators. In order to evaluate this LES tool, we use different performance metrics. We also show experimental results of our evaluation with large and complex datasets from real-world application.

[1]  V. B. Dalvi,et al.  Bottom-Up Generalization: A Data Mining Solution to Privacy Protection , 2015 .

[2]  Tran Khanh Dang,et al.  An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce , 2014, Globe.

[3]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[4]  Felix Naumann,et al.  Efficient Similarity Search in Very Large String Sets , 2012, SSDBM.

[5]  Ranieri Baraglia,et al.  Scaling Out All Pairs Similarity Search with MapReduce , 2010, LSDS-IR@SIGIR.

[6]  Jimmy J. Lin,et al.  No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[7]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .

[8]  Nikos Mamoulis,et al.  Privacy Preservation by Disassociation , 2012, Proc. VLDB Endow..

[9]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[10]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[11]  Panos Kalnis,et al.  Anonymity in Unstructured Data , 2008 .

[12]  Tao Yang,et al.  Optimizing parallel algorithms for all pairs similarity search , 2013, WSDM.

[13]  Jure Lescovek Finding Similar Items , 2012 .

[14]  Tran Khanh Dang,et al.  Solving approximate similarity queries , 2007, Comput. Syst. Sci. Eng..

[15]  Radoslaw Szmit Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data , 2013, IIS.

[16]  Li Ju,et al.  Batch Text Similarity Search with MapReduce , 2011, APWeb.

[17]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[18]  Tran Khanh Dang,et al.  An Efficient Similarity Search in Large Data Collections with MapReduce , 2014, FDSE.

[19]  Raymond Chi-Wing Wong,et al.  Minimality Attack in Privacy Preserving Data Publishing , 2007, VLDB.

[20]  Yufei Tao,et al.  Transparent anonymization: Thwarting adversaries who know the algorithm , 2010, TODS.

[21]  Tran Khanh Dang,et al.  The SH-tree: A Super Hybrid Index Structure for Multidimensional Data , 2001, DEXA.

[22]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[23]  Anthony K. H. Tung,et al.  Efficient and Scalable Processing of String Similarity Join , 2013, IEEE Transactions on Knowledge and Data Engineering.

[24]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[25]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[26]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.