Improving the quality of semantic relationships extracted from massive user behavioral data

As the ability to store and process massive amounts of user behavioral data increases, new approaches continue to arise for leveraging the wisdom of the crowds to gain insights that were previously very challenging to discover by text mining alone. For example, through collaborative filtering, we can learn previously hidden relationships between items based upon users' interactions with them, and we can also perform ontology mining to learn which keywords are semantically-related to other keywords based upon how they are used together by similar users as recorded in search engine query logs. The biggest challenge to this collaborative filtering approach is the variety of noise and outliers present in the underlying user behavioral data. In this paper we propose a novel approach to improve the quality of semantic relationships extracted from user behavioral data. Our approach utilizes millions of documents indexed into an inverted index in order to detect and remove noise and outliers.

[1]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[2]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[3]  Carlos Soares,et al.  Outlier Detection using Clustering Methods: a data cleaning application , 2004 .

[4]  Trey Grainger,et al.  Solr in Action , 2014 .

[5]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[6]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[7]  Gustavo Alonso,et al.  A Pipelined Framework for Online Cleaning of Sensor Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[9]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[10]  A. Karr Exploratory Data Mining and Data Cleaning , 2006 .

[11]  Wei Jiang,et al.  On-line outlier detection and data cleaning , 2004, Comput. Chem. Eng..

[12]  John A. Miller,et al.  PGMHD: A scalable probabilistic graphical model for massive hierarchical data problems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[13]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[14]  Eiman Elnahrawy,et al.  Online data cleaning in wireless sensor networks. , 2003 .

[15]  Camilo Ortiz,et al.  Augmenting recommendation systems using a model of semantically-related terms extracted from user behavior , 2014, ArXiv.

[16]  B. R. Badrinath,et al.  Poster abstract: online data cleaning in wireless sensor networks , 2003, SenSys '03.

[17]  Khalifeh AlJadda,et al.  Crowdsourced query augmentation through semantic discovery of domain-specific jargon , 2014, 2014 IEEE International Conference on Big Data (Big Data).