A Mining Technique Using $N$ n-Grams and Motion Transcripts for Body Sensor Network Data Repository

Recent years have witnessed a large influx of applications in the field of cyber-physical systems. An important class of these systems is body sensor networks (BSNs) where lightweight embedded processors and communication systems are tightly coupled with the human body. BSNs can provide researchers, care providers and clinicians access to tremendously valuable information extracted from data that are collected in users' natural environment. With this information, one can monitor the progression of a disease, identify its early onset, or simply assess user's wellness. One major obstacle is managing repositories that store the large amount of sensing data. To address this issue, we propose a data mining approach inspired by the experience in the areas of text and natural language processing. We represent sensor readings with a sequence of characters, called motion transcripts. Transcripts reduce complexity of the data significantly while maintaining morphological and structural properties of the physiological signals. To further take advantage of the physiological signal's structure, our data mining technique focuses on the characteristic transitions in the signals. These transitions are efficiently captured using the concept of n-grams. To facilitate a lightweight and fast mining approach, we reduce the overwhelmingly large number of n-grams via information gain (IG) feature selection. We report the effectiveness of the proposed approach in terms of the speed of mining while maintaining an acceptable accuracy in terms of the F-score combining both precision and recall.

[1]  Joseph A. Paradiso,et al.  A Distributed Wearable, Wireless Sensor System for Evaluating Professional Baseball Pitchers and Batters , 2009, 2009 International Symposium on Wearable Computers.

[2]  Dirk Heylen,et al.  Combination of facial movements on a 3D talking head , 2004 .

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Stefan Kurtz,et al.  Approximate String Searching under Weighted Edit Distance , 1996 .

[5]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[6]  Theodosios Pavlidis,et al.  Structural pattern recognition , 1977 .

[7]  Guang-Zhong Yang,et al.  FROM IMAGING NETWORKS TO BEHAVIOR PROFILING: UBIQUITOUS SENSING FOR MANAGED HOMECARE OF THE ELDERLY , 2005 .

[8]  Horst Bunke,et al.  Syntactic and structural pattern recognition : theory and applications , 1990 .

[9]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[10]  Claude E. Shannon,et al.  Communication theory of secrecy systems , 1949, Bell Syst. Tech. J..

[11]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[12]  Bhavani M. Thuraisingham,et al.  A scalable multi-level feature extraction technique to detect malicious executables , 2007, Inf. Syst. Frontiers.

[13]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[14]  Enrique Vidal,et al.  Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Yoram Singer,et al.  Beyond Word N-Grams , 1996, VLC@ACL.

[16]  Surajit Chaudhuri Data Mining and Database Systems: Where is the Intersection? , 1998, IEEE Data Eng. Bull..

[17]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[18]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[19]  Matt Welsh,et al.  Sensor networks for medical care , 2005, SenSys '05.

[20]  Luca Benini,et al.  Bio-feedback system for rehabilitation based on a wireless body area network , 2006, Fourth Annual IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOMW'06).

[21]  Peter H Veltink,et al.  Accelerometer and rate gyroscope measurement of kinematics: an inexpensive alternative to optical motion analysis systems. , 2002, Journal of biomechanics.

[22]  J. Kent Information gain and a general measure of correlation , 1983 .

[23]  N. Stergiou Innovative Analyses of Human Movement , 2003 .

[24]  Xiaoming Jin,et al.  Indexing and Mining of the Local Patterns in Sequence Database , 2002, IDEAL.

[25]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[26]  Igor Kononenko,et al.  Semi-Naive Bayesian Classifier , 1991, EWSL.

[27]  Mohamed Kamel,et al.  Adaptive fuzzy k-NN classifier for EMG signal decomposition. , 2006, Medical engineering & physics.

[28]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[29]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[30]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[31]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[32]  Y. Aloimonos,et al.  Discovering a Language for Human Activity 1 , 2005 .

[33]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[34]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[35]  Kiyoshi Yamaoka,et al.  Application of Akaike's information criterion (AIC) in the evaluation of linear pharmacokinetic equations , 1978, Journal of Pharmacokinetics and Biopharmaceutics.

[36]  S. Gilman,et al.  Diagnostic criteria for Parkinson disease. , 1999, Archives of neurology.

[37]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[38]  Vlado Keselj,et al.  Detection of New Malicious Code Using N-grams Signatures , 2004, PST.

[39]  Ying Zhang,et al.  Measuring confidence intervals for the machine translation evaluation metrics , 2004, TMI.

[40]  Hynek Hermansky,et al.  Segmentation of speech for speaker and language recognition , 2003, INTERSPEECH.

[41]  Reinhold Orglmeister,et al.  Posture and Motion Detection Using Acceleration Data for Context Aware Sensing in Personal Healthcare Systems , 2009 .

[42]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[43]  Hassan Ghasemzadeh,et al.  A phonological expression for physical movement monitoring in body sensor networks , 2008, 2008 5th IEEE International Conference on Mobile Ad Hoc and Sensor Systems.

[44]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[45]  Gabriela Guimarães,et al.  Inferring Definite-Clause Grammars to Express Multivariate Time Series , 2005, IEA/AIE.

[46]  Paul A. Viola,et al.  Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade , 2001, NIPS.

[47]  Michael L. Littman,et al.  Activity Recognition from Accelerometer Data , 2005, AAAI.

[48]  Ayumi Shinohara,et al.  Discovering Best Variable-Length-Don't-Care Patterns , 2002, Discovery Science.

[49]  Wu Chou,et al.  Decision tree state tying based on penalized Bayesian information criterion , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[50]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[51]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[52]  Paul J. M. Havinga,et al.  Distributed Activity Recognition with Fuzzy-Enabled Wireless Sensor Networks , 2007, DCOSS.

[53]  Hassan Ghasemzadeh,et al.  Collaborative signal processing for action recognition in body sensor networks: a distributed classification algorithm using motion transcripts , 2010, IPSN '10.

[54]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.