Crowd++: unsupervised speaker count with smartphones

Smartphones are excellent mobile sensing platforms, with the microphone in particular being exercised in several audio inference applications. We take smartphone audio inference a step further and demonstrate for the first time that it's possible to accurately estimate the number of people talking in a certain place -- with an average error distance of 1.5 speakers -- through unsupervised machine learning analysis on audio segments captured by the smartphones. Inference occurs transparently to the user and no human intervention is needed to derive the classification model. Our results are based on the design, implementation, and evaluation of a system called Crowd++, involving 120 participants in 10 very different environments. We show that no dedicated external hardware or cumbersome supervised learning approaches are needed but only off-the-shelf smartphones used in a transparent manner. We believe our findings have profound implications in many research fields, including social sensing and personal wellbeing assessment.

[1]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[2]  P. Lukowicz,et al.  Collaborative Crowd Density Estimation with Mobile Phones , 2011 .

[3]  Cecilia Mascolo,et al.  EmotionSense: a mobile phones based adaptive platform for experimental social psychology research , 2010, UbiComp.

[4]  Larry P. Heck,et al.  Modeling dynamic prosodic variation for speaker verification , 1998, ICSLP.

[5]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[6]  Jie Liu,et al.  SpeakerSense: Energy Efficient Unobtrusive Speaker Identification on Mobile Phones , 2011, Pervasive.

[7]  Romit Roy Choudhury,et al.  SurroundSense: mobile phone localization via ambience fingerprinting , 2009, MobiCom '09.

[8]  Tanzeem Choudhury,et al.  Passive and In-Situ assessment of mental and physical well-being using mobile sensors , 2011, UbiComp '11.

[9]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[10]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Igor Bisio,et al.  Speaker Count application for smartphone platforms , 2010, IEEE 5th International Symposium on Wireless Pervasive Computing 2010.

[12]  Alex Pentland,et al.  Sensing and modeling human networks using the sociometer , 2003, Seventh IEEE International Symposium on Wearable Computers, 2003. Proceedings..

[13]  Mun Choon Chan,et al.  Low cost crowd counting using audio tones , 2012, SenSys '12.

[14]  Daniel P. W. Ellis,et al.  Noise Robust Pitch Tracking by Subband Autocorrelation Classification , 2012, INTERSPEECH.

[15]  Aaron E. Rosenberg,et al.  Unsupervised speaker segmentation of telephone conversations , 2002, INTERSPEECH.

[16]  John H. L. Hansen,et al.  A linguistic data acquisition front-end for language recognition evaluation , 2012, Odyssey.

[17]  John Mason,et al.  Robust voice activity detection using cepstral features , 1993, Proceedings of TENCON '93. IEEE Region 10 International Conference on Computers, Communications and Automation.

[18]  Nuno Vasconcelos,et al.  Privacy preserving crowd monitoring: Counting people without people models or tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Zhigang Liu,et al.  Darwin phones: the evolution of sensing and inference on mobile phones , 2010, MobiSys '10.

[20]  Michael J. Carey,et al.  Robust prosodic features for speaker identification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[21]  Venet Osmani,et al.  Automatic Sensing of Speech Activity and Correlation with Mood Changes , 2013 .

[22]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[23]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[24]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[25]  Gang Liu,et al.  Robust speech enhancement techniques for ASR in non-stationary noise and dynamic environments , 2013, INTERSPEECH.

[26]  Octavian Postolache,et al.  Pervasive and Mobile Sensing and Computing for Healthcare , 2013 .

[27]  Chuohao Yeo,et al.  Modeling Dominance in Group Conversations Using Nonverbal Activity Cues , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  B.Y. Smolenski,et al.  A Speaker Count System for Telephone Conversations , 2006, 2006 International Symposium on Intelligent Signal Processing and Communications.

[29]  Peter A. Dinda,et al.  Indoor localization without infrastructure using the acoustic background spectrum , 2011, MobiSys '11.

[30]  Richard M. Stern,et al.  Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis , 2008, INTERSPEECH.

[31]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Ramesh Govindan,et al.  Cloud-enabled privacy-preserving collaborative learning for mobile sensing , 2012, SenSys '12.

[33]  Douglas A. Reynolds,et al.  HTIMIT and LLHDB: speech corpora for the study of handset transducer effects , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Ning An,et al.  SCPL: indoor device-free multi-subject counting and localization using radio signal strength , 2013, IPSN.

[35]  Yun Lei,et al.  A novel feature extraction strategy for multi-stream robust emotion identification , 2010, INTERSPEECH.

[36]  Elizabeth Shriberg,et al.  Speaker Overlaps and ASR Errors in Meetings: Effects Before, During, and After the Overlap , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[37]  Ronald J. Baken,et al.  Clinical measurement of speech and voice , 1987 .