Region of interest determination using human computation

The ability to identify and track visually interesting regions has many practical applications — for example, in image and video compression, visual marketing and foveal machine vision. Due to challenges in modeling the peculiarities of human physiological and psychological responses, automatic detection of fixation points is an open problem. Indeed, no objective methods are currently capable of fully modeling the human perception of regions of interest (ROIs). Thus, research often relies on user studies with eye tracking systems. In this paper we propose a cost-effective and convenient alternative, obtained by having internet workers annotate videos with ROI coordinates. The workers use an interactive video player with a simulated mouse-driven fovea, which models the fall-off in resolution of the human visual system. Since this approach is not supervised, we implement methods for identifying inaccurate or malicious results. Using this proposal, one can collect ROI data in an automated fashion, and at a much lower cost than laboratory studies.

[1]  Wilson S. Geisler,et al.  Gaze-contingent real-time simulation of arbitrary visual fields , 2002, IS&T/SPIE Electronic Imaging.

[2]  Alan C. Bovik,et al.  Fast algorithms for foveated video processing , 2003, IEEE Trans. Circuits Syst. Video Technol..

[3]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[4]  Wilson S. Geisler,et al.  Real-time foveated multiresolution system for low-bandwidth video communication , 1998, Electronic Imaging.

[5]  Wei Tsang Ooi,et al.  Crowdsourced automatic zoom and scroll for video retargeting , 2010, ACM Multimedia.

[6]  Andrew T. Duchowski,et al.  Eye Tracking Methodology: Theory and Practice , 2003, Springer London.

[7]  Nanning Zheng,et al.  Learning to Detect a Salient Object , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Laura A. Dabbish,et al.  Designing games with a purpose , 2008, CACM.

[9]  Cha Zhang,et al.  CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yung-Yu Chuang,et al.  A collaborative benchmark for region of interest detection algorithms , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Michael Lindenbaum,et al.  Esaliency (Extended Saliency): Meaningful Attention Using Stochastic Image Modeling , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Alan C. Bovik,et al.  GAFFE: A Gaze-Attentive Fixation Finding Engine , 2008, IEEE Transactions on Image Processing.

[13]  Youngmoo E. Kim,et al.  MoodSwings: A Collaborative Game for Music Mood Label Collection , 2008, ISMIR.

[14]  Touradj Ebrahimi,et al.  The JPEG2000 still image coding system: an overview , 2000, IEEE Trans. Consumer Electron..

[15]  Xiaohong W. Gao,et al.  Recognition of traffic signs based on their colour and shape features extracted using human vision models , 2006, J. Vis. Commun. Image Represent..

[16]  B. Wandell Foundations of vision , 1995 .

[17]  B. Fischer,et al.  Human express saccades: extremely short reaction times of goal directed eye movements , 2004, Experimental Brain Research.