Real time object recognition and tracking using 2D/3D images

Object recognition and tracking are the main tasks in computer vision applications such as safety, surveillance, human-robot-interaction, driving assistance system, traffic monitoring, remote surgery, medical reasoning and many more. In all these applications the aim is to bring the visual perception capabilities of the human being into the machines and computers. In this context many significant researches have recently been conducted to open new horizons in computer vision by using both 2D and 3D visual aspects of the scene. While the 2D visual aspect represents some data about the color or intensity of the objects in the scene, the 3D denotes some information about the position of the object surfaces. In fact, these aspects are two different modalities of vision which should be necessarily fused in many computer vision applications to comprehend our three-dimensional colorful world efficiently. Nowadays, the 3D vision systems based on Time of Flight (TOF), which fuse range measurements with the imaging aspect at the hardware level, have become very attractive to be used in the aforementioned applications. However, the main limitation of current TOF sensors is their low lateral resolution which makes these types of sensors inefficient for accurate image processing tasks in real world problems. On the other hand, they do not provide any color information which is a significant property of the visual data. Therefore, some efforts have currently been made to combine TOF cameras with standard cameras in a binocular setup. Although, this solves the problem to some extent, it still deals with some issues, such as complex camera synchronization, complicated and time consuming 2D/3D image calibration and registration, which make the final solution practically complex or even infeasible for some applications. On the other hand, the novel 2D/3D vision system, the so-called MultiCam, which has recently been developed at Center for Sensor Systems (ZESS), combines a TOF-PMD sensor with a CMOS chip in a monocular setup to provide high resolution intensity or color data with range information. This dissertation investigates different aspects of employing the MultiCam for a real time object recognition and tracking to find advantages and limitations of this new camera system. The core contribution of this work is threefold: In the first part of this work, the MultiCam is presented and some important issues such as synchronization, calibration and registration are discussed. Likewise, TOF range data obtained from the PMD sensor are analyzed to find the main sources of noise contributions and some techniques are presented to enhance the quality of the range data. In this section, it is seen that due to the monocular setup of the MultiCam, the calibration and registration of 2D/3D images obtained from the two sensors is simply attainable [12]. Also, thanks to a common FPGA processing unit used in the MultiCam, sensor synchronization, which is a crucial point in the multi-sensor systems, is possible. These are, in fact, the vital points which make the MultiCam suitable for a vision based object recognition and tracking. In the second part, the key point of this work is presented. In fact, by having both 2D and 3D image modalities, obtained from the MultiCam, one can fuse the information from one modality with the other one easily and fast. Therefore, one can take the advantages of both in order to make a fast, reliable and robust object classification and tracking system. As an example, we observe that in the real world problems, where the lighting conditions might not be adequate or the background is cluttered, 3D range data are more reliable than 2D color images. On the other hand, in the cases where many small color features are required to detect an object, like in gesture recognition, the high resolution color data can be used to extract good features. Thus, we have found that a fast fusion of 2D/3D data obtained from the MultiCam, at pixel level, feature level and decision level, provides promising results for real time object recognition and tracking. This is validated in different parts of this work ranging from object segmentation to object tracking. In the last part, the results of our work are utilized in two practical applications. In the first application, the MultiCam is used to observe the defined zones to guarantee the safety of the personnel in a close cooperation with a robot. In the second application, an intuitive and natural interaction system between the human and a robot is implemented. This is done by a 2D/3D hand gesture tracker and classifier which is used as an interface to command the robot. These results validate the adequacy of the MultiCam for real time object recognition and tracking at the indoor conditions. In vielen Anwendungen der Computervision besteht die Hauptaufgabe aus dem Erkennen und Verfolgen von Objekten. Dazu zahlen z.B. Anwendungen aus dem Bereich der Sicherheitsuberwachung, der Mensch-Maschine-Interaktion sowie Fahrerassistenz- und Verkehrsuberwachungssysteme oder auch Anwendungen aus dem medizinischen Bereich. Allen diesen Anwendungen ist das Ziel gemein, die visuellen Fahigkeiten des Menschen auf Maschinen und Computer zu ubertragen. In diesem Zusammenhang wurden in der Vergangenheit bis heute viele Forschungsansatze verfolgt, um neue Horizonte im Bereich der Computervision zu eroffnen, indem sowohl 2D- als auch 3DAspekte der Szene berucksichtigt werden. Wahrend die 2D-Informationen sich auf die Farbe oder Intensitat der Objekte in der Szene beziehen, geben die 3D-Daten Aufschluss uber die Positionen der Objektoberflachen. Diese beiden Aspekte reprasentieren verschiedene Modalitaten, die notwendigerweise fusioniert werden mussen, um die farbige 3D-Welt effizient zu interpretieren. Heutzutage sind die optischen 3D-Messsysteme, die auf der Phasenlaufzeitmessung beruhen und die eine ortlich aufgeloste Abstandsmessung auf Hardwarebasis ermoglichen, fur die oben genannten Anwendungsbereiche sehr attraktiv geworden. Jedoch haben die derzeitigen 3D-Sensoren nur eine sehr geringe laterale Auflosung, was fur Bildverarbeitungsaufgaben bei realen Szenen sehr hinderlich ist. Zudem ubertragen sie keine Informationen uber die Farbe, eine wichtige Eigenschaft der visuellen Daten. Aus diesem Grund wurde in letzter Zeit einiger Aufwand getrieben, um die 3D-Kameras mit Standardkameras in einem binokularen Aufbau miteinander zu verbinden. Obwohl dadurch das Problem zu einem gewissen Ausmas gelost wird, entstehen neue Probleme wie die genaue Synchronisierung, Kalibrierung und Registrierung der Daten, wodurch die finale Losung sehr komplex oder teilweise unmoglich wird. Auf der anderen Seite wurde am Zentrum fur Sensorsysteme eine 2D/3D-Kamera entwickelt ("MultiCam"), die einen 3D-PMD-Sensor mit einem gewohnlichen 2DCMOS- Sensor in einem monokularen Aufbau verbindet und somit gleichzeitig hochaufgeloste Farbbilder und Distanzdaten zur Verfugung stellt. Diese Dissertation untersucht verschiedene Aspekte der MultiCam fur eine Objekterkennung und -verfolgung in Echtzeit und stellt die Vorzuge und Einschrankungen dieser Technik heraus. Der Kernbeitrag dieser Arbeit ist in drei Punkten zu sehen: Im ersten Teil der Arbeit wird die MultiCam vorgestellt und auf einige wichtige Eigenschaften wie die Synchronisierung, Kalibrierung und Registrierung der Daten eingegangen. Auserdem werden die Abstandsdaten der Kamera untersucht und einige Techniken zur Rauschunterdruckung werden vorgestellt. Auf Grund des monokularen Aufbaus der MultiCam kann die Kalibrierung und Registrierung der 2D/3D Bilder sehr einfach erhalten werden [12]. Die Synchronisierung der Daten ist dank einer gemeinsamen FPGA-Verarbeitung moglich, was ein entscheidender Punkt in Multisensorsystemen darstellt. Dieses sind die wichtigsten Eigenschaften, die die MultiCam fur ein optisches Objekterkennungs- und verfolgungssystem sehr effizient machen. Im zweiten Teil wird der Hauptpunkt dieser Arbeit prasentiert. Dadurch, dass 2D- und 3D-Bilder durch eine Kamera akquiriert werden, kann man die Informationen der einen Modalitat mit der anderen sehr einfach fusionieren. Somit konnen beide Modalitaten genutz werden, um ein schnelles, zuverlassiges und robustes Objektklassifizierungs- und verfolgungssystem zu entwickeln. Zum Beispiel konnen bei in der Realitat haufig auftretenden schlechten Lichtverhaltnissen die 3D-Daten benutzt werden, um Objekte zuverlassiger zu detektieren, als dies mit den Farbinformationen moglich ware. Auf der anderen Seite ist zur Erkennung von Gesten eine hohe laterale Auflosung notig, so dass hierfur das 2DFarbbild sehr gut verwendet werden kann. Aus diesem Grund bietet die schnelle Fusion der 2D/3DDaten der MultiCam auf einem Bildpunkte-, Merkmals- oder Entscheidungs-orientierten Level vielversprechende Ergebnisse fur eine Objekterkennung und -verfolgung in Echtzeit. Dies wird in dieser Arbeit in verschiedenen Abschnitten validiert, angefangen bei der Objektsegmentierung bis hin zur Verfolgung. Im letzten Teil werden die Ergebnisse unserer Arbeit in zwei praktischen Anwendungen realisiert. In der ersten Anwendung wird die MultiCam zur Uberwachung definierter Zonen benutzt, um die Sicherheit des Bedienpersonals eines Roboters zu gewahrleisten. In der zweiten Anwendung wird ein intuitives und naturliches Interaktionssystem zwischen Mensch und Roboter implementiert. Dies wird durch eine Handverfolgung und Gestendetektion erreicht, die als Schnittstelle zur Roboterbedienung dienen. Diese Resultate bestatigen die Effizienz und Eignung der MultiCam fur die Objektdetektion und -verfolgung in Echtzeit bei Innenraumbedingungen.

[1]  Sara Nasser,et al.  A Modified Fuzzy K-means Clustering using Expectation Maximization , 2006, 2006 IEEE International Conference on Fuzzy Systems.

[2]  Esther Koller-Meier,et al.  Tracking multiple objects using the Condensation algorithm , 2001, Robotics Auton. Syst..

[3]  Hanqing Lu,et al.  A real-time hand gesture recognition method , 2007, 2011 International Conference on Electronics, Communications and Control (ICECC).

[4]  Antonis A. Argyros,et al.  Fusion of range and visual data for the extraction of scene structure information , 2002, Object recognition supported by user interaction for service robots.

[5]  W. Weihs,et al.  Detection and Classification of Moving Objects-Stereo or Time-of-Flight Images , 2006, 2006 International Conference on Computational Intelligence and Security.

[6]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[7]  Fabio Remondino,et al.  Range imaging technology: new developments and applications for people identification and tracking , 2007, Electronic Imaging.

[8]  Xin Li,et al.  Contour-based object tracking with occlusion handling in video acquired using mobile cameras , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Raúl Rojas,et al.  High resolution segmentation with a time-of-flight 3D-camera using the example of a lecture scene , 2006 .

[10]  Bernd Radig,et al.  Real-time range imaging for dynamic scenes using colour-edge based structured light , 2002, Object recognition supported by user interaction for service robots.

[11]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[12]  P. KaewTrakulPong,et al.  An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection , 2002 .

[13]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[14]  Danica Kulić,et al.  Safety for human-robot interaction , 2006 .

[15]  Peter S. Maybeck,et al.  Stochastic Models, Estimation And Control , 2012 .

[16]  Jake K. Aggarwal,et al.  Temporal spatio-velocity transform and its application to tracking and interaction , 2004, Comput. Vis. Image Underst..

[17]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[18]  Oliver Wirjadi,et al.  Survey of 3d image segmentation methods , 2007 .

[19]  Martin Buss,et al.  Fusing laser and vision data with a genetic ICP algorithm , 2008, 2008 10th International Conference on Control, Automation, Robotics and Vision.

[20]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[21]  W. Eric L. Grimson,et al.  Background Subtraction for Temporally Irregular Dynamic Textures , 2008, 2008 IEEE Workshop on Applications of Computer Vision.

[22]  Ruigang Yang,et al.  Fusion of time-of-flight depth and stereo for high accuracy depth maps , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  S. E. Ghobadi,et al.  Hand Segmentation Using 2D/3D Images , 2007 .

[24]  Michael Isard,et al.  Visual Motion Analysis by Probabilistic Propagation of Conditional Density , 1998 .

[25]  Rasmus Larsen,et al.  Fusion of stereo vision and Time-Of-Flight imaging for improved 3D estimation , 2008, Int. J. Intell. Syst. Technol. Appl..

[26]  Helmut Cantzler,et al.  Improving architectural 3D reconstruction by constrained modelling , 2003 .

[27]  Jun Ohta,et al.  Smart CMOS Image Sensors and Applications , 2007 .

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  Sharath Pankanti,et al.  Appearance models for occlusion handling , 2006, Image Vis. Comput..

[30]  B. Huhle,et al.  Integrating 3D Time-of-Flight Camera Data and High Resolution Images for 3DTV Applications , 2007, 2007 3DTV Conference.

[31]  Dao-Qing Dai,et al.  Improved discriminate analysis for high-dimensional data and its application to face recognition , 2007, Pattern Recognit..

[32]  Andrew W. Fitzgibbon,et al.  An Experimental Comparison of Range Image Segmentation Algorithms , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[34]  Timo Kahlmann,et al.  CALIBRATION FOR INCREASED ACCURACY OF THE RANGE IMAGING CAMERA SWISSRANGER , 2006 .

[35]  Daniel Cremers,et al.  Nonlinear Shape Statistics in Mumford-Shah Based Segmentation , 2002, ECCV.

[36]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[37]  Klaus-Dieter Kuhnert,et al.  Fusion of Stereo-Camera and PMD-Camera Data for Real-Time Suited Precise 3D Environment Reconstruction , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Abbas Rafii,et al.  An Occupant Classification System Eigen Shapes or Knowledge-Based Features , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[39]  Bernd Radig,et al.  The HISCORE camera a real time three dimensional and color camera , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[40]  David Beymer,et al.  Real-Time Tracking of Multiple People Using Continuous Detection , 1999 .

[41]  F. Estrada Advances in computational image segmentation and perceptual grouping , 2005 .

[42]  Dorin Comaniciu,et al.  Kernel-Based Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Gary R. Bradski,et al.  Learning OpenCV - computer vision with the OpenCV library: software that sees , 2008 .

[44]  Detlef Justen Untersuchung eines neuartigen 2D-gestützten 3D-PMD-Bildverarbeitungssystems , 2006 .

[45]  Marc Alexa,et al.  Combining Time-Of-Flight depth and stereo images without accurate extrinsic calibration , 2008, Int. J. Intell. Syst. Technol. Appl..

[46]  R. Dillmann,et al.  Using gesture and speech control for commanding a robot assistant , 2002, Proceedings. 11th IEEE International Workshop on Robot and Human Interactive Communication.

[47]  Ilya Pollak,et al.  Nonlinear scale-space analysis in image processing , 1999 .

[48]  S. Burak Gokturk,et al.  A Time-Of-Flight Depth Sensor - System Description, Issues and Solutions , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[49]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  Michal Haindl Multimodal Range Image Segmentation , 2007 .

[51]  Joachim Hertzberg,et al.  A 3D laser range finder for autonomous mobile robots , 2001 .

[52]  Junbin Gao,et al.  A Discriminant Analysis for Undersampled Data , 2007, AIDM.

[53]  Karin Sobottka Analysis of low-resolution range image sequences , 2000, DISKI.

[54]  G. Giralt,et al.  Safe and dependable physical human-robot interaction in anthropic domains: State of the art and challenges , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[55]  Otmar Loffeld,et al.  Real Time Hand Based Robot Control Using 2D/3D Images , 2008, ISVC.

[56]  Jafar Amiri Parian,et al.  INTEGRATED LASER SCANNER AND INTENSITY IMAGE CALIBRATION AND ACCURACY ASSESSMENT , 2005 .

[57]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[58]  B. Cyganek An Introduction to 3D Computer Vision Techniques and Algorithms , 2009 .

[59]  Chieh-Chih Wang,et al.  Hand posture recognition using adaboost with SIFT for human robot interaction , 2007 .

[60]  Alireza Bab-Hadiashar,et al.  Range image segmentation using surface selection criterion , 2006, IEEE Transactions on Image Processing.

[61]  Otmar Loffeld,et al.  2D/3D Image Data Analysis for Object Tracking and Classification , 2010 .

[62]  Ralf Reulke,et al.  Combination of distance data with high resolution images , 2006 .

[63]  Leonardo Romero,et al.  Fusing a Laser Range Finder and a Stereo Vision System to Detect Obstacles in 3D , 2004, IBERAMIA.

[64]  Stefan Gheorghe Pentiuc,et al.  Hand posture recognition for human-robot interaction , 2007, WMISI '07.

[65]  Jieping Ye,et al.  A two-stage linear discriminant analysis via QR-decomposition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Hélène Laurent,et al.  Review and evaluation of commonly-implemented background subtraction algorithms , 2008, 2008 19th International Conference on Pattern Recognition.

[67]  N.D. Georganas,et al.  3D Hand Tracking and Motion Analysis with a Combination Approach of Statistical and Syntactic Analysis , 2007, 2007 IEEE International Workshop on Haptic, Audio and Visual Environments and Games.

[68]  Andreas Koschan,et al.  Colour Image Segmentation: A Survey , 1994 .

[69]  Peter Eisert,et al.  Adaptive color classification for structured light systems , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[70]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[71]  Otmar Loffeld,et al.  Improved object segmentation based on 2D/3D images , 2008 .

[72]  D. Aubert,et al.  Long Range Obstacle Detection Using Laser Scanner and Stereovision , 2006, 2006 IEEE Intelligent Vehicles Symposium.