Two Streams Multiple-Model Object Tracker for Thermal Infrared Video

Thermal infrared (TIR) visual object tracking has been applied in various applications, such as pedestrian detection, wildlife observation, surveillance systems, and so on. The tracker function is to track a particular object of interest and generate its trajectory which will be integrated into a decision making process. Many trackers that are based on fully convolutional neural networks (CNNs) have good performance for RGB input, but it is not the case for the TIR input. Its lack of texture information and the fact that it produces similar heat maps between two nearby objects make the tracking task very challenging. By relying on a fully CNN network alone, a tracker can learn the appearance model but it will not work well if the object heat map looks too similar to the background. Hence, a Siamese CNN network can be implemented to complement the fully CNN, as it allows a set of recent object templates to be used for matching purposes. Yet, the Siamese network alone is not accurate especially in the case of occlusions, as the stored templates rarely produce robust matching. Thus, we propose a two-stream CNN tracker that combines the fully CNN and the Siamese CNN such that each network keeps a set of matching models to cater to diverse appearance changes. Furthermore, the CNN layers are shared between both CNN streams to reduce computational burden. A single dense score map is produced by overlaying the normalized scores of the two streams. The experiments on VOT-TIR 2016 database show that our tracker works well for the datasets with high motion blur, occlusion, and appearance deformation. Besides, the Siamese CNN response map can also be used as an indicator to decide the size of the search region.