Video object detection is a fundamental research task for scene understanding. Compared with object detection in images, object detection in videos has been less researched due to shortage of labelled video datasets. As frames in a video clip are highly correlated, a larger quantity of video labels are needed to have good data variation, which are not always available as the labels are much more expensive to attain. Regarding the above-mentioned problem, it is easy to train an image object detector, but not always possible to train a video object detector if there are insufficient video labels for certain classes. In order to deal with this problem and improve the performance of an image object detector for the classes without video labels, we propose to augment a well-trained image object detector with an efficient and effective class-agnostic convolutional regression tracker for the video object detection task. The tracker learns to track objects by reusing the features from the image object detector, which is a light-weighted increment to the detector, with only a slight speed drop for the video object detection task. The performance of our model is evaluated on the large-scale ImageNet VID dataset. Our strategy improves the mean average precision (mAP) score for the image object detector by around 5% and around 3% for the image object detector plus Seq-NMS post-processing.
|Number of pages||12|
|Journal||ISPRS journal of photogrammetry and remote sensing|
|Early online date||30 Apr 2021|
|Publication status||E-pub ahead of print/First online - 30 Apr 2021|