MMD-Net: A Vision-Language Perception Model for Maritime Object Detection and Application
-
Abstract
Deep neural network (DNN)-based detectors assist humans in object detection, particularly in maritime transportation, contributing to the autonomy of marine vehicles and systems. In this paper, a multimodal detector network (MMD-Net) incorporating natural language as an additional supervisory signal is proposed to improve the performance and generalization of maritime detectors. Motivated by the limitations of traditional neural networks in maritime object detection, the distribution focal scaling is introduced for object classification, and an algorithm that dynamically adjusts the learning rate to accelerate convergence is proposed, aiming to mitigate the significant class imbalance present in maritime datasets. Furthermore, the performance of the region proposal network is improved through our layer-wise training strategy to better capture the diverse sizes and appearances of sea objects. Additionally, different from traditional maritime object detectors, which are confined to predicting a fixed set of predefined object categories, our detector can infer novel categories with rich prior knowledge gained from region-text pre-training. Validation experiments demonstrated that our proposed method achieves a mean average precision (mAP) of 75.8% on the Singapore Maritime Dataset, surpassing other state-of-the-art DNNs. Impressively, even without extra training on novel maritime classes and scenes, the detection accuracy of our model reached 21.1%.
-
-