篇名 | FOF: Fusing Object Features into Deep Learning Model to Generate Image Caption |
---|---|
卷期 | 30:4 |
作者 | Hang Zhou 、 Xue-Qiang Lv 、 Xin-Dong You 、 Zhi-An Dong 、 Kai Zhang |
頁次 | 206-216 |
關鍵字 | convolutional neural network 、 image caption 、 object detection 、 recurrent neural network 、 EI 、 MEDLINE 、 Scopus |
出刊日期 | 201908 |
DOI | 10.3966/199115992019083004020 |
To solve the problem of category errors and number errors of objects in the sentences generated by existing image captioning model, we propose an image captioning model fused with object features. In particular, we integrate object statistical feature and object regional feature extracted from the image into the Convolutional Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image captioning framework. Using object detection network to extract object statistical feature and object regional feature, the object statistical feature and the image convolutional feature are used as the input of Long Short-Term Memory (LSTM), and Attention Mechanism (AM) is used to concatenating the object regional feature with the output of LSTM to generate sentences, so that the model obtains additional information about objects categories, objects numbers and objects regions, which helps to improve the quality of the generated description. Experiments are conducted on MSCOCO dataset. Especially compared with the Hard-attention model, BLEU3/4 increase 4.5%, 4.9%, respectively and compared with the g-LSTM model, BLEU3/4 increase 4.4%, 3.5%, respectively. The proposed model is of great significance to solve the problem of object category errors and object number errors in image description.