文章詳目資料

International Journal of Science and Engineering

  • 加入收藏
  • 下載文章
篇名 利用多尺度感興趣區域之細微關係提供圖片字幕
卷期 13:2
並列篇名 Image Captioning Based on Fine-grained Relationships with Multiscale Regions of Interest
作者 林亮宇林朝興
頁次 019-038
關鍵字 圖片字幕生成區域提取網路多尺度感興趣區域長短期記憶單元Image CaptioningRegion Proposal NetworksMulti-scale ROIsLong Short-term Memory cells
出刊日期 202310
DOI 10.53106/222344892023101302003

中文摘要

隨著機器學習的蓬勃發展,圖片字幕生成(Image Captioning)的技術愈來愈進步。近期的Image Captioning 引入區域提取網路 (Region proposal Networks,RPN)與注意力機制(Attention Mechanism)。 Image Captioning 透過 RPN 提取圖片中特定的物件區域,可以降低雜訊被當作視覺特徵的機率;注意力機制讓模型更專注在物件到文字的轉換。但是目前研究成果還存在著缺陷,RPN 與注意力機制皆專注於單一物件區域。它們缺少物件與物件之間更細膩的視覺特徵。上述的缺陷導致字幕生成器生成不明確的關係描述。為了提高Image Captioning 生成關係描述的細膩度,本研究提出透過不同物件之間多尺度感興趣區域之關係特徵的Image Captioning 模型。本研究架構有RPN、全卷積神經網路 (Fully Convolutional Neural Networks,FCNN)以及長短期記憶 (Long Short-term Memory,LSTM)單元。相較於現有的研究成果, 在視覺特徵上,除了物件區域外,我們將進一步提取不同物件之間的多尺度ROIs。由於某些多尺度ROIs 是屬於雜訊,因此利用並交比(Intersection-over-Union)進行篩選。每一個ROI 都先經由FCNN 萃取出視覺特徵,再通過融合機制與排序網路獲得已排序的融合特徵,最後利用LSTM 學習此特徵到完整句子的轉換。在訓練過程中額外透過階層式屬性的輔助監督,使字幕生成器能夠針對如何生成細膩的屬性進行學習。本研究提出的架構能夠在動態的圖片上,使用更精確的動詞描述物件動作。並且在基於n-gram 的方法上,獲得更高的分數。

英文摘要

With the rapid development of machine learning, the technique of Image Captioning is becoming more and more advanced. Recent researches of Image Captioning introduce Region Proposal Networks (RPN) and Attention Mechanism. Through RPN, we can extract features of specific object region in the image and reduce the probability of noises being treated as visual features. Attention mechanism makes the models to focus more on the mapping of object and caption. However, the current research results have deficiencies. Both RPN and Attention Mechanism only focus on the single object region instead of fine-grained visual features. Aforementioned deficiencies cause mistakes that caption generator generates uncertain relationships. In this paper, to improve exquisiteness of relationship descriptions for Image Captioning, we propose the Image Captioning model which generates sentence with multi-scale regions of interest (ROIs) between two different objects. Our proposed architecture includes Region Proposal Networks, Fully Convolutional Neural Networks and Long Short-term Memory cells. Compared to the existing research results, we extract not only object regions but multi-scale ROIs between two different objects on visual features. Some of Multi-scale ROIs are noises that can be screened by utilizing Intersection-over-Union (IoU). Each ROI utilizes FCNN to extract the visual features, followed by obtaining sorted fusion features with fusion mechanism and sorting network, and lastly learning transformation between this features to a whole sentence by LSTM. Caption generator can focus on learning how to generate fine-grained attributes with hierarchical attribute supervisions on the training stage. The architecture proposed in this study can use more precise verbs to describe object actions on dynamic pictures. Furthermore, our architecture outperforms on metrics based n-gram.

相關文獻