文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 結合鑑別式訓練聲學模型之類神經網路架構及優化方法的改進
卷期 23:2
並列篇名 Leveraging Discriminative Training and Improved Neural Network Architecture and Optimization Method
作者 趙偉成張修瑞羅天宏陳柏琳
頁次 035-046
關鍵字 中文大詞彙連續語音辨識聲學模型鑑別式訓練矩陣分解來回針法Mandarin Large Vocabulary Continuous Speech RecognitionAcoustic ModelDiscriminative TrainingMatrix FactorizationBackstitchTHCI Core
出刊日期 201812

中文摘要

本論文探討聲學模型上的改進對於大詞彙連續中文語音辨識的影響。在基礎聲學模型的訓練上,有別於以往語音辨識通常使用交互熵(Cross Entropy)作為深度類神經網路目標函數,我們使用Lattice-free Maximum Mutual Information(LF-MMI)做為序列式鑑別訓練的目標函數。LF-MMI 使得能夠藉由圖形處理器(Graphical Processing Unit, GPU)上快速地進行前向後向運算,並且找出所有可能路徑的後驗機率,省去傳統鑑別式訓練前需要提前生成詞圖(Word Lattices)的步驟。針對這樣的訓練方式,類神經網路的部分通常使用所謂的時間延遲類神經網路(Time-Delay Neural Network, TDNN)做為聲學模型可達到不錯的辨識效果。因此,本篇論文將基於TDNN 模型加深類神經網路層數,並藉由半正交低秩矩陣分解使得深層類神經網路訓練過程更加穩定。另一方面,為了增加模型的一般化能力(Generalization Ability),我們使用來回針法(Backstitch)的優化算法。在中文廣播新聞的辨識任務顯示,上述兩種改進方法的結合能讓TDNN-LF-MMI 的模型在字錯誤率(Character Error Rate, CER)有相當顯著的降低。

英文摘要

This paper sets out to investigate the effect of acoustic modeling on Mandarin large vocabulary continuous speech recognition (LVCSR). In order to obtain more discriminative baseline acoustic models, we adopt the recently proposed lattice-free maximum mutual information (LF-MMI) criterion as the objective for sequential training of component neural networks in replace of the conventional cross entropy criterion. LF-MMI brings the benefit of efficient forward-backward statistics accumulation on top of the graphical processing unit (GPU) for all hypothesized word sequences without the need of an explicit word lattice generation process. Paired with LF-MMI, the component neural networks of acoustic models implemented with the so-called time-delay neural network (TDNN) often lead to impressive performance. In view of the above, we explore an integration of two novel extensions of acoustic modeling. One is to conduct semi-orthogonal low-rank matrix factorization on the TDNN-based acoustic models with deeper network layers to increase their robustness. The other is to integrate the backstitch mechanism into the update process of acoustic models for promoting the level of generalization. Extensive experiments carried out on a Mandarin broadcast news transcription task reveal that the integration of these two novel extensions of acoustic modeling can yield considerably improvements over the baseline LF-MMI in terms of character error rate (CER) reduction.

相關文獻