篇名 | NSYSU-MITLab團隊於福爾摩沙語音辨識競賽2020之語音辨識系統 |
---|---|
卷期 | 26:1 |
並列篇名 | NSYSU-MITLab Speech Recognition System for Formosa Speech Recognition Challenge 2020 |
作者 | 林洪邦 、 陳嘉平 |
頁次 | 017-032 |
關鍵字 | 自動語音辨識 、 Transformer 、 Conformer 、 連續性時序分類 、 聲學模型 、 Automatic Speech Recognition 、 Connectionist Temporal Classification 、 Acoustic Model 、 THCI Core |
出刊日期 | 202106 |
本論文中,我們描述了NSYSU-MITLab團隊在福爾摩沙語音辨識競賽2020(Formosa Speech Recognition Challenge 2020, FSR-2020)中所實作的系統。我們使用多頭注意力機制(Multi-head Attention)所構成的Transformer架構建立了端到端的語音辨識系統,並且結合了連續性時序分類(Connectionist Temporal Classification, CTC)共同進行端到端的訓練以及解碼。我們也嘗試將編碼器更改為結合卷積神經網路(Convolutional neural network, CNN)與多頭注意力機制的Conformer架構。同時我們也建立了深度神經網路結合隱藏式馬可夫模型(Deep Neural Network-Hidden Markov Model, DNN-HMM),其中我們以時間限制自注意力機制(Time-Restricted Self-Attention, TRSA)及分解時延神經網路(Factorized Time Delay Neural Network, TDNN-F)建立深度神經網路的部分。最終我們在台文漢字任務上得到最佳的字元錯誤率(Character Error Rate, CER)為43.4%以及在台羅拼音任務上取得最佳的音節錯誤率(Syllable Error Rate, SER) 25.4%。
In this paper, we describe the system team NSYSU-MITLab implemented for Formosa Speech Recognition Challenge 2020. We use the Transformer architecture composed of Multi-head Attention to construct an end-to-end speech recognition system and combine it with Connectionist Temporal Classification (CTC) for end-to-end training and decoding. We have also built a deep neural network combined with a hidden Markov model (DNN-HMM). We use Time-Restricted Self-Attention and Factorized Time Delay Neural Network (TDNN-F) for the deep neural network in DNN-HMM. The best performance we have achieved with the proposed methods is the character error rate of 45.5% for Taiwan Southern Min Recommended Characters (台文漢字) task and syllable error rate 25.4% for Taiwan Minnanyu Luomazi Pinyin (台羅拼音) task.