文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 基於深度學習之中文文字轉台語語音合成系統初步探討
卷期 25:2
並列篇名 A Preliminary Study on Deep Learning-based Chinese Text to Taiwanese Speech Synthesis System
作者 許文漢曾證融廖元甫王文俊潘振銘
頁次 069-084
關鍵字 機器翻譯臺灣閩南語羅馬字拼音台語語音合成Machine TranslationTaiwanese Speech SynthesisTacotron2WaveglowTHCI Core
出刊日期 202012

中文摘要

台語在台灣歷史悠久,使用的族群眾多,有著很重要的存在價值。語音合成在追求跟人類一樣的聲音以及語調的同時,語言的多樣性也是一個需要深入探討的領域。本論文針對目前較少有的台語語音合成系統來作探討,利用翻譯模型Chinese to Taiwanese(C2T)將輸入的中文文字轉成台羅拼音數字調(TLPA),再將拼音輸入Tacotron2 模型(Text to Spectrogram) 後輸出頻譜, 最後由WaveGlow 模型(Spectrogram to Waveform)來實現語音合成。同時有架設網頁可供使用者一同來測試成效。本文 C2T 機器翻譯的實驗方面採取三種模式,包括(1)輸入中文字詞,先進行斷詞,再輸出每個中文詞的台語台羅(Tâi-lô)拼音。(2)輸入中文字元串,直接輸出台羅拼音串。(3)輸入中文字元串,輸出台語的台羅拼音串與台語詞的斷詞關係。若不考慮聲調,方法(1)的syllable error rate(SER)為15.66%。而方法(2)的SER 更可達6.53%。這表示我們所用的sequence-to-sequence 模型確實可以正確地將輸入的中文字元串,直接輸出台羅拼音串。在台語語音合成品質實驗方面,我們找了20 位聽者,各聽取15 句不同內容的合成音檔後,以平均主觀意見進行評分(mean opinion score,MOS,完全不像人講話的聲音為1 分,完全像真人講話聲音為5 分)。總計收集到300 個評分,最後得到我們系統的MOS 得分為4.30 分。這表示我們所用的Tacrtron2 與WaveGlow 模型確實可以正確將台羅拼音串轉成台語語音。此外此系統的語音合成速度為一秒可合成約3.5 秒之音檔,的確可以達到即時語音合成的要求。

英文摘要

This paper focuses on the development and implementation of a Chinese Text-to-Taiwanese speech synthesis system. The proposed system combines three deep neural network-based modules including (1) a sequence-to-sequence-based Chinese characters to Taiwan Minnanyu Luomazi Pinyin (shortened to as Tâi-lô) machine translation (called C2T from now on), (2) a Tacotron2-based Tâi-lô pinyin to spectrogram and (3) a WaveGlow-based spectrogram to speech waveform synthesis subsystems. Among them, the C2T module was trained using a Chinese-Taiwanese parallel corpus (iCorpus) and 9 dictionaries released by Academia Sinica and collected from internet, respectively. The Tacotron2 and Waveglow was tuned using a Taiwanese speech synthesis corpus (a female speaker, about 10 hours speech) recorded by Chunghwa Telecom Laboratories. At the same time, a demonstration Chinese Text-to-Taiwanese speech synthesis web page has also been implemented. From the experimental results, it was found that (1) the best syllable error rate (SER) of 6.53% was achieved by the C2T module, (2) and the average MOS score of the whole speech synthesis system evaluated by 20 listeners gains 4.30. These results confirm that the effectiveness of integration of C2T, Tacrtron2 and WaveGlow models. In addition, the real-time factor of the whole system achieved 1/3.5.

相關文獻