文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 結合語音辨認及合成模組之台語語音轉換系統
卷期 27:2
並列篇名 Taiwanese Voice Conversion based on Cascade ASR and TTS Framework
作者 許文漢廖元甫王文俊潘振銘
頁次 089-138
關鍵字 台文語音語料庫台語語音合成台語語音轉換Taiwanese Across TaiwanTaiwanese Speech SynthesisTaiwanese Voice ConversionTHCI Core
出刊日期 202212

中文摘要

台語已被聯合國列為瀕危消失語言,急需傳承。因此,本論文研究如何做出一個可以用任何人的聲音,合成出任何台語語句的台語語音合成系統。為達到此目的,我們首先(1)建置一Taiwanese Across Taiwan (TAT) 大規模台文語音語料庫,其共有204 位語者,約140 小時的語料,其中有兩男兩女,每人約10小時的台語語音合成專用語料。然後(2)基於Tacotron2 之語音合成架構,並加上前端中文字轉台羅拼音模組與後端WaveGlow 即時語音生成器,建立中文文字轉台語語音合成系統。最後(3)基於串接台語語音辨認與語音合成架構,建置一台語語音轉換系統,並完成同語言:台語對台語語音轉換;以及跨語言:華語對台語語音轉換,兩種台語語音轉換功能。為評估此台語語音轉換系統的成效,我們透過網路公開招募到29 位實驗者,進行同語言及跨語言轉換台語語音兩項評分任務,並分別進行針對「自然度」與「相似度」的MOS 分數之主觀評測。實驗結果顯示,在同語言部分,若使用目標語者10 分鐘,3 分鐘與30 秒語料進行測試,自然度平均MOS 分數依序為3.45 分,3.02 分與2.23 分,相似度平均MOS 分數依序為3.38 分,2.99 分與2.10 分;而在跨語言部分,若使用目標語者6 分鐘與3 分鐘語料進行測試,自然度平均MOS 分數依序為2.90 分與2.70 分,相似度平均MOS 分數依序為2.84 分與2.54 分。由實驗結果,可以顯示我們確實初步達成一個可以用任何人的聲音,合成出任何台語語句的台語語音合成系統。

英文摘要

Taiwanese has been listed as an endangered language by the United Nations and is urgent for passing on. Therefore, this study wants to find out how to make a Taiwanese speech synthesis system that can synthesize any Taiwanese sentences via anyone's voice. To achieve this goal, we first (1) built a large-scale Taiwanese Across Taiwan (TAT) corpus, with in total of 204 speakers and about 140 hours of speech. Among those speakers, two men and women, each one has especially about 10 hours of speech recorded for the purpose of speech synthesis, then (2) establish a Chinese Text-to-Taiwanese speech synthesis system based on the Tacotron2 speech synthesis architecture, plus with a frontend sequence-to-sequence-based Chinese characters to Taiwan Minnanyu Luomazi Pinyin (shortened as Tâi-lô) machine translation module and the backend WaveGlow real-time speech generator, and finally, (3) constructed a Taiwanese voice conversion system based on the concatenated speech recognition and speech synthesis framework where two voice conversion functions had been implemented including (1) same-language: Taiwanese to Taiwanese voice conversion, and (2) multi-language: Chinese to Taiwanese voice conversion. In order to evaluate the Taiwanese voice conversion system, we publically recruited 29 subjects from the Internet to conduct two kinds of scoring task: same-language and cross-language voice conversion and carried out the subjective "naturalness" and "similarity" mean opinion score (MOS) evaluations respectively. The test result shows that in the Intra-lingual session, the average naturalness MOS is 3.45, 3.02 and 2.23 points, and average similarity MOS score’s 3.38, 2.99 and 2.10 points while using 10 minutes, 3 minutes, and 30 seconds target speech, respectively; in cross-lingual part, the average naturalness MOS score is 2.90 and 2.70 points; average similarity MOS score is 2.84 and 2.54 points while using 6 minutes and 3 minutes target speech, respectively. From those results, it shows that our proposed system indeed could synthesize any Taiwanese sentences via anyone's voice.

相關文獻