文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 即時中文語音合成系統
卷期 24:2
並列篇名 Real-Time Mandarin Speech Synthesis System
作者 鄭安傑陳嘉平
頁次 053-062
關鍵字 文字轉語音Tacotron2WaveGlowTTSTHCI Core
出刊日期 201912

中文摘要

本論文研究與實作即時中文語音合成系統。此一系統採用文字序列到梅爾頻譜序列的轉換模型,再串接一個從梅爾頻譜到合成語音的聲碼器。我們使用Tacotron2 實作序列到序列轉換模型,配合數種不同的聲碼器,包括Griffin-Lim,World-Vocoder,與WaveGlow。其中以實作可逆編碼解碼函數的WaveGlow神經網路聲碼器最為突出,無論在合成速度或語音品質方面,皆令人印象深刻。我們使用單人12 小時的標貝語料實作系統。在語音品質方面,使用WaveGlow聲碼器的合成系統語音的MOS 為4.08,略低於真實語音的4.41,而遠勝另兩種聲碼器(平均2.93)。在處理速度方面,若使用GeForce RTX 2080 TI GPU,使用WaveGlow 聲碼器的合成系統產生10 秒48 kHz 的語音僅需 1.4 秒,故為即時系統。

英文摘要

This thesis studies and implements the real time Chinese speech synthesis system. This system uses a conversion model of the text sequence to the Mel spectrum sequence, and then concatenates a vocoder from the Mel spectrum to the synthesized speech. We use Tacotron2 to implement a sequence-to-sequence conversion model with several different vocoders, including Griffin-Lim, World-Vocoder, and WaveGlow. The WaveGlow neural network vocoder, which implements the reversible codec function, is the most prominent, and is impressive in terms of synthesis speed or speech quality. We use a single speaker with 12-hour corpus implementation system. In terms of voice quality, the MOS of the synthesized system voice using the WaveGlow vocoder is 4.08, which is slightly lower than the 4.41 of the real voice, and far better than the other two vocoders (average 2.93). In terms of processing speed, if the GeForce RTX 2080 TI GPU is used, the synthesis system using the WaveGlow vocoder produces a voice of 10 seconds and 48 kHz in 1.4 seconds, so it is a real time system.

相關文獻