HyRead Journal 台灣全文資料庫

文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

自然科學/資訊/科技

篇名	基於音段式LMR 對映之語音轉換方法的改進
卷期	18:4
並列篇名	Improving of Segmental LMR-Mapping Based Voice Conversion Method
作者	古鴻炎、張家維
頁次	097-114
關鍵字	語音轉換、線性多變量迴歸、直方圖等化、目標音框挑選、離散倒頻譜係數、 Voice Conversion 、 Linear Multivariate Regression 、 Histogram Equalization 、 Target Frame Selection 、 Discrete Cepstral Coefficients 、 THCI Core
出刊日期	201312

中文摘要

基於線性多變量迴歸(linear multivariate regression, LMR)頻譜對映之語音轉換方法，轉換出的頻譜包絡仍然存在過度平滑(over smoothing)的現象，因此本論文研究在音段式LMR頻譜對映之前加人直方圖等化(HEQ)的處理，並且在lmr頻譜對映之後加人目標音框挑選的處理，希望藉以提升轉換出語音的品質。在此，直方圖等化處理包含兩個步驟，首先是把離散倒頻譜係數(DCC)轉換成主成分分析(PCA)係數’接者把PCA係數轉換成累積密度函數(CDF)係數；目標音框挑選則是依據一個音框的音段類別編號、及LMR對映出的DCC向量，到目標語者相同音段類別所收集的音框群中，去搜尋出距離較小的目標語者DCC向量、並且取代原先對映出的DCC向量，如此以避免發生頻譜包絡之過度平滑現象。對於直方圖等化與目標音框挑選，我們以外部平行語料(未參加模型參數訓練)來量測語音轉換之平均DCC誤差，當加人直方圖等化後會使誤差值變大ー些，而當加人目標音框挑選後則會使誤差值變大得更多。不過，VR (variance ratio)值量測及主觀聽測的結果卻是相反的方向，亦即直方圖等化可使語音品質提升ー些，而目標音框挑選則可使語音品質獲得更為明顯的提升。這種誤差距離值和語音品質聽測之間的不一致性，我們設法去尋找了它的原因，所找到的一個理由在内文裡說明。

英文摘要

＿Spectral over-smoothing is still observable in the converted spectral envelope when linear multivariate regression (LMR) based spectrum mapping is adopted to convert voice. Therefore, in this paper, we study to place a histogram-equalization (HEQ) module immediately before LMR based mapping and to place a target frame selection (TFS) module immediately after LMR based mapping. These two modules are intended to promote the quality of the converted voice. Here, HEQ processing includes the two steps: (a) transform discrete cepstral coefficients (DCC) into principal component analysis (PCA) coefficients; (b) transform PCA coefficients into cumulated density function (CDF) coefficients. As to TFS, an input frame is first processed to obtain its converted DCC and its segment-class number. Then, the group of target-speaker frames corresponding to the same segment-class number is searched to find a target frame whose DCC are sufficiently close to the converted DCC. Next, the converted DCC are replaced by the DCC of the target frame found. In experimental evaluation, the outside parallel sentences (not used in model-parameter training) are used to measure average cepstral distances (ACD) between the converted DCC and the target DCC. When the HEQ module is added, the value of ACD would be increased a little. Furthermore, the value of ACD would be apparently increased when the TFS module is added. Nevertheless, according to the measured VR (variance ratio) values and the scores of subjective listening tests, the quality of the converted voice will become better when HEQ is added, and become much better when TFS is added. As to the reasons for why the measured ACD values and the perceived converted-voice qualities are inconsistent, we have found one possible cause which can explain why this inconsistency may occur.

本卷期文章目次

關鍵知識WIKI