文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 使用低通時序列語音特徵訓練理想比率遮罩法之語音強化
卷期 26:2
並列篇名 Employing Low-Pass Filtered Temporal Speech Features for the Training of Ideal Ratio Mask in Speech Enhancement
作者 陳彥同洪志偉
頁次 035-048
關鍵字 語音強化特徵時序列低通濾波理想比例遮罩法小波轉換Speech EnhancementTemporal Feature SequenceLowpass FilteringIdeal Ratio MaskWavelet TransformTHCI Core
出刊日期 202112

中文摘要

在諸多基於深度學習之語音強化法中,遮罩式(masking-based)強化法求取一個遮罩與雜訊語音之時頻圖相乘、藉此使所得乘積之新時頻圖所含雜訊成分降低、以重建相對乾淨的語音訊號。在用以訓練遮罩之深度模型其輸入特徵的選取上,許多長期以來用以語音辨識的特徵、如梅爾倒倒頻譜、振幅調變時頻圖、感知線性估測係數等都是適合的選擇、可使訓練所得的遮罩達到有效的語音強化效果。另外,傳統上若將語音特徵之時序列作低通濾波處理,可以抑制雜訊所帶來的失真,因此,在本研究中,我們嘗試將各種語音特徵時序列,藉由離散小波轉換的方式加以低通濾波,再用它們來訓練語音遮罩的深度模型,探究其是否能使所學習之遮罩能對於原始雜訊語音之時頻圖有更佳的語音強化效果。在我們的初步實驗裡,在人聲雜訊環境中,我們發現上述之低通濾波所得之特徵序列、相較於原始特徵序列而言所學習而得的深度模型,能更有效地提升測試語音之品質與可讀性。

英文摘要

The masking-based speech enhancement method pursues a multiplicative mask that applies to the spectrogram of input noise-corrupted utterance, and a deep neural network (DNN) is often used to learn the mask. In particular, the features commonly used for automatic speech recognition can serve as the input of the DNN to learn the well-behaved mask that significantly reduce the noise distortion of processed utterances. This study proposes to preprocess the input speech features for the ideal ratio mask (IRM)-based DNN by lowpass filtering in order to alleviate the noise components. In particular, we employ the discrete wavelet transform (DWT) to decompose the temporal speech feature sequence and scale down the detail coefficients, which correspond to the high-pass portion of the sequence. Preliminary experiments conducted on a subset of TIMIT corpus reveal that the proposed method can make the resulting IRM achieve higher speech quality and intelligibility for the babble noise-corrupted signals compared with the original IRM, indicating that the lowpass filtered temporal feature sequence can learn a superior IRM network for speech enhancement.

相關文獻