文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 The Use of Clustering Techniques for Language Modeling – Application to Asian Language
卷期 6:1
作者 Gao, JianfengGoodman, Joshua T.Miao, Jiangbo
頁次 027-060
關鍵字 THCI Core
出刊日期 200102

中文摘要

英文摘要

Cluster-based n-gram modeling is a variant of normal word-based n-gram
modeling. It attempts to make use of the similarities between words. In this paper, we present an empirical study of clustering techniques for Asian language modeling. Clustering is used to improve the performance (i.e. perplexity) of language models as well as to compress language models. Experimental tests are presented for cluster-based trigram models on a Japanese newspaper corpus and on a Chinese heterogeneous corpus. While the majority of previous research on word clustering has focused on how to get the best clusters, we have concentrated our research on the best way to use the clusters. Experimental results show that some novel techniques we present work much better than previous methods, and achieve
more than 40% size reduction at the same level of perplexity.

關鍵知識WIKI

相關文獻