文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 Extension of Zipf's Law to Word and Character N-grams for English and Chinese
卷期 8:1
作者 Ha, Le-quanSicilia-Garcia, E.-I.Ming, JiSmith, F.-J.
頁次 077-101
關鍵字 Zipf 's lawn-gramsChinese compound wordChinese characterphrasesTHCI Core
出刊日期 200302

中文摘要

英文摘要

It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about than about 1,000. However, when single words or characters are combined together with n-gram words or characters in one list and put in order of frequency, the frequency of tokens in the
combined list follows Zipf’s law approximately with the slope close to -1 on a loglog plot for all n-grams, down to the lowest frequencies in both languages. This behaviour is also found for English 2-byte and 3-byte word fragments. It only happens when all n-grams are used, including semantically incomplete n-grams. Previous theories do not predict this behaviour, possibly because conditional probabilities of tokens have not been properly represented.

相關文獻