篇名 | The Properties and Further Applications of Chinese Frequent Strings |
---|---|
卷期 | 9:1 |
作者 | Lin, Yih-jeng 、 Yu, Ming-shing |
頁次 | 113-128 |
關鍵字 | Chinese frequent strings 、 Chinese spelling error correction 、 Chinese toneless phoneme-to-character 、 unknown words 、 language model 、 THCI Core |
出刊日期 | 200402 |
This paper reveals some important properties of CFSs and applications in Chinese natural language processing (NLP). We have previously proposed a method for extracting Chinese frequent strings that contain unknown words from a Chinese corpus [Lin and Yu 2001]. We found that CFSs contain many 4-character strings, 3-word strings, and longer n-grams. Such information can only be derived from an extremely large corpus using a traditional language model(LM). In contrast to using a traditional LM, we can achieve high precision and efficiency by using CFSs to solve Chinese toneless phoneme-to-character conversion and to correct Chinese spelling errors with a small training corpus. An accuracy rate of 92.86% was achieved for Chinese toneless phoneme-to-character conversion, and an accuracy rate of 87.32% was achieved for Chinese spelling error correction. We also attempted to assign syntactic categories to a CFS. The accuracy rate for assigning syntactic categories to the CFSs was 88.53% for outside testing when the syntactic categories of the highest level were used.