文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 《現代漢語新詞語資訊電子詞典》的研究與實現
卷期 7:2
並列篇名 Development and Study of the Modern Chinese New Words Information Electronic Dictionary
作者 亢世勇
頁次 089-099
關鍵字 中文資訊處理新詞語電子詞典Chinese information processingElectronic dictionaryNew wordsTHCI Core
出刊日期 200208

中文摘要

本文從四個方面介紹了我們正在開發中的《現代漢語新詞語資訊電子詞典》:(1)現代漢語新詞語的界定,(2)新詞語詞典的開發思想,(3)新詞語的採集與新詞語屬性資訊的描述,(4)近四萬新詞語的歸類實踐。我們認定的新詞語是指1978 年以來通過各種途徑産生的、具有基本詞彙沒有的新形式、新意義或新用法的語文詞語。除了詞形、詞義或用法任何一個方面“新"外,還要求必須是人們日常生活中普遍、廣泛使用的語文詞語,人名、地名以及專科術語都不屬於我們所說的“新詞語"。我們堅持開放的原則,儘量全面的採集收錄新詞語,用人機兩用的研究理念,以北京大學計算語言學研究所的《現代漢語語法資訊詞典》爲模型打造一部收詞全面、資訊豐富、資源高度共用的現代漢語新詞語電子詞典,爲新詞語的研究、中文資訊處理的研究提供一個寶貴的資源。目前已收錄新詞語近4 萬,首先我們按照現代漢語詞類的“優勢語法"功能,給這四萬新詞語分類並歸類,然後,利用成熟的關聯資料庫(在ACCESS 環境下實現)詳細地描述了每個詞語的屬性資訊。設立總庫一個,語法資訊庫三個,包括名詞庫、動詞庫、形容詞庫,另外還設立了構詞法庫,舊詞庫、外來詞庫、簡略詞庫。總庫和其他各庫通過“詞語、拼音、義項"三個欄位聯繫起來,構成了一個具有上下位關係的有機系統,便於資訊的提取。這些庫總共設立屬性欄位200 多個,包括每個詞語的語音資訊、語義資訊、來源資訊、構詞法資訊、句法資訊和部分語用資訊。本詞典是目前國內收詞量最大、描寫資訊最多的一部新詞語詞典。

英文摘要

We introduce the development of the Electronic Lexicon of ContemporaryNewborn Chinese Words: (1) the definition of a newborn word, (2) the mainprinciple behind constructing the lexicon, (3) the collection of newborn words and their feature descriptions of them, and (4) the classification of 40,000 newborn words. In our opinion, a new bornword is a character string that appeared after 1978 in a new form, with a new meaning and with a new usage. In addition, it must be frequently used and accepted, but the names of men and places are not newborn words according to our definition. The approach to collecting newborn words is quite unrestricted, that is, the more the better. Based on the Contemporary Chinese Grammatical Knowledge Base of the Institute of Computational Linguistics at Peking University, we have finished compiling a lexicon of almost 40,000 newborn words semi-automatically. The lexicon, we believe, is a worthy resource for research on Chinese word-building rules and Natural Language Processing. Firstly, classification is done based on the preponderant grammatical characteristics of each word, and then the detailed features are described inthe database of ACCESS. The lexicon contains a total base and three grammatical bases (i.e., a noun base,verb base and adjective base); what’s more, it also has an old word base, a loanword base and a acronym base. The entire base is related to the sub-bases through the fields of word, phonetic notation and semantics fields, which form a hypernymy hierarchy that is quite convenient for searching. Totally, there are more than 200 fields in the bases that give information regarding phonetic notation, semantics, sources, word building, syntax and pragmatics. Without doubt, this lexicon is one of the largest domestic lexicons available with the most detailed descriptions of newborn Chinese words.

相關文獻