文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 基於《知網》的辭彙語義相似度計算
卷期 7:2
並列篇名 Word Similarity Computing Based on How-net
作者 劉群李素建
頁次 059-076
關鍵字 《知網》辭彙語義相似度計算自然語言處理How-netNatural Language ProcessingWord Similarity ComputingTHCI Core
出刊日期 200208

中文摘要

詞義相似度計算在很多領域中都有廣泛的應用,例如資訊檢索、資訊抽取、文本分類、詞義排歧、基於實例的機器翻譯等等。詞義相似度計算的兩種基本方法是基於世界知識(Ontology)或某種分類體系(Taxonomy)的方法和基於統計的上下文向量空間模型方法。這兩種方法各有優缺點。《知網》是一部比較詳盡的語義知識詞典,受到了人們普遍的重視。不過,由於《知網》中對於一個詞的語義採用的是一種多維的知識表示形式,這給詞語相似度的計算帶來了麻煩。這一點與WordNet 和《同義詞詞林》不同。在WordNet 和《同義詞詞林》中,所有同類的語義項(WordNet 的synset 或《同義詞詞林》的詞群)構成一個樹狀結構,要計算語義項之間的距離,只要計算樹狀結構中相應結點的距離即可。而在《知網》中辭彙語義相似度的計算存在
以下問題:
1. 每一個詞的語義描述由多個義原組成;2. 詞語的語義描述中各個義原並不是平等的,它們之間有著複雜的關係,通過一種專門的知識描述語言來表示。
我們的工作主要包括:
1. 研究《知網》中知識描述語言的語法,瞭解其描述一個詞義所用的多個義為結構化的方式改寫了《知網》中詞的定義(DEF),其中採用了“集合”和“特徵結構”這兩種抽象資料結構。
2. 研究了義原的相似度計算方法、集合和特徵結構的相似度計算方法,並在此基礎上提出了利用《知網》進行詞語相似度計算的演算法;3. 通過實驗驗證該演算法的有效性,並與其他演算法進行比較。

英文摘要

Word similarity is broadly used in many applications, such as information retrieval, information extraction, text classification, word sense disambiguation, example-based machine translation, etc. There are two different methods used to compute similarity: one is based on ontology or a semantic taxonomy; the other is based on collocations of words in a corpus. As a lexical knowledgebase with rich semantic information, How-net has been employed in various researches. Unlike other thesauri, such as WordNet and Tongyici Cilin, in which word similarity is defined based on the distance between words in a semantic taxonomy tree, How-net defines a word in a complicated multi-dimensional knowledge description language. As a result, a series of problems arise in the process of word similarity computation using How-net. The difficulties are outlined below: 1. The description of each word consists of a group of sememes. For example, the Chinese word “暗箱(camera obscura)” is described as: “part|部件, #TakePicture|拍攝, %tool|用具, body|身”, and the Chinese word “寫信 (write a letter)” is described as: “write|寫, ContentProduct=letter|信件”; 2. The meaning of a word is not a simple combination of these sememes. Sememes are organized using a specific knowledge description language. To meet these challenges, our work includes: 1. A study on the How-net knowledge description language. We rewrite the How-net definition of a word in a more structural format, using the abstract data structure of set and feature structure. 2. A study on the algorithm used to compute word similarity based on How-net. The similarity between sememes, that between sets, and that between feature structures are given. To compute the similarity between two sememes,we use the distance between the sememes in the semantic taxonomy, as is done in Wordnet and Tongyici Cilin. To compute the similarity between two sets or two feature structures, we first establish a one-to-one mapping between the elements of the sets or the feature structures. Then, the similarity between the sets or feature structures is defined as the weighted average of the similarity between their elements. For feature structures, a one-to-one mapping is established according to the attributes. For sets, a one-to-one mapping is established according to the similarity between their elements. 3. Finally, we give experiment results to show the validity of the algorithm and compare them with results obtained using other algorithms. Our results for word similarity agree with people’s intuition to a large extent, and they are better than the results of two comparative experiments.

相關文獻