文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 基於文本概念和kNN 的跨語種文本過濾
卷期 7:1
並列篇名 Cross-Language Text Filtering Based on Text Concepts and kNN
作者 蘇偉峰李紹滋李堂秋尤文建
頁次 079-090
關鍵字 可分義原知網文本表示kNN向量空間Classfiable SememeText RepresentationVector SpaceHowNetTHCI Core
出刊日期 200202

中文摘要

本文介紹一個可以從中文或英文大量的資訊中過濾出用戶的興趣所在的文檔的模型,用一簇可分義原向量空間的向量來表示用戶所感興趣的文本,然後把需要處理的文本也表示成一個可分義原空間中的一個向量,在向量空間中與k個最相近的向量進行計算,從而決定是否將該文本呈現給用戶。實驗證明,這是一個比較好的過濾方法。

英文摘要

The WWW is increasingly being used source of information. The volume of
information is accessed by users using direct manipulation tools. It is obviously that we’d like to have a tool to keep those texts we want and remove those texts we don’t want from so much information flow to us. This paper describes a module that sifts through large number of texts retrieved by the user. The module is based on HowNet, a knowledge dictionary developed by Mr. Zhendong Dong. In this dictionary, the concept of a word is divided into sememes. In the philosophy of HowNet, all concepts in the world can be expressed by a combination more than 1500 sememes. Sememe is a very useful concept in settle the problem of synonym which is the most difficult problem in text filtering. We classified the set of sememes into two sets of sememes: classfiable sememes and unclassficable semems. Classfiable sememes includes those sememes that are more useful in distinguishing a document’s class from other documents. Unclassfiable sememes include those sememes that have similar appearance in all documents. Classfiable includes about 800 sememes. We used these 800 classficable sememes to build Classficable Sememes Vector Space(CSVS). A text is represented as a vector in the CSVS after the following step: 1. text preprosessing: Judge the language of the text and do some process attribute to its language. 3. keywords extraction 4. keyword sense disambiguation based on its environment by calculating its classifiable sememes relevance with it’s environment’s classifiable sememes. We add the weight of a semantic item if there are classifiable sememes the same as classifiable sememe in the its environment word’s semantic item. This is not a strict disambiguation algorithm. We just adjust the weights of those semantic items. 5. Those keywords are reduced to sememes and the weight of all keywords ‘s all
semantic items ‘s classifiable sememes are calculated to be the weight of its vector feature. A user provides some texts to express the text he interested in. They are all expressed as vectors in the CSVS. Then those vectors represent the user’s preference. The relevance of two texts can be measured by using the cosine angle between the two text’s vectors. When a new text comes, it is expressed as a vector in CSVS too. We find its k nearest neighbours in the texts provided by the user in the CSVS . Calculating the relevance of the new text to its k nearest neighbours
and if it is bigger than a certain valve, than it means it is of the user’s interest if smaller, it means that it is not belong to the user’s interesting. The k is determined by calculated every training vector its neighbours. Information filtering based on classifiable sememes has several advantage: 1. Low dimentional input space. We use 800 sememes instead of 10000 words. 2. Few irrelevant feature after the keyword extraction and unclassifiable sememes’s removal. 3. Document vector’s feature’s weight are big.total We made use of documents from eight different users in our experiments. All these users provides texts both in Chinese and English. We took into account the user’s feedback and got a result of about 88 percent of recall and precision. It demonstrates that this is a success method.

相關文獻