HyRead Journal 台灣全文資料庫

文章詳目資料

大學圖書館

綜合類/圖書資訊學

篇名	分類不一致對文件自動分類效果的影響
卷期	9:1
並列篇名	The Effect of Inconsistency in Training Data on Automatic Text Categorization
作者	曾元顯
頁次	002-019
關鍵字	文件分類、一致性、分類測試集、主題分析、複本偵測、 Document classification 、 Consistency 、 Test collection for categorization 、 Subject analysis 、 Duplicate detection
出刊日期	200503

中文摘要

本文探討分類不一致對自動分類成效的影響。經由近似文件的自動偵測，以及兩種分類方法針對兩個測試文件集做的比較實驗，本文發現：訓練資料的分類不一致性，即便高達34%，幾乎也不會影響分類器的成效。此項發現，其重要的意涵是，即便過去的研究使用了一致性不高的測試集做實驗，其結論仍舊是有效的。當然，分類不一致性高的資料，拿來訓練後，不管分類器好壞，其得到的分類成效都是比較低的。除了以上發現外，本文也介紹了一套中文分類測試集，免費提供各界研究使用。另外，作者也提出了一套偵測複本或相似文件的可靠方法，與過去的方法比較，此方法可以彳貞測過去方法所無法偵測到的相似文件。

英文摘要

This article discusses the effect of inconsistency in training data on the performance of text classifiers. Our experiments show that the inconsistency, even reaching a level as high as 34%, hardly affects the effectiveness of the classifiers. Better classifiers perform better independent of duplicates and label inconsistency. The implication is that past experiments (especially on the Reuters-21578 collection) remain valid. In the experiment process, the author proposes a duplicate detection technique that is far more effective than previous ones. A new Chinese test collection for text categorization is also introduced for general free download.

本卷期文章目次

關鍵知識WIKI