篇名 | 混合型資料集的k-means 分群演算法 |
---|---|
卷期 | 19:1 |
並列篇名 | A k-means Based Clustering Algorithm for Mixed- Attribute Data Sets |
作者 | 黃宇翔 、 王品鈞 、 方志強 |
頁次 | 001-028 |
關鍵字 | 叢集分析 、 k-means 、 順序屬性 、 距離量度 、 Clustering analysis 、 k-means 、 ordinal attribute 、 distance measure 、 TSSCI |
出刊日期 | 201706 |
叢集分析為資料探勘分群技術之一,由於目前網路環境快速發展,資料屬性的種 類與數量大量增加,導致傳統分群技術執行的效能大幅降低,傳統k-means 分群方法 將難以應付。因此後續的相關研究則是針對數值、類別、順序等屬性資料的處理作為 研究的重點。本研究以Ahmad and Dey(2007)所提出k-means 之衡量距離定義為基 礎,針對三種屬性同時存在的資料集做叢集分析,並以各自不同的衡量距離定義作為 分群考量,提出基因演算法以求得最佳衡量指標最好之群心組合,希望能提供各界應 用,解決因三種混合的資料屬性所造成分群困難的實務問題。
Clustering is one of the most important analysis methods in data mining. In the wake of the fast development of networks technology, various types of data attribute and large numbers of data items cause the substantial inefficiency of data processing for clustering. Among different clustering approaches, partitioning clustering is relatively easier to implement and faster to perform than other ones. Different types of data attributes make clustering complicated. Most of literature focuses on numerical and categorical attributes or only ordinal attributes, respectively, but the results turn out to be less satisfactory in terms of accuracy and execution time. The proposed clustering approach, based on Ahmad and Dey (2007) k-means method, is advantageous in dealing with the three attributes: numerical, categorical and ordinal attributes simultaneously in which Euclidean distance is used to define the numerical similarity, the frequency of each value’s rank is used to indicate the categorical similarity, and the normalized distance is used to measure the ordinal similarity. The effectiveness of the proposed approach is evaluated by the use of an essential concept of clustering which is to minimize the ratio of the within cluster errors to the between cluster errors. A generic algorithm is also developed for reducing the execution time in dealing with the clustering of the three types of attributes at the same time. We hope the proposed method can provide a useful clustering technique for applications in practice.