
Journal of Computers EIMEDLINEScopus

  • 加入收藏
  • 下載文章
篇名 A Method of Detecting Approximate Repetitive News Documents
卷期 29:2
作者 Xueping LiangXiaojun Wen
頁次 104-109
關鍵字 approximate repetition of documentsdocument clustersmulti-feature fingerprint clustersEIMEDLINEScopus
出刊日期 201804
DOI 10.3966/199115992018042902011



In view of the phenomenon of too much repeated webpage on the Internet, this paper proposes an approximately duplicate webpage detection algorithm and system , which combined multi-feature fingerprint cluster detection with document similarity detection. In this scheme, the multi-feature fingerprint cluster detection is used first to ensure the precision and efficiency of the algorithm; for small portion of the document that not be recalled, approximately duplicate webpage detection algorithm is used to guarantee the recall rate. The scheme has good improvements in the aspects of precision and recall rate, and at the same time has a good balance on performance.

