篇名 | Linking Databases using Matched Arabic Names |
---|---|
卷期 | 19:1 |
作者 | Tarek El-Shishtawy |
頁次 | 033-053 |
關鍵字 | Name Matching 、 Record Linkage 、 Data Integration 、 Arabic NLP 、 Information Retrieval 、 THCI Core |
出刊日期 | 201403 |
In this paper, a new hybrid algorithm that combines both token-based and character-based approaches is presented. The basic Levenshtein approach also has been extended to the token-based distance metric. The distance metric is enhanced to set the proper granularity level behavior of the algorithm. It smoothly maps a threshold of misspelling differences at the character level and the importance of token level errors in terms of token position and frequency.
Using a large Arabic dataset, the experimental results show that the proposed algorithm successfully overcomes many types of errors, such as typographical errors, omission or insertion of middle name components, omission of non-significant popular name, and different writing style character variations. When compared with other classical algorithms, using the same dataset, the proposed algorithm was found to increase the minimum success level of the best tested lower limit algorithm (Soft TFIDF) from 69% to about 80%, while achieving an upper accuracy level of 99.67%.