文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

  • 加入收藏
  • 下載文章
篇名 TQDL: Integrated Models for Cross-Language Document Retrieval
卷期 17:4
作者 Long-Yue WANGDerek F. WONGLidia S. CHAO
頁次 015-032
關鍵字 Cross-Language Document RetrievalStatistical Machine TranslationTF-IDFDocument Translation-BasedLength-Based FilterTHCI Core
出刊日期 201212

中文摘要

英文摘要

This paper proposed an integrated approach for Cross-Language Information
Retrieval (CLIR), which integrated with four statistical models: Translation model,
Query generation model, Document retrieval model and Length Filter model.
Given a certain document in the source language, it will be translated into the
target language of the statistical machine translation model. The query generation
model then selects the most relevant words in the translated version of the
document as a query. Instead of retrieving all the target documents with the query,
the length-based model can help to filter out a large amount of irrelevant candidates
according to their length information. Finally, the left documents in the target
language are scored by the document searching model, which mainly computes the
similarities between query and document.
Different from the traditional parallel corpora-based model which relies on IBM
algorithm, we divided our CLIR model into four independent parts but all work
together to deal with the term disambiguation, query generation and document
retrieval. Besides, the TQDL method can efficiently solve the problem of
translation ambiguity and query expansion for disambiguation, which are the big
issues in Cross-Language Information Retrieval. Another contribution is the length
filter, which are trained from a parallel corpus according to the ratio of length
between two languages. This can not only improve the recall value due to filtering
out lots of useless documents dynamically, but also increase the efficiency in a
smaller search space. Therefore, the precision can be improved but not at the cost
of recall.
In order to evaluate the retrieval performance of the proposed model on
cross-languages document retrieval, a number of experiments have been conducted
on different settings. Firstly, the Europarl corpus which is the collection of parallel
texts in 11 languages from the proceedings of the European Parliament was used
for evaluation. And we tested the models extensively to the case that: the lengths of
texts are uneven and some of them may have similar contents under the same topic,
because it is hard to be distinguished and make full use of the resources.
After comparing different strategies, the experimental results show a significant
performance of the method. The precision is normally above 90% by using a larger
query size. The length-based filter plays a very important role in improving the
F-measure and optimizing efficiency.
This fully illustrates the discrimination power of the proposed method. It is of a
great significance to both cross-language searching on the Internet and the parallel
corpus producing for statistical machine translation systems. In the future work, the
TQDL system will be evaluated for Chinese language, which is a big changing and
more meaningful to CLIR.

相關文獻