HyRead Journal 台灣全文資料庫

文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

自然科學/資訊/科技

篇名	TQDL: Integrated Models for Cross-Language Document Retrieval
卷期	17:4
作者	Long-Yue WANG 、 Derek F. WONG 、 Lidia S. CHAO
頁次	015-032
關鍵字	Cross-Language Document Retrieval 、 Statistical Machine Translation 、 TF-IDF 、 Document Translation-Based 、 Length-Based Filter 、 THCI Core
出刊日期	201212

This paper proposed an integrated approach for Cross-Language Information
Retrieval (CLIR), which integrated with four statistical models: Translation model,
Query generation model, Document retrieval model and Length Filter model.
Given a certain document in the source language, it will be translated into the
target language of the statistical machine translation model. The query generation
model then selects the most relevant words in the translated version of the
document as a query. Instead of retrieving all the target documents with the query,
the length-based model can help to filter out a large amount of irrelevant candidates
according to their length information. Finally, the left documents in the target
language are scored by the document searching model, which mainly computes the
similarities between query and document.
Different from the traditional parallel corpora-based model which relies on IBM
algorithm, we divided our CLIR model into four independent parts but all work
together to deal with the term disambiguation, query generation and document
retrieval. Besides, the TQDL method can efficiently solve the problem of
translation ambiguity and query expansion for disambiguation, which are the big
issues in Cross-Language Information Retrieval. Another contribution is the length
filter, which are trained from a parallel corpus according to the ratio of length
between two languages. This can not only improve the recall value due to filtering
out lots of useless documents dynamically, but also increase the efficiency in a
smaller search space. Therefore, the precision can be improved but not at the cost
of recall.
In order to evaluate the retrieval performance of the proposed model on
cross-languages document retrieval, a number of experiments have been conducted
on different settings. Firstly, the Europarl corpus which is the collection of parallel
texts in 11 languages from the proceedings of the European Parliament was used
for evaluation. And we tested the models extensively to the case that: the lengths of
texts are uneven and some of them may have similar contents under the same topic,
because it is hard to be distinguished and make full use of the resources.
After comparing different strategies, the experimental results show a significant
performance of the method. The precision is normally above 90% by using a larger
query size. The length-based filter plays a very important role in improving the
F-measure and optimizing efficiency.
This fully illustrates the discrimination power of the proposed method. It is of a
great significance to both cross-language searching on the Internet and the parallel
corpus producing for statistical machine translation systems. In the future work, the
TQDL system will be evaluated for Chinese language, which is a big changing and
more meaningful to CLIR.

本卷期文章目次

關鍵知識WIKI

文章詳目資料

International Journal of Computational Linguistics And Chinese Language Processing THCI

中文摘要

英文摘要

本卷期文章目次

關鍵知識WIKI

相關文獻