Hosur'Tech Participation in Interactive INFILE

Tasks performed: interactive InFile French to French and English Main objectives of experiments: As Hossur’Tech started from scratch in mid January to build an information extraction system based on a deep linguistic analysis, InFile runs were too early to be able to use our linguistic tools. Our objective in performing runs was to experiment comparison methods on real data to help us to design our future system. Approach used: topics have been processed using a limited version of XFST with our own resources. Part of speech tagging and lemmatization were obtained. For the documents, it was not possible to use the same linguistic processing because of volume limitation of our version of XFST. A simple dictionary look-up without disambiguation was used. We were only able to process French and English in time. Arabic needed a little more time. For each topic their title, description, and narrative contents were used. The example document was only used as a first positive feedback but not included strictly in the topic. For documents only title and text were used. All document words inferred monolingual equivalents (for French to French comparison) or translations (for French to English comparison). A word intersection was computed and then a concept intersection was established. All words inferred from the same word were considered as representing the same concept. Each concept contained in the topic-document intersection receives a weight according to both a statistics computed on a similar corpus (Clef corpus) and the fact that the concepts are in the topic keyword list or title or not. Proper nouns receive also an increased weight. A tentative threshold between relevant and irrelevant documents was computed between the weight of the example document and the maximum weight of documents relevant to other topics. Adaptation: The threshold has been adjusted according to the simulated feedback. Each word included into >= 2 relevant documents are included into the topic word set. We have asked 4 feedbacks for each topic which is too small according to real use of such systems. Resources employed: own dictionaries Results obtained: a great number of non relevant documents due to the fact that the feedback did not permit to adjust the threshold. The fact that we have not considered that documents could have several topics has also produced a large number of irrelevant documents. The low level of feedback for each topic (4) was not enough to add words from relevant documents in topics. ACM categories and subject descriptors: H.3.3 Information Search and Retrieval, Information filtering Free keywords: adaptive filtering, cross-lingual filtering, natural language processing