Automatic Document Categorisation by User Profile in Medline

We investigate potential improvements to the problem of term extraction related to document representation and indexing in large document collections such as Medline, the premier bibliographic database of the U.S. National Library of Medicine (NLM). Using term extraction methods such as AMTEX and MMTX, document representations are semantically compact and more efficient, being reduced to a limited number of meaningful multi-word terms (phrases), rather than large vectors of single-words, part of which may be void of distinctive content semantics. We show how this information can be used for the automatic categorisation of medical documents by user profile (i.e., novice users and experts). This is achieved by mapping document terms to external lexical resources such as WordNet, and MeSH (the medical thesaurus of NLM). Evaluation results of all methods are presented and discussed.