Did You Know? A Rule-Based Approach to Finding Similar Questions on Online Health Forums

This paper describes our system submitted for the ICHI 2015 Healthcare Data Analytics Challenge. Given a relatively large corpus of questions posted by users on online health forums, for a newly posted question (i.e., Query question), our task is to find three most similar questions from the corpus. Our system employs Elastic search, a search server based on Lucene, at its core. The corpus of existing questions is indexed with n-grams. To search for most similar questions, the query question is re-written to a keyword-based query based on rules by considering multiple text components including title, key phrases, and noun phrases extracted from the question content.