Course-specific search engines: semi-automated methods for identifying high quality topic-specific corpora

Web search is an important research tool for many high school courses. However, generic search engines have a number of problems that arise out of not understanding the context of search (the high school course), leading to results that are off-topic or inappropriate as reference material. In this paper, we introduce the concept of a course-specific search engine and build such a search engine for the Advanced Placement US History (APUSH) course; the results of which are preferred by subject matter experts (high school teachers) over existing search engines. This reference search engine for APUSH relies on a hand-curated set of sites picked specifically for this educational context. In order to automate this expensive process, we describe two algorithms for indentifying high quality topical sites using an authoritative source such as a textbook: one based on textual similarity and another using structured data from knowledge bases. Initial experimental results indicate that these algorithms can successfully classify high quality documents leading to the automatic creation of topic-specific corpora for any course.

[1]  Tim Berners-Lee,et al.  Linked data on the web (LDOW2008) , 2008, WWW.

[2]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[3]  Fan Wu,et al.  Topic-specific crawling on the Web with the measurements of the relevancy context graph , 2006, Inf. Syst..

[4]  Henry Tirri,et al.  A Scalable Topic-Based Open Source Search Engine , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[5]  Gerard Salton,et al.  Improving Retrieval Performance by Relevance Feedback , 1997 .

[6]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Ah-Hwee Tan,et al.  Learning and inferencing in user ontology for personalized Semantic Web search , 2009, Inf. Sci..

[9]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[10]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[11]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[12]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[13]  Ramanathan V. Guha,et al.  Semantic search , 2003, WWW '03.