A combined topical/non-topical approach to identifying web sites for children

Today children interact more and more frequently with information services. Especially in on-line scenarios there is a great amount of content that is not suitable for their age group. Due to the growing importance and ubiquity of the Internet in today's world, denying children any unsupervised Web access is often not possible. This work presents an automatic way of distinguishing web pages for children from those for adults in order to improve child-appropriate web search engine performance. A range of 80 different features based on findings from cognitive sciences and children's psychology are discussed and evaluated. We conducted a large scale user study on the suitability of web sites and give detailed information about the insights gained. Finally a comparison to traditional web classification methods as well as human annotator performance reveals that our automatic classifier can reach a performance close to that of human agreement.

[1]  Gregory K. W. K. Chung,et al.  Children's Internet Searching on Complex Problems: Performance and Process Analyses , 1998, J. Am. Soc. Inf. Sci..

[2]  Sandra L. Calvert Children as Consumers: Advertising and Marketing , 2008, The Future of children.

[3]  Paul N. Bennett,et al.  Refined experts: improving classification in large taxonomies , 2009, SIGIR.

[4]  Lijun Feng,et al.  Cognitively Motivated Features for Readability Assessment , 2009, EACL.

[5]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[6]  Andrew Large,et al.  Information seeking in a multimedia environment by primary school students , 1998 .

[7]  Evgeniy Gabrilovich,et al.  Harnessing the Expertise of 70, 000 Human Editors: Knowledge-Based Feature Generation for Text Categorization , 2007, J. Mach. Learn. Res..

[8]  Jamshid Beheshti,et al.  Design criteria for children's Web portals: The users speak out , 2002, J. Assoc. Inf. Sci. Technol..

[9]  Ying Li,et al.  Detecting online commercial intention (OCI) , 2006, WWW '06.

[10]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[11]  Shiva Naidu Evaluating the Usability of Educational Websites for Children , 2008 .

[12]  V. Rideout,et al.  Introduction: Electronic Media Use in the Lives of Infants, Toddlers, and Preschoolers , 2005 .

[13]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[14]  Maxine Eskénazi,et al.  Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts , 2007, NAACL.

[15]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[16]  Koraljka Golub,et al.  Importance of HTML Structural Elements and Metadata in Automated Subject Classification , 2005, ECDL.

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  Kevyn Collins-Thompson,et al.  A Language Modeling Approach to Predicting Reading Difficulty , 2004, NAACL.

[19]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[20]  Lijun Feng,et al.  Automatic readability assessment for people with intellectual disabilities , 2009, ASAC.

[21]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[22]  George R. Klare,et al.  The measurement of readability: useful information for communicators , 2000, AJCD.

[23]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.