Detection of News Feeds Items Appropriate for Children

Identifying child-appropriate web content is an important yet difficult classification task. This novel task is characterised by attempting to determine age/child appropriateness (which is not necessarily topic-based), despite the presence of unbalanced class sizes and the lack of quality training data with human judgements of appropriateness. Classification of feeds, a subset of web content, presents further challenges due to their temporal nature and short document format. In this paper, we discuss these challenges and present baseline results for this task through an empirical study that classifies incoming news stories as appropriate (or not) for children. We show that while the naive Bayes approach produces a higher AUC it is vulnerable to the imbalanced data problem, and that support vector machine provides a more robust overall solution. Our research shows that classifying children's content is a non-trivial task that has greater complexities than standard text based classification. While the F-score values are consistent with other research examining age-appropriate text classification, we introduce a new problem with a new dataset.

[1]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[3]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[4]  Marie-Francine Moens,et al.  A picture is worth a thousand search results: finding child-oriented multimedia results with collAge , 2010, SIGIR '10.

[5]  Desmond Elliott,et al.  Interaction-based information filtering for children , 2010, IIiX.

[6]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[7]  Jamshid Beheshti,et al.  Developing a visual taxonomy: Children's views on aesthetics , 2009 .

[8]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[9]  Dania Bilal,et al.  Toward a model of children's information seeking behavior in using digital libraries , 2008, IIiX.

[10]  Elizabeth Foss,et al.  Children's roles using keyword search interfaces at home , 2010, CHI.

[11]  Jamshid Beheshti,et al.  Children as information seekers: what researchers tell us , 2008 .

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Arjen P. de Vries,et al.  A combined topical/non-topical approach to identifying web sites for children , 2011, WSDM '11.

[14]  E A Smith,et al.  Automated readability index. , 1967, AMRL-TR. Aerospace Medical Research Laboratories.

[15]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .