Lifting User Generated Comments to SIOC

HTML boilerplate code is acting on webpages as presentation directives for a browser to display data to a human end user. For the machine, our community made tremenduous e orts to provide querying endpoints using consensual schemas, protocols, and principles since the avent of the Linked Data paradigm. These data lifting e orts have been the primary materials for bootstraping the Web of data. Data lifting usually involves an original data structure from which the semantic architect has to produce a mapper to RDF vocabularies. Less e orts are made in order to lift data produced by a Web mining process, due to the di culty to provide an e cient and scalable solution. Nonetheless, the Web of documents is mainly composed of natural language twisted in HTML boilerplate code, and few data schemas can be mapped into RDF. In this paper, we present CommentsLifter, a system that is able to lift SIOC data from user-generated comments in the Web 2.0.

[1]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[2]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[3]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[4]  David R. Karger,et al.  Piggy Bank: Experience the Semantic Web Inside Your Web Browser , 2005, International Semantic Web Conference.

[5]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  Andreas Harth,et al.  Towards Semantically-Interlinked Online Communities , 2005, ESWC.

[7]  José Emilio Labra Gayo,et al.  Mailing Lists Meet the Semantic Web , 2007, SAW.

[8]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[9]  Guus Schreiber,et al.  The Semantic Web – ISWC 2004 , 2004, Lecture Notes in Computer Science.

[10]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[11]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[12]  Andreas Hotho,et al.  Towards Semantic Web Mining , 2002, SEMWEB.

[13]  Wolfgang Nejdl,et al.  A densitometric approach to web page segmentation , 2008, CIKM '08.

[14]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[15]  Yun Chi,et al.  Frequent Subtree Mining - An Overview , 2004, Fundam. Informaticae.

[16]  Hiroyuki Kawano,et al.  AMIOT: induced ordered tree mining in tree-structured databases , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[18]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[19]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[20]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[21]  Anna Lisa Gentile,et al.  Extracting Semantic User Networks from Informal Communication Exchanges , 2011, SEMWEB.

[22]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[23]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.