Disentangling the Wikipedia Category Graph for Corpus Extraction

In several areas of research such as knowledge management and natural language processing, domain-specific corpora are required for tasks such as terminology extraction and ontology learning. The presented investigations herein are based on the assumption that Wikipedia can be used for the purpose of corpus extraction. It presents the advantage of possessing a semantic layer, which should ease the extraction of domain- specific corpora. Yet, as the Wikipedia category graph is scale- free, it can not be used as it is for these purposes. In this paper, we propose a novel approach to graph clustering called BorderFlow, which we use and evaluate on the Wikipedia category graph. Additional possible applications of these results in the area of information retrieval are presented.

[1]  Peter Mika,et al.  Microsearch: An Interface for Semantic Search , 2008, SemSearch.

[2]  Jean Véronis,et al.  Evaluation of multilingual text alignment systems: the ARCADE II project , 2006, LREC.

[3]  Nicolai M. Josuttis,et al.  Soa In Practice The Art Of Distributed System Design , 2007 .

[4]  Constantin Orasan,et al.  Transferring Coreference Chains through Word Alignment , 2006, LREC.

[5]  David W. Embley,et al.  Extracting Data behind Web Forms , 2002, ER.

[6]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[7]  Sharon L. Oviatt,et al.  Toward a theory of organized multimodal integration patterns during human-computer interaction , 2003, ICMI '03.

[8]  Beverly Abbey,et al.  网络教育 : 教学与认知发展新视角 = Instructional and Cognitive Impacts of Web -Based Education , 1999 .

[9]  H. Pashler,et al.  Is dual-task slowing instruction dependent? , 2001, Journal of experimental psychology. Human perception and performance.

[10]  Niladri Chatterjee,et al.  Identification of divergence for English to Hindi EBMT , 2003, MTSUMMIT.

[11]  Harold Pashler,et al.  Task prioritisation in multitasking during driving: opportunity to abort a concurrent task does not insulate braking responses from dual‐task slowing , 2008 .

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Santiago Rodŕıguez A Formal Approach to Spanish Morphology : the COES Tools , 2010 .

[14]  John Davies,et al.  Squirrel: An Advanced Semantic Search and Browse Facility , 2007, ESWC.

[15]  Fernando Sánchez León A Spanish Tagset for the CRATER Project , 1994, ArXiv.

[16]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[17]  Dan Tufis,et al.  RACAI’s Linguistic Web Services , 2008, LREC.

[18]  José Miguel Goñi-Menoyo,et al.  Spanish Inflectional Morphology in DATR , 2002, J. Log. Lang. Inf..

[19]  Alexander F. Gelbukh,et al.  Approach to Construction of Automatic Morphological Analysis Systems for Inflective Languages with Little Effort , 2003, CICLing.

[20]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[21]  Anna Lisa Gentile,et al.  Enhancing Semantic Search using N-Levels Document Representation , 2008, SemSearch.

[22]  Bonnie J. Dorr,et al.  Machine Translation Divergences: A Formal Description and Proposed Solution , 1994, CL.

[23]  Comisión de Gramática Esbozo de una nueva gramática de la lengua española , 1973 .

[24]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[25]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[26]  Bing Liu,et al.  Mining Comparative Sentences and Relations , 2006, AAAI.

[27]  Fabian M. Suchanek,et al.  Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia , 2007 .

[28]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[29]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[30]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[31]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .

[32]  Georg Dorffner,et al.  Konnektionismus - von neuronalen Netzwerken zu einer natürlichen KI , 1991, Leitfäden der angewandten Informatik.

[33]  Jan-Torsten Milde,et al.  SAM - an annotation editor for parallel texts , 2006, LREC.

[34]  Ramanathan V. Guha,et al.  Semantic search , 2003, WWW '03.

[35]  Stevan Harnad,et al.  The Implementation of the Berlin Declaration on Open Access , 2005 .

[36]  Jürgen Umbrich,et al.  Exploring the Knowledge in Semi Structured Data Sets with Rich Queries , 2008, SemSearch.

[37]  José Carlos González,et al.  ARIES: A lexical platform for engineering Spanish processing tools , 1997, Natural Language Engineering.

[38]  R. Toro,et al.  Gramática descriptiva de la lengua española , 2000 .

[39]  Filippo Menczer,et al.  Dynamic extraction topic descriptors and discriminators: towards automatic context-based topic search , 2004, CIKM '04.

[40]  Michael Piotrowski,et al.  Linguistic Support for Revising and Editing , 2008, CICLing.

[41]  Dan Cristea,et al.  Requirements-Driven Automatic Configuration of Natural Language Applications , 2018, NLUCS.

[42]  Serge Sharo Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .

[43]  Ronald Winnemöller,et al.  Constructing text sense representations , 2004 .

[44]  Michael Smithson,et al.  Ignorance and Science , 1993 .

[45]  Emanuele Pianta,et al.  Integration of Semantic, Metadata and Image Search Engines with a Text Search Engine for Patent Retrieval , 2008, SemSearch.

[46]  S. Dongen Graph clustering by flow simulation , 2000 .

[47]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[48]  Sebastian Rudolph,et al.  Ontology-Based Interpretation of Keywords for Semantic Search , 2007, ISWC/ASWC.

[49]  Guojun Gan,et al.  Data Clustering: Theory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied Probability) , 2007 .

[50]  Fan Yang,et al.  Conventions in human-human multi-threaded dialogues: a preliminary study , 2005, IUI '05.

[51]  Sergey Yablonsky,et al.  Russian WordNet From UML-notation to Internet / Intranet Database Implementation , 2004 .

[52]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[53]  Leo Sauermann,et al.  Semantic Desktop 2.0: The Gnowsis Experience , 2006, International Semantic Web Conference.

[54]  Lluís Padró,et al.  FreeLing 1.3: Syntactic and semantic services in an open-source NLP library , 2006, LREC.

[55]  Enrico Motta,et al.  SemSearch: A Search Engine for the Semantic Web , 2006, EKAW.

[56]  Fausto Giunchiglia,et al.  Concept Search: Semantics Enabled Syntactic Search , 2008, SemSearch.

[57]  이규철 Semantic Web Service의 과제 , 2003 .

[58]  Nizar Habash,et al.  DUSTer: a method for unraveling cross-language divergences for statistical word-level alignment , 2002, AMTA.

[59]  Lee Jun-Hee Design of Efficient Simulation-based Contents at e-Learning , 2005 .

[60]  Tim Berners-Lee,et al.  Linked Data on the Web , 2008, LDOW.

[61]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[62]  Carmen del Solar,et al.  Multimodal interaction analysis in a smart house , 2007, ICMI '07.

[63]  Antonella De Angeli,et al.  Integration and synchronization of input modes during multimodal human-computer interaction , 1997, CHI.

[64]  Pilar Manchón Portillo WOZ experiments in Multimodal Dialogue Systems , 2005 .

[65]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[66]  Jean V ronis Parallel Text Processing: Alignment and Use of Translation Corpora , 2002 .

[67]  Yi Zhang,et al.  Efficient bayesian hierarchical user modeling for recommendation system , 2007, SIGIR.

[68]  Patrick Pantel,et al.  Clustering by committee , 2003 .

[69]  Bladimir Díaz Borges Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities , 2008 .

[70]  Christian Simon,et al.  Morphisto - An Open Source Morphological Analyzer for German , 2009, FSMNLP.

[71]  Roland Hausser,et al.  Foundations of computational linguistics - human-computer communication in natural language (2. ed.) , 1999 .

[72]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[73]  Wei-Ying Ma,et al.  Implicit link analysis for small web search , 2003, SIGIR '03.

[74]  Iryna Gurevych,et al.  Analysis of the Wikipedia Category Graph for NLP Applications , 2007 .

[75]  Claudia Wallis,et al.  The multitasking generation. , 2006, Time.

[76]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[77]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[78]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[79]  Jos,et al.  A Framework for Lexical Representation , 1995 .

[80]  Eric Chang,et al.  Red Opal: product-feature scoring from reviews , 2007, EC '07.

[81]  Pilar Manchón,et al.  The MIMUS Corpus , 2006 .

[82]  Sharon Oviatt,et al.  Multimodal interactive maps: designing for human performance , 1997 .

[83]  Atanas Kiryakov,et al.  CLaRK - an XML-based System for Corpora Development 1 , 2001 .

[84]  Roland Hausser,et al.  Three principled methods of automatic word form recognition , 1999 .

[85]  Evelyne Tzoukermann,et al.  Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax , 1997, ACL.

[86]  Richard B. Ivry,et al.  Task switching and multitask performance. , 2000 .

[87]  Axel-Cyrille Ngonga Ngomo,et al.  Involving the User in Semantic Search , 2007, HCI.

[88]  Xiaoyan Zhu,et al.  Movie review mining and summarization , 2006, CIKM '06.

[89]  Simon Schenk,et al.  A SPARQL Semantics Based on Datalog , 2007, KI.

[90]  Petya Osenova,et al.  What ontologies can do for eLearning , 2008 .

[91]  Sharon L. Oviatt,et al.  When do we interact multimodally?: cognitive load and multimodal communication patterns , 2004, ICMI '04.

[92]  Dan Tufi,et al.  Exploiting Aligned Parallel Corpora in Multilingual Studies and Applications , 2007 .

[93]  Valorie Beer The Web Learning Fieldbook , 2000 .

[94]  Carla Umbach,et al.  Anaphora Resolution in Machine Translation , 1992 .

[95]  Bing Liu,et al.  Identifying comparative sentences in text documents , 2006, SIGIR.

[96]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[97]  Mark Fischetti,et al.  Weaving the web - the original design and ultimate destiny of the World Wide Web by its inventor , 1999 .

[98]  Richard Evans,et al.  Applying Machine Learning Toward an Automatic Classification of It , 2001, Lit. Linguistic Comput..