Knowledge capture from multiple online sources with the extensible web retrieval toolkit (eWRT)

Knowledge capture approaches in the age of massive Web data require robust and scalable mechanisms to acquire, consolidate and pre-process large amounts of heterogeneous data, both unstructured and structured. This paper addresses this requirement by introducing the Extensible Web Retrieval Toolkit (eWRT), a modular Python API for retrieving social data from Web sources such as Delicious, Flickr, Yahoo! and Wikipedia. eWRT has been released as an open source library under GNU GPLv3. It includes classes for caching and data management, and provides low-level text processing capabilities including language detection, phonetic string similarity measures, and string normalization.

[1]  Jeroen P.H. Verharen,et al.  Online Resources , 2004, Theories of Programming.

[2]  Laurent Amsaleg,et al.  Locality sensitive hashing: A comparison of hash function types and querying mechanisms , 2010, Pattern Recognit. Lett..

[3]  Marta Sabou,et al.  TourMISLOD: A tourism linked data set , 2013, Semantic Web.

[4]  Marta Sabou,et al.  Media Watch on Climate Change -- Visual Analytics for Aggregating and Managing Environmental Knowledge from Online Sources , 2013, 2013 46th Hawaii International Conference on System Sciences.

[5]  Albert Weichselbraun,et al.  TextSweeper - A System for Content Extraction and Overview Page Detection , 2012, CONF-IRM.

[6]  R. Briggs,et al.  Association for Information Systems , 2009 .

[7]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[8]  Amitava Das,et al.  Sentimantics: Conceptual Spaces for Lexical Sentiment Polarity Representation with Contextuality , 2012, WASSA@ACL.

[9]  Harald Sack,et al.  WhoKnows? Evaluating linked data heuristics with a quiz that cleans up DBpedia , 2011, Interact. Technol. Smart Educ..

[10]  Arno Scharl,et al.  Refining non-taxonomic relation labels with external structured data to support ontology learning , 2010, Data Knowl. Eng..

[11]  Kalina Bontcheva,et al.  Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics , 2013, PLoS Comput. Biol..

[12]  Albert Weichselbraun,et al.  Optimizing queries to remote resources , 2011, Journal of Intelligent Information Systems.

[13]  Albert Weichselbraun A Utility Centered Approach for Evaluating and Optimizing Geo-tagging , 2009, KDIR.

[14]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations (Art of Computer Programming) , 2005 .

[15]  Arno Scharl,et al.  Extracting and Grounding Contextualized Sentiment Lexicons , 2013, IEEE Intelligent Systems.

[16]  Arno Scharl,et al.  Extracting and Grounding Context-Aware Sentiment Lexicons , 2013 .

[17]  Arno Scharl,et al.  Applying Optimal Stopping Theory to Improve the Performance of Ontology Refinement Methods , 2011, 2011 44th Hawaii International Conference on System Sciences.