New Challenges For NLP Frameworks Programme

We present a practical problem that involves the analysis of a large dataset of heterogeneous documents obtained by crawling the web for information related to web services. This analysis includes information extraction from natural-language (HTML and PDF) and machine-readable (WSDL) documents using NLP and other techniques, classifying documents as well as services (defined by sets of documents), and exporting the results as RDF for use in the back-end of a portal that uses Web 2.0 and Semantic Web technology. Triples representing manual annotations made on the portal are also exported back to our application to evaluate parts of our analysis and for use as training data for machine learning (ML). This application was implemented in the GATE framework and successfully incorporated into an integrated project, and included a number of components shared with our group’s other projects.

[1]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[2]  Robert J. Gaizauskas,et al.  SUPPLE: A Practical Parser for Natural Language Engineering Applications , 2005, IWPT.

[3]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[4]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[5]  Henrik Eriksson,et al.  The evolution of Protégé: an environment for knowledge-based systems development , 2003, Int. J. Hum. Comput. Stud..

[6]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[7]  Frank van Harmelen,et al.  A semantic web primer , 2004 .

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  René Witte,et al.  Automatic Quality Assessment of Source Code Comments: The JavadocMiner , 2010, NLDB.

[10]  Sophia Ananiadou,et al.  An Annotation Type System for a Data-Driven NLP Pipeline , 2007, LAW@ACL.

[11]  Genevieve Gorrell,et al.  Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing , 2006, EACL.

[12]  Maria Teresa Pazienza,et al.  Semantic Turkey : A Semantic Bookmarking Tool (System Description) , 2007, ESWC.

[13]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[14]  Robert Parker,et al.  Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium , 2008, LREC.

[15]  Elaine Marsh,et al.  MUC-7 Evaluation of IE Technology: Overview of Results , 1998, MUC.

[16]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[17]  John D. Lafferty,et al.  Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.

[18]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[19]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[20]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[21]  Maria Teresa Pazienza,et al.  Linguistic Enrichment Of Ontologies : a methodological framework , 2006 .

[22]  Sophia Ananiadou,et al.  Integrated NLP Evaluation System for Pluggable Evaluation Metrics with Extensive Interoperable Toolkit , 2009, Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing - SETQA-NLP '09.

[23]  Francis Maes,et al.  Nieme: Large-Scale Energy-Based Models , 2009, J. Mach. Learn. Res..

[24]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[25]  Niko Wilbert,et al.  Modular Toolkit for Data Processing (MDP): A Python Data Processing Framework , 2008, Frontiers Neuroinformatics.

[26]  Jin-Hwan Cho Software & Tools , 2009 .

[27]  Ted Briscoe,et al.  The Second Release of the RASP System , 2006, ACL.

[28]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[29]  Laurel D. Riek,et al.  Callisto: A Configurable Annotation Workbench , 2004, LREC.

[30]  Peter D. Karp,et al.  OKBC: A Programmatic Foundation for Knowledge Base Interoperability , 1998, AAAI/IAAI.

[31]  Asunción Gómez-Pérez,et al.  Localizing Ontologies in OWL , 2007 .

[32]  Paola Merlo,et al.  The Notion of Argument in Prepositional Phrase Attachment , 2006, Computational Linguistics.

[33]  Max Mühlhäuser,et al.  Darmstadt Knowledge Processing Repository Based on UIMA , 2007 .

[34]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[35]  Paul Buitelaar,et al.  LexOnto: A Model for Ontology Lexicons for Ontology-based NLP , 2007 .

[36]  Iryna Gurevych,et al.  Towards Enhanced Interoperability for Large HLT Systems : UIMA for NLP , 2008 .

[37]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[38]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[39]  Maria Teresa Pazienza,et al.  Din din! The (Semantic) Turkey is served! , 2008, SWAP.

[40]  Vijay V. Raghavan,et al.  Vector Space Model of Information Retrieval - A Reevaluation , 1984, SIGIR.

[41]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[42]  Maria Teresa Pazienza,et al.  Linguistic Watermark 3.0: An RDF Framework and a Software Library for Bridging Language and Ontologies in the Semantic Web , 2008, SWAP.

[43]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[44]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[45]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[46]  William H. Press,et al.  Numerical recipes in C , 2002 .

[47]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[48]  Douglas Kramer,et al.  API documentation from source code comments: a case study of Javadoc , 1999, SIGDOC '99.

[49]  F. Puppe,et al.  TextMarker : A Tool for Rule-Based Information Extraction , 2009 .

[50]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[51]  Chris Sander,et al.  Introducing meta-services for biomedical information extraction , 2008, Genome Biology.

[52]  Maria Teresa Pazienza,et al.  A Suite of Semantic Web Tools Supporting Development of Multilingual Ontologies , 2010, Intelligent Information Access.

[53]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[54]  Wim Peters,et al.  SPRAT : a tool for automatic semantic pattern-based ontology population , 2009 .

[55]  Guus Schreiber,et al.  Knowledge Engineering and Management: The CommonKADS Methodology , 1999 .

[56]  Maria Teresa Pazienza,et al.  Exploiting Linguistic Resources for building linguistically motivated ontologies in the Semantic Web , 2006 .

[57]  D. Rebholz-Schuhmann,et al.  Facts from Text—Is Text Mining Ready to Deliver? , 2005, PLoS biology.

[58]  Donna K. Harman The DARPA TIPSTER project , 1992, SIGF.

[59]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[60]  Paul Buitelaar,et al.  LingInfo: Design and Applications of a Model for the Integration of Linguistic Information in Ontologies , 2006 .

[61]  B. A. Tague,et al.  UNIX time-sharing system: Foreword , 1978, The Bell System Technical Journal.

[62]  Kalina Bontcheva,et al.  Large-scale, parallel automatic patent annotation , 2008, PaIR '08.

[63]  Lei Shi,et al.  Putting Pieces Together: Combining FrameNet, VerbNet and WordNet for Robust Semantic Parsing , 2005, CICLing.

[64]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[65]  Paul Buitelaar,et al.  A Protégé Plug-In for Ontology Extraction from Text Based on Linguistic Analysis , 2004, ESWS.

[66]  Petr Sojka An Experience with Building Digital Open Access Repository DML-CZ , 2009 .

[67]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[68]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[69]  M. Brand,et al.  Fast low-rank modifications of the thin singular value decomposition , 2006 .

[70]  Bob Carpenter,et al.  The logic of typed feature structures , 1992 .

[71]  Roberto Basili,et al.  Integrating ontological and linguistic knowledge for conceptual information extraction , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[72]  Johanna Völker,et al.  A Framework for Ontology Learning and Data-driven Change Discovery , 2005 .

[73]  Erik T. Ray,et al.  Learning XML , 2001 .

[74]  Stephanie Strassel,et al.  Annotation Tools for Large-Scale Corpus Development: Using AGTK at the Linguistic Data Consortium , 2004, LREC.

[75]  Philipp Cimiano,et al.  Ontology learning and population from text - algorithms, evaluation and applications , 2006 .

[76]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[77]  Lynette Hirschman,et al.  Mixed-Initiative Development of Language Processing Systems , 1997, ANLP.

[78]  Kalina Bontcheva,et al.  Indexing and querying linguistic metadata and document content , 2007 .