In order to handle the increasing amount of textual information today available on the web and exploit the knowledge latent in this mass of unstructured data, a wide variety of linguistic knowledge and resources (Language Identification, Morphological Analysis, Entity Extraction, etc.). is crucial. In the last decade LRaas (Language Resource as a Service) emerged as a novel paradigm for publishing and sharing these heterogeneous software resources over the Web. In this paper we present an overview of Linguagrid, a recent initiative that implements an open network of linguistic and semantic Web Services for the Italian language, as well as a new approach for enabling customizable corpus-based linguistic services on Linguagrid LRaaS infrastructure. A corpus ingestion service in fact allows users to upload corpora of documents and to generate classification/clustering models tailored to their needs by means of standard machine learning techniques applied to the textual contents and metadata from the corpora. The models so generated can then be accessed through proper Web Services and exploited to process and classify new textual contents.
[1]
Magnus Sahlgren,et al.
From Words to Understanding
,
2001
.
[2]
Ron Kohavi,et al.
Data mining tasks and methods: Classification: Bayesian classification
,
2002
.
[3]
Michael I. Jordan,et al.
Latent Dirichlet Allocation
,
2001,
J. Mach. Learn. Res..
[4]
Matthew Self,et al.
Bayesian Classification
,
1988,
AAAI.
[5]
Toru Ishida,et al.
Language grid: an infrastructure for intercultural collaboration
,
2006,
International Symposium on Applications and the Internet (SAINT'06).
[6]
Peter Wittenburg,et al.
CLARIN: Common Language Resources and Technology Infrastructure
,
2008,
LREC.
[7]
David L. Martin,et al.
Semantic Web Services
,
2012,
Springer Berlin Heidelberg.
[8]
Frank Leymann,et al.
Web Services
,
2004,
Informatik-Spektrum.
[9]
Christian Wolff,et al.
Linguistic Knowledge Services - Developing Web Services in Language Technology
,
2003,
IICS.