A Semantic-based Document Processing Framework: A Security Perspective

The coexistence of different formats and physical supports to store data is one of the main open issues in document management systems, in particular, the presence of unstructured data represents a huge limitation for the elaboration and analysis of many documents and processes. At this aim we are exploiting the adoption of different techniques to analyze texts and automatically extract relevant information, concepts or complex relations, in this paper we proposed a general framework for data transformation and implemented such model trough an architecture based on semantic analysis. The analysis that can be performed on data has many different applications, in this paper we illustrate an interesting perspective related on how to enforce a fine grained access control on sensitive data that are in capsulated in unstructured, monolithic files. We also presented a case study for the formalization and protection of e-health medical records.

[1]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2]  Judith Redi,et al.  A Text Clustering Framework for Information Retrieval , 2009 .

[3]  Hai Jin,et al.  MSVM-kNN: Combining SVM and k-NN for Multi-class Text Classification , 2008, IEEE International Workshop on Semantic Computing and Systems.

[4]  Cataldo Basile,et al.  Ontology-based Security Policy Translation , 2010 .

[5]  Frank Anshen,et al.  Statistics for linguistics , 1978 .

[6]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[7]  Bogdan Vrusias,et al.  Online Self-Organised Map Classifiers as Text Filters for Spam Email Detection , 2009 .

[8]  Susan Conrad,et al.  Corpus Linguistics: Investigating Language Structure and Use , 1998 .

[9]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[10]  Graeme D. Kennedy,et al.  Book Reviews: An Introduction to Corpus Linguistics , 1999, CL.

[11]  Flora Amato,et al.  Knowledge Representation and Management for E-Government Documents , 2008, E-Government, ICT Professionalism and Competences Service Science.

[12]  B. Habert,et al.  Les linguistiques de corpus , 1997 .

[13]  J. Vizmuller-Zocco,et al.  Lessico di frequenza dell'italiano parlato , 1994 .

[14]  David M. Eyers,et al.  OASIS role-based access control for electronic health records , 2006, IEE Proc. Softw..

[15]  Flora Amato,et al.  Information Extraction from Multimedia Documents for e-Government Applications , 2009 .

[16]  Thomas Richardson,et al.  Interpretable Boosted Naïve Bayes Classification , 1998, KDD.

[17]  Günter Holtus Tullio De Mauro/Federico Mancini/Massimo Vedovelli/Miriam Voghera, Lessico di frequenza dell'italiano parlato, Ricerca a cura dell'Osservatorio linguistico e culturale italiano OLCI dell'Universita di Roma "La Sapienza", Elaborazione e lemmatizzazione automatica dei testi: Federico Mancini , 1995 .

[18]  Flora Amato,et al.  A semantic based methodology to classify and protect sensitive data in medical records , 2010, 2010 Sixth International Conference on Information Assurance and Security.

[19]  Flora Amato,et al.  A system for semantic retrieval and long-term preservation of multimedia documents in the e-government domain , 2009, Int. J. Web Grid Serv..