Characteristics of document similarity measures for compliance analysis

Due to increased competition in the IT Services business, improving quality, reducing costs and shortening schedules has become extremely important. A key strategy being adopted for achieving these goals is the use of an asset-based approach to service delivery, where standard reusable components developed by domain experts are minimally modified for each customer instead of creating custom solutions. One example of this approach is the use of contract templates, one for each type of service offered. A compliance checking system that measures how well actual contracts adhere to standard templates is critical for ensuring the success of such an approach. This paper describes the use of document similarity measures - Cosine similarity and Latent Semantic Indexing - to identify the top candidate templates on which a more detailed (and expensive) compliance analysis can be performed. Comparison of results of using the different methods are presented.

[1]  Ethem Alpaydin,et al.  Support Vector Machines for Multi-class Classification , 1999, IWANN.

[2]  Shigeyoshi Shimotsuji,et al.  Form identification based on cell structure , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5]  Jianying Hu,et al.  Document image layout comparison and classification , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  Hanchuan Peng,et al.  Document Image Recognition Based on Template Matching of Component Block Projections , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[8]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[9]  Alexander Löser,et al.  Mapping enterprise entities to text segments , 2008, PIKM '08.

[10]  Tracy Mullen,et al.  Legal Ontology Of Contract Formation Application To Ecommerce , 2005 .

[11]  Liang Chen,et al.  A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties , 2003, ACL 2003.

[12]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[13]  Michael P. Wellman,et al.  Automated Negotiation from Declarative Contract Descriptions , 2001, AGENTS '01.

[14]  George Rzevski,et al.  Creating Contract Templates for Car Insurance Using Multi-agent Based Text Understanding and Clustering , 2007, HoloMAS.

[15]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.