Semi-Supervised Soft K-Means Clustering of Life Insurance Questionnaire Responses

The life insurance questionnaire is a large document containing responses in a mixture of structured and unstructured data. The unstructured data poses issues for the user, in the form of extra input effort, and the insurance company, in the form of interpretation and analysis. In this work, we aim to address these problems by proposing a semi-supervised framework for clustering responses into categories using vector space embedding of responses and soft k-means clustering. Our experiments show that our method achieves adequate results. The resulting category clusters from our method can be used for analysis and to replace free text input questions with structured questions in the questionnaire.

[1]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[2]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[3]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[4]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[5]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .