Feature Dimensionality Reduction for Mammographic Report Classification

The amount and the variety of available medical data coming from multiple and heterogeneous sources can inhibit analysis, manual interpretation, and use of simple data management applications. In this paper a deep overview of the principal algorithms for dimensionality reduction is carried out; moreover, the most effective techniques are applied on a dataset composed of 4461 mammographic reports is presented. The most useful medical terms are converted and represented using a TF-IDF matrix, in order to enable data mining and retrieval tasks. A series of query have been performed on the raw matrix and on the same matrix after the dimensionality reduction obtained using the most useful techniques, such as LSI, PCA, and SVD. The obtained query results are comparable to the results achieved using the raw unprocessed matrix, where the processed matrix contains less than 13 % of the raw TF-IDF data using PCA-LSI techniques and less than 6 % of the raw TF-IDF data using SVD technique.

[1]  L. Muflikhah,et al.  Document Clustering Using Concept Space and Cosine Similarity Measurement , 2009, 2009 International Conference on Computer Technology and Development.

[2]  Genevieve Gorrell,et al.  Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing , 2006, EACL.

[3]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[4]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[5]  Alfonso Farruggia,et al.  A text based indexing system for mammographic image retrieval and classification , 2014, Future Gener. Comput. Syst..

[6]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[7]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[8]  Alfonso Farruggia,et al.  A Novel Web Service for Mammography Images Indexing , 2013, 2013 27th International Conference on Advanced Information Networking and Applications Workshops.

[9]  H. Koh,et al.  Data mining applications in healthcare. , 2005, Journal of healthcare information management : JHIM.

[10]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[11]  Dorairaj Prabhakaran,et al.  ‘Decision support system (DSS) for prevention of cardiovascular disease (CVD) among hypertensive (HTN) patients in Andhra Pradesh, India’ – a cluster randomised community intervention trial , 2012, BMC Public Health.

[12]  Salvatore Vitabile,et al.  An ontology-based retrieval system for mammographic reports , 2015, 2015 IEEE Symposium on Computers and Communication (ISCC).

[13]  Alfonso Farruggia,et al.  Bayesian network based classification of mammography structured reports , 2013, 2013 International Conference on Computer Medical Applications (ICCMA).

[14]  Jun Zhang,et al.  Data dimensionality reduction approach to improve feature selection performance using sparsified SVD , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[15]  Dianne P. O'Leary,et al.  Parallel QR factorization by Householder and modified Gram-Schmidt algorithms , 1990, Parallel Comput..

[16]  Qing Yang,et al.  Support Vector Machine for Intrusion Detection Based on LSI Feature Selection , 2006, 2006 6th World Congress on Intelligent Control and Automation.

[17]  Edoardo Ardizzone,et al.  Unsupervised tissue classification of brain MR images for voxel‐based morphometry analysis , 2016, Int. J. Imaging Syst. Technol..

[18]  I. Jolliffe Principal Component Analysis , 2002 .

[19]  Robert B. Allen,et al.  Active learning for text classification: Using the LSI Subspace Signature Model , 2014, 2014 International Conference on Data Science and Advanced Analytics (DSAA).

[20]  M. Saunders Large-scale linear programming using the Cholesky factorization , 1972 .

[21]  William Nick Street,et al.  Healthcare information systems: data mining methods in the creation of a clinical recommender system , 2011, Enterp. Inf. Syst..