Clustering, Assessment andValidation: an application togene expression data

In this workamulti-step approach forclustering assessment, visualization anddatavalidation isintroduced. Threemainapproaches fordataclustering areusedand compared: K-means, SelfOrganizing MapsandProbabilistic Principal Surfaces. A modelexplorer approach withdifferent similarity measures isusedtoobtain thebestparameters ofthe methods. Theapproach isusedtoidentify genesperiodically expressed intumorsrelated tothehumancellcycle. Finally, clusters arevalidated byusing GO Terminformation. I.INTRODUCTION Inthelast years theKnowledge Discovery inDatabases (KDD)hasbeenbecoming ofgreat importance forseveral fields ofresearch. Infact, anexplosive growth inthequantity, quality andaccessibility ofdatawhichiscurrently expe- rienced inallfields ofscience andhumanendeavour, has triggered thesearch foranewgeneration ofcomputational modelsandtools. Theyarecapable ofassisting humans intheextraction ofuseful information (knowledge) from hugeamounts ofdistributed andheterogeneous data. At thecoreoftheprocess there istheapplication ofspecific data mining methods forpattern discovery andextraction: in genetics, forexample, several datamining approaches have beenproposed toanalyze catalogues obtained fromgenome sequencing projects (14), (15), (20). However, duetothe sheer size ofthedatasetsinvolved andtothecomplexity oftheproblems tobetackled, novelapproaches todata mining andunderstanding, relying onartificial intelligence tools, arenecessary. These tools canbedivided intwomain families: tools forsupervised learning whichmakeuseof prior knowledge togroupsamples intodifferent classes, andunsupervised tools, whichrelyonlyonthestatistical properties ofthedata themselves. Bothapproaches havebeen usedforavariety ofapplications andbothhaveadvantages anddisadvantages: thechoice ofaspecific tooldepends on thepurpose oftheinvestigation andthestructure ofthedata. Amongvarious application, werecall: .Diagnostic: i.e. tofind geneexpression patterns specific togiven classes (mainly dealt withsupervised methods