Comparative study of classification techniques on biomedical data from hypertext documents

In this paper, our goal is to mine biomedical data from hypertext documents e.g., mining data from web contents using data mining algorithms with the help of 'biomedical ontology'. We collect a number of documents using Google and preprocess the hypertext documents and extract the text data. Next job is the identification of biomedical data. To identify whether a word is a biomedical entity or not we use a biomedical database, the 'UMLS metathesaurus'. The mapping of biomedical entity from the metathesaurus will be done based on keyword query. The more occurrence of a biomedical entity in a page, the more relevant the page is, and thus, we can re-rank the documents to find the most important documents. Then we test and analyse the performance of seven most popular classification algorithms by training them separately with the documents ranked by Google and our algorithm.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  D Sauquet,et al.  Rationale and Design Considerations for a Semantic Mediator in Health Information Systems , 1998, Methods of Information in Medicine.

[3]  Ulrich Güntzer,et al.  Algorithms for association rule mining — a general survey and comparison , 2000, SKDD.

[4]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[5]  Martin Romacker,et al.  How knowledge drives understandingmatching medical ontologies with the needs of medical language processing , 1999, Artif. Intell. Medicine.

[6]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[7]  H. Altay Güvenir,et al.  Classification by Voting Feature Intervals , 1997, ECML.

[8]  Paolo Ceravolo,et al.  Semantics-aware matching strategy (SAMS) for the Ontology meDiated Data Integration (ODDI) , 2010, Int. J. Knowl. Eng. Soft Data Paradigms.

[9]  Aldo Gangemi,et al.  Coping with Medical Polysemy in the Semantic Web: the Role of Ontologies , 2004, MedInfo.

[10]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[11]  R. Côté Systematized Nomenclature of Medicine , 1979 .

[12]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[13]  M A Musen,et al.  Medical Informatics: Searching for Underlying Components , 2002, Methods of Information in Medicine.

[14]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[15]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[16]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[17]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[18]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[19]  Zdravko Markov,et al.  Data mining the Web , 2007 .

[20]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[21]  Jerry R. Hobbs Information extraction from biomedical text , 2002, J. Biomed. Informatics.

[22]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[23]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[24]  Malik Agyemang Web content outlier mining: motivation, framework, and algorithms , 2006 .