Automatic recognition of handwritten medical forms for search engines

A new paradigm, which models the relationships between handwriting and topic categories, in the context of medical forms, is presented. The ultimate goals are: (1) a robust method which categorizes medical forms into specified categories, and (2) the use of such information for practical applications such as an improved recognition of medical handwriting or retrieval of medical forms as in a search engine. Medical forms have diverse, complex and large lexicons consisting of English, Medical and Pharmacology corpus. Our technique shows that a few recognized characters, returned by handwriting recognition, can be used to construct a linguistic model capable of representing a medical topic category. This allows (1) a reduced lexicon to be constructed, thereby improving handwriting recognition performance, and (2) PCR (Pre-Hospital Care Report) forms to be tagged with a topic category and subsequently searched by information retrieval systems. We present an improvement of over 7% in raw recognition rate and a mean average precision of 0.28 over a set of 1,175 queries on a data set of unconstrained handwritten medical forms filled in emergency environments.

[1]  Stuart C. Shapiro,et al.  Book Reviews: Natural Language Processing and Knowledge Representation: Language for Knowledge and Knowledge for Language , 2001, CL.

[2]  Gyeonghwan Kim,et al.  Bankcheck Recognition Using Cross Validation Between Legal and Courtesy Amounts , 1997, Int. J. Pattern Recognit. Artif. Intell..

[3]  William R. Hersh,et al.  Assessing thesaurus-based query expansion using the UMLS Metathesaurus , 2000, AMIA.

[4]  Hsinchun Chen,et al.  Medical Informatics: Knowledge Management and Data Mining in Biomedicine (Operations Research/Computer Science Interfaces) , 2005 .

[5]  W. Guitang,et al.  A new method for image segmentation , 2009, 2009 Asia-Pacific Conference on Computational Intelligence and Industrial Applications (PACIIA).

[6]  John A. Richards,et al.  Remote Sensing Digital Image Analysis , 1986 .

[7]  William R. Hersh,et al.  Information Retrieval: A Health and Biomedical Perspective , 2002 .

[8]  Barry Smith,et al.  A Strategy for Improving and Integrating Biomedical Ontologies , 2005, AMIA.

[9]  John T. Favata Offline General Handwritten Word Recognition Using an Approximate BEAM Matching Algorithm , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Ulrich Kressel,et al.  Categorizing Paper Documents: A Generic System for Domain and Language Independent Text Categorization , 1998, Comput. Vis. Image Underst..

[11]  YangYiming,et al.  An example-based mapping method for text categorization and retrieval , 1994 .

[12]  Venu Govindaraju,et al.  ANALYSIS OF PRINTED FORMS , 1997 .

[13]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[14]  Christopher R. Dance,et al.  Binarising camera images for OCR , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[15]  National Electronic Disease Surveillance System (NEDSS): a standards-based approach to connect public health and clinical medicine. , 2001, Journal of public health management and practice : JPHMP.

[16]  Claus Bahlmann,et al.  Online handwriting recognition with support vector machines - a kernel approach , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[17]  Venu Govindaraju,et al.  Syntactic methodology of pruning large lexicons in cursive script recognition , 2001, Pattern Recognit..

[18]  Venu Govindaraju,et al.  Extraction of Handwritten Text from Carbon Copy Medical Form Images , 2006, Document Analysis Systems.

[19]  Øivind Due Trier,et al.  Evaluation of Binarization Methods for Document Images , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  D. Guillevic,et al.  Cursive script recognition: A fast reader scheme , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[21]  K. Goodman Reading: A psycholinguistic guessing game , 1967 .

[22]  Hyeran Byun,et al.  Applications of Support Vector Machines for Pattern Recognition: A Survey , 2002, SVM.

[23]  Victor Wu Document Image Clean-up and Binarization , 1998 .

[24]  Fon Silvers Relational Database Management System (RDBMS) , 2008 .

[25]  Bob Carpenter,et al.  Vector-based Natural Language Call Routing , 1999, Comput. Linguistics.

[26]  Jules J. Berman,et al.  Confidentiality issues for medical data miners , 2002, Artif. Intell. Medicine.

[27]  Cheng-Lin Liu,et al.  Global shape normalization for handwritten Chinese character recognition: a new method , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[28]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Brijesh Verma,et al.  A neural based segmentation and recognition technique for handwritten words , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[30]  Venu Govindaraju,et al.  A Stochastic Model Combining Discrete Symbols and Continuous Attributes and Its Application to Handwriting Recognition , 2002, Document Analysis Systems.

[31]  J. Stoker,et al.  The Department of Health and Human Services. , 1999, Home healthcare nurse.

[32]  Ed Greengrass,et al.  Information Retrieval: A Survey , 2000 .

[33]  Isla Gilmour Nonlinear model evaluation : ɩ-shadowing, probabilistic prediction and weather forecasting , 1999 .

[34]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[35]  Gyeonghwan Kim,et al.  A Lexicon Driven Approach to Handwritten Word Recognition for Real-Time Applications , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Chew Lim Tan,et al.  Imaged Document Text Retrieval Without OCR , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Nitish V. Thakor,et al.  StylPen: on-line adaptive canceling of pathological tremor for computer pen handwriting , 1996, Proceedings of the IEEE 22nd Annual Northeast Bioengineering Conference.

[38]  Alessandro Vinciarelli,et al.  Application of information retrieval techniques to single writer documents , 2005, Pattern Recognit. Lett..

[39]  Éric Anquetil,et al.  Lexicon organization and string edit distance learning for lexical post-processing in handwriting recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[40]  Eberhard Mandler,et al.  Document analysis-from pixels to contents , 1992 .

[41]  Ching Y. Suen,et al.  Distance features for neural network-based recognition of handwritten characters , 1998, International Journal on Document Analysis and Recognition.

[42]  McDermottDrew Artificial intelligence meets natural stupidity , 1976 .

[43]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[44]  Jean-Michel Jolion,et al.  Text localization, enhancement and binarization in multimedia documents , 2002, Object recognition supported by user interaction for service robots.

[45]  D. Sorensen,et al.  Automatic identification of discrete substates in proteins: Singular value decomposition analysis of time‐averaged crystallographic refinements , 1995, Proteins.

[46]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[47]  K. Coonan Medical informatics standards applicable to emergency department information systems: making sense of the jumble. , 2004, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[48]  Jian Zhou,et al.  Off-Line Handwritten Word Recognition Using a Hidden Markov Model Type Stochastic Network , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  KamelMohamed,et al.  Extraction of binary character/graphics images from grayscale document images , 1993 .

[50]  Yee Whye Teh,et al.  Making Latin Manuscripts Searchable using gHMMs , 2004, NIPS.

[51]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[52]  염흥렬,et al.  [서평]「Applied Cryptography」 , 1997 .

[53]  Shigeki Sagayama,et al.  Substroke approach to HMM-based on-line Kanji handwriting recognition , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[54]  Pau-Choo Chung,et al.  A Fast Algorithm for Multilevel Thresholding , 2001, J. Inf. Sci. Eng..

[55]  Lan Ji Huang,et al.  A Survey On Web Information Retrieval Technologies , 2000 .

[56]  Justin Zobel,et al.  Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..

[57]  Robert Sabourin,et al.  Fast two-level HMM decoding algorithm for large vocabulary handwriting recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[58]  D. B. Davis,et al.  Sun Microsystems Inc. , 1993 .

[59]  García,et al.  Large-amplitude nonlinear motions in proteins. , 1992, Physical review letters.

[60]  Bin Zhang,et al.  Transcript mapping for historic handwritten document images , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[61]  Gene H. Golub,et al.  Matrix computations , 1983 .

[62]  Bob Carpenter,et al.  Dialogue Management in Vector-Based Call Routing , 1998, ACL.

[63]  C. V. Jawahar,et al.  Retrieval from Document Image Collections , 2006, Document Analysis Systems.

[64]  Tamotsu Kasai,et al.  A Method for the Correction of Garbled Words Based on the Levenshtein Metric , 1976, IEEE Transactions on Computers.

[65]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[66]  Samy Bengio,et al.  Offline recognition of unconstrained handwritten texts using HMMs and statistical language models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Prabhakar Raghavan,et al.  Information retrieval algorithms: a survey , 1997, SODA '97.

[68]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Joel L. Fagan The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[70]  K. Yamada,et al.  WORD LEXICON REDUCTION BY CHARACTER SPOTTING , 2004 .

[71]  Alessandro Vinciarelli Noisy Text Categorization , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[72]  David S. Doermann,et al.  Binarization of low quality text using a Markov random field model , 2002, Object recognition supported by user interaction for service robots.

[73]  E. Oren,et al.  Impact of emerging technologies on medication errors and adverse drug events. , 2003, American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists.

[74]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .

[75]  R. Manmatha,et al.  Boosted decision trees for word recognition in handwritten document retrieval , 2005, SIGIR '05.

[76]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[77]  Reshad Hosseini,et al.  Wavelet based fingerprint image enhancement , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[78]  R. Manmatha,et al.  Classification models for historical manuscript recognition , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[79]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[80]  Roy Goldman,et al.  Proximity Search in Databases , 1998, VLDB.

[81]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[82]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[83]  Venu Govindaraju,et al.  Handwriting analysis of pre-hospital care reports , 2004, Proceedings. 17th IEEE Symposium on Computer-Based Medical Systems.

[84]  Sung-Bae Cho,et al.  Neural-network classifiers for recognizing totally unconstrained handwritten numerals , 1997, IEEE Trans. Neural Networks.

[85]  N. Abdelmalek Round off error analysis for Gram-Schmidt method and solution of linear least squares problems , 1971 .

[86]  Wayne Nilback An introduction to digital image processing , 1985 .

[87]  Joel L. Fagan,et al.  The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[88]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[89]  H. Beigi Processing, Modeling and Parameter Estimation of the Dynamic On-line Handwriting Signal , 2007 .

[90]  Isabelle Guyon Applications of Neural Networks to Character Recognition , 1991, Int. J. Pattern Recognit. Artif. Intell..

[91]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[92]  Matthias Zimmermann,et al.  Lexicon reduction using key characters in cursive handwritten words , 1999, Pattern Recognit. Lett..

[93]  Kazem Taghva,et al.  Evaluating text categorization in the presence of OCR errors , 2000, IS&T/SPIE Electronic Imaging.

[94]  David A. Forsyth,et al.  Searching for Character Models , 2005, NIPS.

[95]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[96]  C. D. Meyer,et al.  The Use of the Linear Algebra by Web Search Engines , 2004 .

[97]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[98]  Sriganesh Madhvanath The holistic paradigm in handwritten word recognition and its application to large and dynamic lexicon scenarios , 1998 .

[99]  Robert Sabourin,et al.  Large vocabulary off-line handwriting recognition: A survey , 2003, Pattern Analysis & Applications.

[100]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[101]  Ching,et al.  The State of the Art in On-Line Handwriting Recognition , 2000 .

[102]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[103]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[104]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[105]  Ioannis Pratikakis,et al.  Adaptive degraded document image binarization , 2006, Pattern Recognit..

[106]  H. Markov,et al.  An algorithm to , 1997 .

[107]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[108]  R. Manmatha,et al.  A search engine for historical manuscript images , 2004, SIGIR '04.

[109]  Venu Govindaraju,et al.  Use of Lexicon Density in Evaluating Word Recognizers , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[110]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[111]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[112]  Judah Rosenblatt,et al.  PROBABILITY AND STATISTICS , 2016 .

[113]  Sargur N. Srihari,et al.  Integrating diverse knowledge sources in text recognition , 1982, TOIS.

[114]  Horst Bunke,et al.  Lexicon reduction in an framework based on quantized feature vectors , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[115]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[116]  Stuart J. Nelson,et al.  The MeSH Translation Maintenance System: Structure, Interface Design, and Implementation , 2004, MedInfo.

[117]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[118]  R. Mahesh K. Sinha,et al.  Visual text recognition through contextual processing , 1988, Pattern Recognit..

[119]  Jung-Hsien Chiang,et al.  Neural and Fuzzy Methods in Handwriting Recognition , 1997, Computer.

[120]  Quan Wan,et al.  Face recognition based on spectroface and uniform eigen-space SVD for one training image per person , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[121]  J Allan,et al.  Readings in information retrieval. , 1998 .

[122]  Stefan Schulz,et al.  Anatomical Information Science , 2005, COSIT.

[123]  Venu Govindaraju,et al.  On the Dependence of Handwritten Word Recognizers on Lexicons , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[124]  Venu Govindaraju,et al.  Reading handwritten US census forms , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[125]  Robert Sabourin,et al.  A hybrid large vocabulary handwritten word recognition system using neural networks with hidden Markov models , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[126]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[127]  Venu Govindaraju,et al.  Medical word recognition using a computational semantic lexicon , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[128]  C. Moler,et al.  Singular Value Analysis of Cryptograms , 1983 .

[129]  M. P. Perrone,et al.  Handwritten document retrieval , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[130]  Hong Yan,et al.  An adaptive logical method for binarization of degraded document images , 2000, Pattern Recognit..

[131]  Matti Pietikäinen,et al.  Adaptive document binarization , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[132]  Bob Carpenter,et al.  Dialogue Management in Vector-Based Call Routing , 2022, COLING.

[133]  Venu Govindaraju,et al.  Fast handwriting recognition for indexing historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[134]  Mounia Lalmas,et al.  A survey on the use of relevance feedback for information access systems , 2003, The Knowledge Engineering Review.

[135]  Mikhail J. Atallah,et al.  Algorithms and Theory of Computation Handbook , 2009, Chapman & Hall/CRC Applied Algorithms and Data Structures series.

[136]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[137]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[138]  Venu Govindaraju,et al.  Character image enhancement by selective region-growing , 1996, Pattern Recognit. Lett..

[139]  Venu Govindaraju,et al.  Separating text and background in degraded document images - a comparison of global thresholding techniques for multi-stage thresholding , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[140]  Torsten Caesar,et al.  Using lexical knowledge for the recognition of poorly written words , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[141]  Milan Sonka,et al.  Image processing analysis and machine vision [2nd ed.] , 1999 .