Image analysis and metadata extraction for document search

This thesis work is mainly focused on two problems related to document search. The first problem is the analysis and utilization of images contained within documents for document retrieval applications. The second problem is the metadata generation for scanned scientific documents at web based archives. Images are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the images. This thesis proposes an integrated document retrieval schema utilizing both text and image information. As the initial step in enabling integrated document search, images are categorized into a set of pre-defined types. Several categories of images have been defined based on their functionalities in scholarly articles. A machine-learning-based approach has been proposed to categorize images using both global features and part features extracted from content of images. After categorization of images, algorithms have been designed to analyze two common types of images in documents: 2-D plots and diagrams. A thin line analysis based algorithm has been designed for extracting numerical data from 2-D plot images. An integrated algorithm has been designed for symbol recognition in diagrams. The proposed approach has been evaluated on a test bed document set collected from the CiteSeer scientific literature digital library and other sources. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for real world use. Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. This thesis work tackles the problem of extracting metadata from scanned volumes of journals. The goal is to extract information describing internal structures and content of scanned volumes, which is necessary for providing effective content access functionalities to digital library users. Methods have been designed for automatically generating volume level, issue level, and article level metadata based on format and text features extracted from scanned volumes. The automatic metadata generation software has been developed and integrated into an operational digital library, the Internet Archive, for real world usage.

[1]  Beng Chin Ooi,et al.  Efficient Image Retrieval By Color Contents , 1994, ADB.

[2]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Donna J. Peuquet,et al.  An Examination Of Techniques For Reformatting Digital Cartographic Data / Part 2: The Vector-To-Raster Process , 1981 .

[5]  Raphaël Marée,et al.  Random subwindows for robust image classification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  Sargur N. Srihari,et al.  Knowledge-based derivation of document logical structure , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[7]  Naomi Dushay Localizing experience of digital content via structural metadata , 2002, JCDL '02.

[8]  Vijay V. Raghavan,et al.  Design and evaluation of algorithms for image retrieval by spatial similarity , 1995, TOIS.

[9]  Howard Besser The Next Stage: Moving from Isolated Digital Collections to Interoperable Digital Libraries , 2002, First Monday.

[10]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[11]  Farshad Fotouhi,et al.  Region based image annotation through multiple-instance learning , 2005, MULTIMEDIA '05.

[12]  Edward A. Fox,et al.  Digital libraries , 1995, CACM.

[13]  Djemel Ziou,et al.  Edge Detection Techniques-An Overview , 1998 .

[14]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  William Y. Arms,et al.  An Architecture for Information in Digital Libraries , 1997, D Lib Mag..

[16]  Herbert Freeman,et al.  Computer Processing of Line-Drawing Images , 1974, CSUR.

[17]  Vijay V. Raghavan,et al.  Content-Based Image Retrieval Systems - Guest Editors' Introduction , 1995, Computer.

[18]  Rohini K. Srihari,et al.  Intelligent Indexing and Semantic Retrieval of Multimodal Documents , 2004, Information Retrieval.

[19]  Edward Y. Chang,et al.  CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines , 2003, IEEE Trans. Circuits Syst. Video Technol..

[20]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Juyang Weng,et al.  Using Discriminant Eigenfeatures for Image Retrieval , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Robert P. Futrelle,et al.  Summarization of Diagrams in Documents , 1999 .

[24]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[25]  Hong Yan,et al.  An adaptive logical method for binarization of degraded document images , 2000, Pattern Recognit..

[26]  Cordelia Schmid,et al.  Local Grayvalue Invariants for Image Retrieval , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  James Ze Wang,et al.  A metadata generation system for scanned scientific volumes , 2008, JCDL '08.

[28]  James Ze Wang,et al.  Real-Time Computerized Annotation of Pictures , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Stephanie Elzer Schwartz,et al.  Information graphics: an untapped resource for digital libraries , 2006, SIGIR.

[30]  Yixin Chen,et al.  A Region-Based Fuzzy Feature Matching Approach to Content-Based Image Retrieval , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Jelena Kovacevic,et al.  Wavelets and Subband Coding , 2013, Prentice Hall Signal Processing Series.

[32]  Ingemar J. Cox,et al.  The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments , 2000, IEEE Trans. Image Process..

[33]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[34]  Jitendra Malik,et al.  Blobworld: A System for Region-Based Image Indexing and Retrieval , 1999, VISUAL.

[35]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[36]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[37]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[38]  Tat-Seng Chua,et al.  An integrated color-spatial approach to content-based image retrieval , 1995, MULTIMEDIA '95.

[39]  C. J. Hilditch,et al.  Linear Skeletons From Square Cupboards , 1969 .

[40]  Emanuele Trucco,et al.  Introductory techniques for 3-D computer vision , 1998 .

[41]  Ashok Samal,et al.  A system for recognizing a large class of engineering drawings , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Anil K. Jain,et al.  On image classification: city images vs. landscapes , 1998, Pattern Recognit..

[43]  William I. Grosky,et al.  Narrowing the semantic gap - improved text-based web document retrieval using visual features , 2002, IEEE Trans. Multim..

[44]  Rachid Deriche,et al.  Using Canny's criteria to derive a recursively implemented optimal edge detector , 1987, International Journal of Computer Vision.

[45]  C. Lee Giles,et al.  Mining, indexing, and searching for textual chemical molecule information on the web , 2008, WWW.

[46]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[47]  James Ze Wang,et al.  Content-based image retrieval: approaches and trends of the new age , 2005, MIR '05.

[48]  Lawrence O'Gorman,et al.  K × K Thinning , 1990, Comput. Vis. Graph. Image Process..

[49]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[51]  Song Mao,et al.  A dynamic feature generation system for automated metadata extraction in preservation of digital materials , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[52]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[53]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[54]  Andreas Stolcke,et al.  Structural metadata research in the EARS program , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[55]  Catherine C. Marshall,et al.  Going digital: a look at assumptions underlying digital libraries , 1995, CACM.

[56]  Hideyuki Tamura,et al.  Textural Features Corresponding to Visual Perception , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[57]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[58]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[59]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[60]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[61]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[62]  Francesca Cesarini,et al.  Page Classification for Meta-data Extraction from Digital Collections , 2001, DEXA.

[63]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[64]  Edward Lank,et al.  Treatment of Diagrams in Document Image Analysis , 2000, Diagrams.

[65]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[66]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[67]  Ching Y. Suen,et al.  Thinning Methodologies - A Comprehensive Survey , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[68]  Michael Unser,et al.  Texture classification and segmentation using wavelet frames , 1995, IEEE Trans. Image Process..

[69]  Anil K. Jain,et al.  Goal-Directed Evaluation of Binarization Methods , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[70]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[71]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[72]  Prasenjit Mitra,et al.  Automatic Extraction of Data from 2-D Plots in Documents , 2007 .

[73]  Mohan S. Kankanhalli,et al.  Color matching for image retrieval , 1995, Pattern Recognit. Lett..

[74]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[75]  Jian Fan,et al.  Texture Classification by Wavelet Packet Signatures , 1993, MVA.

[76]  Kazuhiro Mori,et al.  An Automatic Circuit Diagram Reader with Loop-Structure-Based Symbol Recognition , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[77]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[78]  Robert P. Futrelle,et al.  Recognition and Classification of Figures in PDF Documents , 2005, GREC.

[79]  Lawrence O'Gorman PRIMITIVES CHAIN CODE , 1988 .

[80]  James Ze Wang,et al.  Automatic categorization of figures in scientific documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[81]  David S. Doermann,et al.  A parallel-line detection algorithm based on HMM decoding , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..