Unstructured Document Recognition on Business Invoice CS 229 : Machine Learning

This project describes a bag-of-words approach for business invoice recognition. Bags of potential features are generated to capture layout and textual properties for each field of interest, and weighted to reveal key factors that identify a field. Feature selection, threshold tuning, and model comparison are evaluated. Overall, we achieved 8.81% for training error and 13.99% for testing error.

[1]  Jian Liu,et al.  Research on Chinese financial invoice recognition technology , 2003, Pattern Recognit. Lett..

[2]  Y. Belaid,et al.  Morphological tagging approach in document analysis of invoices , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[3]  Simone Marinai,et al.  Introduction to Document Analysis and Recognition , 2008, Machine Learning in Document Analysis and Recognition.

[4]  Chew Lim Tan,et al.  Hough technique for bar charts detection and recognition in document images , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[5]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[8]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[9]  Enrico Sorio Machine Learning Techniques for Document Processing and Web Security , 2013 .

[10]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[11]  Bidyut Baran Chaudhuri,et al.  Incremental classification of invoice documents , 2008, 2008 19th International Conference on Pattern Recognition.

[12]  Yiming Ying,et al.  Support Vector Machine Soft Margin Classifiers: Error Analysis , 2004, J. Mach. Learn. Res..

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..