Clustering online handwritten mathematical expressions

Abstract To help human markers mark many answers in the form of online handwritten mathematical expressions (OHMEs), this paper proposes bag-of-features for clustering OHMEs. It consists of six levels of features from low-level pattern features to high-level symbolic and structural features obtained from a state-of-the-art OHME recognizer. Then, it introduces distance-based representation (DbR) to reduce the dimensionality of our proposed feature spaces. Moreover, it presents a method for combining the proposed features to improve the performance. Experiments using the k-means++ algorithm are conducted on a set of 3,150 OHMEs (Dset_50) and an answer dataset (Dset_Mix) of 200 OHMEs intermixed between real patterns and synthesized patterns for each of 10 questions. When the number of clusters is set as the true number of categories, the best purity around 0.99 is produced by bag-of-symbols with DbR for Dset_50, which is better than state-of-the-art methods for clustering offline patterns converted from their OHMEs. The combination of both low-level and high-level features with DbR achieves a purity of around 0.777, increases to more than 0.90 and reduce the marking cost by more than 0.35 point than manually marking OHME answers by adjusting the number of clusters for Dset_Mix.

[1]  Harold Mouchère,et al.  ICFHR2016 CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[2]  Hung Tuan Nguyen,et al.  CNN based spatial classification features for clustering offline handwritten mathematical expressions , 2020, Pattern Recognit. Lett..

[3]  Sumit Basu,et al.  Divide and correct: using clusters to grade short answers at scale , 2014, L@S.

[4]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[5]  Jun Tan,et al.  Residual BiRNN Based Seq2Seq Model with Transition Probability Matrix for Online Handwritten Mathematical Expression Recognition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[6]  Rohini K. Srihari,et al.  Automatic scoring of short handwritten essays in reading comprehension tests , 2008, Artif. Intell..

[7]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Pieter Abbeel,et al.  Gradescope: A Fast, Flexible, and Fair System for Scalable Assessment of Handwritten Work , 2017, L@S.

[9]  Masaki Nakagawa,et al.  A System for Recognizing Online Handwritten Mathematical Expressions and Improvement of Structure Analysis , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[10]  Ted Briscoe,et al.  Constrained Multi-Task Learning for Automated Essay Scoring , 2016, ACL.

[11]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[12]  Garrison W. Cottrell,et al.  A Deep Siamese Neural Network Learns the Human-Perceived Similarity Structure of Facial Expressions Without Explicit Categories , 2016, CogSci.

[13]  Jun Du,et al.  Track, Attend, and Parse (TAP): An End-to-End Framework for Online Handwritten Mathematical Expression Recognition , 2019, IEEE Transactions on Multimedia.

[14]  Joseph J. LaViola,et al.  A usability evaluation of AlgoSketch: a pen-based application for mathematics , 2009, SBIM '09.

[15]  Harold Mouchère,et al.  ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[16]  Sumit Basu,et al.  Powergrading: a Clustering Approach to Amplify Human Effort for Short Answer Grading , 2013, TACL.

[17]  Rui Zhao,et al.  Fuzzy Bag-of-Words Model for Document Representation , 2018, IEEE Transactions on Fuzzy Systems.

[18]  Salvatore Valenti,et al.  An Overview of Current Research on Automated Essay Grading , 2003, J. Inf. Technol. Educ..

[19]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[20]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[21]  Doaa Mohey El-Din Enhancement Bag-of-Words Model for Solving the Challenges of Sentiment Analysis , 2016 .

[22]  Joseph J. LaViola,et al.  MathPad2: a system for the creation and exploration of mathematical sketches , 2004, SIGGRAPH 2004.

[23]  Lingfeng Wang,et al.  Deep Adaptive Image Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Mahshad Mahdavi Tree-Based Structure Recognition Evaluation for Math Expressions : Techniques and Case Study , 2019 .

[25]  Cuong Tuan Nguyen,et al.  Online Handwritten Mathematical Symbol Segmentation and Recognition with Bidirectional Context , 2020, 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[26]  Tsunenori Ishioka,et al.  Automated Japanese essay scoring system:jess , 2004, Proceedings. 15th International Workshop on Database and Expert Systems Applications, 2004..

[27]  Masaki Nakagawa,et al.  Generating Synthetic Handwritten Mathematical Expressions from a LaTeX Sequence or a MathML Script , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[28]  Kazuhiko Yamamoto,et al.  Research on Machine Recognition of Handprinted Characters , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[30]  Dit-Yan Yeung,et al.  PenCalc: a novel application of on-line mathematical expression recognition technology , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[31]  James Bailey,et al.  Document clustering of scientific texts using citation contexts , 2010, Information Retrieval.