Re-flowable Document Structure Understanding by Comprehensive Use of Features and Rules

Aimed to improve the shortcomings in the previous re-flowable document structure understanding that component features are not fully utilized in component identification, this paper proposed a new method to understand documents in combination with features and grammatical rules. In the method, two vectors are used, the first one is the format vector representing the format features, such as fonts etc., the second one is the content vector representing text features such as keywords etc. Then the components to be identified are compared with the candidates by measuring the distance between the vectors with different weights. The experiment results show that this method can effectively improve the accuracy of component identification, and in turn improve the accuracy of whole document understanding.