Automated histologic grading from free-text pathology reports using graph-of-words features and machine learning

Traditional n-gram feature representation of freetext documents often fails to capture word ordering and semantics, thus compromising text comprehension. Graph-of-words, a new text representation approach based on graph analytics, is a superior method overcoming the limitations by modeling word co-occurrence. In this study, we present a novel application of graph-of-words text description for automated extraction of histologic grade from unstructured pathology reports. Using 10-fold cross-validation tests, the proposed approach resulted in substantially higher macro and micro-F1 scores with undirected graph-of-words features, compared to traditional bi-gram text features. Our feasibility study demonstrated that graph-of-words is a highly efficient method of text comprehension for information extraction from free-text clinical documents.