Textual Analysis of Good Practice Requirements of EuroRec Repository Statements
暂无分享,去创建一个
Objective and design. The generic and comprehensive repository of statements is composed by the European Institute for Health Records (EuroRec) to describe the use of functions, structuring, and data elements of Electronic Health Record (EHR) systems. The project EHR-Q-TN enables the repository and tools to be accessible to partners from European countries to establish seamless cross-border multilingual description and validation or certification of EHR systems and use functions. The repository statements are grouped in two interlinked sets: Fine Grained Statements (FGS) and Good Practice Requirements (GPR). In the process of translating repository statements into national languages two threats to consistency and coherency of translations were identified. Some of the English words or phrases do not exist in other languages and substitutes in different national languages might alter the meaning of the original statement. The other problem is consistency of translations within the same language. The aim of this study was to provide support to creation of coherent multilingual dictionary by identifying most frequent words and segments within GPR repository statements using statistical textual analysis. Methods. Text corpus comprised 178 GPR statements. We performed lexicometric analysis, analysis of repeated segments and text concordance analysis. For the purpose of the analysis we excluded articles (a, an, the) from the text corpus and we jointed auxiliary verb and “not” in the negative form of the verb. No other change was done in the text. French software Dtm-Vic (Data and Text Mining – Visualization, Inference, Classification) was used for the analyses. Results. There were 4990 words in total in the analyzed text corpus of 178 GPR statements. Number of distinct words was 1053 (21.1%). Among 20 most frequent words (frequency over 40) there were 13 (65%) meaningful words such as “system”, “enables”, “user”, “medicinal”, “medication”, “product”, “data”, “health”, “prescription”, “well”, “be”, “item”, “patient”. The word "system" was the most frequent word with 209 occurrences in the text corpus. In the analysis of repeated segments we limited segments to the length of 3 words because we expected that segments of 2 and 3 words would give meaningful units suitable for translation. "System enables" and "medicinal product" were two most frequent meaningful two words segments, followed by "health item" and "enables user". Ten most frequent two words segments (5.4% of total number of extracted segments) comprised 550 (25.4%) of total 2166 words extracted in 186 segments. Text concordance analysis extracted multipart segments which are grouped with/around some words forming long segments suitable for direct translation, such as “system enables user to”. Conclusions. Statistical textual analysis might be a useful tool to bridge a gap in multilingual environment of the process of unified EHR system quality assessment. By combining lexicometric analysis, analysis of repeated segments and text concordance analysis it is possible to easily identify words or segments which have the greatest weight in the text corpus of repository statements. Using these words and segments as the basis of translation and their inclusion in a multilingual dictionary would enable consistent and coherent translation of majority of repository statements.