LanguageTool proofreading rules evolution and update
暂无分享,去创建一个
This paper crawls through the historical version of Baidu's Encyclopedia, which evolves from Chinese processing rules for LanguageTool semantic proofreading tools. By using the HTML parser Jsoup to crawl the entries, then using the Java page analysis tool HtmlUnit and the Json data processor Fastjson to compare and analyze the crawling content. Moreover, DOM API is used to develop a set of program which could generate XML rules automatically and attain 81% accuracy in the test. By completing the automatic evolution of Chinese semantic proofreading rules and adding a module for correcting knowledge on the basis of existed rules of LanguageTool. It also develops a more abundant corpus, which improves accuracy of dealing with Chinese text by using LanguageTool semantic proofing tool at the same time. The conversion rate of automation rules is not high after text manipulation because that Baidu Encyclopedia term modification has some disadvantages urgently to be solved, such as lacks of transparency, updating not in time and other factors. The research based on Baidu Encyclopedia entries to achieve automatic extraction of Chinese semantic proofreading rules, which is innovative. And the test results indicates that there is a high degree of practicality and reliability. However, under the circumstance of crawling data is too large, the stability of the technology needs to be improved.
[1] Roman Grundkiewicz,et al. Automatic Extraction of Polish Language Errors from Text Edition History , 2013, TSD.