Domain Information Acquiring with A Text Analysis System On Distributed Computing

Large amounts of unstructured documents are collected for data tasks or information studies in many fields. The analysis and study for these domain texts become the fundamental issue for domain information acquiring. In this paper, a web-based platform for text distributed computing of Hadoop are constructed, technologies of Spring Boot framework and node.js etc. web front-end techniques, distributing computing and text analysis are integrated for domain text processing. Documents are pre-processing for text analysis, Chinese segmented with the open source tool of IKAnalyzer is implemented. Our system is a good practice for the process of text gathering, filtering, distributed computing, analysis and results visualization. Documents from an educational investigation upload in the system, thousands of school office documents in Guizhou rural villages and towns can be dealing directly and get a quick look for the issue. Meanwhile, other documents such as EMC conference proceedings can follow the pipeline for knowledge discovery with this platform.