Personal Information Extraction of the Teaching Staff Based on CRFs

As the attribute information of the profile stored in a web page is usually in the form of natural language, it is very difficult to use the HTML structure to extract the target information. In this paper Conditional Random Fields is adopted to extract the personal attribute information of the personal detail in web pages. Via segmentation system the HTML document could be divided into the sequence of words, and then to establish the appropriate template of characteristics and train the sample sequences, at last using the characteristics function model generated by CRFs to mark the test sequences and identify the information which need to be extracted. The experimental results show that annotation and reasoning function of the CRFs in the text sequence can be used to extract the specific attributes information in the personal home page very well.