论文信息 - Data extraction and cleansing of semi-structured Chinese texts

Data extraction and cleansing of semi-structured Chinese texts

The rapid growth of data mining generates an ever-increasing demand for automatic information extraction from Chinese texts. However, existing approaches in this domain focus on well-structured Chinese texts and therefore have difficulties in dealing with semi-structured Chinese texts which do not conform to strict syntactic structures. We propose in this paper an approach to semi-automatic data extraction and cleansing for these texts. Preliminary experimental results show that, with modest manual intervention, it can effectively extract information from raw semi-structured Chinese texts collected from e-business applications.

Shun Long | Wei-Heng Zhu

[1] W. H. Inmon,et al. Building the data warehouse , 1992 .

[2] Craig A. Knoblock,et al. Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[3] Robert L. Grossman,et al. Mining data records in Web pages , 2003, KDD '03.

[4] 杨建武,et al. A Semi—Structured Document Model for Text Mining , 2002 .

[5] Chen Xiaoou,et al. A semi-structured document model for text mining , 2002 .

[6] Khaled Shaalan,et al. A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[7] Nicholas Kushmerick,et al. Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..