Automatic Data Extraction from Web Discussion Forums

This paper presents an approach to extract information from web discussion forums automatically. HTML tag paths built from a HTML DOM tree are employed to generate the post extraction template. Visual text features and HTML structure information in the same page are also combined together to extract author profile, posted date and post content automatically. Experiment results show that our approach is effective.