Web page analysis based on HTML DOM and its usage for forum statistics and alerts

Message boards are part of the Internet known as the 'Invisible Web' and pose many problems to traditional search engine spiders. The dynamic content is usually very deep and difficult to search. In addition, many of these sites change their locations, servers, or URLs almost daily creating problems with the indexing process. However, during the growth of the World Wide Web and with the help of search engines, they represent an important source of information to solve different problems. Another interesting feature of this type of webpages is that a big community has been developed, expressing different opinions and discussing various topics. Using special retrieval and indexing algorithms, mostly based on the HTML DOM tree, we have developed an algorithm to obtain detailed and accurate trend statistics that can be used for different marketing solutions and analysis tools.

[1]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[2]  Weidong Li,et al.  Information Extraction from Semi-structured WEB Page Based on DOM Tree and its Application in Scientific Literature Statistical Analysis System , 2009, 2009 IITA International Conference on Services Science, Management and Engineering.

[3]  Zaenal Akbar,et al.  Reverse Method for Labeling the Information from Semi-Structured Web Pages , 2009, 2009 International Conference on Signal Processing Systems.

[4]  Valter Crescenzi,et al.  Automatic annotation of data extracted from large Web sites , 2003, WebDB.

[5]  Jie Zou,et al.  Combining DOM tree and geometric layout analysis for online medical journal article segmentation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[6]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.