Content Extraction from News Pages Using Particle Swarm Optimization

Today’s Web pages are commonly made up of more than merely one cohesive block of information. For instance, news pages from popular media channels such as Financial Times or Washington Post consist of no more than 30%-50% of textual news, next to advertisements, link lists to related articles, disclaimer information, and so forth.

[1]  Matthew Hurst,et al.  BlogPulse: Automated Trend Discovery for Weblogs , 2003 .

[2]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[3]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[4]  Ee-Peng Lim,et al.  Web unit mining: finding and classifying subgraphs of web pages , 2003, CIKM '03.

[5]  HongJiang Zhang,et al.  HTML page analysis based on visual cues , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[6]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[7]  Hung-Yu Kao,et al.  The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[8]  Cai-Nicolas Ziegler,et al.  Towards Automated Reputation and Brand Monitoring on the Web , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[9]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[10]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[11]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[12]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[13]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[14]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[15]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[16]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[17]  Gail E. Kaiser,et al.  Automating Content Extraction of HTML Documents , 2005, World Wide Web.

[18]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[19]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.