Structure and content analysis for html medical articles: a hidden markov model approach

We describe ongoing research on segmenting and labeling HTML medical journal articles. In contrast to existing approaches in which HTML tags usually serve as strong indicators, we seek to minimize dependence on HTML tags. Designing logical component models for general Web pages is a challenging task. However, in the narrow domain of online journal articles, we show that the HTML document, modeled with a Hidden Markov Model, can be accurately segmented into logical zones.

[1]  Jie Zou,et al.  Combining DOM tree and geometric layout analysis for online medical journal article segmentation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[2]  Timo Laakko,et al.  Two approaches to bringing Internet services to WAP devices , 2000, Comput. Networks.

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  Joe Marini,et al.  Document Object Model , 2002, Encyclopedia of GIS.

[5]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[6]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[7]  Hongjun Lu,et al.  Toward Learning Based Web Query Processing , 2000, VLDB.