Title extraction from Loosely Structured Data Records

In this paper, we present a novel title extraction method from loosely structured data records (LSDRs). Firstly, we automatically identify the format of titles and then extract them accordingly. For the Web page whose title is occurred in all the data records, we obtain the one in the candidate titles which has the largest length of the dasiasame contentpsila as the accurate title. And for the Web page whose title is occurred before the first data record, the candidate title which has the largest length of the dasiadifferent contentpsila can be considered as the accurate title. Our experiment demonstrates that our automatic algorithm is robust and effective on two databases collected from the Internet.

[1]  Shuming Shi,et al.  Title extraction from bodies of HTML documents and its application to web page retrieval , 2005, SIGIR '05.

[2]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[3]  Clement T. Yu,et al.  Annotating Structured Data of the Deep Web , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  King-Lup Liu,et al.  Automatic Extraction of Publication Time from News Search Results , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[5]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[6]  Clement T. Yu,et al.  Automatic extraction of dynamic record sections from search engine result pages , 2006, VLDB.

[7]  Yi Zhang,et al.  Unsupervised Learning of Tree Alignment Models for Information Extraction , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[8]  Shuming Shi,et al.  Microsoft Research Asia at the Web Track of TREC 2009 , 2009, TREC.

[9]  Jing Chen,et al.  Extracting Loosely Structured Data Records Through Mining Strict Patterns , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[10]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[11]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[12]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.