Analyzing the Perceptions of Change in a Distributed Collection of Web Documents

It is not unusual for documents on the Web to degrade and suffer from problems associated with unexpected change. In an analysis of the Association for Computing Machinery conference list, we found that categorizing the degree of change affecting digital documents over time is a difficult task. More specifically, we found that categorizing this degree of change is not a binary problem where documents are either unchanged or they have changed so dramatically that they do not fit within the scope of the collection. It is in part, a characterization of the intent of the change. In this paper, we present a case study that compares change detection methods based on machine learning algorithms against the assessment made by human subjects in a user study. Consequently, this paper will focus on two research questions. First, how can we categorize the various degrees of change that documents endure? And second, how did our automatic detection methods fare against the human assessment of change in the ACM conference list?

[1]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[2]  Michael L. Nelson,et al.  Object Persistence and Availability in Digital Libraries , 2002, D Lib Mag..

[3]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[4]  Frank M. Shipman,et al.  Identifying "Soft 404" Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections , 2012, TPDL.

[5]  Mary K. Taylor,et al.  Linkrot and the usefulness of Web site bibliographies , 2000 .

[6]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[7]  Johan Bollen,et al.  The Availability and Persistence of Web References in D-Lib Magazine , 2005, ArXiv.

[8]  Ricardo Baeza-Yates,et al.  Genealogical trees on the web: a search engine user perspective , 2008, WWW.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[11]  Frank M. Shipman,et al.  Managing change on the web , 2001, JCDL '01.

[12]  Michael L. Nelson,et al.  Rediscovering missing web pages using link neighborhood lexical signatures , 2011, JCDL '11.

[13]  Diomidis Spinellis,et al.  The decay and failures of web references , 2003, CACM.

[14]  Ian Witten,et al.  Data Mining , 2000 .

[15]  Hugh C. Davis,et al.  Hypertext link integrity , 1999, CSUR.

[16]  Z. Dalai,et al.  Managing distributed collections: evaluating Web page changes, movement, and replacement , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[17]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[18]  Frank M. Shipman,et al.  Restoring Semantically Incomplete Document Collections Using Lexical Signatures , 2013, TPDL.

[19]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[20]  Frank M. Shipman,et al.  A quantitative evaluation of techniques for detection of abnormal change events in blogs. , 2012, JCDL '12.

[21]  Helen Ashman,et al.  Electronic document addressing: dealing with change , 2000, CSUR.

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  David M. Pennock,et al.  Analysis of lexical signatures for improving information persistence on the World Wide Web , 2004, TOIS.

[24]  Dion Hoe-Lian Goh,et al.  Link decay in leading information science journals , 2007 .

[25]  Frank M. Shipman,et al.  WPv4: a re-imagined Walden's paths to support diverse user communities , 2011, JCDL '11.

[26]  Wallace Koehler,et al.  Web page change and persistence - A four-year longitudinal study , 2002, J. Assoc. Inf. Sci. Technol..

[27]  Robert Wilensky,et al.  Robust Hyperlinks Cost Just Five Words Each , 2000 .

[28]  Esther Shein,et al.  Preserving the internet , 2015, Commun. ACM.

[29]  Wallace Koehler,et al.  A longitudinal study of Web pages continued: a consideration of document persistence , 2003, Inf. Res..