Facilitating Knowledge Discovery by Mining the Content and Link Structure of the Web

Given the vast amount of online information covering almost all aspects of human endeavor, the Internet, especially the Web, is clearly a fertile ground for data mining research from which to extract valuable knowledge. Web mining is the application of data mining techniques to extract knowledge from Web data, including Web documents, Web hyperlink structure, and Web usage logs. Traditional Web mining research has been mainly focused on addressing the information overload problem. Many information retrieval (IR) and artificial intelligence (AI) techniques have been adopted or developed to identify relevant information from the Web to meet users’ specific information needs. However, most existing studies do not fully explore the social and behavioral aspects of the Web. Thus, the primary goal of this dissertation is to develop an integrated research framework that extends traditional Web mining methodologies to fully explore the technical, social, and behavioral aspects of Web knowledge discovery. My dissertation framework is composed of technical and social/behavioral components. In the technical component of my dissertation, a set of domain specific Web collection building, Web content and link structure mining, and Web knowledge presentation techniques were developed. These techniques were tested in a series of case

[1]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[2]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[3]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[4]  Brian A. Jackson Technology Acquisition by Terrorist Groups: Threat Assessment Informed by Lessons from Private Sector Technology Adoption , 2001 .

[5]  Giles,et al.  Searching the world wide Web , 1998, Science.

[6]  Matthew Hurst,et al.  Layout and Language: Challenges for Table Understanding on the Web , 2001 .

[7]  Michael D. Cooper,et al.  Using clustering techniques to detect usage patterns in a Web-based information system , 2001, J. Assoc. Inf. Sci. Technol..

[8]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[9]  Hsinchun Chen,et al.  A smart itsy bitsy spider for the web , 1998 .

[10]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[11]  Dorothy E. Leidner,et al.  Review: Knowledge Management and Knowledge Management Systems: Conceptual Foundations and Research Issues , 2001, MIS Q..

[12]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[13]  Masaru Kitsuregawa,et al.  WEB Community Mining and WEB Log Mining: Commodity Cluster Based Execution , 2002, Australasian Database Conference.

[14]  Ahmad M. Ahmad Wasfi Collecting user access patterns for building user profiles and collaborative filtering , 1998, IUI '99.

[15]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[16]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[17]  Noriko Kando,et al.  The web retrieval task and its evaluation in the third NTCIR workshop , 2002, SIGIR '02.

[18]  G. Weimann www.terror.net – How Modern Terrorism Uses the Internet , 2004 .

[19]  Masaru Kitsuregawa,et al.  Creating a Web community chart for navigating related communities , 2001, Hypertext.

[20]  Ravi Kumar,et al.  Extracting Large-Scale Knowledge Bases from the Web , 1999, VLDB.

[21]  G. Garson Handbook of Public Information Systems , 2005 .

[22]  Jenny Preece,et al.  Online Communities: Designing Usability and Supporting Sociability , 2000 .

[23]  Fah-Chun Cheong Internet Agents: Spiders, Wanderers, Brokers, and 'Bots , 1996 .

[24]  Hsinchun Chen,et al.  Web mining: Machine learning for web applications , 2005, Annu. Rev. Inf. Sci. Technol..

[25]  Steven Coll,et al.  Terrorists Turn to the Web as Base of Operations , 2005 .

[26]  Alex G. Büchner Discovering Internet Marketing Intelligence through Web Log Mining , 2003 .

[27]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[28]  AlaviMaryam,et al.  Review: Knowledge management and knowledge management systems , 2001 .

[29]  Thorsten Joachims,et al.  WebWatcher : A Learning Apprentice for the World Wide Web , 1995 .

[30]  Maurice D. Mulvenna,et al.  Discovering Internet marketing intelligence through online analytical web usage mining , 1998, SGMD.

[31]  Gloria Bordogna,et al.  A user-adaptive indexing model of structured documents , 2001, 10th IEEE International Conference on Fuzzy Systems. (Cat. No.01CH37297).

[32]  Daniel R. Tobin Transformational Learning: Renewing Your Company Through Knowledge and Skills , 1996 .

[33]  Phyllis B. Gerstenfeld,et al.  Hate Online: A Content Analysis of Extremist Internet Sites , 2003 .

[34]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[35]  Vitaliy V. Kluev Compiling document collections from the Internet , 2000, SIGF.

[36]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[37]  C. Lee Giles,et al.  Clustering and identifying temporal trends in document databases , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[38]  Noriko Kando NTCIR Workshop: Japanese- and Chinese-English Cross-Lingual Information Retrieval and Multi-grade Relevance Judgments , 2000, CLEF.

[39]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[40]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[41]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[42]  Helen L. Armstrong,et al.  Internet anonymity practices in computer crime , 2003, Inf. Manag. Comput. Secur..

[43]  Michael Chau,et al.  Comparison of Three Vertical Search Spiders , 2003, Computer.

[44]  V. Burris,et al.  White Supremacist Networks on the Internet , 2000 .

[45]  Umeshwar Dayal,et al.  From User Access Patterns to Dynamic Hypertext Linking , 1996, Comput. Networks.

[46]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[47]  T. Thomas Al Qaeda and the Internet: The Danger of “Cyberplanning” , 2003, Parameters.

[48]  Gerald Salton,et al.  Automatic text processing , 1988 .

[49]  Dennis L. Hoffman,et al.  Marketing in Hypermedia Computer-Mediated Environments : Conceptual Foundations 1 ) , 1998 .

[50]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[51]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[52]  Oren Etzioni,et al.  The World-Wide Web: quagmire or gold mine? , 1996, CACM.

[53]  Donna Bergmark,et al.  Collection synthesis , 2002, JCDL '02.

[54]  Peter Pirolli,et al.  Life, death, and lawfulness on the electronic frontier , 1997, CHI.

[55]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[56]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[57]  Brian Detlor,et al.  Information Seeking on the Web: An Integrated Model of Browsing and Searching , 2000, First Monday.

[58]  Jonathan Furner,et al.  Scholarly communication and bibliometrics , 2005, Annu. Rev. Inf. Sci. Technol..

[59]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[60]  Michael Whine,et al.  Cyberspace-A New Medium for Communication, Command, and Control by Extremists , 1999 .

[61]  Chien Chou,et al.  Interactivity and interactive functions in web-based learning systems: a technical framework for designers , 2003, Br. J. Educ. Technol..

[62]  Y. Tsfati,et al.  www.terrorism.com: Terror on the Internet , 2002 .

[63]  R. Daft,et al.  Media Symbolism, Media Richness, and Media Choice in Organizations , 1987 .

[64]  John Arquilla,et al.  Cyberwar is coming , 1993 .

[65]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[66]  Daniel C. A. Hillman,et al.  Learner-Interface Interaction in Distance Education: An Extension of Contemporary Models and Strategies for Practitioners , 1994 .

[67]  M. I. Mauldin,et al.  Lycos: design choices in an Internet search service , 1997 .

[68]  Dorothy E. Denning,et al.  Information Operations and Terrorism , 2005 .

[69]  Hsinchun Chen,et al.  Terrorism Knowledge Discovery Project: A Knowledge Discovery Approach to Addressing the Threats of Terrorism , 2004, ISI.

[70]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[71]  Hsinchun Chen,et al.  US domestic extremist groups on the Web: link and content analysis , 2005, IEEE Intelligent Systems.

[72]  William Elison,et al.  Netwar: Studying Rebels on the Internet , 2000 .

[73]  David A. Griffith,et al.  An emerging model of Web site design for marketing , 1998, CACM.

[74]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[75]  A. Anderson Risk, terrorism, and the internet , 2003 .

[76]  Peter B. Danzig,et al.  Scalable Internet resource discovery: research problems and approaches , 1994, CACM.

[77]  Edna O. F. Reid,et al.  Identifying a company's noncustomer online communities: a proto-typology , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[78]  Marc Najork,et al.  Breadth-First Search Crawling Yields High-Quality Pages , 2001 .