Uncertainty Reduction for Knowledge Discovery and Information Extraction on the World Wide Web

In this paper, we give an overview of knowledge discovery (KD) and information extraction (IE) techniques on the World Wide Web (WWW). We intend to answer the following questions: What kind of additional uncertainty challenges are introduced by the WWW setting to basic KD and IE techniques? What are the fundamental techniques that can be used to reduce such uncertainty and achieve reasonable KD and IE performance on the WWW? What is the impact of each novel method? What types of interactions can be conducted between these techniques and information networks to make them benefit from each other? In what way can we utilize the results in more interesting applications? What are the remaining challenges and what are the possible ways to address these challenges? We hope this can provide a road map to advance KD and IE on the WWW to a higher level of performance, portability and utilization.

[1]  Satoshi Sekine,et al.  On-Demand Information Extraction , 2006, ACL.

[2]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[3]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[4]  Hua Li,et al.  Improving web search results using affinity graph , 2005, SIGIR '05.

[5]  Andrew McCallum,et al.  Information extraction, data mining and joint inference , 2006, KDD '06.

[6]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[7]  Oren Etzioni,et al.  The Tradeoffs Between Open and Traditional Relation Extraction , 2008, ACL.

[8]  Bo Zhao,et al.  Probabilistic topic models with biased propagation on heterogeneous information networks , 2011, KDD.

[9]  Heng Ji,et al.  Collaborative Ranking: A Case Study on Entity Linking , 2011, EMNLP.

[10]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[11]  Mitsuru Ishizuka,et al.  Graph Based Multi-View Learning for Semantic Relation Extraction , 2010, Int. J. Semantic Comput..

[12]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[13]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[14]  Heng Ji,et al.  Refining Event Extraction through Cross-Document Inference , 2008, ACL.

[15]  Heng Ji,et al.  Predicting Unknown Time Arguments based on Cross-Event Propagation , 2009, ACL.

[16]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[17]  Eugene Agichtein,et al.  Predicting accuracy of extracting information from unstructured text collections , 2005, CIKM '05.

[18]  Yizhou Sun,et al.  iTopicModel: Information Network-Integrated Topic Modeling , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[19]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[22]  W. Marsden I and J , 2012 .

[23]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2008, IEEE Trans. Knowl. Data Eng..

[24]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[25]  Djoerd Hiemstra,et al.  Modeling multi-step relevance propagation for expert finding , 2008, CIKM '08.

[26]  Doug Downey,et al.  A Probabilistic Model of Redundancy in Information Extraction , 2005, IJCAI.

[27]  E. Gilder,et al.  The Authors , 1977 .

[28]  Paul McNamee HLTCOE Efforts in Entity Linking at TAC KBP 2010 , 2010, TAC.

[29]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[30]  Allan Borodin,et al.  Link analysis ranking: algorithms, theory, and experiments , 2005, TOIT.

[31]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[32]  Deng Cai,et al.  Probabilistic dyadic data analysis with local and global consistency , 2009, ICML '09.

[33]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[34]  Jennifer Chu-Carroll,et al.  Improving QA Accuracy by Question Inversion , 2006, ACL.

[35]  Hongbo Deng,et al.  Formal Models for Expert Finding on DBLP Bibliography Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[36]  Luis Gravano,et al.  Extracting Relations from Large Plain-Text Collections , 1999 .

[37]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[38]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[39]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[40]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[41]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[42]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[43]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[44]  Craig MacDonald,et al.  Voting for candidates: adapting data fusion techniques for an expert search task , 2006, CIKM '06.

[45]  ChengXiang Zhai,et al.  Probabilistic Models for Expert Finding , 2007, ECIR.

[46]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[47]  Juan-Zi Li,et al.  Extraction and mining of an academic social network , 2008, WWW.

[48]  Hongbo Deng,et al.  Effective latent space graph-based re-ranking model with global consistency , 2009, WSDM '09.

[49]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[50]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[51]  Ralph Grishman,et al.  Automatic Acquisition of Domain Knowledge for Information Extraction , 2000, COLING.

[52]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[53]  Siddharth Patwardhan,et al.  A Unified Model of Phrasal and Sentential Evidence for Information Extraction , 2009, EMNLP.

[54]  David Ahn,et al.  The stages of event extraction , 2006 .

[55]  Jiawei Han,et al.  Mining advisor-advisee relationships from research publication networks , 2010, KDD.

[56]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[57]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[58]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[59]  Xiang Li,et al.  Domain-Independent Novel Event Discovery and Semi-Automatic Event Annotation , 2010, PACLIC.

[60]  Heng Ji,et al.  Language Specific Issue and Feature Exploration in Chinese Event Extraction , 2009, NAACL.

[61]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[62]  Vittorio Castelli,et al.  Slot Filling through Statistical Processing and Inference Rules , 2009, TAC.

[63]  Michael R. Lyu,et al.  A generalized Co-HITS algorithm and its application to bipartite graphs , 2009, KDD.

[64]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[65]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[66]  Heng Ji,et al.  Cross-document Event Extraction and Tracking: Task, Evaluation, Techniques and Challenges , 2009, RANLP.

[67]  Philip S. Yu Editorial: State of the Transactions , 2004, IEEE Trans. Knowl. Data Eng..

[68]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[69]  Gideon S. Mann Multi-Document Relationship Fusion via Constraints on Probabilistic Databases , 2007, NAACL.

[70]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[71]  Xiang Li,et al.  CUNY-BLENDER TAC-KBP2010 Entity Linking and Slot Filling System Description , 2010, TAC.

[72]  Nathanael Chambers,et al.  Template-Based Information Extraction without the Templates , 2011, ACL.

[73]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[74]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[75]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[76]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[77]  Yan Li,et al.  PRIS at TAC2010 KBP Track , 2010, TAC.

[78]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[79]  Heng Ji,et al.  Graph-based Event Coreference Resolution , 2009, Graph-based Methods for Natural Language Processing.

[80]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[81]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[82]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.