Uncertainty Reduction for Knowledge Discovery and Information Extraction on WWW

In this paper we give an overview of Knowledge Discovery (KD) and Information Extraction (IE) techniques on the World Wide Web (WWW). We intend to answer the following questions: What kind of additional uncertainty challenges are introduced by WWW setting to basic KD and IE techniques? What are the fundamental techniques that can be used to reduce such uncertainty and achieve reasonable KD and IE performance on WWW? What is the impact of each novel method? What types of interactions can be conducted between these techniques and information networks to make them benefit from each other? In which way can we utilize the results in more interesting applications? What are the remaining challenges and what are the possible ways to address these challenges? We hope this can provide a road map to advance KD and IE on WWW to a higher level of performance, portability and utilization.

[1]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[2]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[3]  Heng Ji,et al.  Predicting Unknown Time Arguments based on Cross-Event Propagation , 2009, ACL.

[4]  Eugene Agichtein,et al.  Predicting accuracy of extracting information from unstructured text collections , 2005, CIKM '05.

[5]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[6]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[7]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[8]  Michael R. Lyu,et al.  A generalized Co-HITS algorithm and its application to bipartite graphs , 2009, KDD.

[9]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[10]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[11]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[12]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[13]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[14]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[15]  David Ahn,et al.  The stages of event extraction , 2006 .

[16]  Doug Downey,et al.  A Probabilistic Model of Redundancy in Information Extraction , 2005, IJCAI.

[17]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[18]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[19]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[20]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[21]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[22]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[23]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[24]  Siddharth Patwardhan,et al.  A Unified Model of Phrasal and Sentential Evidence for Information Extraction , 2009, EMNLP.

[25]  Allan Borodin,et al.  Link analysis ranking: algorithms, theory, and experiments , 2005, TOIT.

[26]  ChengXiang Zhai,et al.  Probabilistic Models for Expert Finding , 2007, ECIR.

[27]  Heng Ji,et al.  Refining Event Extraction through Cross-Document Inference , 2008, ACL.

[28]  Paul McNamee HLTCOE Efforts in Entity Linking at TAC KBP 2010 , 2010, TAC.

[29]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[30]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[31]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[32]  Gideon S. Mann Multi-Document Relationship Fusion via Constraints on Probabilistic Databases , 2007, NAACL.

[33]  Jennifer Chu-Carroll,et al.  Improving QA Accuracy by Question Inversion , 2006, ACL.

[34]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[35]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[36]  Hongbo Deng,et al.  Effective latent space graph-based re-ranking model with global consistency , 2009, WSDM '09.

[37]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[38]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[39]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[40]  Djoerd Hiemstra,et al.  Modeling multi-step relevance propagation for expert finding , 2008, CIKM '08.

[41]  Jiawei Han,et al.  Mining advisor-advisee relationships from research publication networks , 2010, KDD.

[42]  Yizhou Sun,et al.  iTopicModel: Information Network-Integrated Topic Modeling , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[43]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[44]  Heng Ji,et al.  Collaborative Ranking: A Case Study on Entity Linking , 2011, EMNLP.

[45]  Bo Zhao,et al.  Probabilistic topic models with biased propagation on heterogeneous information networks , 2011, KDD.

[46]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[47]  Yan Li,et al.  PRIS at TAC2010 KBP Track , 2010, TAC.

[48]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[49]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[50]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[51]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[52]  Roman Yangarber,et al.  Redundancy-based Correction of Automatically Extracted Facts , 2005, HLT.

[53]  Oren Etzioni,et al.  The Tradeoffs Between Open and Traditional Relation Extraction , 2008, ACL.

[54]  Hongbo Deng,et al.  Formal Models for Expert Finding on DBLP Bibliography Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[55]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[56]  Heng Ji,et al.  Cross-document Event Extraction and Tracking: Task, Evaluation, Techniques and Challenges , 2009, RANLP.

[57]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[58]  Xiang Li,et al.  CUNY-BLENDER TAC-KBP2010 Entity Linking and Slot Filling System Description , 2010, TAC.

[59]  Heng Ji,et al.  Language Specific Issue and Feature Exploration in Chinese Event Extraction , 2009, NAACL.

[60]  Andrew McCallum,et al.  Information extraction, data mining and joint inference , 2006, KDD '06.

[61]  Heng Ji,et al.  Graph-based Event Coreference Resolution , 2009, Graph-based Methods for Natural Language Processing.

[62]  Nathanael Chambers,et al.  Template-Based Information Extraction without the Templates , 2011, ACL.

[63]  Xiang Li,et al.  Domain-Independent Novel Event Discovery and Semi-Automatic Event Annotation , 2010, PACLIC.

[64]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[65]  Craig MacDonald,et al.  Voting for candidates: adapting data fusion techniques for an expert search task , 2006, CIKM '06.

[66]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[67]  Deng Cai,et al.  Probabilistic dyadic data analysis with local and global consistency , 2009, ICML '09.

[68]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[69]  Satoshi Sekine,et al.  On-Demand Information Extraction , 2006, ACL.

[70]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[71]  Mitsuru Ishizuka,et al.  Graph Based Multi-View Learning for Semantic Relation Extraction , 2010, Int. J. Semantic Comput..

[72]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[73]  Ralph Grishman,et al.  Automatic Acquisition of Domain Knowledge for Information Extraction , 2000, COLING.

[74]  Vittorio Castelli,et al.  Slot Filling through Statistical Processing and Inference Rules , 2009, TAC.

[75]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[76]  Hua Li,et al.  Improving web search results using affinity graph , 2005, SIGIR '05.