IJCAI 2016 Proceedings of the Workshop on Scholarly Big Data: AI Perspectives, Challenges, and Ideas

In this talk, we will introduce the new release of a Web scale entity graph, which serves as the backbone of Microsoft Academic Service. The architecture of the data pipeline which produces Microsoft Academic Graph will be presented. Challenges and opportunities on various research topics as well as engineering efforts are exploited. In addition, Microsoft Research has opened up this graph dataset to the research community with new APIs to support further research, experimentation, and development. This talk will highlight how the research community can take advantage of these data and APIs to fuel new research opportunities.

[1]  C. Lee Giles,et al.  SimSeerX: a similar document search engine , 2014, DocEng '14.

[2]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[3]  C. Lee Giles,et al.  Near duplicate detection in an academic digital library , 2013, ACM Symposium on Document Engineering.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Cornelia Caragea,et al.  Classifying Scientific Publications Using Abstract Features , 2011, SARA.

[6]  Dmitri Loguinov,et al.  Probabilistic near-duplicate detection using simhash , 2011, CIKM '11.

[7]  Wen-tau Yih,et al.  Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[8]  Hiep Phuc Luong,et al.  Conceptual recommender system for CiteSeerX , 2009, RecSys '09.

[9]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[10]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[11]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[12]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[13]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[14]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[15]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[16]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[17]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[18]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[19]  C. Lee Giles,et al.  CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications , 1998, AGENTS '98.

[20]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[21]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[22]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[23]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[24]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[25]  Jintao Li,et al.  A study on mutual information-based feature selectionfor text categorization , 2007 .

[26]  Steven Bird NLTK: The Natural Language Toolkit , 2006, ACL.

[27]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[28]  I. Herstein,et al.  Topics in algebra , 1964 .