Automated Information Extraction from Web APIs Documentation

A fundamental characteristic of Web APIs is the fact that, de facto, providers hardly follow any standard practices while implementing, publishing, and documenting their APIs. As a consequence, the discovery and use of these services by third parties is significantly hampered. In order to achieve further automation while exploiting Web APIs we present an approach for automatically extracting relevant technical information from the Web pages documenting them. In particular we have devised two algorithms that automatically extract technical details such as operation names, operation descriptions or URI templates from the documentation of Web APIs adopting either RPC or RESTful interfaces. The algorithms devised, which exploit advanced DOM processing as well as state of the art Information Extraction and Natural Language Processing techniques, have been evaluated against a detailed dataset exhibiting a high precision and recall---around 90% for both REST and RPC APIs---outperforming state of the art information extraction algorithms.

[1]  Gujjar Vineel,et al.  Web page DOM node characterization and its application to page segmentation , 2009, 2009 IEEE International Conference on Internet Multimedia Services Architecture and Applications (IMSAA).

[2]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[3]  Thomas Erl,et al.  SOA Principles of Service Design (The Prentice Hall Service-Oriented Computing Series from Thomas Erl) , 2007 .

[4]  Thomas Erl,et al.  SOA Principles of Service Design , 2007 .

[5]  John Domingue,et al.  Toward the Next Wave of Services: Linked Services for the Web of Data , 2010, J. Univers. Comput. Sci..

[6]  Deepayan Chakrabarti,et al.  Page-level template detection via isotonic smoothing , 2007, WWW '07.

[7]  C. V. Ramamoorthy,et al.  Semantic Web Services , 2010 .

[8]  Roy Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures"; Doctoral dissertation , 2000 .

[9]  Amit P. Sheth,et al.  SA-REST: Semantically Interoperable and Easier-to-Use Services and Mashups , 2007, IEEE Internet Computing.

[10]  Jaeyoung Yang,et al.  Repetition-based web page segmentation by detecting tag patterns for small-screen devices , 2010, IEEE Transactions on Consumer Electronics.

[11]  Sandip Debnath,et al.  Automatic identification of informative sections of Web pages , 2005, IEEE Transactions on Knowledge and Data Engineering.

[12]  Holger Lausen,et al.  Web Service Search on Large Scale , 2009, ICSOC/ServiceWave.

[13]  Xiaojun Wan,et al.  Block-Based Similarity Search on the Web Using Manifold-Ranking , 2006, WISE.

[14]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[15]  Amit P. Sheth,et al.  A Faceted Classification Based Approach to Search and Rank Web APIs , 2008, 2008 IEEE International Conference on Web Services.

[16]  Sam Ruby,et al.  RESTful Web Services , 2007 .

[17]  John Domingue,et al.  Investigating Web APIs on the World Wide Web , 2010, 2010 Eighth IEEE European Conference on Web Services.

[18]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[19]  Cesare Pautasso,et al.  REST: From Research to Practice , 2011 .

[20]  Misha Mehra,et al.  Semantic Web Applications , 2011 .

[21]  Qiang Wang,et al.  An Adaptive Scoring Method for Block Importance Learning , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[22]  James A. Hendler,et al.  Handbook of Semantic Web Technologies , 2011, Handbook of Semantic Web Technologies.

[23]  Tomas Vitvar,et al.  RESTful Services with Lightweight Machine-readable Descriptions and Semantic Annotations , 2011, REST: From Research to Practice.

[24]  John Domingue,et al.  Feature LDA: A Supervised Topic Model for Automatic Detection of Web API Documentations from the Web , 2012, SEMWEB.