Strategies for Extracting Data from HTML and XML Content

In this chapter, we compare different approaches to parsing XML and HTML documents and extracting data from these documents into R. We illustrate these with comprehensive, real-world examples that illustrate XPath and R functions for processing XML documents. We also introduce event-driven parsing where we use a collection of R functions to respond to events in the XML parser. These work for both tree-based (DOM) parsing and SAX parsing where we avoid building the tree. At the end of the chapter, the reader should have a good understanding of the various different strategies that can be used in R to parse XML documents and extract content.

[1]  Chris Dix,et al.  Beginning XML , 2000 .

[2]  Elliotte Rusty Harold,et al.  XML in a Nutshell , 2001 .