Information Geometry, the Embedding Principle, and Document Classification

High dimensional structured data such as text and images is often poorly understood and misrepresented in statistical modeling. Typical approaches to modeling such data involve, either explicitly or implicitly, arbitrary geometric assumptions. In this paper, we review a framework introduced by Lebanon and Lafferty that is based oň Cencov's theorem for obtaining a coherent geometry for data. The framework enables adaptation of popular models to the new geometry and in the context of text classification yields superior performance with respect to classification error rate on held out data. The framework demonstrates how information geometry may be applied to modeling high dimensional structured data and points at new directions for future research.