We describe an experiment transforming large collections of LaTEX documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arχiv) using LaTeXML, a LaTEX to XML converter currently under development. While the long-term goal is a large body of scientific documents available for semantic analysis, search indexing and other experimentation, the immediate goals are tools for creating such corpora. The first task of our arXMLiv project is to develop LaTeXML bindings for the (thousands of) LaTEX classes and packages used in the arχiv collection, as well as methods for coping with the eccentricities that TEX encourages. We have created a distributed build system that runs LaTeXML over the collection, in part or entirely, while collecting statistics about missing bindings and other errors. This guides debugging and development efforts, leading to iterative improvements in both the tools and the quality of the converted corpus. The build system thus serves as both a production conversion engine and software test harness. We have now processed the complete arχiv collection through 2006 consisting of more than 400,000 documents (a complete run is a processor-year-size undertaking), continuously improving our success rate. We are now able to convert more than 90% of these documents to XHTML+MathML. We consider over 60% to be successes, converted with no or minor warnings. While the remaining 30% can also be converted, their quality is doubtful, due to unsupported macros or conversion errors.
[1]
Mark van den Brand,et al.
Extracting mathematical semantics from LATEX documents
,
2003
.
[2]
Michael Kohlhase,et al.
MathDox : mathematical documents on the web
,
2006
.
[3]
Michael Kohlhase,et al.
Using as a Semantic Markup Format
,
2008,
Math. Comput. Sci..
[4]
Michael Kohlhase,et al.
Transforming the arXiv to XML
,
2008,
AISC/MKM/Calculemus.
[5]
Michael Kohlhase,et al.
A Search Engine for Mathematical Formulae
,
2006,
AISC.
[6]
Dimitar Misev,et al.
Conversion d’articles en LaTeX vers XML avec MathML : une étude comparative@@@MathML-aware article conversion from LaTeX — A comparison study
,
2008
.
[7]
Michael Kohlhase,et al.
OMDoc - An Open Markup Format for Mathematical Documents [version 1.2]
,
2006,
Lecture Notes in Computer Science.
[8]
Michael Kohlhase,et al.
An Architecture for Linguistic and Semantic Analysis on the arXMLiv Corpus
,
2009,
GI Jahrestagung.
[9]
Stephen M. Watt,et al.
Mathematical Markup Language (MathML) Version 3.0
,
2001,
WWW 2001.
[10]
Michael Kohlhase,et al.
Using L A T E X as a Semantic Markup Format
,
2008
.