Text extraction on Windows/spl reg/-based documents
暂无分享,去创建一个
Syntel LLC is the developer of a mail presorting application called AutoMail/spl reg/, which needs to alter bank statements that are being printed. For this and other applications, it is sometimes impossible to exert any control over the document creation software, but changes to the printed documents must nevertheless be made. The purpose of this project is to retrieve data which has been sent to the Microsoft Windows/spl reg/ printing subsystem, parse the data, modify sections of text contained within each document, and continue the print process, leaving the document unmolested except for the altered sections of text. This is done by processing enhanced metafile (EMF) documents, and generating XML documents formatted to be easily read by the software modules responsible for actually altering the text data. During some phase of the print process on Microsoft Windows operating systems, each page will exist as an EMF document. Each EMF document consists of a number of entries describing drawing operations. Those drawing operations which are found to pertain to text output in the important spatial regions of the document are converted to plain text. This text, along with certain formatting and positioning information, is written to the XML file. All other drawing operations are included in the XML file as "black box" entities, so that the document can be repackaged after processing. Repackaging is accomplished by creating new text drawing operations, reinserting the other drawing operations, and using the Windows/spl reg/ API to print the resulting EMF document.
[1] Adobe Press,et al. PostScript Language Reference Manual , 1985 .