Google Book Search is working with libraries and publishers around the world to digitally scan books. Some of those works are now in the public domain and, in keeping with Google's mission to make all the world's information useful and universally accessible, we wish to allow users to download them all. For users, it is important that the files are as small as possible and of printable quality. This means that a single codec for both text and images is impractical. We use PDF as a container for a mixture of JBIG2 and JPEG2000 images which are composed into a final set of pages. We discuss both the implementation of an open source JBIG2 encoder, which we use to compress text data, and the design of the infrastructure needed to meet the technical, legal and user requirements of serving many scanned works. We also cover the lessons learnt about dealing with different PDF readers and how to write files that work on most of the readers, most of the time.
[1]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[2]
Dan S. Bloomberg,et al.
Pattern matching using the blur hit - miss transform
,
2000,
J. Electronic Imaging.
[3]
Daniel P. Huttenlocher,et al.
Comparing Images Using the Hausdorff Distance
,
1993,
IEEE Trans. Pattern Anal. Mach. Intell..
[4]
Luc Vincent,et al.
Blur hit-miss transform and its use in document image pattern detection
,
1995,
Electronic Imaging.
[5]
Wilson C. Hsieh,et al.
Bigtable: A Distributed Storage System for Structured Data
,
2006,
TOCS.
[6]
GhemawatSanjay,et al.
The Google file system
,
2003
.