DiCoMo: the digitization cost model

The estimate of digitization costs is a very difficult task. It is difficult to obtain accurate values because of the great quantity of unknown factors. However, digitization projects need to have a precise idea of the economic costs and the times involved in the development of their contents. The common practice when we start digitizing a new collection is to set a schedule, and a firm commitment to fulfil it (both in terms of cost and deadlines), even before the actual digitization work starts. As it happens with software development projects, incorrect estimates produce delays and cause costs overdrafts. Based on methods used in Software Engineering for software development cost prediction like COCOMO and Function Points, and using historical data gathered during 5 years at the MCDL project, during the digitization of more than 12000 books, we have developed a method for time-and-cost estimates named DiCoMo (Digitization Cost Model) for digital content production in general. This method can be adapted to different production processes, like the production of digital XML or HTML texts using scanning and OCR, and undergoing human proofreading and error correction, or for the production of digital facsimiles (scanning without OCR). The accuracy of the estimates improve with time, since the algorithms can be optimized by making adjustments based on historical data gathered from previous tasks. Finally, we consider the problem of parallelizing tasks, i.e. dividing the work among a number of encoders that will work in parallel.

[1]  Tom DeMarco,et al.  Peopleware: Productive Projects and Teams , 1987 .

[2]  Richard E. Fairley,et al.  Software engineering concepts , 1985, McGraw-Hill series in software engineering and technology.

[3]  Lawrence H. Putnam,et al.  A General Empirical Solution to the Macro Software Sizing and Estimating Problem , 1978, IEEE Transactions on Software Engineering.

[4]  Ana Magazinovic Exploring Cost Estimation Inaccuracy - Why do practitioners still fail to predict the actuals? , 2008 .

[5]  John E. Gaffney,et al.  Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation , 1983, IEEE Transactions on Software Engineering.

[6]  Barry W. Boehm,et al.  Cost models for future software life cycle processes: COCOMO 2.0 , 1995, Ann. Softw. Eng..

[7]  alastair dunning Digital Resources for the Humanities , 2004 .

[8]  A. B. Platas The use of multimediality to enhance the accessibility to digital library resources : The multicultural-scope of the services offered by the Miguel de Cervantes digital library project , 2001 .

[9]  Barry W. Boehm,et al.  Calibrating the COCOMO II Post-Architecture model , 1998, Proceedings of the 20th International Conference on Software Engineering.

[10]  E. E. Grant,et al.  Exploratory experimental studies comparing online and offline programming performance , 1968, CACM.

[11]  Rafael Muñoz,et al.  Estimating Digitization Costs in Digital Libraries Using DiCoMo , 2010, ECDL.

[12]  Kathleen Bauer Cost analysis of a project to digitize classic articles in neurosurgery. , 2002, Journal of the Medical Library Association : JMLA.

[13]  D. Kieras,et al.  The role of cognitive task analysis in the application of predictive models of human performance , 1998 .

[14]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[15]  Alejandro Bia,et al.  The Miguel de Cervantes Digital Library: the Hispanic Voice on the Web , 2001, Lit. Linguistic Comput..

[16]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[17]  J. C. Ballard Computerized assessment of sustained attention: a review of factors affecting vigilance performance. , 1996, Journal of clinical and experimental neuropsychology.

[18]  Ricardo Valerdi,et al.  THE CONSTRUCTIVE SYSTEMS ENGINEERING COST MODEL (COSYSMO) , 2005 .

[19]  J D Duke How much does it really cost? , 1973, Hospitals.