Integrated generation of graphics and text- a corpus study

We describe the results of a corpus s tudy of more than 400 text excerpts that accompany graphics. We show that text and graphics play complementary roles in transmitt ing information from the writer to the reader and derive some observations for the automatic generation of texts associated with graphics. For the past few years, we have studied the automatic generation of graphics from statistical data in the context of the PostGraphe system (Fasciano, 1996; Fasciano and Lapalme, 1998) based on the study of graphic principles from such diverse sources as Bertin (1983), Cleveland (1980) and Zelazny (1989). Postfiraphe is given the data in tabular form as might be found in a spreadsheet; also input is a declaration of the types of values in the columns of the table. The user then indicates the intentions to be conveyed in the graphics (e.g. compare two variables or show the evolution of a set of variables) and the system generates a report in lATEX with the appropriate PostScript graphic files. PostGraphe also generates an accompanying text following a few simple text schemas. But before adding new schemas, we have decided to make a corpus study of texts associated with graphics and this paper presents the results of this study. We studied more than 400 texts and we will show that the saying "a picture is worth a thousand words" needs to be modulated because graphics and text are far from being interchangeable and that their interactions are quite subtle. With hindsight, this may seem obvious but, without a corpus study, we could not have documented this result. Although multimedia systems have been studied for many years, we are not aware of any previous corpus s tudy of the same scale. 63 1 O v e r v i e w o f PostGraphe Many sophisticated tools can be used to build a presentation using statistical graphs. However, most of them focus on producing professionallooking graphics without trying to help the user to organize the presentation. To help in this aspect, we have built PostGraphe which generates a report integrating graphics and text from a set of writer's intentions. The writer's intentions can be classified according to two basic criteria: structural differences and contents differences. We refer to intentions derived from structural differences as o b j e c t i v e i n t e n t i o n s and intentions derived from contents differences as s u b j e c t i v e i n t en t ions . This definition stems from the fact that when differences between two intentions are more content than structure related, the writer is choosing what to say and not how to say it. The writer is thus making a subjective choice as to what is more important . In our research, we have built a classification of messages, given in figure 1, based on Zelazny's (1989) work. At the first level, our classification contains 5 categories two of which have sub-categories obtained by using a fractional modifier. For comparison, the fractional modifier indicates that the comparison should be done on fractions of the whole instead of the actual values. For distribution, we obtain a specialized intention where the classes are presented according to their fraction of the total. At the second level, the intentions become specialized according to subjective criteria. These simple intentions can then combined either by composition or superposition. In composition, the order of the variables is impor tant and there is a dominant intention; for example, the comparison of evolutions is quite different Objective Structure i Subjective Content How to say ? What to say ? Reading(V) Comparison(S1,S2) Comparison Fractional(V,S) Evolution(V1,V2) Correlation '<V1,V2) Distribution(V,S) Distribution Fractional(S) Increase Decrease Stability Recapitulative Figure 1: Two level decomposition of simple intentions: V is a variable and S is a set of variables from the evolution of a comparison. For example, Sales figures of Xyz increased less quickly than the ones of Pqr between 1992 and 1994 compares evolutions while Pqr always stayed at the top except between 1992 and 1994 shows the evolution of the comparison. In superposition, the intentions are merely expressed using the same graphic but the intentions do not interfere. Figure 2 shows the the part of the Prolog input specifying the intentions and the output from PostGraphe. The intentions are divided in 2 sections: the first presents the 3 variables (year, company and profits). The second presents the comparison of the profits between companies and the evolution of the profits along the years. We have also "ported" this idea of taking account of the writer's intentions into the spreadsheet world by creating an alternative Chart Wizard for Microsoft Excel which asks for the intentions of the user (comparison, evolution, distribution . . . ) instead of prompting for the sort of graphic (bar chart, pie chart . . . ); see (Fasciano and Lapalme, 1998) for more information. 2 T e x t a n d g r a p h i c s i n t e g r a t i o n Graphics and text are very different media. Fortunately, when their integration is successful, they complement each other very well: a picture shows whereas a text describes. To create an data(... 7. the intentions [ [lecture (~nn4e), lecture (compagnie), lecture (profits) ], [comparaison( [profits], [compagnie] ), evolution (profits, ~nn~e) ] ] , 7, the raw data [[1987,'A' ,30] , . . . ] ) . Nouve l l e sec t ion (3 i n t en t i ons h t r a i t e r ) .