This paper examines the two basic philosophies of spatial audio reproduction, with reference to their application to teleconferencing services. Sound Field Simulation, as exemplified by multiple loudspeakers techniques such as Ambisonics, encodes information about a remote or virtual sound field, and allows the reproduction of that field across a listening space. Alternatively, an application might employ Perceptual Synthesis, in which measured or simulated sound localisation cues (e.g. Head-Related Transfer Function (HRTF) data) are imposed on the signals reproduced over headphones or a suitably set-up pair of loudspeakers. The relative merits and drawbacks of each approach are discussed in terms of cost, implementation logistics, flexibility, specification and, critically, perceived performance. INTRODUCTION In the development of modern telecommunications services, displays and user interfaces there exists the general paradigm of Telepresence; the idea that interaction between parties at remote locations can be made more effective and efficient if it makes use of as much perceptual information as possible. Under such circumstances, telecommunications will more closely emulate real interaction between people in the physical world. This philosophy supports the development of large visual displays with increased colour and resolution, 3-D displays, interactive support for head movement and improved audio. In auditory display, SPATIAL AUDIO is also considered to be a desirable enhancement to telecommunications interfaces, facilitating the reproduction of the voices of other parties in such a way so they are perceived as emanating from particular locations in space generally the positions of associated visual stimuli. When communicating with one or more people in person our perception of the scene, both auditory and visual, is inherently spatial and, therefore, spatial auditory displays of this type should have a higher level of telepresence than non-spatial reproduction through their presentation of information in a more natural manner; a style of presentation we each have a lifetime’s experience in understanding. Such exploitation of our sound localisation ability also has the potential to yield benefits in discriminating a voice from others, or from vocal or other noise present on the channel. However, the spatial audio community seems entrenched in two camps, each favouring a different approach to the spatialisation of sound, with different implications upon implementation and performance in teleconference applications. We can consider the first approach to be Sound Field Simulation, in which an array of loudspeakers (in practice four or more) are used to reproduce a sound field across a listening space. The reproduced sound field can be sourced by special microphone configurations at a real, but remote, location or might be a virtual sound field, artificially encoded from individual sources. Whatever the origin of the sound field being reconstructed, ideally it should be presented across the listening space in such a way so that listeners within the space are exposed to auditory information identical to that which would be present in the real environment. The listeners should, therefore, perceive the spatial qualities of the sound field accordingly. Ambisonics is a popular and widespread implementation of the sound field simulation technique [1]. The second general method of creating spatial audio is Perceptual Synthesis, in which we attempt to ensure that the perceptual cues used by the human auditory system to localise sound as emanating from particular positions are present in the signals reaching the listener’s ears. Physical recordings containing such localisation information can be made by means of a microphones placed in the ears of a binaural dummy head or a real listener [2], effectively capturing the sound, including localisation cues, present. Alternatively, artificial binaural recordings can be made by modelling the directionally-dependent features present in the sound reaching a listener’s ears, generally in the from of Head-Related Transfer Functions (HRTFs), and imposing them on the sound signals that we wish to spatialise [3]. Binaural recordings are inherently two-channel and are naturally suited to reproduction over headphones, although the technique of cancelling left-right crosstalk can be used to enable a pair of loudspeakers to reproduce binaural stimuli [4]. Sound field simulation and perceptual synthesis are two quite different approaches to tackling the same task, reproducing spatial audio. This paper examines the relative merits and drawbacks of implementing spatial audio into a teleconference environment by means of the two techniques (as exemplified by Ambisonics and HRTF-based spatialisation). In teleconferencing we must weigh the improved telepresence that may be achieved by the use of a spatial audio display with financial and logistical concerns and the flexibility of the systems. Also, and critically, we must ensure that any increase in telepresence is not achieved at the expense of a reduction in the effectiveness with which the words themselves are understood by the listener since, in speech telecommunications, the intelligible, unambiguous and comfortable transmission of the spoken word is of primary importance. TELECONFERENCE APPLICATIONS Figure 1 shows an artist’s impression of a future teleconference system with an extremely high level of telepresence [5]. This application consists of completely unobtrusive cameras and microphones and a 3-D visual display which does not require glasses. The application is also multipoint and multi-user, with each participant experiencing the sense of shared workspace. Reproducing spatial audio in such a way as it supports the natural interaction of such a service rather than compromising it will be essential. Teleconference facilities as advanced as these are still some years beyond the scope of current technology. However, there is a rapidly increasing market for the current generation of commercially-available teleconference systems, produced by companies such as PictureTel in the USA, and BT in Britain, which have exploited the bandwidth provided by the ISDN. The digital network has also allowed the inclusion of ancillary tools to increase telepresence and effectiveness, such as collaborative drawing utilities and data sharing. Figure 1: A future teleconference system proposed by NTT, reproduced from [5] Most ISDN teleconferencing systems can be classified into the following general categories: • Permanent Videoconference Rooms • “Rollabout” Teleconference Systems • PC/Desktop Videotelephones Although forming the common view of teleconferencing systems, rooms permanently set aside for videoconference use are comparatively absent from the marketplace. Such facilities will generally involve large video-projection screens and allow the inclusion of a large number of participants. Clearly the logistics and costs involved in such large scale applications restrict the attractiveness of this option, but benefits are yielded in the form of high quality, high telepresence presentation of visual images. Permanent videoconference facilities are almost exclusively a niche, custom-built market. “Rollabout” or group videoconferencing systems provide more practical and flexible access to teleconference facilities whist maintaining many of the benefits of the larger-scale systems, in terms of multiple users and moderately large screens. Products in this category generally house the bulk of the hardware, including monitor, video camera, microphones and loudspeakers, on a movable, and thus to a certain extent portable, trolley. BT’s “Rollabout” product is the VS2. PictureTel produce three very similar systems of which the highest specification model, the Concorde 4500, features a 27 inch monitor and a camera which automatically tracks the position of the person speaking, by means of a small beamforming microphone array. Systems of this kind are rapidly becoming the standard product for situations in which groups of parties wish to participate. They have the flexibility and sufficient portability to be moved between different sized rooms, and support is often included for peripherals such as document cameras or digital overhead projection scanners in order to share documents. Videotelephones small units with visual and audio reproduction generally intended for single users have not found as active a market as might have been expected by considering what seems to be a logical progression from conventional telephony. ISDN videotelephones are available, such as BT’s VC9000, also known as Presence. However, the fact is that, in the workplace at least, a large proportion of potential videotelephone users already have video and audio playback hardware on their desk, in the form of a PC. Thus, a natural transition has been to incorporate the functionality of the videotelephone into the computer. Products such as PictureTel’s Live100 or the DVS100 from BT, consist of a PC expansion card which connects to the ISDN, and associated software, generally running under a windows-type environment. A microphone and small monitor videocamera complete the system. The visual image of the other party in the teleconference is shown in a window on the PC desktop and the sound reproduced either over headphones or loudspeakers. Shared applications and additional telepresence tools can run in other windows. Such desktop systems clearly have a lower level of telepresence, primarily because of the small visual image. However, applications of this kind, as well as forming a highly cost-effective teleconferencing solution, are also ideal for day-today use, generally by individuals at their desks. SPATIAL AUDIO COST, LOGISTICS AND FLEXIBILITY Channels and Loudspeakers A central influence upon the cost and logistics of physically realising a spatial audio system is the number of cha
[1]
F L Wightman,et al.
Headphone simulation of free-field listening. I: Stimulus synthesis.
,
1989,
The Journal of the Acoustical Society of America.
[2]
Jerry Bauck,et al.
Generalized transaural stereo and applications
,
1996
.
[3]
F L Wightman,et al.
Localization using nonindividualized head-related transfer functions.
,
1993,
The Journal of the Acoustical Society of America.
[4]
Masato Miyoshi,et al.
NNT's research on acoustics for future telecommunication services
,
1992
.
[5]
James A. S. Angus,et al.
The Perceived Performance of Spech Spatialized Using a Spherical Harmonic Model of Head-Related-Transfer Functions
,
1997
.
[6]
Jens Blauert,et al.
Teleconferencing system using head-related signals
,
1992
.
[7]
F L Wightman,et al.
Headphone simulation of free-field listening. II: Psychophysical validation.
,
1989,
The Journal of the Acoustical Society of America.
[8]
Durand R. Begault,et al.
3-D Sound for Virtual Reality and Multimedia Cambridge
,
1994
.
[9]
Malcolm J. Hawksford,et al.
Limitations of Dynamically Controlling the Listening Position in a 3-D Ambisonic Environment
,
1997
.
[10]
E. Knudsen,et al.
Creating a unified representation of visual and auditory space in the brain.
,
1995,
Annual review of neuroscience.
[11]
Duane H. Cooper,et al.
Prospects for Transaural Recording
,
1989
.
[12]
David G. Malham,et al.
3-D Sound Spatialization using Ambisonic Techniques
,
1995
.