Keynote speakers

Monday: Akihiko Sugiyama, NEC Information and Media Processing Laboratories, Japan

A Versatile Speech Front-End for Telecommunication and Speech Recognition - The Last One Mile : Implementation Issues for a Better Product

This talk presents a versatile speech front-end for telecommunication and speech recognition. Requirements for versatile speech front-end systems are first established and then, some algorithms to satisfy these requirements are reviewed in the context of human and human-robot communications. Implementation techniques for these algorithms are discussed from view points of efficiency in three different levels; development, approximations, and coding of the algorithm. They are useful in both LSI and PC software implementations. Evaluation results and sound/video demonstrations conclude this talk.

Tuesday: Mark Gales, Cambridge University, UK (homepage)

Model-Based Approaches to Handling Additive Noise in Reverberant Environments

This talk will discuss the application of model-based noise robustness schemes to reverberant environments. Model-based approaches to handle additive noise and convolutional distortions have been extensively investigated and applied to a wide-range of noise conditions. However, the application of these approaches to handling reverberant noise has received less attention. This talk will review the current state-of-the-art in model-based approaches for handling additive noise and convolutional distortions. Issues with extending these approaches to handling reverberant environments will be described. To illustrate possible solutions, two extensions to standard adaptation/compensation approaches to handling reverberant noise will then be discussed. The first is an extension of vector Taylor series (VTS) compensation, reverberant VTS, where a mismatch function representing reverberant noise is used. The parameters of the mismatch function are estimated using maximum likelihood and the models compensated. The second approach modifies a standard speaker adaptation scheme, constrained MLLR, to allow a wide-span of frames to be taken into account and "projected" into the required dimensionality. To allow additive noise to be handled, both these schemes are combined with standard VTS. Results on two tasks, MC-WSJ-AV and a reverberant simulated version of AURORA-4, comparing the two forms of approach will be described.

Wednesday: Dan Ellis, Laboratory for Recognition and Organization of Speech and Audio (LabROSA), Columbia University. USA (homepage)

Environmental sound recognition and classification

The decreased cost and increased capabilities of mobile devices is feeding an accelerating growth of the huge amount of recorded audio-video content available. Casually-recorded, largely unedited videos can be an amazingly rich source of information, but at present they are often very difficult to navigate and find. While computer vision is a large and active field, the use of the soundtrack as a basis for analyzing and searching unconstrained video is an underexplored avenue. In this talk, I will introduce some application areas for the analysis of web-style video, and describe some recent projects in my lab aimed at extracting information from this kind of soundtrack material, using both speech and nonspeech content. I will also discuss opportunities for interaction between this area of research and the HSCMA community.

Closing talk: Ivan Tashev, Microsoft Research, USA (homepage)

HSCMA: Hands-free Sound Capture and Microphone Array in Kinect

This talk will discuss aspects of the acoustical design and audio processing pipeline of Kinect, the fastest-selling electronics device in history as recorded in the Guinness Book of Records. The device is the first industrial product with surround sound echo cancellation, one of the first to offer hands free speech recognition from a distance of up to four meters, and is the first open microphone speech recognition device. Ivan Tashev is one of the architects behind Kinect and created most of the algorithms in the audio pipeline.