Language and Speech Technology - Speech & Technology

'Language and Speech Technology' is about developing computational methods for processing spoken language. In spoken language, a physical signal – minute changes of air pressure – carries information – speech sounds, words, emotions, identity.

Automatic Speech Recognition (ASR), Text-to-Speech (TTS) and Dialog Systems (DS) are the three main research areas, and they are complemented by Speaker Diarisation, Voice Activity Detection, Emotion Recognition, and Segmentation.

Automatic Speech Recognition

ASR In ASR, the words of an utterance are extracted. Statistical methods, and, more recently, neural networks have shown to be very successful. These systems are trained on large speech corpora which consist of the raw audio files and their verbatim transcripts. ASR systems achieve the best results for speech similar to the material they were trained with. Depending on the application area, the raw output of speech recognition systems needs further processing, e.g. text normalisation and punctuation.

Text-to-Speech

TTS (or Speech Synthesis) aims to generate an audio signal from a given text or concept. There are two main approaches: parametric synthesis, which generates an audio signal from a mathematical description of the articulation process, and unit-selection which generates an utterance by concatenating small building blocks of prerecorded speech. Recent research on TTS also explores the use of neural networks to improve the perceived quality of the speech output.

Dialog Systems

In DS research, the focus is on high-quality speech communication between humans and machine systems. Typical examples are information, assistance or education systems, sales and entertainment, but also support systems for hands-free operation. The major challenge in this field is the smooth interaction of multimodal communication, namely speech, text, graphics and haptics.

Speaker Diarisation is the process of assigning speakers to fragments of a given speech signal. Humans are very good at this, but more research is needed for machines to reach useful performance. The same is true for voice activity detection: here, the challenge is to first determine non-speech parts of the signal, and then to decide about their communicative function.

In Emotion Recognition, biophysical features are traced in the speech signal and mapped to emotional states. Such features are, among others, intonation contours, speech rate and variation. Again, the main challenge is to identify which speech features can be mapped to emotional states in which contexts.

Finally, in segmentation, fragments of the speech signal are associated with some symbolic data. In phoneme-based segmentation systems, a raw verbatim transcript is converted to a sequence of phonemes which are then mapped to the speech signal. The challenges here are languages with segmentation features not covered by the training data, e.g. tone languages, or languages for which there exists very little training material.

Interdisciplinarity

Language and Speech Technology is interdisciplinary research. Advances in LST will benefit diverse scientific areas, and vice-versa the contributions of other research fields are needed to advance LST. The relationship between signal and information is complex and multi-dimensional, and disambiguation may require access to knowledge about the context in which the signal was produced and perceived.

For example, based on the acoustic signal alone, an automatic speech recognition system may return the same probability for the word sequences "kind of research that I can duct" and "kind of research that I conduct". A syntactic analysis supports the second word sequence, because a verb is highly probable at this position, and this is further supported by a lexical context analysis showing that 'research' and 'conduct' co-occur very often.