Technologies and tools for speech data

softwareWhat kind of technologies and tools can be applied when working with speech data?

Humanities scholars in general are well versed with hermeneutic analysis, but sometimes lack the knowledge that is required to understand the technologies and tools that have been developed to digitally process speech data. This understanding is also hampered by the frequent use of abbreviations and jargon in computer science.

To bridge this gap, this section offers descriptions of various technologies and digital tools that apply these technologies. The pages relate to all kinds of techniques that a humanities scholar might want to use when working with spoken data. Per technique, we provide two sections. The first section gives a description of the generic technology. In the second section, we list one or multiple specific tools in which these technologies are made available for research. Additionally, a glossary is provided with an explanation of the most frequently used terms and abbreviations in technologies related to speech and language.

Any machine processing of human written or spoken language requires specific techniques and tools. Broadly, it consists of a sequence of tasks: Recording, Recognition, Analysis.

During recording, speech and any kind of written analog data, must be digitised with sufficient fidelity. Thus, speech is recorded using one of the many audio formats with their specific technological parameters, and written language is recorded as images obtained via scanning.

Recognition is the task of converting raw digital data into symbols. It is thus a categorisation process: in optical character recognition (OCR), regions of an image are associated with character symbols, and likewise in automatic speech recognition (ASR), fragments of speech are associated with character symbols. Note that there are many types of recognition: speaker diarization aims to determine who is speaking, emotion recognition aims to extract information about a speaker’s emotional state.

The performance of automatic recognition systems is given as an error rate: the difference between an ideal and a given recognition can be calculated. In general, human performance is considered as the gold standard, meaning that machine processing is compared against human performance.

Analysis is any higher processing of speech and language – one tries to associate meaning to the symbolic representation. In general, it integrates the results of the recognition process with knowledge-based representations. As an example, an automatic speech recognition system commonly returns a sequence of candidate words; an analysis of the syntactic structure in which the candidates occur allows a selection of the optimal word.

For research issues, see here.