Automatic Speech Recognition

Automatic Speech Recognition, or ASR, is software that enables recordings of spoken word to automatically be converted into text. Such technology is similar to OCR, which turns images of text into text characters that a computer can read, but instead of converting images, ASR converts audio and video, or AV files.

Nowadays, statistical models are replaced by Deep Neural Networks (DNNs) which amaze us of doing the same job better, so the overall quality is sometimes on par with humans. DNNs for end-to-end ASR are capable to find their own way to transform speech into text, they learn to cope with acoustics, phonetics, grammar, vocabulary, real-world knowledge and orthography without direct human supervision. They only need the training data, appropriate design and powerful computers.

Machine Learning
Speech recognition software is “trained”, which means it has been provided with many AV files with perfect man-made transcriptions, so that the mashine learns what sounds correspond to which letter combinations. The name for this process is Machine Learning ()ML. This means that a selection has to be made by humans on how to train the software. Because of this, ASR can be accent and dialect sensitive: ASR for Australian English might not work for American English.

Considerations when using ASR
When a researcher is contemplating using ASR, it is important to realize certain AV sources are more suitable than others. Here are some things to keep in mind. First, the audio quality should be high. This means that voices are clear, not echoey, and recorded close to the mouth with adequate microphones. Secondly, ASR works best on monologues. If an audio file is full of people interjecting and talking over each other, results might be confusing to read, as not all software is capable of recognizing different people by voice. Ideally, the file should have a seperate channel for every speaker. Lastly, ASR is generally not very good at dealing with accents and dialects. When dealing with migrants or rural peoples with accents that might be easy for you to understand, ASR might still majorly struggle with it. Let alone accents that are hard to understand for human outsiders.


Speech Recognition can be done via a computer program on your own computer (Windows, Linux, OSX) or via a webservice. Both options have their advantages and disadvantages.

While a webservice is easy-to-use and often does not require any computer power of note, it comes with the downside of privacy issues. Almost all webservices for ASR rely on cloud computing, which means that your recorded interview is at least temporarily stored on a server somewhere. In some cases, by using a webservice, you agree to them using your interview recording for their own research purposes, which is very problematic for privacy reasons. We recommend to always read the terms and conditions of an ASR service before deciding if it suits your needs when it comes to privacy

When working with very sensitive data, where you don't feel like trusting any online service, running an ASR service on your own computer has the advantage of giving you full control over the whole process. Data will not be uploaded to a different computer for processing, and you could run the software without turning on the internet at all, if you'd like. The big disadvantage is that local ASR processing is often more complicated, and requires a computer with at least some processing power. While a lot of online ASR services would easily work on a simple webbrowsing laptop like a Chromebook, this is less so the case with an installable service.

Down here, we will list a couple of options worth naming when it comes to ASR services. The first will be the Transcription Portal, as part of the team responsible for this website has had a large part in the creation of this online ASR tool. Other unafilliated options will be listed at a later time.