The State of Automatic Speech Recognition

Q&A with Kaldi’s Dan Povey

This article continues the series on Automatic Speech Recognition, written at Medium.com.

Few experts in the field of automatic speech recognition have the kind of vantage point that Daniel Povey does. Povey is Associate Research Professor at the Center for Language and Speech Processing at Johns Hopkins University—and the lead developer and steward of the Kaldi project.

Kaldi is an open-source ASR toolkit; since its debut in 2011, it has helped supercharge the field, giving researchers a robust and flexible foundation to build from while taking advantage of the latest in ASR techniques. (In fact, Descript uses Kaldi for some of its features!)

Because of Kaldi’s prevalence in the field, Povey is attuned to many of its recent developments. We asked him a few questions about the state of the industry, and are thrilled he responded with a series of thoughtful answers, which we’ve included in full below.

What is the state of ASR today? What are its biggest shortcomings, and where are researchers/products making the most progress?

It’s nice that ASR is actually starting to be useful now. When I started out, the most visible ASR product was Dragon Dictate, which few people actually used— I believe it was marketed as the ideal Christmas present, which was deceptive.

These days we have Amazon Alexa and Google Home, which people actually use — not to mention call center dialog systems. They are annoying, but that’s often a limitation from the dialog management rather than the ASR.

A drawback, in my mind, is that most uses of ASR that make economic sense are still very large-scale, because it requires highly paid, hard-to-find experts to build a good-performing system. Kaldi reduces that barrier because it means you don’t have to build your software from the ground up. It even has recipes that you can follow, but it was still basically designed for experts to use. I think of Kaldi as like the Millennium Falcon. Sure, it can do the Kessel run in less than 12 parsecs, but as Han says to Luke: “who’s gonna fly it, kid?”

It’s actually a problem for academics, that ASR is doing so well. It’s viewed by some funding agencies as a “solved problem”. That means we can’t graduate many PhD students, and there are too few PhDs graduating to satisfy the demand from industry. Plus, many of the best academics defect to industry.

Where do you see the strengths of transcription services offered by Google and other major companies? How do these compare to Kaldi, and how do they differ between each other?

It’s hard to build a general-purpose model that will work as well as a model built for your specific task. Google’s ASR models are very good, but they won’t customize their models for your specific scenario. Also their service isn’t free, and sometimes there are privacy issues that preclude the use of a cloud service. It’s actually hard to know how Kaldi compares with Google’s ASR because they tend to be cagey about releasing performance figures on commonly available datasets, but we can compare with systems built by other major companies that do release such figures (like Microsoft or IBM).

Generally speaking, Kaldi performs about the same. In fact, the current best number on the Switchboard subset Eval2000, which is 5.0% Word Error Rate, is a Kaldi-based system — although not built by us, but by a company called cap.io. It’s a huge system combination, which is what you do when you want to get the best-ever number.

Much of the recent progress in ASR involves having access to large corpuses of data — which gives big companies an advantage. Meanwhile initiatives like Mozilla Common Voice are looking to level the playing field, at least in this respect.

How do you see this going: will ‘big data’ become less of a competitive advantage?

I dispute that much of the recent progress in ASR involves access to large amounts of data. Yes, there are papers that say: “We built a huge model on tens of thousands of hours of proprietary data and got this amazing performance on Switchboard Eval2000”. Those kinds of papers may be a huge PR win for the company that published them, but they don’t move the field forward. We’ve known since forever that the more data you have, the better you can do, so there’s nothing really new there; and in my opinion it’s not the case that people who have had access to big data have used it to develop particularly interesting new methods. Anyway, something that only works on ten thousand hours of data isn’t that interesting, in my opinion, because most of the time you won’t have that much data of the right type.

People tend to get excited about things that work on big data — it’s kind of a fashion nowadays — but I maintain that small data is equally interesting. If you’re building an application for which you don’t have training data that’s well matched — and most applications are like that — you’re probably going to want it to work well when trained with 10 hours of data, so that you can build a prototype with reasonable performance. That will allow you to scale your data up (or get the next round of funding). It’s like a fish: in order to get big, it needs to be able to survive when it’s small, because fish aren’t born big.

There’s definitely an advantage to scale, but I don’t think it’s just about the scale of the data. It’s also about the cost of building your application. Those costs are mostly fixed (they don’t scale with the size of your market), so to make a profit you need to have a certain scale. Of course the scale at which you can break even will tend to decrease over time, as the algorithms get better and the software becomes easier to use. Generally speaking the cost of the training data will still be less than what you’re paying your ASR engineers.

Regarding Mozilla Common Voice: it’s nice that they are collecting data, and free data is always a good thing, but you’ve got to remember that there are different types of data. If you want to build, say, a recognizer to process call center conversations in mixed Hindi and Indian-accented English, or one that can deal with commands in Mandarin inside a car, the Mozilla Common Voice data isn’t going to help. And in terms of research, there are already enough free large-scale databases for people to work with (for example, Librispeech is 1000 hours). So Mozilla Common Voice isn’t really a game changer in terms of research. It’s still useful for the purpose they collected it, though, which is to build ASR systems for a browser that accepts voice commands.

Do current approaches to ASR have shortcomings that will lead to diminishing returns? Do you expect we’ll hit a ‘wall’ in terms of accuracy?

There’s always going to be some kind of wall, because human speech is inherently ambiguous, even taking the context into account. I don’t know that I can say much about this because “current approaches” encompasses a lot of things. What I will say is that I am skeptical about the current passion for “end-to-end” speech recognition.

(Ed. note: Most ASR systems use several distinct models — acoustic, pronunciation, and language — in tandem. End-to-end systems attempt to process speech all in one go.)

In my opinion the defining feature of these “end-to-end” approaches is the attempt to take structure out of the system: whether that structure is the language model, the knowledge of pronunciations of words, or the concept of speech feature extraction, and other things too.

So it’s a simplification. Of course, simpler is good, but people forget that the structure was in there for a reason. For example, words really do have pronunciations that are distinct from their spellings; and it does make sense to train a language model separately from the ASR system because you can use separate text data for that. People seem to think that by taking the structure out of the system, the fairy dust of neural networks will improve the performance, but I think that’s a mirage.

Given how much attention and money are being invested in ASR, could progress actually accelerate?

I think the best we can hope for is that we’ll continue to make progress at the same rate that we have been making it recently. Much of the attention that’s being given to ASR is not the kind of attention that would contribute to progress, anyway. And some of the recent improvements in ASR have come from ideas that were developed largely by people not working in ASR: for instance, batch-norm, or RNNLMs. I would be very disappointed if progress stopped here, though.

AI and Oral History: Applications in Holocaust Testimonies

What automatic speech recognition can and cannot do for conversational speech transcription

A new ASR tool: aTrain

Update Whisper Large Model

How Might We Create Better Benchmarks for Speech Recognition?

How researchers digitally preserve Holocaust evidence

The dubbing artist: 'That's how Artificial Intelligence stole my voice'

Whisper, a new ASR engine

Exploring the possibilities of Thomson’s fourth paradigm transformation—The case for a multimodal approach to digital oral history?

The State of Automatic Speech Recognition

The State of Automatic Speech Recognition

Q&A with Kaldi’s Dan Povey