Every day APM|MPR generates several hours of audio content for its radio and digital broadcasts. Over time that adds up to many terabytes of audio, most of which has no written transcripts available, because transcripts are expensive and slow to create. Imagine listening to the news and writing down everything you hear, word for word, then going back to error-check and verify the spelling of names and places. Now imagine doing that for hours every day. It’s tedious work.
Yet unless a transcript exists, there is really no way to search the audio later. Basic metadata, like key words, speaker names, title, date and time, might be available, but it won’t begin to represent the detail of a conversation in a radio interview.
In the last decade, speech-to-text technology has evolved to the point where we can start to imagine computers transcribing our news broadcasts and radio shows well enough to make them searchable. APM|MPR wanted to find out if the open source speech-to-text (also known as automatic speech recognition (ASR)) toolkits were mature enough to use for searching our audio archives.
We had modest goals. We knew that even commercial products, like Google Voice or Siri or even your company’s voicemail system, could vary widely in the quality and accuracy of their transcriptions. Previous, failed attempts to automate transcriptions at APM|MPR had a much loftier goal: we wanted to publish the results for our audiences. We decided for this prototype to scale back our ambitions and instead focus on keyword detection. Most search engines focus on nouns as being the most useful words. So we wanted to answer one question: could we develop an ASR system that could identify the most frequently used nouns or noun phrases in a piece of audio? The Knight Foundation agreed to help us fund a project to answer that question.
We partnered with some industry experts at Cantab Research, Ltd. who agreed to build the basic ASR tools for us, based on their extensive work with the open source software we wanted to evaluate. Cantab is led by Dr. Tony Robinson, a leader in the ASR field. Cantab would build the various acoustic and language training models required, as well as write the scripts for manipulating the ASR libraries. APM|MPR would build the testing scripts and processing infrastructure, including a web application for viewing and comparing transcripts.
Based on consultation with Cantab, we chose to focus on a comparison between open source two ASR libraries: Julius and Kaldi. We identified about a hundred hours of audio for which we had manually-generated transcripts, and supplied the audio and text files to Cantab. Unfortunately many of the transcripts were not accurate enough, because they had been “cleaned up” for audience presentation, but Cantab was able to identify additional public domain audio and transcripts to flesh out the training collection and push on with the work.
Over the course of three months Cantab delivered five different iterations of the models and code. Each version got progressively faster and more accurate. Three of the iterations used Julius and two of them used Kaldi. In that way we were able to compare the two ASR libraries against one another using the same collection of testing material. In the end we were able to get comparable results with both libraries.
While Cantab was training the ASR models, we built a web application where users could register audio by URL and trigger a transcription for later delivery via email. The application was designed to process the queue of incoming audio using a variable number of machines so that it could scale linearly. The more machines we point at the queue, the faster the application can process the audio.
Each time Cantab delivered a new version of the ASR components, we re-ran our evaluation against our testing collection, using the web application we had developed. The testing collection was composed of the same 100 hours of audio and transcripts we had sent to Cantab originally. Our testing procedure looked like:
- Generate a machine transcript automatically
- Apply a part-of-speech tagger and extract the nouns and noun phrases, sorted by frequency, for both machine and human transcripts
- Compare the machine and human word lists
What we found was that the testing scripts consistently found 85-100% of the same key words in both the machine and human transcripts, as long as frequency was ignored. If frequency was weighed, the overlap dropped to 50-70%. What that told us was that the machine transcripts were accurate enough, most of the time, to surface the key words, even if they couldn’t be relied upon to identify those words every time they appeared. That feels “good enough” to us to pursue this route, since frequency in full-text search is typically used only to affect rankings, not inclusion, within a result set.
Processing audio for ASR, even to test a single configuration setting, can be very time-consuming and resource-intensive, so we knew we had an aggressive schedule (six months) and budget for this project. Still, our experience prototyping this project taught us several things, among them:
- Garbage in, garbage out. The accuracy of the ASR application is completely dependent on both the quality and quantity of training material we can provide. We would like to identify a much larger corpus of APM audio to use for improving our training models.
- Who said that? Identifying specific speakers (called “diarization”), such as the reporter versus the interviewee, could help us improve our search results, by allowing audiences to limit their searches to specific speakers.
- Cross your Ts. For search purposes we ignored capitalization, punctuation and sentence structure. If we spent some time maturing our language models and related scripts, we might be able to better improve key word identification, particularly around proper nouns like people’s names.
- A little song, a little dance. Identifying sounds that are not human beings speaking, such as music or other sound effects, could use a lot more work.
We really enjoyed working on this project. APM|MPR would like to thank Cantab Research, particularly Dr Robinson and Niranjani Prasad, who helped elucidate the mysteries of ASR systems, and the Knight Foundation and Knight Prototype Fund, whose financial support and encouragement made the project possible.
All our code is available under a MIT license at: