Every day APM|MPR generates several hours of audio content for its radio and digital broadcasts. Over time that adds up to many terabytes of audio, most of which has no written transcripts available, because transcripts are expensive and slow to create. Imagine listening to the news and writing down everything you hear, word for word, then going back to error-check and verify the spelling of names and places. Now imagine doing that for hours every day. It’s tedious work.

Yet unless a transcript exists, there is really no way to search the audio later. Basic metadata, like key words, speaker names, title, date and time, might be available, but it won’t begin to represent the detail of a conversation in a radio interview.

In the last decade, speech-to-text technology has evolved to the point where we can start to imagine computers transcribing our news broadcasts and radio shows well enough to make them searchable. APM|MPR wanted to find out if the open source speech-to-text (also known as automatic speech recognition (ASR)) toolkits were mature enough to use for searching our audio archives.

We had modest goals. We knew that even commercial products, like Google Voice or Siri or even your company’s voicemail system, could vary widely in the quality and accuracy of their transcriptions. Previous, failed attempts to automate transcriptions at APM|MPR had a much loftier goal: we wanted to publish the results for our audiences. We decided for this prototype to scale back our ambitions and instead focus on keyword detection. Most search engines focus on nouns as being the most useful words. So we wanted to answer one question: could we develop an ASR system that could identify the most frequently used nouns or noun phrases in a piece of audio? The Knight Foundation agreed to help us fund a project to answer that question.

We partnered with some industry experts at Cantab Research, Ltd. who agreed to build the basic ASR tools for us, based on their extensive work with the open source software we wanted to evaluate. Cantab is led by Dr. Tony Robinson, a leader in the ASR field. Cantab would build the various acoustic and language training models required, as well as write the scripts for manipulating the ASR libraries. APM|MPR would build the testing scripts and processing infrastructure, including a web application for viewing and comparing transcripts.

Based on consultation with Cantab, we chose to focus on a comparison between open source two ASR libraries: Julius and Kaldi. We identified about a hundred hours of audio for which we had manually-generated transcripts, and supplied the audio and text files to Cantab. Unfortunately many of the transcripts were not accurate enough, because they had been “cleaned up” for audience presentation, but Cantab was able to identify additional public domain audio and transcripts to flesh out the training collection and push on with the work.

Over the course of three months Cantab delivered five different iterations of the models and code. Each version got progressively faster and more accurate. Three of the iterations used Julius and two of them used Kaldi. In that way we were able to compare the two ASR libraries against one another using the same collection of  testing material. In the end we were able to get comparable results with both libraries.

While Cantab was training the ASR models, we built a web application where users could register audio by URL and trigger a transcription for later delivery via email. The application was designed to process the queue of incoming audio using a variable number of machines so that it could scale linearly. The more machines we point at the queue, the faster the application can process the audio.

Each time Cantab delivered a new version of the ASR components, we re-ran our evaluation against our testing collection, using the web application we had developed. The testing collection was composed of the same 100 hours of audio and transcripts we had sent to Cantab originally. Our testing procedure looked like:

  • Generate a machine transcript automatically
  • Apply a part-of-speech tagger and extract the nouns and noun phrases, sorted by frequency, for both machine and human transcripts
  • Compare the machine and human word lists

What we found was that the testing scripts consistently found 85-100% of the same key words in both the machine and human transcripts, as long as frequency was ignored. If frequency was weighed, the overlap dropped to 50-70%. What that told us was that the machine transcripts were accurate enough, most of the time, to surface the key words, even if they couldn’t be relied upon to identify those words every time they appeared. That feels “good enough” to us to pursue this route, since frequency in full-text search is typically used only to affect rankings, not inclusion, within a result set.

Processing audio for ASR, even to test a single configuration setting, can be very time-consuming and resource-intensive, so we knew we had an aggressive schedule (six months) and budget for this project. Still, our experience prototyping this project taught us several things, among them:

  • Garbage in, garbage out. The accuracy of the ASR application is completely dependent on both the quality and quantity of training material we can provide. We would like to identify a much larger corpus of APM audio to use for improving our training models.
  • Who said that? Identifying specific speakers (called “diarization”), such as the reporter versus the interviewee, could help us improve our search results, by allowing audiences to limit their searches to specific speakers.
  • Cross your Ts. For search purposes we ignored capitalization, punctuation and sentence structure. If we spent some time maturing our language models and related scripts, we might be able to better improve key word identification, particularly around proper nouns like people’s names.
  • A little song, a little dance. Identifying sounds that are not human beings speaking, such as music or other sound effects, could use a lot more work.

We really enjoyed working on this project. APM|MPR would like to thank Cantab Research, particularly Dr Robinson and Niranjani Prasad, who helped elucidate the mysteries of ASR systems, and the Knight Foundation and Knight Prototype Fund, whose financial support and encouragement made the project possible.

All our code is available under a MIT license at:

https://github.com/APMG/audio-search

appetites2

If you’re a die-hard fan of MPR News, you may have noticed some new page layouts on our site recently. We have been working hard on our search and grouping tools that allow us to generate these pages. We call these groups collections: pages that list and link to other pieces of content, almost always news stories and/or audio segments.

Collections aren’t generally pages that get much traffic or attention from audiences or search engines. But, they can occasionally serve a few useful purposes. First, for a small (and I mean small) subset of our visitors, they are highly utilitarian pages that allow browsing and refining by topic, where search doesn’t work well. Secondly, collection pages are useful for grouping particular highly focused stories together, at times when there is a lot of coverage happening for a relatively short period of time. For example, coverage of a Franken/Coleman recount or 35W Bridge Collapse.

The other important consideration for us is that our homepage is essentially a collection page, but the collection it “searches” is every story we create. What we’re building for our low traffic collection pages will be hugely important for building our new homepage.

Old collection pages: Bad

On our old site, we have had very little standardization of our visual design, and this was most evident on our collection, project, and episode pages. Each collection and project page was a special little flower, crafted and cared for during it’s special little moment, then left to wither and die. Here are a few examples:

A non-responsive mess.

Collection pages on our old site were not conceived in a time when smartphones, tablets, and retina screens were a thing. But today they are, and our pages obviously need to deal with that. Most of the old pages pages feature too many or too small images, type that can only be read with a magnifying glass, and little standardization between collections. Not only is this a web developer maintenance nightmare and not responsive to device capabilities, it is confusing and scattershot for our audiences.

New collection pages: Less bad

Our new collections have a range of visual and utility options that can be turned on and off by editors, depending on the needs of the story. Some collections can just be a basic list of stories, others need an introduction paragraph or two, a sidebar with links to evergreen interactive tools, and/or a listing of people who worked on a project. During the design process, I called them basic, fancy, fanciest, and visual. Here’s a few of the various flavors now in use on our live site:

We severely constrained the design options that can be changed on these pages by non-developers. The only visual flourish that can be manipulated by an editor is the title for a collection, such as on The Daily Circuit collection page. We’re still working on the actual implementation for the fanciest and major topic/section pages, but the visual design is solidified. You can preview our high fidelity mockups in InVision.

A small amount of nerdy details

The configuration for these pages is read from a JSON config file that lets editors change text and turn on/off options. Right now, these JSON files are hard-coded, but we’re working on a general-purpose configuration management tool that can be used by all our websites for just this type of situation.

For the listings of stories on these pages, it is a little more complex. We have an internal tool we call “The Barn”, built for the PMP, based on Dezi, that indexes and normalizes content from our various CMSes in real-time (every animal/CMS is welcome in the barn, even the stinky & bad tasting ones). We have another tool, we call Meeker, that allows us to save queries and some metadata about the queries to The Barn. Both The Barn and Meeker spit out JSON, which makes them effectively language/platform agnostic for our web apps. The mprnews.org web app is essentially the VC in MVC, relying on RESTy JSON models on other servers for the data.

My geekier colleagues will hopefully be sharing more details about how Meeker, The Barn, and our other REST API systems work  and allow us to evolve our CMSes.

Today we have launched our new weather pages for MPR News, sporting a new design, improved weather data, and geolocation. If your browser supports it, we will attempt to give you the most accurate forecast for the location where you are.

Our old weather pages were very text heavy. We’ve re-vamped that with more relevant visualizations of the upcoming weather. For the next 48 hours, we show the sky conditions, and a quick text description of each day’s forecast. We also show a handy line graph of the temperature swing, highlighting the highs and lows, and when they’ll happen.

48hours

Our new icons are also new, and support retina devices. Looking at other weather pages, flat icons like Meteocons appear to be all the rage. I don’t think these icons communicate terribly well the range of weather conditions or the differences between night and day, especially at smaller sizes. Our new icons are an evolution of icons I previously created and these icons by  Tobias Wiedenmann. For something as vibrant as weather, color, depth, and texture were tools we didn’t want to abandon.

7 day forecast shows a heat wave.

For the longer term 7-day forecast, show the temperature range for  the day, and if your device’s screen can fit, when the high/low temps for that day are going to happen. We also show the average high/low temps for the day, if available for your location (more on this below).

We also include a link and blurb on the latest forecast from Updraft. Despite the great data, it’s important to have a skilled meteorologist interpret the data and help us peek around the corner for signs of hope and/or gloom

Oh noes! A blizzard in Grand Forks!

When severe weather is happening, we also display prominent alerts from the NWS. If we’re running a live blog, we’ll also have prominent links to that to get up to the minute storm coverage.

With this new page, we have replaced Weather Underground with Forecast.io. Our meteorologists generally prefer the National Weather Service data for forecasts, but wunderground hasn’t tracked NWS data quite as closely as we’d like over the years. Forecast.io tracks the NWS LAMP data very closely in the US, which makes our meteorologists happy. Forecast.io also has an excellent API which makes developers happy, and reasonable pay-as-you-go pricing, which makes the bean counters happy. 

Forecast.io data source tracking.

There are two things with forecast.io that we have to work around or augment: First, the API response doesn’t include the trend for atmospheric pressure: rising, falling, or steady. To work around this, we make a second API call asking for the conditions 3 hours ago, then compare the barometer readings. Any changes let us know the pressure trend. Weather nerds know the atmospheric pressure is important to understand coming weather patterns. 

Secondly, forecast.io doesn’t provide the average high and low temps for a given location. We have retrieved the 30 year ‘normals‘ from the Climactic Data Center and built a little system to retrieve it from a handful of csv files. NOAA makes this data available for the entire country as a series of 30mb CSV files or via a very slow REST API, but we opted to just grab the handful of MN observation stations. We’ve hard-coded the coordinates of these stations (they don’t move) and then do some quick calculations to see if your weather location is near enough to our known locations. If it is, we show you normals and how they compare to your location. 

I have personally been using these weather pages for the past month and find them both useful and complicit in my discontent with our polar vortex fueled misery. Depending on the forecast (and how you’re feeling), the average highs and lows are inspiring or damning. The trend lines really help you know when might be the best time to take the dog for a walk.

Any feedback or issues are always welcome.