As requested by my colleague, here are some nerdy details about the flow of data driving mprnews.org and other MPR|APM websites.

Origins

The Barn is a the central internal search engine and content aggregator within MPR|APM. Here’s how it came to be.

A few years ago I went through a period of reading and re-reading Charlotte’s Web to my kids. I loved the metaphor of the barn as a place for everything and everything in its place.

Around the same time I had grown dissatisfied with the state of search within the company. There was no single place where I could go and find everything that the company had ever produced. Google knew more about our content than we did. That seemed wrong to me.

I also knew that we would soon be faced with a big project called the Public Media Platform, which would involve standardizing the metadata and structure of our content for publishing to the central PMP repository. That meant I needed to learn about all the different CMS systems at work within the company, a non-trivial task since we have at least these:

  • Itasca (homegrown, powers mprnews.org)
  • Drupal (powers marketplace.org, splendidtable.org)
  • WordPress (powers MPR blogs (including this one), witsradio.org, dinnerpartydownload.org and others)
  • Teamsite (static files with SSI (!))
  • Eddy (our digital archive)
  • SCPRv4 (powers scpr.org)

In Barn parlance, each of those CMS data silos is an “origin”.

I had spent the three years prior working on the Public Insight Network, particularly in building a robust search feature for the PIN’s main tool. Out of that work came Dezi, a search platform similar to Elasticsearch but based on Apache Lucy rather than (like Elasticsearch and Solr) Apache Lucene. So I knew what tool I wanted to use to search the content.

Aggregation

Before I could use Dezi, though, I needed to tackle a much harder problem: aggregating all those origins into a single location. I cobbled together a system based on the following criteria:

  • If I had direct access to the backend storage of the origin (usually MySQL) I would talk directly to the backend.
  • If I had access only to the public-facing website, I would crawl the site periodically based on a feed or sitemap.
  • If I had no feed or sitemap, I would make a best effort based on a traditional web crawler approach.

Since my go-to language is Perl, I ended up using the following CPAN modules:

Rose::DB::Object
High-performance ORM that can interrogate a database and derive its schema automatically
Rose::DBx::Garden::Catalyst, CatalystX::CRUD::Controller::REST and Catalyst
MVC web framework for managing the Barn database, scheduler and logs
LWP::UserAgent
and WWW::Sitemap::XML
Web crawling tools
Search::Tools
My own module, essential for massaging things like character encodings, XML/HTML/JSON transformations, and the like.

Throughout the day, cron jobs keep the aggregated content in sync with the various origins and marshals it all into XML documents on local disk. An indexer sweeps through, finds any new XML documents and incrementally updates the Dezi indexes.

I create a single index for each origin+content-type, so there’s a separate index for Itasca-Features and Itasca-Audio and Itasca-Images. Maintaining separate indexes makes it much easier to create distinct endpoints within Dezi for limiting search to particular content subsets, whether by origin or type. It also helps with scaling, by sharding the search across multiple filesystems and machines.

Creative re-purposing

Once the Barn system was up and humming along its automated path, we started to see other uses for it besides just search. Since all our content is now normalized into a standard format and metadata schema, we can create ad hoc collections of content across multiple origins. Since the Barn knows how to talk to the origin backend databases, in real-time, we can de-normalize and cache assets (like MPR News stories) that do not change very often but which can be expensive (in terms of SQL queries) to generate. And since we now have all our content in one place, we can re-distribute it wherever we want, automatically.

Here, for example, is an example command for exporting Barn content to the PMP:

  % perl bin/barn-export -e PMP 'origin:marketplace date:today'  

That pushes all of today’s Marketplace content to the PMP. It runs on a cron job via the Barn’s scheduler system. An export is just a search query, so we could also do something like:

   % perl bin/barn-export -e PMP 'sausage politics origin:feature'  

That pushes every story mentioning the keywords ‘sausage’ and ‘politics’ to the PMP. Pretty handy.

Future

The Barn has proven very helpful to our internal infrastructure and content delivery. That improves our audience experience in some direct ways (faster page load times, automating some tasks our content creators used to do manually). We’d also like to open up a subset of the Barn’s search functionality to our audiences as well, so that they can search across all our content at once, and preview audio and images inline within results, just like our reporters and editors can do today.

Every day APM|MPR generates several hours of audio content for its radio and digital broadcasts. Over time that adds up to many terabytes of audio, most of which has no written transcripts available, because transcripts are expensive and slow to create. Imagine listening to the news and writing down everything you hear, word for word, then going back to error-check and verify the spelling of names and places. Now imagine doing that for hours every day. It’s tedious work.

Yet unless a transcript exists, there is really no way to search the audio later. Basic metadata, like key words, speaker names, title, date and time, might be available, but it won’t begin to represent the detail of a conversation in a radio interview.

In the last decade, speech-to-text technology has evolved to the point where we can start to imagine computers transcribing our news broadcasts and radio shows well enough to make them searchable. APM|MPR wanted to find out if the open source speech-to-text (also known as automatic speech recognition (ASR)) toolkits were mature enough to use for searching our audio archives.

We had modest goals. We knew that even commercial products, like Google Voice or Siri or even your company’s voicemail system, could vary widely in the quality and accuracy of their transcriptions. Previous, failed attempts to automate transcriptions at APM|MPR had a much loftier goal: we wanted to publish the results for our audiences. We decided for this prototype to scale back our ambitions and instead focus on keyword detection. Most search engines focus on nouns as being the most useful words. So we wanted to answer one question: could we develop an ASR system that could identify the most frequently used nouns or noun phrases in a piece of audio? The Knight Foundation agreed to help us fund a project to answer that question.

We partnered with some industry experts at Cantab Research, Ltd. who agreed to build the basic ASR tools for us, based on their extensive work with the open source software we wanted to evaluate. Cantab is led by Dr. Tony Robinson, a leader in the ASR field. Cantab would build the various acoustic and language training models required, as well as write the scripts for manipulating the ASR libraries. APM|MPR would build the testing scripts and processing infrastructure, including a web application for viewing and comparing transcripts.

Based on consultation with Cantab, we chose to focus on a comparison between open source two ASR libraries: Julius and Kaldi. We identified about a hundred hours of audio for which we had manually-generated transcripts, and supplied the audio and text files to Cantab. Unfortunately many of the transcripts were not accurate enough, because they had been “cleaned up” for audience presentation, but Cantab was able to identify additional public domain audio and transcripts to flesh out the training collection and push on with the work.

Over the course of three months Cantab delivered five different iterations of the models and code. Each version got progressively faster and more accurate. Three of the iterations used Julius and two of them used Kaldi. In that way we were able to compare the two ASR libraries against one another using the same collection of  testing material. In the end we were able to get comparable results with both libraries.

While Cantab was training the ASR models, we built a web application where users could register audio by URL and trigger a transcription for later delivery via email. The application was designed to process the queue of incoming audio using a variable number of machines so that it could scale linearly. The more machines we point at the queue, the faster the application can process the audio.

Each time Cantab delivered a new version of the ASR components, we re-ran our evaluation against our testing collection, using the web application we had developed. The testing collection was composed of the same 100 hours of audio and transcripts we had sent to Cantab originally. Our testing procedure looked like:

  • Generate a machine transcript automatically
  • Apply a part-of-speech tagger and extract the nouns and noun phrases, sorted by frequency, for both machine and human transcripts
  • Compare the machine and human word lists

What we found was that the testing scripts consistently found 85-100% of the same key words in both the machine and human transcripts, as long as frequency was ignored. If frequency was weighed, the overlap dropped to 50-70%. What that told us was that the machine transcripts were accurate enough, most of the time, to surface the key words, even if they couldn’t be relied upon to identify those words every time they appeared. That feels “good enough” to us to pursue this route, since frequency in full-text search is typically used only to affect rankings, not inclusion, within a result set.

Processing audio for ASR, even to test a single configuration setting, can be very time-consuming and resource-intensive, so we knew we had an aggressive schedule (six months) and budget for this project. Still, our experience prototyping this project taught us several things, among them:

  • Garbage in, garbage out. The accuracy of the ASR application is completely dependent on both the quality and quantity of training material we can provide. We would like to identify a much larger corpus of APM audio to use for improving our training models.
  • Who said that? Identifying specific speakers (called “diarization”), such as the reporter versus the interviewee, could help us improve our search results, by allowing audiences to limit their searches to specific speakers.
  • Cross your Ts. For search purposes we ignored capitalization, punctuation and sentence structure. If we spent some time maturing our language models and related scripts, we might be able to better improve key word identification, particularly around proper nouns like people’s names.
  • A little song, a little dance. Identifying sounds that are not human beings speaking, such as music or other sound effects, could use a lot more work.

We really enjoyed working on this project. APM|MPR would like to thank Cantab Research, particularly Dr Robinson and Niranjani Prasad, who helped elucidate the mysteries of ASR systems, and the Knight Foundation and Knight Prototype Fund, whose financial support and encouragement made the project possible.

All our code is available under a MIT license at:

https://github.com/APMG/audio-search

appetites2

If you’re a die-hard fan of MPR News, you may have noticed some new page layouts on our site recently. We have been working hard on our search and grouping tools that allow us to generate these pages. We call these groups collections: pages that list and link to other pieces of content, almost always news stories and/or audio segments.

Collections aren’t generally pages that get much traffic or attention from audiences or search engines. But, they can occasionally serve a few useful purposes. First, for a small (and I mean small) subset of our visitors, they are highly utilitarian pages that allow browsing and refining by topic, where search doesn’t work well. Secondly, collection pages are useful for grouping particular highly focused stories together, at times when there is a lot of coverage happening for a relatively short period of time. For example, coverage of a Franken/Coleman recount or 35W Bridge Collapse.

The other important consideration for us is that our homepage is essentially a collection page, but the collection it “searches” is every story we create. What we’re building for our low traffic collection pages will be hugely important for building our new homepage.

Old collection pages: Bad

On our old site, we have had very little standardization of our visual design, and this was most evident on our collection, project, and episode pages. Each collection and project page was a special little flower, crafted and cared for during it’s special little moment, then left to wither and die. Here are a few examples:

A non-responsive mess.

Collection pages on our old site were not conceived in a time when smartphones, tablets, and retina screens were a thing. But today they are, and our pages obviously need to deal with that. Most of the old pages pages feature too many or too small images, type that can only be read with a magnifying glass, and little standardization between collections. Not only is this a web developer maintenance nightmare and not responsive to device capabilities, it is confusing and scattershot for our audiences.

New collection pages: Less bad

Our new collections have a range of visual and utility options that can be turned on and off by editors, depending on the needs of the story. Some collections can just be a basic list of stories, others need an introduction paragraph or two, a sidebar with links to evergreen interactive tools, and/or a listing of people who worked on a project. During the design process, I called them basic, fancy, fanciest, and visual. Here’s a few of the various flavors now in use on our live site:

We severely constrained the design options that can be changed on these pages by non-developers. The only visual flourish that can be manipulated by an editor is the title for a collection, such as on The Daily Circuit collection page. We’re still working on the actual implementation for the fanciest and major topic/section pages, but the visual design is solidified. You can preview our high fidelity mockups in InVision.

A small amount of nerdy details

The configuration for these pages is read from a JSON config file that lets editors change text and turn on/off options. Right now, these JSON files are hard-coded, but we’re working on a general-purpose configuration management tool that can be used by all our websites for just this type of situation.

For the listings of stories on these pages, it is a little more complex. We have an internal tool we call “The Barn”, built for the PMP, based on Dezi, that indexes and normalizes content from our various CMSes in real-time (every animal/CMS is welcome in the barn, even the stinky & bad tasting ones). We have another tool, we call Meeker, that allows us to save queries and some metadata about the queries to The Barn. Both The Barn and Meeker spit out JSON, which makes them effectively language/platform agnostic for our web apps. The mprnews.org web app is essentially the VC in MVC, relying on RESTy JSON models on other servers for the data.

My geekier colleagues will hopefully be sharing more details about how Meeker, The Barn, and our other REST API systems work  and allow us to evolve our CMSes.