The new mprnews.org on a tablet.

Earlier this week, we made public the new homepage for MPR News. This is the final big piece of our ongoing responsive re-design of the site. Technology-wise, there aren’t any new systems or components on the homepage that haven’t been put in use in the topic pages or story pages. But, the homepage is a very visible and important design change.

Old and tired.

The biggest problem we were trying to solve is that our old page didn’t work well on a mobile device. Today, about 40% of our total traffic comes from mobile devices. That’s a lot, and to remain relevant to that growing percentage, we need to not be a bad experience, and maybe even a good one.

The last redesign of MPR News was done in 2008, before responsive websites were really a thing, and mobile websites were only just starting to pop up. In addition to not being mobile-friendly, there were numerous other substantial problems with our old homepage:  The type was too small and without hierarchy. There were too many topical sections that all looked alike. Some testing showed that few visitors (under 25%) scrolled past the “blog box”. And there were so many different links and elements on the page that it was too much to practically take in and decipher.

To design the new homepage, we formed a small group of invested parties, the core group of which was Digital News Director Jon Gordon, Product Director Peter Rasmussen, and myself. We started by making a list of the things that we wanted to be on the new homepage. Designing a page to work well on a mobile device means you need to focus on the things that are relevant to someone with a limited screen size. We settled on the following things, which neatly explain our final design:

  • News stories that editors can adjust in order and prominence
  • NewsCut, updraft, and the weather forecast are important and well loved by our audiences
  • Today’s Question needed to make an appearance when relevant, as decided by editors
  • We do news related events, and those needed to show up, and not as ads
  • Links to the major sections of the site for more focused news
  • Most viewed is very popular, and we wanted that to stand out more
  • We do excellent photos and video, and wanted that to stay omnipresent, but not huge
  • The radio schedule should be present, since we are, after all, a radio service
  • More links to find us in other places: social media, our apps, podcasts, and email
  • Audio everywhere, because we create great audio

Much like our section fronts, we settled on a three column layout. Unlike the section fronts, the persistent column moves depending on screen size. On desktops & larger screens, it is on the left, on tablets and medium screens, it moves to the right. We debated this, but ultimately liked it a lot on tablets because it puts the latest news furthest to the left and felt that was most appropriate. On phones, this all shifts to one column, and puts the news stories first.

When we display the news stories, we default to reverse chronological of our latest stories, but editors can and do override that to put the more important and noteworthy stories at the top of the heap. This listing of stories integrates content from our internal CMS, itasca, our blogs, and the PMP, through our internal search normalizer, The Barn. In addition to ordering, there are five different levels of prominence a story can be given. They are:

  • Level 0: Just the headline
  • Level 1: Headline slightly larger, thumbnail image, and a short description. This is a “described story”
  • Level 2: Headline larger yet, larger image, and the short description. This is a “promoted story”
  • Level 3: Much bigger headline, short description, no image, goes across both columns on tablet and desktop screens. We probably won’t use this very often. This is a “blowout story”
  • Level 4: Just like the blowout story, but with an even bigger headline. Think “Dewey Defeats Truman”. This a “super blowout”.

In addition to these levels, editors can turn on or off the date stamps and add labels, e.g. “BREAKING NEWS”, above headlines.

news_levels

We’ve also fully switched to using Franklin Gothic Demi Condensed as our headline typeface, and use Franklin Gothic Medium in some places as well. As any newspaper designer knows, using a condensed font allows more characters to fit into a line, a consideration that is particularly important on smaller phone-sized screens. The MPR News logotype is Akzidenz Grotesk, but Akzidenz is not easy to license as a webfont. Franklin is easier to license and is a close relative of Akzidenz, so it suits our needs. This change to Franklin Gothic now propagates to all the pages on the site, including the stories, topics, and section fronts.

One element I particularly like is the new schedule. We are a radio station, and the schedule serves a very utilitarian and necessary function of informing the audience when shows are going to be on. It was surprisingly difficult to find on our old site. With the new homepage, the schedule will move to the top of the page on the weekends, when the news slows down somewhat and the programs are different from the week. It is a carousel, which is somewhat taboo for mobile, but slick works fairly well for our limited and text-based implementation.

We still have some work left to do on the homepage and mprnews.org: Our show pages aren’t fully migrated to the new layout; Our media player & playlist system needs to be re-worked to use websockets; There is an election coming up… The list goes on and a website is never truly finished (well, maybe). But, we are in a better place for more of our visitors than we were a year ago when we started this project.

We know everyone won’t agree with all the choices that we’ve made, and we know we’re not perfect. Please feel free to share your thoughts on our design here in the comments, or use the feedback forum we’ve set up.

As requested by my colleague, here are some nerdy details about the flow of data driving mprnews.org and other MPR|APM websites.

Origins

The Barn is a the central internal search engine and content aggregator within MPR|APM. Here’s how it came to be.

A few years ago I went through a period of reading and re-reading Charlotte’s Web to my kids. I loved the metaphor of the barn as a place for everything and everything in its place.

Around the same time I had grown dissatisfied with the state of search within the company. There was no single place where I could go and find everything that the company had ever produced. Google knew more about our content than we did. That seemed wrong to me.

I also knew that we would soon be faced with a big project called the Public Media Platform, which would involve standardizing the metadata and structure of our content for publishing to the central PMP repository. That meant I needed to learn about all the different CMS systems at work within the company, a non-trivial task since we have at least these:

  • Itasca (homegrown, powers mprnews.org)
  • Drupal (powers marketplace.org, splendidtable.org)
  • WordPress (powers MPR blogs (including this one), witsradio.org, dinnerpartydownload.org and others)
  • Teamsite (static files with SSI (!))
  • Eddy (our digital archive)
  • SCPRv4 (powers scpr.org)

In Barn parlance, each of those CMS data silos is an “origin”.

I had spent the three years prior working on the Public Insight Network, particularly in building a robust search feature for the PIN’s main tool. Out of that work came Dezi, a search platform similar to Elasticsearch but based on Apache Lucy rather than (like Elasticsearch and Solr) Apache Lucene. So I knew what tool I wanted to use to search the content.

Aggregation

Before I could use Dezi, though, I needed to tackle a much harder problem: aggregating all those origins into a single location. I cobbled together a system based on the following criteria:

  • If I had direct access to the backend storage of the origin (usually MySQL) I would talk directly to the backend.
  • If I had access only to the public-facing website, I would crawl the site periodically based on a feed or sitemap.
  • If I had no feed or sitemap, I would make a best effort based on a traditional web crawler approach.

Since my go-to language is Perl, I ended up using the following CPAN modules:

Rose::DB::Object
High-performance ORM that can interrogate a database and derive its schema automatically
Rose::DBx::Garden::Catalyst, CatalystX::CRUD::Controller::REST and Catalyst
MVC web framework for managing the Barn database, scheduler and logs
LWP::UserAgent
and WWW::Sitemap::XML
Web crawling tools
Search::Tools
My own module, essential for massaging things like character encodings, XML/HTML/JSON transformations, and the like.

Throughout the day, cron jobs keep the aggregated content in sync with the various origins and marshals it all into XML documents on local disk. An indexer sweeps through, finds any new XML documents and incrementally updates the Dezi indexes.

I create a single index for each origin+content-type, so there’s a separate index for Itasca-Features and Itasca-Audio and Itasca-Images. Maintaining separate indexes makes it much easier to create distinct endpoints within Dezi for limiting search to particular content subsets, whether by origin or type. It also helps with scaling, by sharding the search across multiple filesystems and machines.

Creative re-purposing

Once the Barn system was up and humming along its automated path, we started to see other uses for it besides just search. Since all our content is now normalized into a standard format and metadata schema, we can create ad hoc collections of content across multiple origins. Since the Barn knows how to talk to the origin backend databases, in real-time, we can de-normalize and cache assets (like MPR News stories) that do not change very often but which can be expensive (in terms of SQL queries) to generate. And since we now have all our content in one place, we can re-distribute it wherever we want, automatically.

Here, for example, is an example command for exporting Barn content to the PMP:

  % perl bin/barn-export -e PMP 'origin:marketplace date:today'  

That pushes all of today’s Marketplace content to the PMP. It runs on a cron job via the Barn’s scheduler system. An export is just a search query, so we could also do something like:

   % perl bin/barn-export -e PMP 'sausage politics origin:feature'  

That pushes every story mentioning the keywords ‘sausage’ and ‘politics’ to the PMP. Pretty handy.

Future

The Barn has proven very helpful to our internal infrastructure and content delivery. That improves our audience experience in some direct ways (faster page load times, automating some tasks our content creators used to do manually). We’d also like to open up a subset of the Barn’s search functionality to our audiences as well, so that they can search across all our content at once, and preview audio and images inline within results, just like our reporters and editors can do today.

Every day APM|MPR generates several hours of audio content for its radio and digital broadcasts. Over time that adds up to many terabytes of audio, most of which has no written transcripts available, because transcripts are expensive and slow to create. Imagine listening to the news and writing down everything you hear, word for word, then going back to error-check and verify the spelling of names and places. Now imagine doing that for hours every day. It’s tedious work.

Yet unless a transcript exists, there is really no way to search the audio later. Basic metadata, like key words, speaker names, title, date and time, might be available, but it won’t begin to represent the detail of a conversation in a radio interview.

In the last decade, speech-to-text technology has evolved to the point where we can start to imagine computers transcribing our news broadcasts and radio shows well enough to make them searchable. APM|MPR wanted to find out if the open source speech-to-text (also known as automatic speech recognition (ASR)) toolkits were mature enough to use for searching our audio archives.

We had modest goals. We knew that even commercial products, like Google Voice or Siri or even your company’s voicemail system, could vary widely in the quality and accuracy of their transcriptions. Previous, failed attempts to automate transcriptions at APM|MPR had a much loftier goal: we wanted to publish the results for our audiences. We decided for this prototype to scale back our ambitions and instead focus on keyword detection. Most search engines focus on nouns as being the most useful words. So we wanted to answer one question: could we develop an ASR system that could identify the most frequently used nouns or noun phrases in a piece of audio? The Knight Foundation agreed to help us fund a project to answer that question.

We partnered with some industry experts at Cantab Research, Ltd. who agreed to build the basic ASR tools for us, based on their extensive work with the open source software we wanted to evaluate. Cantab is led by Dr. Tony Robinson, a leader in the ASR field. Cantab would build the various acoustic and language training models required, as well as write the scripts for manipulating the ASR libraries. APM|MPR would build the testing scripts and processing infrastructure, including a web application for viewing and comparing transcripts.

Based on consultation with Cantab, we chose to focus on a comparison between open source two ASR libraries: Julius and Kaldi. We identified about a hundred hours of audio for which we had manually-generated transcripts, and supplied the audio and text files to Cantab. Unfortunately many of the transcripts were not accurate enough, because they had been “cleaned up” for audience presentation, but Cantab was able to identify additional public domain audio and transcripts to flesh out the training collection and push on with the work.

Over the course of three months Cantab delivered five different iterations of the models and code. Each version got progressively faster and more accurate. Three of the iterations used Julius and two of them used Kaldi. In that way we were able to compare the two ASR libraries against one another using the same collection of  testing material. In the end we were able to get comparable results with both libraries.

While Cantab was training the ASR models, we built a web application where users could register audio by URL and trigger a transcription for later delivery via email. The application was designed to process the queue of incoming audio using a variable number of machines so that it could scale linearly. The more machines we point at the queue, the faster the application can process the audio.

Each time Cantab delivered a new version of the ASR components, we re-ran our evaluation against our testing collection, using the web application we had developed. The testing collection was composed of the same 100 hours of audio and transcripts we had sent to Cantab originally. Our testing procedure looked like:

  • Generate a machine transcript automatically
  • Apply a part-of-speech tagger and extract the nouns and noun phrases, sorted by frequency, for both machine and human transcripts
  • Compare the machine and human word lists

What we found was that the testing scripts consistently found 85-100% of the same key words in both the machine and human transcripts, as long as frequency was ignored. If frequency was weighed, the overlap dropped to 50-70%. What that told us was that the machine transcripts were accurate enough, most of the time, to surface the key words, even if they couldn’t be relied upon to identify those words every time they appeared. That feels “good enough” to us to pursue this route, since frequency in full-text search is typically used only to affect rankings, not inclusion, within a result set.

Processing audio for ASR, even to test a single configuration setting, can be very time-consuming and resource-intensive, so we knew we had an aggressive schedule (six months) and budget for this project. Still, our experience prototyping this project taught us several things, among them:

  • Garbage in, garbage out. The accuracy of the ASR application is completely dependent on both the quality and quantity of training material we can provide. We would like to identify a much larger corpus of APM audio to use for improving our training models.
  • Who said that? Identifying specific speakers (called “diarization”), such as the reporter versus the interviewee, could help us improve our search results, by allowing audiences to limit their searches to specific speakers.
  • Cross your Ts. For search purposes we ignored capitalization, punctuation and sentence structure. If we spent some time maturing our language models and related scripts, we might be able to better improve key word identification, particularly around proper nouns like people’s names.
  • A little song, a little dance. Identifying sounds that are not human beings speaking, such as music or other sound effects, could use a lot more work.

We really enjoyed working on this project. APM|MPR would like to thank Cantab Research, particularly Dr Robinson and Niranjani Prasad, who helped elucidate the mysteries of ASR systems, and the Knight Foundation and Knight Prototype Fund, whose financial support and encouragement made the project possible.

All our code is available under a MIT license at:

https://github.com/APMG/audio-search