As requested by my colleague, here are some nerdy details about the flow of data driving mprnews.org and other MPR|APM websites.
The Barn is a the central internal search engine and content aggregator within MPR|APM. Here’s how it came to be.
Around the same time I had grown dissatisfied with the state of search within the company. There was no single place where I could go and find everything that the company had ever produced. Google knew more about our content than we did. That seemed wrong to me.
I also knew that we would soon be faced with a big project called the Public Media Platform, which would involve standardizing the metadata and structure of our content for publishing to the central PMP repository. That meant I needed to learn about all the different CMS systems at work within the company, a non-trivial task since we have at least these:
- Itasca (homegrown, powers mprnews.org)
- Drupal (powers marketplace.org, splendidtable.org)
- WordPress (powers MPR blogs (including this one), witsradio.org, dinnerpartydownload.org and others)
- Teamsite (static files with SSI (!))
- Eddy (our digital archive)
- SCPRv4 (powers scpr.org)
In Barn parlance, each of those CMS data silos is an “origin”.
I had spent the three years prior working on the Public Insight Network, particularly in building a robust search feature for the PIN’s main tool. Out of that work came Dezi, a search platform similar to Elasticsearch but based on Apache Lucy rather than (like Elasticsearch and Solr) Apache Lucene. So I knew what tool I wanted to use to search the content.
Before I could use Dezi, though, I needed to tackle a much harder problem: aggregating all those origins into a single location. I cobbled together a system based on the following criteria:
- If I had direct access to the backend storage of the origin (usually MySQL) I would talk directly to the backend.
- If I had access only to the public-facing website, I would crawl the site periodically based on a feed or sitemap.
- If I had no feed or sitemap, I would make a best effort based on a traditional web crawler approach.
Since my go-to language is Perl, I ended up using the following CPAN modules:
- High-performance ORM that can interrogate a database and derive its schema automatically
- Rose::DBx::Garden::Catalyst, CatalystX::CRUD::Controller::REST and Catalyst
- MVC web framework for managing the Barn database, scheduler and logs
- Web crawling tools
- My own module, essential for massaging things like character encodings, XML/HTML/JSON transformations, and the like.
Throughout the day, cron jobs keep the aggregated content in sync with the various origins and marshals it all into XML documents on local disk. An indexer sweeps through, finds any new XML documents and incrementally updates the Dezi indexes.
I create a single index for each origin+content-type, so there’s a separate index for Itasca-Features and Itasca-Audio and Itasca-Images. Maintaining separate indexes makes it much easier to create distinct endpoints within Dezi for limiting search to particular content subsets, whether by origin or type. It also helps with scaling, by sharding the search across multiple filesystems and machines.
Once the Barn system was up and humming along its automated path, we started to see other uses for it besides just search. Since all our content is now normalized into a standard format and metadata schema, we can create ad hoc collections of content across multiple origins. Since the Barn knows how to talk to the origin backend databases, in real-time, we can de-normalize and cache assets (like MPR News stories) that do not change very often but which can be expensive (in terms of SQL queries) to generate. And since we now have all our content in one place, we can re-distribute it wherever we want, automatically.
Here, for example, is an example command for exporting Barn content to the PMP:
perl bin/barn-export -e PMP 'origin:marketplace date:today'
That pushes all of today’s Marketplace content to the PMP. It runs on a cron job via the Barn’s scheduler system. An export is just a search query, so we could also do something like:
perl bin/barn-export -e PMP 'sausage politics origin:feature'
That pushes every story mentioning the keywords ‘sausage’ and ‘politics’ to the PMP. Pretty handy.
The Barn has proven very helpful to our internal infrastructure and content delivery. That improves our audience experience in some direct ways (faster page load times, automating some tasks our content creators used to do manually). We’d also like to open up a subset of the Barn’s search functionality to our audiences as well, so that they can search across all our content at once, and preview audio and images inline within results, just like our reporters and editors can do today.