Getting clever with data feeds

2009-01-26 13:07:49

There is a lot of data available out there on the Internet these days, and more and more of this data is being made available in various forms of feeds or APIs (Application Programming Interface). For kicks, we wrote a feed from Twitter to Conquent so you can see live results scroll by at http://conquent.com/scroller/ and I also display my recent blog postings and Twitter postings through RSS on various portal pages.

But let's talk about the value for business.

Our accessible shopping comparison engine at http://empower.conquent.com/ pulls the literally millions of records from Shopzilla and other data sources and formats the results in a way that's easy for screen readers like JAWS or Window-Eyes to read as well as making it easier for mobility impaired, or vision impaired visitors without readers, to view the results.

The problem is with querying millions of items 'on the fly.' You can't do it efficiently -- even with solid bandwidth and good servers, pulling down millions of records takes time to process. And hitting the source data every time someone wants to see vacuum cleaners makes the user experience crawl and eventually lay down and die.

We worked around this problem with an 'on-demand caching' of results. Shopzilla updates their system every 24 hours, so there's no reason to grab a set of data if we've seen it in the last 24 hours. The first time someone hits a category that hasn't been viewed in 24 hours, we fire up the process to grab the data, process it and cache it -- it's a little slower for that one person, but for the rest of the day it's lightning fast.

At the same time, we don't have to store unused categories -- if no one ever browses 'tires' on empower.conquent.com, we haven't ever wasted bandwidth and processor time by grabbing something no one cares about. Granted, every time Google or MSN comes by they'll trip the category, but that works to our advantage creating a slow, background cache copy in the event someone DOES come by -- but if we don't already have it, the visitor will always get the most recent copy, regardless of whether we already grabbed it or not.

Building in these simple efficiencies makes the site a lot more scalable, faster, and keeps our content provider happier as we're not crowding them with unnecessary requests.