Caching, Grouping, and Additive Filtering

I’ve wiled away the last few months of my work life doing statistics. For example, I try as best as I can to figure out how many people have clicked to our download page, how many initiated a download, how many downloads successfully finished, and how many people are actually using the browser. It’s a lot trickier to figure this stuff out than you’d think, though I won’t bore you with the details.

I plan to bore you with different details. As you begin to collect more and more data, providing live statistics becomes challenging. Say you’ve got a database table with a few million rows of data in it, and somebody wants to pull a query from that database for a given date range and get a report that requires a lot of calculations. Suddenly, you not only have a query that has to comb through a lot of data, but once you’ve finally got all the data, you have to iterate over it to perform calculations. In our case, a couple of months ago, we found that we had enough data being churned that an overview report was taking 20 seconds or so to run. I spent a lot of time tweaking indices and optimizing queries, and I managed to save a lot of time, but I knew that as our data grew, the reports would bog down again. The solution, I figured, lay in caching data for speedy retrieval.

We typically care (at least for now) about daily data. This means that I can occasionally run the big heavy queries, group the data by date, and store data in such a way that I can simply add up counts for a date range to get very responsive reports. So for example, a query like this from a huge table of log entries:

SELECT date, count(*) FROM data WHERE date BETWEEN '2006-09-01 00:00:00' AND '2006-09-03 23:59:59' GROUP BY date

might yield a (partial) result like this:

2006-09-01       23
2006-09-02       28
2006-09-03       27

I stuff those values into a caching table. Then, when somebody asks for a total count of some piece of data for the date range 09/01 – 09/03, I can do a lightning-fast query from the caching table and add up the values in the count column to return the result (quite possibly using mysql’s SUM function to eliminate the need for any post-processing of the data). This gets a little dicey when you start having to calculate averages (for just one example) or when you start selecting based on more criteria than just date. This last selection type is something else I’ve handled in what I think/hope is a pretty nifty way that I call, for lack of any more fitting term I’m familiar with, additive filtering.

So, the scenario. Imagine that you want to allow users of your reports to filter on various criteria. In our case, we provide a standard version of Flock and a Photobucket version. We also provide versions for different platforms for both Flock and Photobucket. And for Flock releases, we provide builds for multiple platforms for multiple languages. The simple count-per-date caching I used as an example above doesn’t allow us to filter the data on these criteria efficiently, so we need another mechanism. Enter more sophisticated grouping in our caching tables. The query of the huge table we used before becomes something more like this:

SELECT date, vendor, os, locale, count(*) FROM data WHERE date BETWEEN '2006-09-01 00:00:00' AND '2006-09-03 23:59:59' GROUP BY date, vendor, os, locale

And our results more like this:

2006-09-01	flock	win	en_US	12
2006-09-01	flock	mac	en_US	 4
2006-09-01	flock	linux	en_US	 1
2006-09-01	flock	win	zn_CH	12
2006-09-01	flock	mac	zn_CH	 4
2006-09-01	flock	linux	zn_CH	 1
2006-09-01	pb	win	en_US	12
2006-09-01	pb	mac	en_US	 4

Note that I’m showing just one sample day here (and by the way, these aren’t real numbers, so don’t try to extrapolate and announce a download count for Flock). A few times a day, I put this data into a table with a unique key comprising all columns but the count, and each time I run the big query, I replace the values. For nearly-live reporting, I can then run queries summing the counts and adding constraints (ahem, “additive filtering”) on any of the columns desired. So if I want a total count, I just do “SELECT SUM(count_col) FROM cache_table” for the requested date range, and if I want a Photobucket count, I do “SELECT SUM(count_col) FROM cache_table WHERE vendor = ‘pb'” for the date range. For any further narrowing I wish to do, I just add another constraint to my WHERE clause. Again, it gets a little tricky if you’re doing any sort of averaging without constraints (you can’t just do a sum in that case but have to calculate a weighted average per would-be constraint set), but by and large, this has so far been a pretty good quick and dirty way of doing caching to generate speedy reports.

There are no doubt better ways to handle caching of large data sets, but for our purposes, this little trick has proven effective. I imagine about 1% of my already limited readership probably found this even remotely interesting. Still, I occasionally wonder if it might be appealing to some to have a glimpse into some of the less glamorous things that go on behind the scenes in a project like Flock. So, what do you think is better? Turkeys and neat little discoveries or things like this post?


4 thoughts on “Caching, Grouping, and Additive Filtering

  1. What you are (starting) is a data warehouse. And it sounds like what you need. I recommend tracking down a book on the concepts, sorry I can’t recommend one. In short, you make a second database that is not normalized (called a Star Schema). At the heart is one or more Fact Tables – these are tables that the unique key is made up of several fields. What fields normally depends on the Time Resolution you want, in you case this sounds like a day. So for each day you would have the total downloads per vendor, per os, per local, per anything else you may need.

    From this Fact Table, you build Dimension Tables. These tables take the Fact Table and look only at one field, or Dimension – so a Dimension Table for Vendor, a Dimension Table for OS, etc. Dimension Tables can also change the Time Resolution to per month, per year, etc. There are some more things you can do (like Data Marts), but that’s the overall gist. All this is updated by the Extract, Transform, and Load (ETL) phase which is normally a set of stored procedures but can be some scripts run by cron too.

    Oh, before going the complex – check indexes, analyze the query to make sure it’s using the index, consider adding some specialized multi-field index, making a pseudo-index field, etc.

  2. Imo, this is pretty interesting stuff, Daryl. It doesn’t hurt to have posts like this once in a while. I haven’t had to do data caching yet, because I haven’t worked with data sets this big, so it was nice to learn how people do it.

  3. I no longer feel that my cerebellum is so feeble, considering the peregrinations you go to for a statistical extract, I am still able to retrieve one or two memories daily. While semantically you are doing numerical analysis, you are attempting to organize matrices (I think) and ultimately this becomes theoretically a topological problem. Fun to think about. http://www.gnuzworks.com

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s