I’ve wiled away the last few months of my work life doing statistics. For example, I try as best as I can to figure out how many people have clicked to our download page, how many initiated a download, how many downloads successfully finished, and how many people are actually using the browser. It’s a lot trickier to figure this stuff out than you’d think, though I won’t bore you with the details.
I plan to bore you with different details. As you begin to collect more and more data, providing live statistics becomes challenging. Say you’ve got a database table with a few million rows of data in it, and somebody wants to pull a query from that database for a given date range and get a report that requires a lot of calculations. Suddenly, you not only have a query that has to comb through a lot of data, but once you’ve finally got all the data, you have to iterate over it to perform calculations. In our case, a couple of months ago, we found that we had enough data being churned that an overview report was taking 20 seconds or so to run. I spent a lot of time tweaking indices and optimizing queries, and I managed to save a lot of time, but I knew that as our data grew, the reports would bog down again. The solution, I figured, lay in caching data for speedy retrieval.
We typically care (at least for now) about daily data. This means that I can occasionally run the big heavy queries, group the data by date, and store data in such a way that I can simply add up counts for a date range to get very responsive reports. So for example, a query like this from a huge table of log entries:
SELECT date, count(*) FROM data WHERE date BETWEEN '2006-09-01 00:00:00' AND '2006-09-03 23:59:59' GROUP BY date
might yield a (partial) result like this:
2006-09-01 23 2006-09-02 28 2006-09-03 27
I stuff those values into a caching table. Then, when somebody asks for a total count of some piece of data for the date range 09/01 – 09/03, I can do a lightning-fast query from the caching table and add up the values in the count column to return the result (quite possibly using mysql’s SUM function to eliminate the need for any post-processing of the data). This gets a little dicey when you start having to calculate averages (for just one example) or when you start selecting based on more criteria than just date. This last selection type is something else I’ve handled in what I think/hope is a pretty nifty way that I call, for lack of any more fitting term I’m familiar with, additive filtering.
So, the scenario. Imagine that you want to allow users of your reports to filter on various criteria. In our case, we provide a standard version of Flock and a Photobucket version. We also provide versions for different platforms for both Flock and Photobucket. And for Flock releases, we provide builds for multiple platforms for multiple languages. The simple count-per-date caching I used as an example above doesn’t allow us to filter the data on these criteria efficiently, so we need another mechanism. Enter more sophisticated grouping in our caching tables. The query of the huge table we used before becomes something more like this:
SELECT date, vendor, os, locale, count(*) FROM data WHERE date BETWEEN '2006-09-01 00:00:00' AND '2006-09-03 23:59:59' GROUP BY date, vendor, os, locale
And our results more like this:
2006-09-01 flock win en_US 12 2006-09-01 flock mac en_US 4 2006-09-01 flock linux en_US 1 2006-09-01 flock win zn_CH 12 2006-09-01 flock mac zn_CH 4 2006-09-01 flock linux zn_CH 1 2006-09-01 pb win en_US 12 2006-09-01 pb mac en_US 4
Note that I’m showing just one sample day here (and by the way, these aren’t real numbers, so don’t try to extrapolate and announce a download count for Flock). A few times a day, I put this data into a table with a unique key comprising all columns but the count, and each time I run the big query, I replace the values. For nearly-live reporting, I can then run queries summing the counts and adding constraints (ahem, “additive filtering”) on any of the columns desired. So if I want a total count, I just do “SELECT SUM(count_col) FROM cache_table” for the requested date range, and if I want a Photobucket count, I do “SELECT SUM(count_col) FROM cache_table WHERE vendor = ‘pb'” for the date range. For any further narrowing I wish to do, I just add another constraint to my WHERE clause. Again, it gets a little tricky if you’re doing any sort of averaging without constraints (you can’t just do a sum in that case but have to calculate a weighted average per would-be constraint set), but by and large, this has so far been a pretty good quick and dirty way of doing caching to generate speedy reports.
There are no doubt better ways to handle caching of large data sets, but for our purposes, this little trick has proven effective. I imagine about 1% of my already limited readership probably found this even remotely interesting. Still, I occasionally wonder if it might be appealing to some to have a glimpse into some of the less glamorous things that go on behind the scenes in a project like Flock. So, what do you think is better? Turkeys and neat little discoveries or things like this post?