There’s been much ado of late about Google‘s acquisition of YouTube. Many are referring to YouTube now as GooTube, and it’s clear from some videos on YouTube that some folk are pretty nervous about the acquisition, fearing that the video service will become a pay-to-play service or that Google will find some other way to ruin YouTube. My own attention lately has been directed largely to the preparation of my home for sale, and as part of those preparations, I’ve purchased a product, pictured here, that may be of interest to those worried by this new business partnership. I present to you “Goo Gone,” which in my experience is better at killing ants than it is at anything else. Apply to GooTube and rub vigorously. Results may vary.
I’ve wiled away the last few months of my work life doing statistics. For example, I try as best as I can to figure out how many people have clicked to our download page, how many initiated a download, how many downloads successfully finished, and how many people are actually using the browser. It’s a lot trickier to figure this stuff out than you’d think, though I won’t bore you with the details.
I plan to bore you with different details. As you begin to collect more and more data, providing live statistics becomes challenging. Say you’ve got a database table with a few million rows of data in it, and somebody wants to pull a query from that database for a given date range and get a report that requires a lot of calculations. Suddenly, you not only have a query that has to comb through a lot of data, but once you’ve finally got all the data, you have to iterate over it to perform calculations. In our case, a couple of months ago, we found that we had enough data being churned that an overview report was taking 20 seconds or so to run. I spent a lot of time tweaking indices and optimizing queries, and I managed to save a lot of time, but I knew that as our data grew, the reports would bog down again. The solution, I figured, lay in caching data for speedy retrieval.
We typically care (at least for now) about daily data. This means that I can occasionally run the big heavy queries, group the data by date, and store data in such a way that I can simply add up counts for a date range to get very responsive reports. So for example, a query like this from a huge table of log entries:
SELECT date, count(*) FROM data WHERE date BETWEEN '2006-09-01 00:00:00' AND '2006-09-03 23:59:59' GROUP BY date
might yield a (partial) result like this:
2006-09-01 23 2006-09-02 28 2006-09-03 27
I stuff those values into a caching table. Then, when somebody asks for a total count of some piece of data for the date range 09/01 – 09/03, I can do a lightning-fast query from the caching table and add up the values in the count column to return the result (quite possibly using mysql’s SUM function to eliminate the need for any post-processing of the data). This gets a little dicey when you start having to calculate averages (for just one example) or when you start selecting based on more criteria than just date. This last selection type is something else I’ve handled in what I think/hope is a pretty nifty way that I call, for lack of any more fitting term I’m familiar with, additive filtering.
So, the scenario. Imagine that you want to allow users of your reports to filter on various criteria. In our case, we provide a standard version of Flock and a Photobucket version. We also provide versions for different platforms for both Flock and Photobucket. And for Flock releases, we provide builds for multiple platforms for multiple languages. The simple count-per-date caching I used as an example above doesn’t allow us to filter the data on these criteria efficiently, so we need another mechanism. Enter more sophisticated grouping in our caching tables. The query of the huge table we used before becomes something more like this:
SELECT date, vendor, os, locale, count(*) FROM data WHERE date BETWEEN '2006-09-01 00:00:00' AND '2006-09-03 23:59:59' GROUP BY date, vendor, os, locale
And our results more like this:
2006-09-01 flock win en_US 12 2006-09-01 flock mac en_US 4 2006-09-01 flock linux en_US 1 2006-09-01 flock win zn_CH 12 2006-09-01 flock mac zn_CH 4 2006-09-01 flock linux zn_CH 1 2006-09-01 pb win en_US 12 2006-09-01 pb mac en_US 4
Note that I’m showing just one sample day here (and by the way, these aren’t real numbers, so don’t try to extrapolate and announce a download count for Flock). A few times a day, I put this data into a table with a unique key comprising all columns but the count, and each time I run the big query, I replace the values. For nearly-live reporting, I can then run queries summing the counts and adding constraints (ahem, “additive filtering”) on any of the columns desired. So if I want a total count, I just do “SELECT SUM(count_col) FROM cache_table” for the requested date range, and if I want a Photobucket count, I do “SELECT SUM(count_col) FROM cache_table WHERE vendor = ‘pb'” for the date range. For any further narrowing I wish to do, I just add another constraint to my WHERE clause. Again, it gets a little tricky if you’re doing any sort of averaging without constraints (you can’t just do a sum in that case but have to calculate a weighted average per would-be constraint set), but by and large, this has so far been a pretty good quick and dirty way of doing caching to generate speedy reports.
There are no doubt better ways to handle caching of large data sets, but for our purposes, this little trick has proven effective. I imagine about 1% of my already limited readership probably found this even remotely interesting. Still, I occasionally wonder if it might be appealing to some to have a glimpse into some of the less glamorous things that go on behind the scenes in a project like Flock. So, what do you think is better? Turkeys and neat little discoveries or things like this post?
While waiting in line at a craft store today, I saw the item pictured here in a display by the cash register. Next to it were some other “Flocked Minis” of Halloween figures (ghosts and pumpkins), but for my purchasing power, the birds were it. (There were also some little bears that I almost always think about getting for my daughter when I see them, but they didn’t come in the tangentially-Flock-relevant packaging, so I opted out of those once again.)
Lest you think this is an ill-advised product placement, let me explain that flocking is (among other things) a technique of texturing or patterning something with pulverized wool or felt (source). These turkey figurines are flocked. I wonder if turkeys actually flock in the wild.
I’m no good with version numbers, but I hereby propose that whatever version number we’ve reached when we’re down to the letter “T” in our count (our already having used Buzzard and Cardinal and heading into Danphe) that we name the browser Turkey.
My daughter enjoyed playing with the turkeys.
I picked up Cal Henderson’s Building Scalable Web Sites the other night hoping to learn something about “The Flickr Way” of building scalable web sites. I’m about halfway through and have mixed feelings so far about its usefulness for my purposes, but I couldn’t help feeling a little thrill when I encountered a reference to Flock on page 142. This portion of the book gives a brief explanation of the http request/response life cycle, and Cal happened to be using Flock while generating his example request (which returns a 404, if you’re interested, though the example suggests otherwise 😉 ).
I suppose I shouldn’t have been terribly surprised to see a Flock reference in a book by a guy from Web 2.0 darling Flickr. That Lloyd and Cal are buddies (there are photos to prove it) and that Cal has been spotted at SuperHappyDevHouses (the creation and proliferation of which past Flockers Termie and Factoryjoe have been instrumental in) should have rendered it even less of a surprise. Still, when I saw the example, I flopped the book down and exclaimed to my wife, “Hey, Flock’s in this book! Woo hoo!”
I wrote the other night about my frustration with comments in WordPress. The anti-spam tools that come with the software are great for squashing spam, but I still occasionally go through periods during which I get a bunch of emails asking me to log in and moderate spam, and that irritates me. Most of these moderation requests are for spam on old posts. So I decided to write a WordPress plugin that would auto-disable comments and pings on old posts. This allows me to keep comments open for those who read my stuff and would like to comment within a reasonable timeframe but keeps me from having to go back and manually run mysql queries by hand (or, worse yet, manually edit old posts) to turn off comments on individual posts. (The WordPress options for commenting affect only future posts.)
The plugin is a simple one, providing an option under the “Manage” and “Manage Comments” administrative screens allowing you to set the number of days old a post must be to have its comments disabled. If you post infrequently, this plugin won’t do you much good, as its action is triggered by the saving of a post. If you post every few days as I tend to, though, it’ll do an ok job of keeping old posts from picking up comment spam. The default threshold is 14 days. So any time I add or save a post, WordPress runs a query that disables comments and pings on all posts created more than 14 days ago. That’s all there is to it.
Download the plugin here if you think you’d find it useful.
From a late-night IRC conversation with a Flock staffer and a Flock community member. It’s sort of a running joke that, a Southerner, I partake of all the bad habits and am characterized by all the provincialisms generally associated with the South.
daryl: (sorry, it’s late here and I’m working on a blog post entitled “My complex relationship with meat,” so it’s fitting that I should be in a weird mindframe 😉
daryl: yosh, I’m occasionally eating meat again
daryl: we ran out of vegetables in Tennessee
daryl: except for tobacco, that is, and it tastes really bad
yosh: well, TN really didn’t have that many
yosh: daryl: tobacco can be good with the right sauce
daryl: like a durian sauce?
yosh: durian-natto sauce
daryl: actually, I’m mainly eating meat b/c my newly pregnant wife craves it and I’m tired of cooking two meals a night
[redacted]: daryl: you actually eat tabacco?
daryl: [redacted], it’s a staple in Tennessee
daryl: that and buggering cows 😉
daryl: (no, I don’t eat tobacco)
daryl: (though I am married to my sister, who is also my grandmother and my third cousin six times removed; and my father)
yosh: daryl is his own grandpa
daryl: and grandma
daryl: I’m also my own sandwich
daryl: (my other grandpa having mated with a tobacco plant, that is)
[redacted]: daryl: good (you don’t eat tobacco) 😉
daryl: opium, now that’s a different story 😉