Caching, Grouping, and Additive Filtering

I’ve wiled away the last few months of my work life doing statistics. For example, I try as best as I can to figure out how many people have clicked to our download page, how many initiated a download, how many downloads successfully finished, and how many people are actually using the browser. It’s a lot trickier to figure this stuff out than you’d think, though I won’t bore you with the details.

I plan to bore you with different details. As you begin to collect more and more data, providing live statistics becomes challenging. Say you’ve got a database table with a few million rows of data in it, and somebody wants to pull a query from that database for a given date range and get a report that requires a lot of calculations. Suddenly, you not only have a query that has to comb through a lot of data, but once you’ve finally got all the data, you have to iterate over it to perform calculations. In our case, a couple of months ago, we found that we had enough data being churned that an overview report was taking 20 seconds or so to run. I spent a lot of time tweaking indices and optimizing queries, and I managed to save a lot of time, but I knew that as our data grew, the reports would bog down again. The solution, I figured, lay in caching data for speedy retrieval.

We typically care (at least for now) about daily data. This means that I can occasionally run the big heavy queries, group the data by date, and store data in such a way that I can simply add up counts for a date range to get very responsive reports. So for example, a query like this from a huge table of log entries:

SELECT date, count(*) FROM data WHERE date BETWEEN '2006-09-01 00:00:00' AND '2006-09-03 23:59:59' GROUP BY date

might yield a (partial) result like this:

2006-09-01       23
2006-09-02       28
2006-09-03       27

I stuff those values into a caching table. Then, when somebody asks for a total count of some piece of data for the date range 09/01 – 09/03, I can do a lightning-fast query from the caching table and add up the values in the count column to return the result (quite possibly using mysql’s SUM function to eliminate the need for any post-processing of the data). This gets a little dicey when you start having to calculate averages (for just one example) or when you start selecting based on more criteria than just date. This last selection type is something else I’ve handled in what I think/hope is a pretty nifty way that I call, for lack of any more fitting term I’m familiar with, additive filtering.

So, the scenario. Imagine that you want to allow users of your reports to filter on various criteria. In our case, we provide a standard version of Flock and a Photobucket version. We also provide versions for different platforms for both Flock and Photobucket. And for Flock releases, we provide builds for multiple platforms for multiple languages. The simple count-per-date caching I used as an example above doesn’t allow us to filter the data on these criteria efficiently, so we need another mechanism. Enter more sophisticated grouping in our caching tables. The query of the huge table we used before becomes something more like this:

SELECT date, vendor, os, locale, count(*) FROM data WHERE date BETWEEN '2006-09-01 00:00:00' AND '2006-09-03 23:59:59' GROUP BY date, vendor, os, locale

And our results more like this:

2006-09-01	flock	win	en_US	12
2006-09-01	flock	mac	en_US	 4
2006-09-01	flock	linux	en_US	 1
2006-09-01	flock	win	zn_CH	12
2006-09-01	flock	mac	zn_CH	 4
2006-09-01	flock	linux	zn_CH	 1
2006-09-01	pb	win	en_US	12
2006-09-01	pb	mac	en_US	 4

Note that I’m showing just one sample day here (and by the way, these aren’t real numbers, so don’t try to extrapolate and announce a download count for Flock). A few times a day, I put this data into a table with a unique key comprising all columns but the count, and each time I run the big query, I replace the values. For nearly-live reporting, I can then run queries summing the counts and adding constraints (ahem, “additive filtering”) on any of the columns desired. So if I want a total count, I just do “SELECT SUM(count_col) FROM cache_table” for the requested date range, and if I want a Photobucket count, I do “SELECT SUM(count_col) FROM cache_table WHERE vendor = ‘pb'” for the date range. For any further narrowing I wish to do, I just add another constraint to my WHERE clause. Again, it gets a little tricky if you’re doing any sort of averaging without constraints (you can’t just do a sum in that case but have to calculate a weighted average per would-be constraint set), but by and large, this has so far been a pretty good quick and dirty way of doing caching to generate speedy reports.

There are no doubt better ways to handle caching of large data sets, but for our purposes, this little trick has proven effective. I imagine about 1% of my already limited readership probably found this even remotely interesting. Still, I occasionally wonder if it might be appealing to some to have a glimpse into some of the less glamorous things that go on behind the scenes in a project like Flock. So, what do you think is better? Turkeys and neat little discoveries or things like this post?



Six months or so ago, I wrote the current version of Flock’s extensions site. It’s pretty simple and was intended to be simple. It wound up being even simpler than originally planned and has plenty of functionality hidden because we haven’t had a need for it or the ability to support it. We’ve thought a lot about various approaches to making extensions available, and I blogged some of those thoughts a few months ago. The quick summary is that we were waffling over whether to enforce quality for Flock extensions by officially hosting only a few or whether to have a free-for-all, or whether to blend the approaches. We ultimately decided on a blended approach, and we planned to launch a new extensions site in early February with the launch of the 0.5 release of Flock. For various reasons, that didn’t pan out.

I visited Flock HQ last week, and we revisited the subject. The result? I’ve burned some midnight oil over the last few days and have built a draft of Flock’s new extensions site. It’s a Drupal module that I anticipate ultimately making public in the event that any community sites or similar projects wish to host extensions. Here’s what you can expect to see when it launches in (unless history’s a good indicator) a couple of weeks.

  • Search by (at least) keyword, locale, platform, and category.
  • See what Flock staffers recommend
  • Find extensions that get high ratings from the community, which presupposes…
  • One-click extensions ratings
  • Comment on extensions
  • Convert extensions that don’t already work in Flock

This last one was a fun little piece of code to write, and I want to talk about it for a minute because I’m frankly a little proud of the way it solves a few problems both on the community side and the administration side.

Converting extensions is nothing new. I wrote a command-line utility six months ago that was handy for batch-converting extensions, though not especially suitable for broad public use. I also started work on a web converter, but I never got a chance to finish it. Luckily, the community stood up and filled the need by writing various server and desktop programs to help individuals make extensions work in Flock and discover what extensions had already been converted. Further, for the would-have-been February launch of a new Flock web site (complete with extensions site), John Vandenberg did a lot of work and provided lots of valuable input for an extensions module in particular and on extensions conversion and management generally. So my building a module to convert extensions isn’t exactly trail-blazing. The module nevertheless solves several problems in a way that blends some of these approaches and that I think will add lots of value both for Flock users and for our staff.

So, how does it work? Say you’ve done a quick search for an extension and can’t find it on our site. You’ll be prompted to go to the extension conversion page, where you can either upload or paste the url for the extension you’re looking for. The site then fetches the extension and reads in its meta-data. If it’s already in our database, the site updates a few fields. For example, it looks for locales and platforms the extension reports itself to be compatible with. So let’s say that someone else has previously uploaded the extension, but the version you upload supports more locales. Your action will update our record for the extension, making our meta-information that much more accurate. We also try to automatically figure out from the extension configuration who the author of the extension is, and, where possible, we provide “also by” information to help people find other extensions we’ve catalogued by a given author. It’s not perfect yet, but we’re heading in the right direction. And finally (among the goodies that give me a little thrill), we do some very basic security checks and provide a warning for extensions that do things that could potentially be dangerous. This merits some elaboration.

Over a year ago, I blogged about a major but necessary flaw in the Firefox extensions framework that makes is possible for extensions authors to write extensions that appear to be nice but that can morph over time into malicious extensions. All it takes is writing an extension that makes an XMLHttpRequest call to a remote server and evaluates the javascript string returned. Say I write an extension with broad appeal, and it makes such a call, and during the first week of deployment, the javascript returned by the server does useful and expected and non-malicious things. But after a week of gathering users, say I change the javascript returned by the server so that it reads your cookies and sends them to me or performs some other potentially nasty tasks. This is entirely possible, and it’s long fueled wariness on my part about installing extensions. The browser tells you only to install trusted extensions, but how many of us have even noticed that warning, much less paid it very much attention?

I don’t know of a way to address this issue fully, short of disabling extensions capability, which obviously won’t do. In order to help raise the security bar a little bit (not much, but every little bit helps), the new extensions catalog does a few rudimentary security checks. It looks for some potentially exploitable strings in extensions and produces a warning if they exist. In most cases, these will probably be false positives. For example, it’s perfectly valid and necessary for a del.icio.us extension to make XMLHttpRequest calls so that it can send and retrieve data to and from your del.icio.us account. Pretty much any extension that contacts a web service (read: every social networking extension) will cause the security flags to be raised. And it’s distinctly possible that users of our catalog will begin to ignore these warnings as readily as they ignore the browser’s warning about installing extensions. Every little flag or barrier to installing potentially malicious extensions helps, though, I think.

So, there’s a preview of what should be coming soon at a revamped Flock site. We’ll recommend a few extensions, and others will be pretty easily browsable. As we begin to accumulate ratings and comments, we’ll try to expose better and better ways of finding the best extensions. The catalog we’ll be releasing will in some ways be a very rough draft with its various hiccups and sputters, but it should be an improvement on what’s out there now, and it will empower the community to port most extensions to work with Flock even if we don’t publish all ported extensions. The process will also be streamlined for Flock staffers, and I think we’re heading down a good path for all of us. Stay tuned.


technorati tags: , ,


Inviting JSON to the Table

I’m doing some preliminary work on a project for which it’s been suggested that I consider using JSON rather than XML as a data transport. "JSON?" you ask. "What’s that?" It stands for JavaScript Object Notation, and for those of us who’ve spent a lot of time writing javascript, it’ll look very familiar. It’s a subset of the javascript language and can be described as the convention whereby one represents object members as name/value pairs. In short, it’s a form of serialization native to javascript and is therefore understood by all modern browsers out of the box and by many other programming languages either natively or by simple extension. A javascript function can eval a JSON text string with no additional parsing needed and can then use the decoded values directly. This can be beneficial to web applications in at least two potentially notable ways:

  • It eliminates the need to parse a verbose XML document into an object and then perform operations on the object.
  • The format can be (though isn’t necessarily) less verbose than XML.

Transfer time and processing overhead can therefore be optimized when using JSON in some circumstances. Furthermore, for some uses, JSON might actually save programming effort required on the server side to generate XML from objects or on the client side going in the other direction.

Those advantages notwithstanding, I was originally hesitant to give JSON more than a passing glance. For most web applications that feature the sort of functionality I had in mind (including, as far as I was aware at the time, the one I’m doing R&D for), existing AJAX toolkits fit the bill, and I was inclined to use an existing AJAX toolkit rather than to implement JSON for the sheer novelty of doing so. Consider an editable grid table, for example. You have a text field with an onchange event. On change, you send a small piece of data to the server and you get a small piece of data back that tells the client how to provide UI feedback. None of the three benefits of JSON I mentioned above really apply here, as the data in both directions is small, requires almost no processing, and need not be an otherwise usable object. It’s text out, text in, and minor DOM manipulation. In such a case, JSON provides no real advantage, and you might as well go with a standard AJAX toolkit.

A colleague working on the project with me pointed out some other possible use cases, however, that might render JSON worth further investigation. For example, if the data comes down as an object, it can be sorted and have calculations performed on it more readily on the client side without a round-trip to the server and back per operation. There’s something very appealing to me about this. So I’ll be doing more diligence on JSON.

As my first foray into coding with JSON, I wanted to test the example my colleague brought up. Doing so required me to grab a few libraries, and I haven’t packaged it all up nicely, but it’d be reasonably easy to assemble these things and test this out for yourself if you’re interested. Here’s what you need to grab: 

So, in a web sandbox, save the PEAR class as JSON.php and create a file named "process.php" with the following contents:



$rows = array();
$cols = array();


for($i=0; $i<$colcount; $i++){
        array_push($cols, "col $i");

for($i=0; $i<100; $i++){
        $row=array("count"      => $i);
        for($j=0; $j<$colcount; $j++){
                $row[$cols[$j] ] = substr(md5(microtime() . $cols[$colcount]),0,8);
        array_push($rows, $row);

        "error"         => "0",
        "message"       => "success",
        "payload"       => $rows

$json = new Services_JSON();
$output = $json->encode($response);


This script generates 100 rows of 20 columns of junk data and returns it as a JSON object. In a real-world application, this would presumably be a data set returned from a database. The XMLHttpRequest issued from your client calls this script and handles the data. Now on to that part of the code. Create a file named index.html and populate it as follows:

        <title>JSON Demo</title>
        <script type="text/javascript" xsrc="sortable.js" mce_src="sortable.js"></script>
        <script type="text/javascript" xsrc="json.js" mce_src="json.js"></script>
        <script type="text/javascript" xsrc="xmlrpc.js" mce_src="xmlrpc.js"></script>
        <style type="text/css">
                /* Sortable tables */
                table.sortable a.sortheader {
                        font-weight: bold;
                        text-decoration: none;
                        display: block;
                table.sortable span.sortarrow {
                        color: black;
                        text-decoration: none;
        <div id="container">
                <input type="text" id="thefield" name="thefield" value="" />
                <input type="button" id="thebutton" name="thebutton" value="The Button" onclick="json_request(document.getElementById('thefield').value)" />

Note the javascript includes at the top, and be sure to name the downloaded libraries appropriately or to change the file. Now create json.js and populate it as follows:

function json_request(txt){

        var myOnComplete = function(responseText, responseXML){
                var obj = eval('(' + responseText + ')');
                var container=document.getElementById('container');

        var myOnLog = function(msg){

        var provider = new oyXMLRPCProvider();
        provider.onComplete = myOnComplete;
        provider.onError = myOnLog;

        provider.submit("process.php?txt=" + escape(txt));

function make_table(data){
        var table = document.createElement('table');
        var tbody = document.createElement('tbody');
        for(var i=0; i<data.length; i++ ){
                var tr = document.createElement('tr');
                //Create headers on first iteration.
                        for(var field in data[i]){
                                var th = document.createElement('th');
                        //Be sure to start a new row.
                        var tr = document.createElement('tr');
                for(var field in data[i]){
                        var td = document.createElement('td');
        table.id="table_" + Math.floor ( Math.random ( ) * 100 );
        return table;

If you get everything linked up correctly, the result should be that when you press the button on the main page, a JSON object is pulled down asynchronously from the server and appended to the page as a table of data. Each such table is independently sortable without round trips to the server and back (and without your having to write sorting validation code to prevent SQL injection attacks, etc.). Of course, this does degrade poorly for browsers in which javascript is disabled. In any case, I’ve modeled one bit of functionality that’s pretty painless to implement using JSON, and I suspect that further work in this direction will turn up even more interesting results.

For more information on JSON, be sure to hit the JSON site

technorati tags: , ,


Ten Ways to Be a Bona Fide Flocker

  1. Use Flock! Specifically, keep on top of hourly builds to monitor how we’re progressing and to test ongoing changes.
  2. Report bugs. It helps us most to have bugs reported against hourly builds. See the wiki for some bug-reporting guidelines.
  3. Filter bugs. It’s natural for there to be duplicate bugs or bugs that could use more detail, and we could use help resolving these sorts of issues.
  4. Help other users. If you see somebody in the forums or on a list asking a question you know the answer to, chime in with friendly instructions or an informational link.
  5. Respond to press. Take a minute to respond politely to blog posts about Flock, providing tips and additional information where possible. We’re working on tools to help us track our response rate to Flock news on the web.
  6. Engage with the community. Here are some ideas: Collaborate to write wiki entries listing the top 50 most annoying bugs at any given time; do the same for a top 20 features list, using bugzilla as a guideline in both cases. Organize contests among yourselves; create Flock art (don’t forget to tag it “flockart“); create your own Top X lists; emerge, emerge!
  7. Localize Flock. Start with the tutorial, and then engage with other localizers in the forums or on the mailing list.
  8. Volunteer. We anticipate having all sorts of opportunities available, from massaging data to web development to more formal bug filtering. Let us know how you’re best qualified to help, and we’ll match you up with fitting opportunities as they arise.
  9. Talk to us. We’re very eager to provide tools that’ll help you build a vibrant community around Flock. Email me at daryl at flock.com to let me know what you think would be useful, or better yet, join the flock-discuss mailing list and share your thoughts with others as well.
  10. Spread the meme. Write your own blog posts about Flock. Add banners to your site. Wear swag (it’ll be available soon).

Technorati Tags: flock, community


Drupal/Bugzilla Integration

As noted the other day, I’m trying to encourage the Flock community to log bugs themselves. I haven’t had a chance to gauge just yet how many are actually doing this. I do keep seeing bug reports in the forums, and I contine to suggest that people log them as real bugs, but I worry that if we’re not careful, some of these valuable reports will fall through the cracks.

For now, as I comment on these reports, I’m adding a little pseudo-tag to my comment that I can go back and search for later to make sure these entries do in fact have bugs logged. But it promises to be sort of a nasty process. First, I leave little garbage pseudo-tags in my comment; then, I’ll have to endure the tedium of ensuring that a bug is created for each of these. Here’s what’d be neat:

Each forum post has a button I can push to convert the post and comments into a bug report at my bugzilla installation. At minimum, it should send the link as the url associated with the bug, and ideally, it dumps the content of the forum thread into the bug. In an ideal world, it also gets a response from Bugzilla that tells it the bug number, and it adds a comment to the forum thread that includes a link to the bug that’s just been logged.

It’d also be neat if each forum post had a small text box that would allow me to type in a bug number and that would add a comment including a link to the bug when I did so.

I think this’d be just a great bit of integration between Drupal and Bugzilla, and it seems like it’d be of some use to the broader community of open source developers. Does such a thing exist in hiding somewhere? Is there anybody who’d be willing to whip something like this up?

Technorati Tags: ,


Flocker: A New Definition

          1. One who works as an employee of Flock.
          2. One who uses Flock, especially if active in the Flock community.
          3. A program that converts Firefox extensions into Flock extensions.

Geoff Gauchet has written a little app that lets you drag a Firefox extension onto it and that spits out a Flock-compatible version. This is a great tool for those of you who’re waiting on your favorite extensions to be ported before making the switch to Flock. I hear that the tool doesn’t port all extensions yet but that it’ll soon support more. Great work, Geoff!

As for Flock-hosted extensions, we’re working right now to get the extensions site up and running with updates and the ability for extensions developers to submit their own extensions. In the mean time, Geoff’s efforts will help get current Flockers through. This is certainly worth some swag. Geoff, look for an email from me today.

Right now, Flocker works only on Mac and PC. If somebody does a Linux port (especially if it supports both a cli interface and a gui interface), you can bet I’ll send a shirt and some stickers your way.


Bugging People to Use Bugzilla

I spend a surprising amount of time each day lately asking people in the forums or on the mailing lists to submit to our bug database the issues they’re reporting. Today, I wanted to take a minute to explain why I’m asking people to do this themselves rather than trotting off to submit the bugs myself. There are two reasons, and neither of them is that I’m just plain lazy.

My first justification is that that a bug report is almost always going to be more accurate if submitted by the person who experienced the bug. You’re probably familiar with the old party game in which one person starts out by whispering a phrase into the ear of the person next to him, and it goes on down the line until the last person says what was whispered in his ear. “I dig blogging with the new browser Flock” turns into “Let’s go frog gigging in a new blouse or frock.” This is great at parties, but when it comes to bugs, we want nothing lost in translation, so I’m doing my best to stay out of the bug-logging-by-proxy business.

My second reason for directing people to log their own bugs is that I want to encourage our users — and not just the tech wonks (whom we love) — to become engaged in assuring the quality of the browser. If you encounter a problem, I want you to feel comfortable doing a quick search through the bug list and either adding your comments to existing bugs that have affected you or creating new bugs for issues for which you’re the first reporter. I think that owning bugs contributes to a sense of ownership of or investment in the browser, and that’s a good thing for our users to have. By reporting or confirming bugs, you’re directly helping to improve the quality of Flock, and I want our users to have the satisfying experience of seeing one of their bugs resolved.

Now, everybody who’s taken even a sidelong glance at bugzilla knows that it’s not the prettiest thing out of the box. It’s very intimidating to a lot of people, though it’s got some very nice features that make it a good choice for enterprise bug management. Part of our community effort is going to include trying to make Bugzilla more palatable to front-end users. I plan to provide quick views (maybe feeds) of popular bugs, for example, and Bart has already enumerated a few other things we’d like to contribute to the Bugzilla project. So the Flock Bugzilla experience will get better soon.

In the mean time, I hope you’ll start getting familiar with the bug database. If you’re having problems, hit me with questions and I’ll see if I can help out. I’d love to see an industrious community member post some simple documentation at the wiki on best practices for searching and creating bugs. (Do a good job of it and let me know and it’ll be worth some swag.)

No frogs, blouses, frocks, or flocks were harmed in the production of this entry.

Technorati Tags: ,