Bill Katz

My Brain

An occasionally updated repository of thoughts, past work, and links.

Fault-tolerant counters for App Engine

The datastore in Google App Engine can occasionally throw an error.  It might be a timeout, quota violation, a result of maintenance (CapabilityDisabledError), or other exceptions thrown by the db and apiproxy_errors module.

So how do you gracefully handle datastore failures?  You could just inform the user to try again later.  Another approach is to use the memcache API and build a buffer for failed datastore puts.  That's the approach I took when reworking my sharded, memcached counter system.

In order to decrease write contentions, Google suggests spreading a count across shards.  To get some fault-tolerance, we can catch any exceptions when writing a shard, add the amount that failed to be put() into a delayed_incr cache, and when we next write into a shard, add the delayed_incr amount back in.  If we keep getting put() failures, the delayed_incr amount increases, the counter values in the datastore stall, but we get accurate counter values from memcache.  As soon as we have one successful put, though, the counter shards in the datastore will reflect the correct count.

I assume that datastore errors are limited in duration relative to memcache times.  So the memcache smooths over timeout and other errors.

You can review the code at the counter's git repository.  Let me know if you spot any issues with the approach.  Like the App Engine SDK, the counter code is released under an Apache 2.0 license.

It might be possible to generalize this method to provide a more fault-tolerant datastore layer for your models.

Category: App Engine Software

Comments are closed

4 Comments

  1. Re: Article by Brett (2008-10-08)

    Cool.
  2. Re: Article by jeremy Awon (2008-10-09)

    i was just about to waste my afternoon coding this myself! thanks.
  3. Cache? by Paul Bunkham (2008-11-04)

    Great work, I've found it very useful in a project I'm working on.

    I have discovered something that might be a problem. The memcached part of the counter will always return 0 when a value isn't found in the cache.

    This means that when the cache is invalidated, counter values are not retrieved from the data store unless it's forced in the code using the get_counter(nocache=true) call. The fix is quite simple, including a couple of 'if' statements, however is this behaviour intentional?

  4. Re: Article by cristiang (2008-11-11)

    :)