Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

On python, javascript, django, search, apis, ...

 

Recent entries

Pragmatism, purity and JSON content types 22 days ago

I started a conversation about this on Twitter the other day, but Twitter is a horrible place to have an archived discussion so I’m going to try again here.

If you’re producing a JSON API for other people to use (as opposed to an API that’s only really meant for your own local Ajax responses), you need to decide which Content-Type to use. The best option is not entirely obvious.

RFC 4672 defines JSON and reserves application/json as the preferred media type. The problem is that most browsers will prompt you to download the file rather than displaying it inline (as they would for text/plain or application/javascript). One of my favourite qualities of REST-style APIs is that they enable exploration and debugging using just a browser—using application/json throws a big, frustrating road block in the way. There are ways of telling your browser to treat application/json in the same way as text/plain but that doesn’t really help you if your aim is to create an API that’s easy for other developers to use.

It’s also worth mentioning that if you are returning JSONP (with an extra callback function wrapped around the JSON response to enable the dynamic script tag hack) you HAVE to serve as application/javascript—otherwise the script you are providing won’t be executed by the browser. Don’t forget to include charset=UTF8 as well (for both types of response).

So, it’s pragmatism v.s. purity. The correct thing to do is to return application/json, but doing so makes your API harder for developers to use.

In a brief, non-comprehensive review of some existing JSON APIs (FriendFeed, Flickr, Google Social Graph etc) I couldn’t find any that were using application/json, presumably for this exact reason.

Using the Accept: header

The Accept: header is one of my least favourite parts of HTTP. I like to be confident that if I send a URL to someone, they’ll get back exactly the same bytes as I did when I retrieved it myself (I distrust language negotiation for the same reason). However, a number of people suggested it on Twitter and it looks like it could be a useful solution to this problem.

I’m currently considering the following: ONLY use the application/json Content-Type in reply to requests that include application/json in their Accept header—essentially allowing clients that care about the correct content type to opt-in to receiving it. Everyone else (browsers included) gets application/javascript, which is less correct (though not an all-out lie, since JSON is a subset of JavaScript) but solves the usability problem.

A couple of things worry me about this. Firstly, is this a reasonable thing to use Accept for? Secondly, is there a chance that browsers might add application/json to their Accept header at some point in the future? Safari currently sends text/xml,application/xml,application/xhtml+xml,text/html; q=0.9,text/plain; q=0.8,image/png,*/*; q=0.5 while Firefox sends text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8. Would it be smarter to look for */* and serve the incorrect Content-Type to those requests and the correct one to everything else?

An alternative is to simply allow people to specify “JSON with a browsable Content-Type” as an alternative format option, or to enable a “pretty=1” query string argument which returns the response as text/plain and potentially pretty prints it as well. I haven’t yet decided if this is better than messing around with the Accept header.

Rate limiting with memcached one month ago

On Monday, several high profile “celebrity” Twitter accounts started spouting nonsense, the victims of stolen passwords. Wired has the full story—someone ran a dictionary attack against a Twitter staff member, discovered their password and used Twitter’s admin tools to reset the passwords on the accounts they wanted to steal.

The Twitter incident got me thinking about rate limiting again. I’ve been wanting a good general solution to this problem for quite a while, for API projects as well as security. Django Snippets has an answer, but it works by storing access information in the database and requires you to run a periodic purge command to clean up the old records.

I’m strongly averse to writing to the database for every hit. For most web applications reads scale easily, but writes don’t. I also want to avoid filling my database with administrative gunk (I dislike database backed sessions for the same reason). But rate limiting relies on storing state, so there has to be some kind of persistence.

Using memcached counters

I think I’ve found a solution, thanks to memcached and in particular the incr command. incr lets you atomically increment an already existing counter, simply by specifying its key. add can be used to create that counter—it will fail silently if the provided key already exists.

Let’s say we want to limit a user to 10 hits every minute. A naive implementation would be to create a memcached counter for hits from that user’s IP address in a specific minute. The counter key might look like this:

ratelimit_72.26.203.98_2009-01-07-21:45

Increment that counter for every hit, and if it exceeds 10 block the request.

What if the user makes ten requests all in the last second of the minute, then another ten a second later? The rate limiter will let them off. For many cases this is probably acceptable, but we can improve things with a slightly more complex strategy. Let’s say we want to allow up to 30 requests every five minutes. Instead of maintaining one counter, we can maintain five—one for each of the past five minutes (older counters than that are allowed to expire). After a few minutes we might end up with counters that look like this:

ratelimit_72.26.203.98_2009-01-07-21:45 = 13
ratelimit_72.26.203.98_2009-01-07-21:46 = 7
ratelimit_72.26.203.98_2009-01-07-21:47 = 11

Now, on every request we work out the keys for the past five minutes and use get_multi to retrieve them. If the sum of those counters exceeds the maximum allowed for that time period, we block the request.

Are there any obvious flaws to this approach? I’m pretty happy with it—it cleans up after itself (old counters quietly expire from the cache), it shouldn’t use much resources (just five active cache keys per unique IP address at any one time) and if the cache is lost the only snag is that a few clients might go slightly over their rate limit. I don’t think it’s possible for an attacker to force the counters to expire early.

An implementation for Django

I’ve put together an example implementation of this algorithm using Django, hosted on GitHub. The readme.txt file shows how it works—basic usage is via a simple decorator:

from ratelimitcache import ratelimit

@ratelimit(minutes = 3, requests = 20)
def myview(request):
    # ...
    return HttpResponse('...')

Python decorators are typically functions, but ratelimit is actually a class. This means it can be customised by subclassing it, and the class provides a number of methods designed to be over-ridden. I’ve provided an example of this in the module itself—ratelimit_post, a decorator which only limits on POST requests and can optionally couple the rate limiting to an individual POST field. Here’s the complete implementation:

class ratelimit_post(ratelimit):
    "Rate limit POSTs - can be used to protect a login form"
    key_field = None # If provided, this POST var will affect the rate limit
    
    def should_ratelimit(self, request):
        return request.method == 'POST'
    
    def key_extra(self, request):
        # IP address and key_field (if it is set)
        extra = super(ratelimit_post, self).key_extra(request)
        if self.key_field:
            value = sha.new(request.POST.get(self.key_field, '')).hexdigest()
            extra += '-' + value
        return extra

And here’s how you would use it to limit the number of times a specific IP address can attempt to log in as a particular user:

@ratelimit_post(minutes = 3, requests = 10, key_field = 'username')
def login(request):
    # ...
    return HttpResponse('...')

The should_ratelimit() method is called before any other rate limiting logic. The default implementation returns True, but here we only want to apply rate limits to POST requests. The key_extra() method is used to compose the keys used for the counter—by default this just includes the request’s IP address, but in ratelimit_post we can optionally include the value of a POST field (for example the username). We could include things like the request path here to apply different rate limit counters to different URLs.

Finally, the readme.txt includes ratelimit_with_logging, an example that over-rides the disallowed() view returned when a rate limiting condition fails and writes an audit note to a database (less overhead than writing for every request).

I’ve been a fan of customisation via subclassing ever since I got to know the new Django admin system, and I’ve been using it in a bunch of projects. It’s a great way to create reusable pieces of code.

DjangoCon and PyCon UK five months ago

September is a big month for conferences. DjangoCon was a weekend ago in Mountain View (forcing me to miss both d.Construct and BarCamp Brighton), PyCon UK was this weekend in Birmingham, I’m writing this from @media Ajax and BarCamp London 5 is coming up over another weekend at the end of this month. As always, I’ve been posting details of upcoming talks and notes and materials from previous ones on my talks page.

DjangoCon went really, really well. Huge thanks to conference chair Robert Lofthouse for pulling it all together in just two months and Leslie Hawthorne for making it all happen from Google’s end. Google’s facitilies were superb: the AV team were the best I’ve ever worked with and an army of Google volunteers made sure everything went smoothly. It’s hard to see how it could have gone better; the principle complaint we got was that at only two days it was hard to justify the travel, something which future DjangoCons will definitely address.

Every session was recorded and the videos should be going up on YouTube shortly are now up on YouTube. For the impatient, you can subscribe to an Atom feed of a YouTube search for “DjangoCon”. I recommend starting with Cal Henderson’s keynote “Why I hate Django” which was both funny and insightful in equal parts. Malcolm’s talk on ORM internals was another personal favourite.

PyCon UK was the second I’ve attended, but last year I only stayed for the first day. This time I stuck around and was enormously impressed by the grassroots feel of the conference and the enthusiastic atmosphere. I presented a tutorial on extending the Django admin and a lightning talk on Zeppelins, prepared two hours in advance after Jacob mentioned that the lightning talks were tending too much towards the technical side. It went down very well; I’m tempted to extend it to a half hour session for BarCamp London.

Unlike most conferences I attend, PyCon tickets included a sit-down dinner for all attendees complete with a “dramatic lecture” on the Lunar Society presented by Andrew Lound. This was a great fit for the conference, both for the Birmingham connection and the many analogies to the modern open source community—loose collaboration, patent concerns and what you might call an 18th century equivalent of the modern hacker ethic.

Next year the PyCon UK team will be hosting EuroPython, and I’m certain they’ll do an excellent job of it. Meanwhile, Rob has already started making plans for a Euro DjangoCon in around six months time, probably taking place in Prague.

Elsewhere

Yesterday

  • jQuery Sparklines. Delightful Sparklines implementation, using canvas or VML in IE. A neat nod towards unobtrusiveness as well: you can specify your data as comma separated values inside a span, then use a single jQuery method call to convert the span in to a sparkline image. 1
  • Magic properties make Firefox synchronously load the Java plugin. Even defining a function called sun() (or several other symbols) will trigger the Java VM to be loaded, dramatically hurting the performance of your page. 0
  • How FriendFeed uses MySQL to store schema-less data. The pain of altering/ adding indexes to tables with 250 million rows was killing their ability to try out new features, so they’ve moved to storing pickled Python objects and manually creating the indexes they need as denormalised two column tables. These can be created and dropped much more easily, and are continually populated by an off-line index building process. 0

26th February 2009

Nat's salad

  • Accessibility and Degradation in Cappuccino. Ross Boucher from 280 North responds to Drew McLellan. 0
  • Kestrel. Twitter’s Robey Pointer rewrote their Starling message queue in 1500 lines of Scala, adding reliable fetch (where consumers can confirm their receipt of an item) and blocking fetches, which reduce the need for consumers to poll for updates (and hence solve my only beef with the original Starling). I haven’t tried running this on a low spec VPS yet but it looks very promising. 1
  • It is Ryanair policy not to waste time and energy corresponding with idiot bloggers and Ryanair can confirm that it won’t be happening again. Lunatic bloggers can have the blog sphere all to themselves as our people are far too busy driving down the cost of air travel.

    Ryanair 5

25th February 2009

  • What I've Learned from Hacker News. I’m always fascinated by online community war stories. 0
  • The Cost of Accessibility. Drew McLellan comments on the seemingly inevitable march towards JavaScript dependent applications, and argues that JavaScript frameworks such as Cappuccino have a duty to integrate accessibility in to their core. 3
  • django-springsteen and Distributed Search. Will Larson’s Django search library currently just talks to Yahoo! BOSS, but is designed to be extensible for other external search services. Interestingly, it uses threads to fire off several HTTP requests in parallel from within the Django view. 0
  • FAPWS3-0.2 (WSGI server based on libev). Another strong contender for Python’s answer to Mongrel—3500 requests/s for static files, 43 for a simple dynamic (Django powered) pages and 4.8 for a heavy SQL query—all benchmarked with 300 concurrent requests. 4
  • Some people, when confronted with a problem, think “I know, I’ll quote Jamie Zawinski.” Now they have two problems.

    Mark Pilgrim 1

23rd February 2009

  • Building and Scaling a Startup on Rails: 12 Things We Learned the Hard Way. Lessons learned from Posterous. Some good advice in here, in particular “Memcache later: If you memcache first, you will never feel the pain and never learn how bad your database indexes and Rails queries are”. Also recommends using job queues for offline processing of anything that takes more than 200ms. 0
  • Oscars 2009: the interactive results | guardian.co.uk. My latest project for the Guardian, put together on very short notice. Updates live as the results are announced, and allows Twitter users to vote on their favourite for each category by sending a specially formatted message to @guardianfilm—jQuery and Ajax polling against S3 under the hood. 0

22nd February 2009

Occult practitioners needed

  • I think you overstate the usefulness of the [jQuery Rules] plugin. Using this plugin, users are now limited by what selectors that can use (they can only use what the browsers provide—and are at the mercy of the cross-browser bugs that are there) which is a huge problem. Not to mention that it encourages the un-separation of markup/css/js.

    John Resig 1

  • jQuery.Rule (via) jQuery plugin for manipulating stylesheet rules. For me, this is the single most important piece of functionality currently missing from the core jQuery API. The ability to add new CSS rules makes an excellent complement to the .live() method added in jQuery 1.3. 9
A django site