Feb 28

Concurrent Python (AWPUG talk resources)

I'm doing a concurrency/gevent talk tonight at the Austin Web Python User Group. Slides and other resources are available here.

Sep 04

Django on gevent Djangocon slides

Here are my slides from "Django on gevent", my talk from Djangocon 2012.

http://media.codysoyland.com/pdf/django-on-gevent.pdf.

Feb 06

Evented Django part one: Socket.IO and gevent

The buzz around the asynchronous, real-time web has been getting more and more attention lately, and for good reason. The old paradigm of thick servers and thin clients is getting outdated as the new web demands rich, fast, asynchronous, full-duplex messaging. The technologies that enable server-to-browser asynchronous messaging have been given the umbrella term "Comet," and the number of ways to provide Comet services is growing constantly. The transport options include XHR-multipart, WebSockets, and Adobe Flash Sockets, among others. Socket.IO was invented to provide a unified interface for server-browser messaging and let the developer not worry about the inconsistent browser support. In this post, I'm going to explain how to use Django with Socket.IO.

Socket.IO was developed with a Node.JS server implementation, but work is being done to add server implementations to a variety of languages. Two such servers exist for Python, tornadio and gevent-socketio. I'm a big fan of gevent, so I will use gevent-socketio, but tornadio looks well-written and very promising.

Why you should be thinking about gevent

Socket.IO runs great under Node.JS, but I think it's important to highlight why I think Python and gevent need more attention (feel free to skip ahead if you have already drank the gevent koolaid). Node.JS (and its underlying V8 Javascript engine) is a pinnacle achievement for the world of Javascript. It has done two especially important things: it helped show the world that evented application servers enable extremely fast high-concurrency connections, and it helped promote Javascript as a serious language, opening the doors for powerful tools such as testing frameworks, a package manager, and better community code standards. Its popularity is not surprising: it's built on top of one of the world's most well-known programming languages.

The Python community is a bit more fragmented, with several concurrent networking libraries -- among them: twisted, tornado, gevent, eventlet, and concurrence. It's certainly harder to know where to start without a "clear winner" like we see in the Javascript community. Personally, gevent has quickly become my favorite way to write asynchronous applications. I think Python with gevent wins over Node.JS in two important ways:

  1. It's Python, a sane and concise language with an awesome standard library and community.
  2. It uses greenlets instead of callbacks to provide concurrency.

Gevent, like Node.JS, is built on libevent (Update: Node actually uses libev. Thanks to Travis Cline for correcting me there), an underlying C library that provides a high-speed event loop. Node's concurrency model relies on callbacks to handle values from asynchronous I/O calls. This, combined with Javascript's highly nestable syntax, begs programmers to nest functions within other function calls, making callback passing a piece of cake, but I've seen this produce ugly, unreadable nested code, and I've seen programmers pull their hair out while trying to get things synchronized and avoid race conditions. In my experience, debugging an app with heavy use of callbacks is nearly impossible. Greenlet changes the game, because you can write simple "blocking" code that only blocks the current greenlet instead of the entire interpreter, allowing you to maintain stacks, along with beautiful Python stack traces.

Running Django on gevent-socketio

Gevent-socketio comes with one important caveat: you must use the gevent pywsgi server. This means you can't serve your WSGI app out of Apache like you might be used to doing (however, it should be possible to proxy requests from a front-end load balancer, but I haven't experimented with proxying web sockets). There's a pretty good reason for this: WSGI doesn't inherently allow web sockets. The only way this is possible is by hooking into the raw socket using the hooks provided by the pywsgi server.

Gevent-socketio works by creating a SocketIO handler and adding it to the WSGI "environ" dictionary before executing your WSGI app. When Django handles a request, it creates a WSGIRequest object and assigns it the environ dictionary created by pywsgi. So, if we are running Django under gevent-socketio, our SocketIO handler is available by accessing "request.environ['socketio']". I've demonstrated this by porting the gevent-socketio example chatroom application to Django. My ported code is available on Github.

Installation

I always choose to work in virtualenv, and I've created a pip requirements file that should cover what you need to get started. To run my example, clone the code on Github and install the requirements from pip:

git clone git://github.com/codysoyland/django-socketio-example.git
cd django-socketio-example
easy_install pip
pip install virtualenv
virtualenv .
source ./bin/activate
pip install -r pip_requirements.txt

Note the contents of pip_requirements.txt: I'm using the "tip" versions of both gevent-websocket and gevent-socketio. This is still beta-quality software, so we are using development versions. Note: Expect bugs!

A chat server request handler

The Socket.IO calls come in like normal requests and can be handled by a view, but your view code can actually contain a long-running event loop, sending and receiving messages from your web client. Here is the view that handles Socket.IO requests:

from django.http import HttpResponse

buffer = []

def socketio(request):
    socketio = request.environ['socketio']
    if socketio.on_connect():
        socketio.send({'buffer': buffer})
        socketio.broadcast({'announcement': socketio.session.session_id + ' connected'})

    while True:
        message = socketio.recv()

        if len(message) == 1:
            message = message[0]
            message = {'message': [socketio.session.session_id, message]}
            buffer.append(message)
            if len(buffer) > 15:
                del buffer[0]
            socketio.broadcast(message)
        else:
            if not socketio.connected():
                socketio.broadcast({'announcement': socketio.session.session_id + ' disconnected'})
                break

    return HttpResponse()

The view is plugged into your site like any other view:

urlpatterns += patterns('views',
    (r'^socket\.io', 'socketio'),
)

Running the example

Run the example by starting the server:

./run_example.py

Then point your browser to http://localhost:9000/.

If you run the example, you should see the same result as running the gevent-socketio example: a multi-client chatroom. The beauty of greenlet is at play in the line containing "socketio.recv()". This line blocks the greenlet and allows the server to keep processing other requests until a new Socket.IO message is ready to be processed. As soon as a new message is ready, the greenlet is re-awakened and the message is processed.

Note that we can't use our good old friend "./manage.py runserver" for this example. This is because we need to run the SocketIO server, which we import from gevent-socketio. Here is the example runner:

PORT = 9000

import os

os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'

import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()

from socketio import SocketIOServer

if __name__ == '__main__':
    print 'Listening on port %s and on port 843 (flash policy server)' % PORT
    SocketIOServer(('', PORT), application, resource="socket.io").serve_forever()

This is all it takes to hook up gevent-socketio to the Django WSGIHandler. A monkey could easily make this into a custom management command if we desired.

Further reading

In my next post, I will explain how to scale our chatroom example to multiple web servers using ZeroMQ. Until then, I recommend checking out the following resources:

I would like to extend a special thanks to Jeffrey Gelens and other contributors for writing gevent-websocket and gevent-socketio.

Aug 09

Django Template Tag Namespaces Now Possible

I've been interested in the Django template language for some time now, and I've admired much of its simplicity and extendibility. I even wrote a shell for it (screencast) and a two-phase template renderer. Having spent the time to understand how it works, I've also had my share of ideas on how to improve it (addition of "elif", mathematical operations in variable tags, namespaces). The pony that I've been wanting the most is probably namespaces. There has been talk of adding namespaces to Django templates for quite a while (including a ticket with patches and some various discussions on the mailing list (1, 2 and 3)). For years, this concept has sat dormant due to lack of discussion and interest. No pluggable solution had been offered (as far as I know), so I wrote a couple of templatetags that offer namespacing and other features while retaining backwards compatibility and not requiring a fork of Django. This code is available on Github as django-smart-load-tag.

Backwards compatibility

Django's policy is to remain backwards compatible, and the template language is certainly no exception. In order to give the "{% load %}" tag namespacing features, it needed to be extended in a way that allows current assumptions about its behavior to remain the same. In particular, the assumption that all tags will be loaded into the global namespace by default had to stay. This means that, given a template library named "lib1" containing "tag1" and "tag2", the following code must work:

{% load lib1 %}
{% tag1 %}
{% tag2 %}

Current proposals have suggested the backwards-incompatible syntax that assumes namespaces are on by default:

{% load lib1 %}
{% lib1.tag1 %}
{% lib1.tag2 %}

In my implementation, "load" works the same (as in the top example), but has a few keywords that control its behavior. For example, to load a library into a namespace, use "into":

{% load lib1 into lib1 %}
{% lib1.tag1 %}
{% lib1.tag2 %}

Other features

To load a specific tag (optionally renaming it with the "as" keyword):

{% load lib1.tag1 as my_tag %}
{% my_tag %}

Loading from a specific app can be done using the "from" keyword:

{% load lib1 from app1 into app1 %}
{% load lib1 from app2 into app2 %}
{% app1.tag1 %}
{% app2.tag1 %}

To make everybody happy

It has been suggested to write a separate "{% import %}" tag in order to enable namespaces by default while retaining backwards-compatibility with existing Django applications. I've also experimented with the following import syntax, and it's also included in django-smart-load-tag:

{% import lib1 %}
{% lib1.tag1 %}

Its namespace-on-by-default design can be subverted using "* from":

{% import * from lib1 %}
{% tag1 %}

The "as" and "from" keywords are also implemented:

{% import lib1 as my_lib %}
{% my_lib.tag1 %}

{% import lib1 from app1 %}
{% lib1.tag1 %}

Where to go from here

If template tag namespaces are to be accepted as a core part of Django, some discussion will need to take place on what is the most correct solution for moving forward. Your comments here or on the mailing list can make a difference, and with enough contribution from the community, perhaps all my ponies will one day run free.

(Source and documentation available here.)

Jun 06

SSH tip: Automatic Reverse Tunnels for Workflow Simplification

I came up with a trick today to simplify the workflow of certain tasks with SSH. The problem this solves is largely driven by a particular annoyance I had with remote SSH file management. Consider the following workflow for locating a remote file and copying it back to the local machine:

# Find the remote file:
local:~$ ssh remote
remote:~$ find . -name 'file.tar.gz'
/home/cody/somewhere/deep/file.tar.gz

# Ah, there it is! Now I want to copy it back to my local machine.
# Pop open a new shell to my local machine in a separate tab and copy the file with scp:
local:~$ scp remote:/home/cody/somewhere/deep/file.tar.gz .

The reason I have to run scp on my local machine is because it is behind a firewall, so the remote machine can't see it directly. Ideally, I could just run "scp file.tar.gz local:" from the remote machine, but SSH wouldn't be able to resolve my home computer's IP address or be able to get to the port, as I am almost always behind a firewall. This workflow also breaks down when you need to run "sudo" on the remote machine to access the files. In this case, you might feel forced to create a temporary tarball remotely using sudo before copying it to the local machine.

There is a way to get around this limitation with DNS/firewalls, and the answer is by using SSH reverse tunnels to bind your local SSH daemon port to a remote SSH alias. This can be accomplished in a simple manner by modifying your ~/.ssh/config on both the local and remote computers.

On the local computer, the SSH configuration entry for the remote host should be given a RemoteForward option, specifying that localhost:22 (the local ssh daemon) should be tunnelled to remote port 8022:

Host remote
HostName remote.example.com
RemoteForward 8022 localhost:22

On the remote computer, an alias to the local computer is created, binding itself to the local SSH daemon:

Host local
HostName 127.0.0.1
Port 8022

Now, the local machine is transparently accessible from the remote machine, replacing the above workflow with something a little smoother:

# Find the remote file:
local:~$ ssh remote
remote:~$ find . -name 'file.tar.gz'
/home/cody/somewhere/deep/file.tar.gz

# Don't go anywhere. Just copy it:
remote:~$ scp /home/cody/somewhere/deep/file.tar.gz local:

Update (6-8-2010): Some of the commenters have expressed security concerns regarding using SSH tunnels in this way. Please see the comments for more info (particularly this one). Thanks to those who point out when I'm wrong!

Feb 17

Screencast - django-template-repl

Django-template-repl is a unique project aimed at providing debugging tools for the Django template language. I did a screencast to highlight its features.

In this video, I describe how to use Django-template-repl's management shell, template tag, and context-capturing features.

Please grab the source or simply run:

pip install django-template-repl
Feb 03

Thread-Safe Object-Oriented Views in Django

An emerging design pattern in Django is the usage of class-based views. Writing views as classes is made possible by Python features that allow classes and objects to behave as callables, just like functions. They can help organize view code and promote reusability by offering a greater level of customization. However, callable objects sometimes have thread safety issues that developers are often not aware of.

Update 2/2/11: This post is now mostly irrelevant, because Django will soon solve this problem with the newly-refactored generic view module in Django 1.3, which is due for release this month. Since I wrote this post a year ago, massive arguments over the design of class-based views erupted on the mailing list. It definitely got out of hand, but a lot was learned, and I'm happy with the solution that was committed. I still encourage people to understand how to avoid threading side-effects. I would also suggest reading the new documentation on class-based generic views.


The most common approach to class-based views is to create a callable object, a class that has been written with a __call__ method, therefore making instances of the class callable. The view is instantiated either as a module-level variable in the views file or in the urlconf. Jacob Kaplan-Moss has written a series of class-based generic views that follow this model.

The Problem

When writing persistent class-based views, you must be careful not to introduce stateful information in your object. From what I can tell, Jacob's implementation seems thread-safe, because the view's state appears to only be altered on initialization. The problem arises when you store request-specific variables on the view object. Because the object is only instantiated once per Python process, it persists while running multiple HTTP requests, for the life of the process. Stateful information can cause side effects including security problems.

Here is a simple example illustrating this effect:

class MyView(object):
    thing = 0
    def __call__(self, request):
        self.thing += 1
        return HttpResponse('%s' % (self.thing,))

my_view = MyView()

URL patterns:

urlpatterns = patterns('',
    url(r'^my_view/$', 'my_app.views.my_view', name='my_view'),
)

Every time you refresh your browser, you will see a number incrementing. Even in single-threaded (prefork) environments, this bug is present. Multi-threaded environments are succeptible to even scarier problems with object state. For example, you might be tempted to set the request object as an attribute to the view class:

class MyView(object):
    def __call__(self, request):
        self.request = request
        return self.create_response()

    def create_response(self):
        return HttpResponse('Welcome, %s' % (self.request.user,))

my_view = MyView()

If a view is called twice at roughly the same time in two threads, the request object may be incorrect in one of the threads, because the view object is shared between the threads. This bug might manifest seldomly, but it would be difficult to track down and conceivably a security problem.

Reducing shared state

If you are having thread safety issues because of class-based views, the first thing you can do is enable a prefork worker module in your web server. This causes greater memory usage, but it creates new processes instead of threads to handle concurrency. This is only a workaround, however, and there are ways to maintain thread-safe code using class-based views. A quick fix for a code base that already suffers from this bug would be to drop in __new__ method that handles the creation of new view instances per request. This should be as simple as adding the following to your view class:

def __new__(cls, *args_static, **kwargs_static):
    def view_wrapper(request, *args, **kwargs):
        view = object.__new__(cls)
        view.__init__(*args_static, **kwargs_static)
        return view(request, *args, **kwargs)
    return view_wrapper

If this is added to MyView above, it magically becomes thread-safe because the view is wrapped in such a way that every time it gets called, a new MyView instance is created for the request.

This method may be useful for existing apps, but a better designed stateful class-based view might drop the usage of __call__ altogether in favor of using a class (not an instance) as the view itself.

Michael Malone suggested in this Django ticket using __init__ as an alternative to __call__. By subclassing HttpResponse, you can treat the class as a view because calling the class creates an HttpResponse object. This creates a new view object for every request coming in. His suggestion did not appeal to me because you lose control over the resulting HttpResponse object and it's not possible to instantiate the view without creating an HttpResponse object. An alternative I've come up with is to override __new__ to return an HttpResponse.

Creating the response in __new__

Remember that any callable can be used as a view. This can mean functions, callable objects, and even classes. When you call a class, it's __new__ method is called and returns something, usually an instance of that class. However, it can be anything you want, including an HttpResponse object. A new approach to thread-safe classed-based views might be something like this:

class BaseView(object):
    def __new__(cls, request, *args, **kwargs):
        view = cls.new(request, *args, **kwargs)
        return view.create_response()

    @classmethod
    def new(cls, *args, **kwargs):
        obj = object.__new__(cls)
        obj.__init__(*args, **kwargs)
        return obj

    def __init__(self, request, *args, **kwargs):
        raise NotImplementedError()

    def create_response(self):
        raise NotImplementedError()

class MyView(BaseView):
    template = 'path/to/template.html'

    def __init__(self, request):
        self.request = request

    def create_response(self):
        return render_response(self.request, self.template, {'title': 'Home'})

This would be added to your URLPatterns directly (no need to instantiate it):

urlpatterns = patterns('',
    url(r'^my_view/$', 'my_app.views.MyView', name='my_view'),
)

A key difference is that the view is not a subclass of HttpResponse, but when you attempt to instantiate it, it will create an HttpResponse. In case you want to create a view object for testing or otherwise, the BaseView class has a factory method called "new" that makes new view instances.

The main thing to get out of all of this is that you need to be careful and aware of shared state whenever it exists. I prefer to minimize the possibility of side effects by avoiding module-level variables, singletons, and globals as they are almost always the road to hell. But if you do have a persistent view object, make sure that you don't store request-specific data on it.

Jan 17

Evaluating Django Caching Options

Caching is one of the first things you can do when you need to start thinking about scaling. Among efforts such as query minimization, denormalization, code optimizations, compression, database tuning, indexing, and load balancing, caching remains one of the lowest hanging fruits in methods to lighten your server load and handle huge amounts of traffic. There are many options, and I chose to evaluate a few of the most interesting setups.

This is not intended to be a rigoriously scientific test, but more of a first impression of the different caching systems. For all the tests I'm describing, I'm using a single VPS on Rackspace Cloud with 320MB of RAM, a quad-core AMD Opteron 2350HE, and a bleeding edge server stack using Ubuntu Server 9.10, NGINX with UWSGI, Python 2.6, Django 1.1, and PostgreSQL 8.4. I'm serving the home page view of Django-Mingus, which provides a realistic amount of complexity to the Python side of things and gives us a 9387 byte response. I'm using 4 UWSGI processes and a single NGINX worker. All my tests are using ApacheBench, which I'm running on the same machine. Note that for all my cache tests I'm prepopulating the cache before running the benchmark. Here are the different setups I'm going to evaluate:

  1. No caching whatsover.
  2. Django's template caching templatetag.
  3. Django's two-part caching middleware.
  4. NGINX Memcached module.
  5. On-disk caching with Django-staticgenerator.
  6. Varnish as front-end load-balancing cache.

No Caching

For any content-driven website, this is probably the worst idea of them all, and as you'll find out, it is trivial to implement most of the above caching strategies. Clearly, my single server arrangement is not going to be representative of your large app server cluster, so I urge you to evaluate all the options if you are anticipating scaling. Finding the right recipe for your server setup is going to be the fun part.

For the purpose of establishing a baseline, I ran ApacheBench on my setup with no caching turned on. I'm running 10 concurrent requests for a 1000 requests using the following ApacheBench command:

ab -n 1000 -c 10 <server-name>

Here's a snipped version of the results:

Concurrency Level:      10
Time taken for tests:   68.619 seconds
Sent requests:          1000
Completed requests:     1000
Failed requests:        0
Total transferred:      9660000 bytes
HTML transferred:       9387000 bytes
Requests per second:    14.5732231597662
Transfer rate:          141.610400362873 kb/s received

Connnection Times (ms)
              min     avg   max
Connect:        0    0.12    10
Response:     309  681.65  1330

It's probably possible to tune this for shorter latency, but we got the main number we were looking for; we can push 14.57 requests/second without a cache. Not bad, until you get Slashdotted!

Django's template caching templatetag

Django provides an easy way to cache parts of your template using the "cache" template tag. Here is an example of usage:

{% load cache %}
{% cache 500 sidebar %}
    This goes into cache.
{% endcache %}

Django-Mingus makes good use of the cache template tag in the default templates. In this test, I enabled Memcache in Django and removed view caching so I could get an idea how segment caching affects performance. This page benefits from 10 template cache hits and 4 other Memcache hits used in some of Mingus's apps.

Concurrency Level:      10
Time taken for tests:   26.19 seconds
Sent requests:          1000
Completed requests:     1000
Failed requests:        0
Total transferred:      9479000 bytes
HTML transferred:       9387000 bytes
Requests per second:    38.1825124093165
Transfer rate:          353.449253054601 kb/s received

Connnection Times (ms)
              min     avg   max
Connect:        0    0.29    10
Response:      90  260.61   490

Enabling templatetag caching has given a significant speed boost to 38.18 requests/second. This is a 262% improvement over no cache. Response time is also improved, down from 682ms to an acceptable 260ms. Good, but there's still a lot of room for improvement.

The subtle increase in performance shouldn't deter you from implementing the tag though, as template caching bears the benefit that one segment can be cached and used across multiple pages (for example, a sidebar that is the same on different parts of the site).

Django's two-part caching middleware

Django comes equiped with middleware that provides frontend proxy-style full page caching with almost no configuration. Full page caching is clearly where you're going to find the greatest benefits. Something like Squid, Varnish, or NGINX is better suited for this job, but the ease of setup makes this middleware useful for environments where a minimal amount of complexity is desired. Because of the greater performance, I'm running 10,000 requests instead of 1,000 to get a better sample.

Concurrency Level:      10
Time taken for tests:   9.07 seconds
Sent requests:          10000
Completed requests:     10000
Failed requests:        0
Total transferred:      130040000 bytes
HTML transferred:       127560000 bytes
Requests per second:    1102.53583241455
Transfer rate:          14001.3437155458 kb/s received

Connnection Times (ms)
              min     avg   max
Connect:        0    0.15    10
Response:       0    9.02   470

This is about as fast as Django's going to run on this hardware without a more sophisticated caching proxy. We've revved Django's internal caching to give us 1103 requests/second, over 75 times as many as we had with no caching. However, we're still passing every request into Python, which gives us limits we cannot avoid without moving the caching layer into the frontend server. For this we'll need to explore NGINX or Varnish.

NGINX's Memcached module

NGINX has a very nice caching feature that most servers lack: it can serve an HTML document directly from Memcached without ever touching your Python code. Since we are already using NGINX, enabling the Memcached HTTP caching module was a trivial task. For this test, I will disable Django's caching middleware and add a custom cache update middleware that sets a cache key that NGINX can be configured to read. I used a modified version of the middleware from Oliver Weichold's blog post on using Django with NGINX+Memcached. Enabling the module in NGINX config was just adding a new location directive for Memcached and assigning the web app as a 404 handler for that location:

Before:

location / {
    uwsgi_pass  unix:///tmp/mingus.sock;
    include     uwsgi_params;
}

After:

location / {
    default_type  text/html;
    set $memcached_key nginx.$request_uri;
    memcached_pass 127.0.0.1:11211;
    error_page 404 = @cache_miss;
}
location @cache_miss {
    uwsgi_pass  unix:///tmp/mingus.sock;
    include     uwsgi_params;
}

Running the same benchmark as above, here are my results:

Concurrency Level:      10
Time taken for tests:   3.699 seconds
Sent requests:          10000
Completed requests:     10000
Failed requests:        0
Total transferred:      130640000 bytes
HTML transferred:       129190000 bytes
Requests per second:    2703.43336036767
Transfer rate:          34489.8959178156 kb/s received

Connnection Times (ms)
              min     avg   max
Connect:        0    0.36    30
Response:       0    3.66   109

Now we're getting serious! I was serving 2703 requests/second through memcache on my VPS. Now we're in Slashdotting territory. This is over 185 times as fast as vanilla Django. The important thing to note here is that we're accomplishing the same thing as Django's built-in two-part caching middleware, but now we are doing it 2.5 times faster.

On-disk caching with django-staticgenerator

Another approach is to use on-disk caching techniques to serve static files. This is made possible with django-staticgenerator, which has middleware that generates flat files that NGINX can serve directly. It was simple to set up, and here are my results:

Concurrency Level:      10
Time taken for tests:   2.78 seconds
Sent requests:          10000
Completed requests:     10000
Failed requests:        0
Total transferred:      131320000 bytes
HTML transferred:       129190000 bytes
Requests per second:    3597.12230215827
Transfer rate:          46130.2832733813 kb/s received

Connnection Times (ms)
              min     avg   max
Connect:        0    0.67    90
Response:       0    2.66   190

Now we're rocking 3597 requests/second. NGINX can serve static files like nobody's business.

Varnish

Varnish is a very powerful load balancing caching proxy that is made for heavy traffic. I'm configuring it as an HTTP proxy to my NGINX server to see how it stacks up.

Concurrency Level:      10
Time taken for tests:   2.76 seconds
Sent requests:          10000
Completed requests:     10000
Failed requests:        0
Total transferred:      131230000 bytes
HTML transferred:       129190000 bytes
Requests per second:    3623.1884057971
Transfer rate:          46432.716259058 kb/s received

Connnection Times (ms)
              min     avg   max
Connect:        0    0.60    20
Response:       0    2.74    90

Varnish is very competitive in raw speed, serving 3623 requests/second, an impressive number, nearly 250 times higher than if there was no cache. Varnish is also very configurable and built for extremely high traffic.

Conclusion

Every scaling problem has it's own variables that can greatly affect the types of decisions that need to be made to implement a good caching stategy. For example, a multi-server setup is likely to behave much different given the same benchmarks. There are also more complicated factors such as how to treat logged-in users and cookies. There are workarounds for cookie hashing problems (such as removing the "Vary: Cookie" response header) that can add complexity to certain environments, so there is more to consider than raw performance.

Also make note that not all of these are mutually exclusive. A good combination of caching might be internal template caching plus either Varnish or NGINX acting as a frontend cache. My best suggestion is to experiment and see what works best for your environment, and I hope this post was helpful for summarizing your options.

Dec 13

Django Template Debugging Made Easier with django-template-repl

When working with the Django template language, specifically writing template tags or trying out new template tags that are not well documented, it's easy to fall into a testing loop that involves modifying your code, saving the file, causing runserver to restart, which could take some time for large projects, switching to your web browser, hitting reload, and viewing the results. This workflow can be repetitive and unproductive. I decided to improve template interpreter interactivity by writing a REPL for it, and I released the project as django-template-repl, which is freely available on Github and PyPI. I was surprised and happy to see it was well accepted judging from the twitter chatter and github statistics, so I wrote this to better explain how to use it.

One of the greatest advantages of Python, Lisp, and other programming languages is the ease of debugging and understanding code behavior provided by REPLs. A REPL, which stands for Read-Eval-Print Loop, is a shell that gives you an interactive command line session with your language interpreter. The two REPLs most Django users should be familiar with are ipython and (i)pdb. These tools are incredibly helpful and really boost the usability of Python. I spend more time inside ipdb than I read code output in a web browser. django-template-repl provides this type of tool for the Django template language.

Invoking the REPL

There are two ways to invoke the REPL. One uses a management command and the other uses a template tag. Run "python manage.py templateshell" to open the REPL:

$ python manage.py templateshell
>>> test
test
>>> {% if 1 %}
... It is true!
... {% else %}
... It is false!
... {% endif %}
It is true!

This behaves almost exactly like a python shell. It uses the readline library, so it handles command history. It also detects when you are inside a block tag, giving you decent multi-line support.

Providing context

Context can be provided in a number of ways. You can pass a context dictionary with the command-line, for example:

$ python manage.py templateshell --context "{'testvar': 'this is a string'}"
>>> {{ testvar }}
this is a string

Note that this actually runs eval() on the context string. That's the first time I've ever used eval(), but it seems not so hacky for this purpose.

Context can also be extracted from a specific URL in your project. This is accomplished by using Django's test client to capture context used in a page and bootstrap a REPL with the captured context. For example:

$ python manage.py templateshell --url /admin/
>>> {{ title }}
Log in
>>> {{ user }}
AnonymousUser

If you need to capture context at a specific place in your page, like in a for loop, you can use the {% repl %} template tag. This will halt the template rendering process at a specific point and replace your runserver shell with a template REPL loaded with the context used in the calling template.

Use the same tools with PDB

The management command and template tag give you nice ways to capture context, so I've made it possible to use pdb or ipdb with them. The management command has an option, --pdb. If used in conjunction with --url or --context, you can load context into a PDB shell. For example:

$ python manage.py templateshell --url /admin/ --pdb
...
ipdb> vars
Out[0]: 
['app_path',
 'error_message',
 'root_path',
 'title',
 'MEDIA_URL',
 'LANGUAGES',
 'LANGUAGE_BIDI',
 'LANGUAGE_CODE',
 'perms',
 'messages',
 'user']
ipdb> title
Out[0]: u'Log in'
ipdb> user
Out[0]: <django.contrib.auth.models.AnonymousUser object at 0x1019d7490>

This functionality is also possible with the template tag, using {% repl pdb %}.

I do hope these tools help somebody out. I feel like it is much easier to debug issues with template context and try new template tags. If you want to try it out, just run pip install django-template-repl and add 'template_repl' to your installed apps.

Dec 13

New Blog

I dropped my old styles and went with Mingus.

I set up a new blog today with django-mingus and so far I'm pretty pleased. Basically, I had some missing features on my blog and I decided to take the lazy route and use this instead of building everything by scratch. It's still Django, so I can hack it up all I want, so I think I made the right choice.

Props to Kevin Fricovsky for releasing this project and to the authors of the apps it uses! It makes setting up a Django blog so much easier.

It's still using the stock basic CSS but I will probably make some changes to it eventually, when I have time.

One thing to note is that I went with Disqus for comments, so my old comments are now gone. Sorry to the few commenters I've had in the past, but I think this will be better for the future.

Oct 31

Building Search into Pip

I've begun using pip frequently for working with python packages in my virtualenvs. pip does a great job with installing (and now uninstalling) python packages, but it currently lacks a command that I miss from other types of package managers, such as aptitude and port -- search.

I've forked a Git mirror of pip on Github for the purpose of tooling it with a search command, and I've made some progress in the last few days getting it working. It provides a search command which can search PyPI directly or a local search index. It displays the installation status of the package (i=installed, n=not installed) along with the name and summary.

$ pip search django
n BabelDjango               - Utilities for using Babel in Django
n blogango                  - A django based blog
n Blogmaker                 - Blog application for Django
n boletin                   - Newsletter generation and sending application for Django
n Camelot                   - A python GUI framework on top of  Sqlalchemy  and PyQt, inspired by the Django admin interface. Start building desktop applications at warp speed, simply by adding some additional information to you model definition.
n chishop                   - Simple PyPI server written in Django.
...

To try out this feature, check out my branch on Github.

Update (6/6/2010): I sprinted on this feature at Pycon 2010 and it has been merged into pip and available since the 0.7 release. Thanks to Ian Bicking and Carl Meyer for helping make this happen.

Sep 09

Introduction to Surlex Slides from Djangocon 2009

I have uploaded my slides from my lightning talk on Surlex at Djangocon 2009.

I have uploaded my slides from my lightning talk on Surlex at Djangocon 2009.

Here is the link: http://media.codysoyland.com/pdf/surlex-presentation.pdf.

Sep 06

Introduction to Surlex

Surlex (Simple URL Expression) is a language for matching and extracting text, similar to regex (regular expression). It works by generating a regular expression from a more concise syntax that is designed to produce easy-to-read url patterns.

Installation

Surlex is indexed in PyPI, so it can be installed via pip or easy_install:

pip install surlex

Alternatively, the source code repository is available via git:

git clone git://github.com/codysoyland/surlex.git

Basic Capturing

A capture is performed by enclosing a variable name inside angle brackets:

<name>

This is equivalent to the following regex:

(?P<name>.+)

A usage example for matching a blog index view is:

/blog/<year>/

This is equivalent to the following regex:

/blog/(?P<year>.+)/

This surlex would match the following URLs:

/blog/2008/
/blog/2009/

To invoke the capture in our source code we can use the Surlex class or the match function:

from surlex import Surlex, match

surlex = Surlex('/blog/<year>/') # returns surlex object
surlex.match('/blog/2009/') # returns {u'year': '2009'}

match('/blog/<year>/', '/blog/2009/') # returns {u'year': '2009'}
match('/blog/<year>/<month>/', '/blog/2009/09/') # returns {u'year': '2009', u'month': '09'}

Advanced Capturing with Embedded Regex

Surlex allows you to insert custom regex strings inside your surlex by postfixing the capture name with the equals sign (=) followed by a regular expression:

from surlex import surlex_to_regex
surlex_to_regex('<number=[0-9]+>') # returns (?P<number>[0-9]+)

The preceeding surlex will match "2009", but not "these-letters."

If a right angle bracket is needed inside the regex, it may be escaped with a backslash.

Surlex Macros for Common Tasks

Surlex includes a number of built-in macros which ease common tasks such as matching a slug or date string. They can be invoked by postfixing the capture name with a colon (:) followed by the name of the macro. One of the built-in macros matches a url slug and is identified by the letter "s." A usage example:

/blog/<year>/<month>/<slug:s>/

This will produce the following regular expression:

/blog/(?P<year>.+)/(?P<month>.+)/(?P<slug>[\w-]+)/

Here are all the built-in macros:

MacroRegex EquivalentDescription
Y\d{4}year, including century
y\d{2}year, not including century
M(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)month, abbreviated
m\d{1,2}month, 1 or 2 digit
d\d{1,2}day, 1 or 2 digit
#\d+number, any length
s[\w-]+slug

A more complete example of a blog entry url might be defined as:

/blog/<year:Y>/<month:M>/<day:d>/<slug:s>/

This will produce the following regex:

/blog/(?P<year>\d{4})/(?P<month>(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec))/(?P<day>\d{1,2})/(?P<slug>[\w-]+)/

If matched against "/blog/2009/sep/06/surlex-released/", it will produce the following dictionary:

{
        u'year': '2009',
        u'month': 'sep',
        u'day': '06',
        u'slug': 'surlex-released',
}

Custom Macros

Custom surlex macros can be registered with the register_macro function. Here, we will register a macro to match an email address:

from surlex import register_macro, surlex_to_regex
surlex.register_macro('e', r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b')
surlex.surlex_to_regex('/email-to/<:e>/') # returns /email-to/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/

Regex and Macro Matching without Capturing

If the name of the capture block is omitted, surlex will produce a regular expression that will match the given regex or macro without capturing, for example:

surlex_to_regex('<=[0-9]+>') # returns [0-9]+
surlex_to_regex('<:s>') # returns [\w-]+

Optional Strings

If a section of a surlex is enclosed in parentheses, it is considered optional and will allow the pattern to match even if it is not in the subject string. It is equivalent to enclosing an item in ()? in regex. For example:

surlex_to_regex('/blog/(<year>/)') # returns /blog/((?P<year>.+)/)?

The preceeding surlex will match "/blog/" and "/blog/2009/".

Wildcards

Standard wildcards can be used with the asterisk (*):

surlex_to_regex('/*.*') # returns "/.*\..*", matches "robots.txt", "anything.html", etc

Matching Beginning and End of Strings

The caret (^) and dollar ($) characters are preserved in the outputted regex, so they can still be used to match the start and end of strings:

surlex_to_regex(r'^only-this$') # returns '^only-this$'

Django Integration

Surlex can be used inside Django URLPatterns by importing the helper function and writing a standard urlpattern, substituting the regex for a surlex and the url function with surl:

from surlex.dj import surl
urlpatterns = patterns('',
    surl(r'^blog/<year:Y>/<month:M>/<day:d>/<slug:s>/$', blog.views.post_detail, name='blog_post_detail'),
)

Update (4/12/2010): I noticed a large surge of traffic from Simon Willison's weblog this morning and realized I left out links to the documentation and repository pages. Here is the documentation and you can follow the project on Github.

Sep 05

Serve your FLAC and OGG files to iTunes with Firefly

The Firefly Media Server is a killer app for serving your music collection to DAAP-compatible music clients (including iTunes for Mac/Windows and Rhythmbox for Linux). One of it's features that might go unnoticed is its ability to transcode OGG and FLAC compressed files to raw uncompressed sound, bypassing iTunes's incompatibility with those formats.

Getting Started

The current stable Firefly release, version 0.2.4.2, does not scan FLAC metadata, so the resulting database contains "Unknown" for all the artist and song columns. For this reason, we will compile an unstable build, SVN revision 1586. I initially had problems setting up the nightly builds on Ubuntu 9.04, so until a newer version is released that fixes these problems, we will use Ubuntu 8.04. I am using an Ubuntu 8.04 virtual machine that runs under an Ubuntu 9.04 host. My music collection is mounted in the virtual machine using SSHFS.

Mounting your Music Collection

If you are using a virtual machine as I am, you will need to mount your music collection from the host operating system. There are a few options (SSHFS, NFS, Samba...) but we will use SSHFS. To install, run:

sudo aptitude install sshfs

and you will be able to mount your music using a command such as:

sshfs hostos:/media/storage/music /mnt/music

Installing Build Dependencies

Firefly needs a number of packages to compile from source. This should cover all of them:

sudo aptitude install g++ libsqlite3-dev libid3tag0-dev libvorbis-dev libflac-dev flac

Downloading and Compiling Firefly

Nightly Subversion builds can be obtained here.

We will fetch the svn-1586 release and untar it:

wget http://nightlies.fireflymediaserver.org/nightlies/svn-1586/mt-daapd-svn-1586.tar.gz
tar -xvzf mt-daapd-svn-1586.tar.gz

Enter the mt-daapd directory and compile:

cd mt-daapd-svn-1586/
./configure --enable-flac --enable-ogg --enable-sqlite3
make
sudo make install

If you get hung up in the configure step, you are most likely missing a dependency. Please let me know so I can update this article if this happens.

Configuration

After everything is compiled and installed, you will need to configure Firefly. There is an example configuration file in the contrib directory of the mt-daapd tarball. Copy this to /usr/local/etc/:

sudo cp ./contrib/mt-daapd.conf /usr/local/etc/

Edit the file to prepare mt-daapd for first run. For mine, I changed db_type to sqlite3, mp3_dir to my root music directory (such as /mnt/music), servername to whatever you want, and I added the .flac and .ogg extensions to the extentions list. You might also need to change the "runas" configuration. The most important thing to watch here is that the user has read access to the music directory.

After you've modified the configuration file, you can try running Firefly. We will run in foreground mode (with -f) to watch for error messages:

sudo mt-daapd -f

If everything went well, you can now load the web interface, located at http://localhost:3689/. The username is "admin" and the default password is "mt-daapd" (unless changed in the configuration file). Click "Start Scan" and it will begin indexing your music files. This can take awhile depending on the size of your music collection. Once it's finished, start up iTunes or Rhythmbox on any computer on the network and you should see your full music collection, ready to play.

Jun 15

Quick Directory Changing in Linux

When working from a terminal, I often find myself typing the same commands over and over. There are a handful of long "cd" commands that I find myself typing regularly. I've found a few ways of speeding up my workflow by using some bash tools and tricks, and now I can usually get where I need to be with just a few keystrokes.

Aliases

Here are a few commands I find myself typing regularly:

cd /usr/lib/python2.6/site-packages
cd /usr/local/src/django
cd ~/projects

One way to speed up your workflow is to alias these commands by adding the following lines to your ~/.bashrc:

alias .sp="cd /usr/lib/python2.6/site-packages"
alias .dj="cd /usr/local/src/django"
alias .p="cd ~/projects"

After restarting bash, you can enter your site-packages folder just by typing ".sp"

Bash Tricks

This can also be done with a function in bash, by adding the following to your ~/.bashrc:

function goto {
    case $1 in
        sp) cd /usr/lib/python2.6/site-packages ;;
        dj) cd /usr/local/src/django ;;
        p) cd ~/projects ;;
    esac
}

This gives you a command called goto, so you can enter your site-packages folder by typing "goto sp"

For more flexibility in your "goto" script, it may be worth it to write it in a more powerful language, such as Python. In this example, we are embedding python code into a bash function:

function goto { $(python <<< "
alias = '$1'
targets = {
    'p': '/home/cody/projects/',
    'dj': '/usr/local/src/django/',
}
try:
    # anything we print gets executed in the current bash environment
    print('cd %s' % targets[alias])
except KeyError:
    print('echo Target not found')
"); }

Because we are working in Python now, we can easily modify our script to be database-backed or use a simple configuration file to store "targets" (aliased directories). Of course, with greater complexity, it would be cleaner to move the python code to a separate file and strip down the bash function to just:

function goto { $(python /home/cody/bin/goto.py $*); }

A Different Solution with Python

This approach works, but is obviously hacky due to the fact that anything sent to stdout in python gets executed in bash. This makes it difficult to use something like optparse to provide arguments and help. Ideally, we could write something in pure python without having to source a bash script to change the working directory, but I have not found a way to do so. You can, however, use exit codes from python to instruct bash on whether to execute or print the output of the python script. This bash script demonstrates the use of exit codes (Note: this also must be sourced with ".", "source" or by placing the source into ~/.bashrc):

function goto {
    TEMPFILE=$(mktemp)
    python /home/cody/bin/goto.py $* > $TEMPFILE
    EXIT_CODE=$?
    if [ $EXIT_CODE == 10 ]; then
        . $TEMPFILE
    else
        cat $TEMPFILE
    fi
}

Now we need to create our python file (goto.py). Note the use of "exit(10)." This is the exit code that gets picked up in our bash script as $?. Here we will use optparse to provide command line argument parsing. Now we will have a help menu (accessible with --help or -h) and a list-display of the available aliases (accessible with -l or --list):

targets = {
    'p': '/home/cody/projects/',
    'dj': '/usr/local/src/django/',
}

import sys
from optparse import OptionParser

def execute_and_exit(command):
    print(command)
    # bash script uses return code as flag to execute command
    exit(10)

parser = OptionParser()
parser.add_option('-l', '--list',
    action='store_true', dest='display_list', default=False,
    help='Display list of targets.')
(options, args) = parser.parse_args()

if options.display_list:
    for name, path in targets.iteritems():
        print('%s: %s%s' % (
            name,
            ''.join(' ' for i in range(5 - len(name))),
            path
        ))
else:
    alias = args[0]
    try:
        execute_and_exit('cd %s' % targets[alias])
    except KeyError:
        print('Target not found')

After moving the bash to a file called ~/bin/goto.sh and the python to a file called ~/bin/goto.py, we have a pretty useful alias program. Now we can do the following:

cody@laptop:~$ source ~/bin/goto.sh #alternatively add this line to your ~/.bashrc
cody@laptop:~$ goto -h
Usage: goto.py [options]

Options:
  -h, --help  show this help message and exit
  -l, --list  Display list of targets.
cody@laptop:~$ goto -l
p:     /home/cody/projects/
dj:    /usr/local/src/django/
cody@laptop:~$ goto dj
cody@laptop:/usr/local/src/django$ goto p
cody@laptop:~/projects$

Other Things from Bash

There are also a few shortcuts in bash to re-enter directories that you have been in since you started your bash session. For example, if you are in your projects directory and you need to change to /tmp to do a quick task, you may re-enter your projects directory using:

cd -

This command always takes you back to the last directory you were in.

There are also the little-known *nix commands pushd and popd. These commands build a directory stack, as illustrated by this example:

cody@laptop:/$ pushd .
/ /
cody@laptop:/$ cd ~
cody@laptop:~$ pushd .
~ ~ /
cody@laptop:~$ cd /tmp
cody@laptop:/tmp$ pushd .
/tmp /tmp ~ /
cody@laptop:/tmp$ popd
/tmp ~ /
cody@laptop:/tmp$ popd
~ /
cody@laptop:~$ popd
/
cody@laptop:/$ popd
bash: popd: directory stack empty
cody@laptop:/$

In this example, I move from /, to ~, to /tmp, and use popd to go back to each directory in reverse order.

History

In addition, Bash history provides a way to quickly run any command you've run earlier in your session by typing CTRL-R followed by a few unique letters from the command you've typed before. If I had ran the command "cd /usr/lib/python2.6/site-packages" several minutes ago, and I need to get back to my site-packages folder, I can type "site" and it will likely find the "cd" command in the Bash history (unless I had more recently typed a command with the word "site" in it, in which you can hit CTRL-R again to match commands further back in the history).

Another bash trick that is very time-saving is to use ALT+. to repeat the last word of the last command typed. For example, say I'm looking for a file, so I run:

ls ~/projects/python/codysoyland.com

After I see the file I'm looking for, now I want to cd to the directory and start working, so I just type "cd " and hit ALT+. to insert the path I typed in the last command.