Sep 06

Introduction to Surlex

Surlex (Simple URL Expression) is a language for matching and extracting text, similar to regex (regular expression). It works by generating a regular expression from a more concise syntax that is designed to produce easy-to-read url patterns.

Installation

Surlex is indexed in PyPI, so it can be installed via pip or easy_install:

pip install surlex

Alternatively, the source code repository is available via git:

git clone git://github.com/codysoyland/surlex.git

Basic Capturing

A capture is performed by enclosing a variable name inside angle brackets:

<name>

This is equivalent to the following regex:

(?P<name>.+)

A usage example for matching a blog index view is:

/blog/<year>/

This is equivalent to the following regex:

/blog/(?P<year>.+)/

This surlex would match the following URLs:

/blog/2008/
/blog/2009/

To invoke the capture in our source code we can use the Surlex class or the match function:

from surlex import Surlex, match

surlex = Surlex('/blog/<year>/') # returns surlex object
surlex.match('/blog/2009/') # returns {u'year': '2009'}

match('/blog/<year>/', '/blog/2009/') # returns {u'year': '2009'}
match('/blog/<year>/<month>/', '/blog/2009/09/') # returns {u'year': '2009', u'month': '09'}

Advanced Capturing with Embedded Regex

Surlex allows you to insert custom regex strings inside your surlex by postfixing the capture name with the equals sign (=) followed by a regular expression:

from surlex import surlex_to_regex
surlex_to_regex('<number=[0-9]+>') # returns (?P<number>[0-9]+)

The preceeding surlex will match "2009", but not "these-letters."

If a right angle bracket is needed inside the regex, it may be escaped with a backslash.

Surlex Macros for Common Tasks

Surlex includes a number of built-in macros which ease common tasks such as matching a slug or date string. They can be invoked by postfixing the capture name with a colon (:) followed by the name of the macro. One of the built-in macros matches a url slug and is identified by the letter "s." A usage example:

/blog/<year>/<month>/<slug:s>/

This will produce the following regular expression:

/blog/(?P<year>.+)/(?P<month>.+)/(?P<slug>[\w-]+)/

Here are all the built-in macros:

MacroRegex EquivalentDescription
Y\d{4}year, including century
y\d{2}year, not including century
M(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)month, abbreviated
m\d{1,2}month, 1 or 2 digit
d\d{1,2}day, 1 or 2 digit
#\d+number, any length
s[\w-]+slug

A more complete example of a blog entry url might be defined as:

/blog/<year:Y>/<month:M>/<day:d>/<slug:s>/

This will produce the following regex:

/blog/(?P<year>\d{4})/(?P<month>(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec))/(?P<day>\d{1,2})/(?P<slug>[\w-]+)/

If matched against "/blog/2009/sep/06/surlex-released/", it will produce the following dictionary:

{
        u'year': '2009',
        u'month': 'sep',
        u'day': '06',
        u'slug': 'surlex-released',
}

Custom Macros

Custom surlex macros can be registered with the register_macro function. Here, we will register a macro to match an email address:

from surlex import register_macro, surlex_to_regex
surlex.register_macro('e', r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b')
surlex.surlex_to_regex('/email-to/<:e>/') # returns /email-to/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/

Regex and Macro Matching without Capturing

If the name of the capture block is omitted, surlex will produce a regular expression that will match the given regex or macro without capturing, for example:

surlex_to_regex('<=[0-9]+>') # returns [0-9]+
surlex_to_regex('<:s>') # returns [\w-]+

Optional Strings

If a section of a surlex is enclosed in parentheses, it is considered optional and will allow the pattern to match even if it is not in the subject string. It is equivalent to enclosing an item in ()? in regex. For example:

surlex_to_regex('/blog/(<year>/)') # returns /blog/((?P<year>.+)/)?

The preceeding surlex will match "/blog/" and "/blog/2009/".

Wildcards

Standard wildcards can be used with the asterisk (*):

surlex_to_regex('/*.*') # returns "/.*\..*", matches "robots.txt", "anything.html", etc

Matching Beginning and End of Strings

The caret (^) and dollar ($) characters are preserved in the outputted regex, so they can still be used to match the start and end of strings:

surlex_to_regex(r'^only-this$') # returns '^only-this$'

Django Integration

Surlex can be used inside Django URLPatterns by importing the helper function and writing a standard urlpattern, substituting the regex for a surlex and the url function with surl:

from surlex.dj import surl
urlpatterns = patterns('',
    surl(r'^blog/<year:Y>/<month:M>/<day:d>/<slug:s>/$', blog.views.post_detail, name='blog_post_detail'),
)

Update (4/12/2010): I noticed a large surge of traffic from Simon Willison's weblog this morning and realized I left out links to the documentation and repository pages. Here is the documentation and you can follow the project on Github.