Archive for the 'Quick Tips' Category

Using Python to detect the most frequent words in a file

Antonio Cangiano March 18th, 2008

Working with Python is nice. Just like Ruby, it usually doesn’t get in the way of my thought process and it comes “with batteries included”. Let’s consider the small task of printing a list of the N most frequent words within a given file:

from string import punctuation
from operator import itemgetter

N = 10
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("test.txt")
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

I won’t provide a step by step explanation of what I believe is already rather understandable. There are however a few tricky considerations to be made on behalf of those who are not too familiar with the language. First and foremost, I love using Generator Expressions because they are lazily evaluated and have a math-like readability. It’s just a very convenient way of crating generator objects. Notice how in the snippet I favor them over the option of placing a whole file into a string by concatenating the read() method to the open() one. Doing so results in a significant performance improvement for large files. Generator Expressions and List Comprehension are extremely useful language features which are inherited from the world of functional programming, and I’m glad that Python fully embraces them.

In the third for loop we count words and add them and their respective frequencies to the words dictionary (similar to a Ruby Hash). Notice how the method get() enabled us to specify a default value before incrementing the counter, in case the given key didn’t exist yet (which means that the word we were adding hadn’t been encountered before). We pass operator.itemgetter() as a keyword argument (another nice Python feature) to the sorted() function. itemgetter() returns a callable object that fetches the given item(s) from its operand which, in our case, essentially means that we can tell sorted() to sort based on the value of the dictionary’s items (the frequency of the words) rather than based on the keys (the words themselves).

Unfortunately there is a problem with this code. It will correctly sort the most popular words in the file, but equally represented words won’t be alphabetically ordered. Given that we specified a reverse order for the sorted() function, we could simply pass it key=itemgetter(1, 0) to order (in descending order) by value first and by key second. But let’s be realistic. In most cases, you want to have these type of keys whose values are equal, be alphabetically ordered (in ascending order). With a few changes to the code, this can be easily achieved:

from string import punctuation

def sort_items(x, y):
    """Sort by value first, and by key (reverted) second."""
    return cmp(x[1], y[1]) or cmp(y[0], x[0])

N = 10
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("test.txt")
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), cmp=sort_items, reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

Previously we specified what “key” should we use for sorting, while in this case we now have a much greater deal of control. By defining the function sort_items() and passing a pointer to it for the cmp argument of the function sorted(), we get to define how the comparison amongst the items of the dictionary should be carried out. The function that we defined at the beginning of the script will return -1, 0 or 1, depending on how the two key-value pairs compare. The returned value is cmp(x[1], y[1]) or cmp(y[0], x[0]). This may seem complicated but the trick is rather easy. The first part compares the frequencies of the two words and returns 1 or -1 if one is greater than the other. If they are equal, the expression to the left of the or will be 0, therefore the expression on the right of the or will be returned. On the right we compare the keys (the words), but invert the order of the arguments y and x to reverse the effects of the reversed ordering defined in sorted().

Finally, for those who prefer to use a lambda expression, rather than to define a function, we can write the following:

from string import punctuation

N = 10
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("test.txt")
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(),
                   cmp=lambda x, y: cmp(x[1], y[1]) or cmp(y[0], x[0]),
                   reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

Or simplified further by getting rid of reverse=True and using key rather than cmp:

from string import punctuation

N = 10
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("test.txt")
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(),
                   key=lambda(word, count): (-count, word))[:N] 

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

Please bear in mind that the code makes a few assumptions so as to keep things simple. As it stands, the script would consider “l’amore” as a single word, and an accidental lack of spaces wouldn’t be accounted for (e.g. “word.Another” would be a single word too). The replace() method can be used to address these sorts of special cases.

Sure, this was a rather trivial example, born from an iPython session, but I think it gives away Python’s expressiveness and flexibility when dealing with problems that, approached in some other languages, would be much more error prone and verbose. Batteries included indeed.

Installing Django with PostgreSQL on Ubuntu

Antonio Cangiano December 26th, 2007

This how-to is essentially the same as my previous one, only this time I’ve provided step-by-step instructions for installing Django with PostgreSQL on Ubuntu 7.10.

First and foremost, we are going to install Django from its svn repository, as opposed to obtaining the 0.96 release archive. The reason for this is that the trunk version implements a few new features. The development code is also rather stable and used by most people in production mode, even for sites like the Washington Post.

Install Subversion

sudo apt-get install subversion

Checkout Django

svn co http://code.djangoproject.com/svn/django/trunk django_trunk

Tell Python where Django is

Ubuntu already ships with Python 2.5.1, thus you won’t have to install it. You can verify this by running python in your shell (use exit() to get out of the python shell). What you need to do is inform Python about the location of your django_trunk directory. To do this create the following file:

/usr/lib/python2.5/site-packages/django.pth

Within this file, place only one line containing the path to your django_trunk folder. In my case, this is:

/home/antonio/django_trunk

Of course, change it to the full path location of the directory on your filesystem.

Add django-admin.py to your PATH

The bin directory within the django folder (which is inside django_trunk itself) contains several management utilities. We need therefore to add the following to the PATH (again, change it to your own location):

/home/antonio/django_trunk/django/bin

How you go about doing this, depends on the shell you are using, and I’m assuming you are able to export a shell variable on your own. In case you are using the bash shell (as I do) you could export it in .bashrc. Alternatively, you could just create a symlink to the utility django-admin.py in /usr/bin, but I recommend the former approach.

Install PostgreSQL and Psycopg2

sudo apt-get install postgresql pgadmin3 python-psycopg2

This will install PostgreSQL 8.2.5, PgAdmin III and the driver Psycopg2 for you. Most people at this point will ask, what’s the default password for PostgreSQL on Ubuntu? You can use the following instructions to set the password for the user postgres both in Ubuntu and within PostgreSQL:

sudo su -
passwd postgres
su postgres
psql template1

The last instruction should open the psql shell, where you can run the following:

ALTER USER postgres WITH ENCRYPTED PASSWORD ‘mypassword’;

Verify the installation

You should be all set now, but let’s verify this right away. Open the shell and run the following instructions inside the python shell (start off with the python command).

>>> import django
>>> print django.VERSION
(0, 97, ‘pre’)
>>> import psycopg2
>>> psycopg2.apilevel
‘2.0′

By running exit() get out of the python shell, and verify that django-admin.py is in your path:

django-admin.py
Type ‘django-admin.py help’ for usage.

If you obtain a similar output for all three of them, you are really set.

Where to go from here

Now that Django is installed, you can go read the Django Book 1.0 that’s available for free online. Something equally well done and useful is really missing from the Rails community. Above all, experiment, Django (and programming in general) is learnt by doing. The Definitive Guide to Django: Web Development Done Right is also available for purchase in its deadtree version, which just came out. It’s cheap and it’s already a best seller on Amazon. Despite the availably of a free version online, I like having paper versions of tech books so that I can read without staring at the monitor. Furthermore, I feel like rewarding the authors (who are also the framework creators), while encouraging publishing companies that are willing to allow authors to make their books available for free on the web. Well done guys!

How to install Django with MySQL on Mac OS X

Antonio Cangiano December 22nd, 2007

Installing Django on Mac OS X Leopard is supposed to be very straightforward, but if you are new to it, you may encounter a few puzzling questions and, in the case of MySQL, even a couple of headaches. I’m writing about this for the benefit of those of you who may attempt and struggle with this feat. MacPorts is not required for this how-to.

First and foremost, we are going to install Django from its svn repository, as opposed to obtaining the 0.96 release archive. The reason for this is that the trunk version implements a few new features. The development code is also rather stable and used by most people in production mode, even for sites like the Washington Post.

Checkout Django

svn co http://code.djangoproject.com/svn/django/trunk django_trunk

Tell Python where Django is

Mac OS X 10.5 already ships with Python 2.5.1, thus you won’t have to install it. You can verify this by running python in the Terminal (use exit() to get out of the python shell). What you need to do is inform Python about the location of your django_trunk directory. To do this create the following file:

/Library/Python/2.5/site-packages/django.pth

Within this file, place only one line containing the path to your django_trunk folder. In my case, this is:

/Users/Antonio/Code/django_trunk

Of course, change it to the full path location of the directory on your filesystem.

Add django-admin.py to your PATH

The bin directory within the django folder (which is inside django_trunk itself) contains several management utilities. We need therefore to add the following to the PATH (again, change it to your own location):

/Users/Antonio/Code/django_trunk/django/bin

How you go about doing this, depends on the shell you are using, and I’m assuming you are able to export a shell variable on your own. In case you are using the bash shell (as I do) then you should have a .profile file in your home directory. Alternatively, you could just create a symlink to the utility django-admin.py in /usr/bin, but I recommend the former approach.

Grab and install MySQL

I would normally recommend PostgreSQL, at least until we have DB2 on Mac, but I realize that many of you use and prefer MySQL, which also seems to be the only one that requires special instructions due to a few installation issues when trying to get MySQL and Python to work together. You can install MySQL by grabbing and running one of the packages that are available on the official site. Choose the one for x86 and Mac OS X 10.4.

Install the MySQLdb driver

Get MySQL-python-1.2.2.tar.gz from SourceForge. Please follow these exact instructions because the source code won’t compile out of the box and will give you the following error when trying to build it:

/usr/include/sys/types.h:92: error: duplicate ‘unsigned’
/usr/include/sys/types.h:92: error: two or more data types
in declaration specifiers
error: Setup script exited with error: command ‘gcc’ failed

Run the following:

tar xvfz MySQL-python-1.2.2.tar.gz
cd MySQL-python-1.2.2

At this point, edit the _mysql.c file and comment out lines 37, 38 and 39 as follows:

//#ifndef uint
//#define uint unsigned int
//#endif

Now, from the MySQL-python-1.2.2 folder run:

python setup.py build
sudo python setup.py install

If you still get an error (and only in that case) you’ll need to edit the site.cfg file within the same folder and set threadsafe = False, before running the two commands above once again.
If instead, you don’t receive an error but you see warnings about files not required on this architecture, don’t be concerned about them. The last step required is to create a symbolic link with the following command:

sudo ln -s /usr/local/mysql/lib/ /usr/local/mysql/lib/mysql

All these adjustments are required because we are building and installing the driver on Mac and not on Linux.

Verify the installation

You should be all set now, but let’s verify this right away. Open Terminal and run the following commands in the python shell (start this with the python command).

Verify that MySQLdb is correctly installed:

>>> import MySQLdb
>>> MySQLdb.apilevel
‘2.0′

Now, verify that Django is working:

>>> import django
>>> print django.VERSION
(0, 97, ‘pre’)

By running exit() get out of the python shell, and verify that django-admin.py is in your path:

django-admin.py
Type ‘django-admin.py help’ for usage.

If you obtain a similar output for all three of them, you are really set to write the next YouTube.

Where to go from here

Now that Django is installed, you can go read the Django Book 1.0 that’s available for free online. Something equally well done and useful is really missing from the Rails community. Above all, experiment, Django (and programming in general) is learnt by doing. The Definitive Guide to Django: Web Development Done Right is also available for purchase in its deadtree version, which just came out. It’s cheap and it’s already a best seller on Amazon. Despite the availably of a free version online, I like having paper versions of tech books so that I can read without staring at the monitor. Furthermore, I feel like rewarding the authors (who are also the framework creators), while encouraging publishing companies that are willing to allow authors to make their books available for free on the web. Well done guys!

Next »