Principle of Biggest Surprise

Preface

First things first, let me introduce myself. I'm just the usual coder, writing programs no one else uses, or probably even needs. In the past years of using Python for developing my programs, I have made a few observations of things that just felt counter-intuitive. This is a collection of the ones I still remember.

This document was written in June 2009, so if you are reading it later don't take it for granted that all things mentioned here are still the same.

Now, let me clarify some things. I neither hate Python, nor do I hate CPython. In fact, I use both of them every day. Still, there are issues which I think are not excuseable. Actually, there are surely more of them than the ones I have mentioned, but it's a start. I thank you in advance for taking your time to read my thoughts and hope you enjoy reading this text.

Special Method Lookup

We all know Python is a dynamic language where attribute lookup is done in the instance. At least, many think that. In fact, it isn't really. Not all lookup is done in the instance, so we cannot really call it like that.

So you might have got curious which lookup is not done in the instance. For instances of new-style classes, all special method lookup that is done implicitely is done in the class struct. Thus changing an instance's __str__ attribute does not effect the result of str(). Still, explicitely getting the attribute from the instance gives you the method in the instance. Of course, compared to a normal lookup, which is a lookup in a hashtable, simply accessing a field of a struct might provide a nice speedup. But still, it's not worth the confusion it creates.

You might have noticed the emphasis above. For old-style classes all lookup is done in the instance. A subtile but important difference. You should also think about this when porting your code to Python 3.0, where old-style classes will be gone.

To quote the Zen of Python: Special cases aren't special enough to break the rules. Personally, I do not even consider this to be a special case. Either all lookup should be done in the class, or all lookup should be done in the instance. Everything else is confusing. I would be alright with either.

Source Code
class Foo(object):
    def __str__(self):
        return "old str"

foo = Foo()
foo.__str__ = lambda: "new str"

print str(foo)
print foo.__str__()
Output
old str
new str

new-style class lookup

Source Code
class Foo:
    def __str__(self):
        return "old str"

foo = Foo()
foo.__str__ = lambda: "new str"

print str(foo)
print foo.__str__()
Output
new str
new str

old-style class lookup

copy.copy does not work as expected

Everything in Python is an object, even classes. All objects are basically the same. Right? I'm sorry to destroy your idealist view of the world, but no. Quote: This module does not copy types like module, method, stack trace, stack frame, file, socket, window, array, or any similar types. At least it's documented. I can understand some of those, but most don't make sense and I would really much prefer it to throw an exception over doing nothing at all.

Source Code
import copy

class A(object):
    pass

print A is copy.copy(A)

Output
True

copy.copy returns its argument

Circular References and __del__

Python is a garbage collected language. You don't have to care about memory management. Not quite correct.

The one thing you have to consider is not to create circular references of objects that implement __del__. This will make the cycle uncollectable and your objects will remain in memory. No warning, nothing. It would be wise to look at the gc.garbage (objects to which no reference exists but that can still not be collected) at bigger applications to ensure you aren't affected by this problem.

Python does that because it does not know in which order to call the __del__ methods. It doesn't know whether the destructor of the one object needs the other one alive, or the other way round, or even both (okay, this isn't really possible).

The other problem with the __del__ method is that upon program shutdown all objects get collected, so naive __del__ methods which access attributes from outer scope are likely to throw a lot of exceptions, but these are swallowed by the interpreter. A solution would be to assign the modules you need in your __del__ methods to the objects and access them as object attributes (in __init__). Generally Python developers agree that the usage of __del__ is discouraged.

Source Code
import gc

class A(object):
    def __init__(self):
        self.b = B(self)

    def __del__(self):
        pass


class B(object):
    def __init__(self, a):
        self.a = a

    def __del__(self):
        pass


a = A()

del a
gc.collect()
print gc.garbage
Output
[<__main__.A object at 0x4035a70c>, <__main__.B object at 0x4035a76c>]

unbreakable reference cycle

Standard Library

Included batteries tend to not be fully charged. But let me relativise. Python's standard library isn't particulary bad. It's old. Old, dusty and cobwebbed. And it's huge. Apparently too huge. How can the standard library be too large? As there is a limited amount of workforce available, as it grows it becomes more and more unmaintainable. This inevitably leads to a decrease of its quality.

A good example for this is the mmap module. Especially the mmap.flush method. On Unix it retuns 0 to indicate success and otherwise raises an exception, while on Windows it returns a zero on failure and a nonzero otherwise.

It also offers modules which are really inferior to third-party libraries that provide the same functionality, like asyncore or asynchat. Any developer would be better off using a proper network library rather than asyncore. These modules are merely ballast.

So clean up the standard library, reduce it to a maintainable size. Everything that is missing then will be taken care of by third-party projects.

Threading

Here we go, the old topic. You have probably heard of CPython's Global Interpreter Lock (GIL) that generally serialising the execution of multiple threads by locking the whole interpreter so that only one piece of bytecode is interpreted at a time. CPython is the primary implementation of Python, while others do exist, they are not as widespread.

It not only makes threads unsuitable for scaling you application on muliple processors, performance even degrades the more threads you use. There are even cases where one thread can block the whole interpreter for a pretty long time (computationally intensive C extension module calls). Another big problem are signals, that need to be handled in the main-thread, that do not get handled. I won't go into detail here, there is a nice presentation that explains the GIL in depth.

But threading in Python also has more subtile issues. On POSIX platforms (on others I cannot tell for sure), acquiring a lock also swallows all signals. While this may not sound too bad for you, it generally means that your application cannot be killed using Ctrl-C because the acquire swallows the keyboard interrupt.

To prevail in the age of parallelism, it would be wise to offer a good thread implementation. While there are alternatives like multiprocessing, they aren't easily appliable for all problems that could be solved using threads.

Epilogue

Thank you for reading this text. If you have any further questions or recommendations, do not hesitate to contact me. If you are interested in reading more texts written by me, please visit my blog. New, longer articles like this one will be published here in the future.


© 2009 Florian Mayer.

Thanks to Lukas Prokop for helping me format this text. Thanks to all of those who have helped me proof-read this text.