Wednesday, March 23, 2016

To Python 3 and Back Again: Is It Worth the Switch?

Python 3 has been in existence for 7 years now, yet some still prefer to use Python 2 instead of the newer version. This is a problem especially for neophytes that are approaching Python for the first time. I realized this at my previous workplace with colleagues in the exact same situation. Not only were they unaware of the differences between the two versions, they were not even aware of the version that they had installed.
Inevitably, different colleagues had installed different versions of the interpreter. That was a recipe for disaster if they would’ve then tried to blindly share the scripts between them.
This wasn’t quite their fault, on the contrary. A greater effort for documenting and raising awareness is needed to dispel that veil of FUD (fear, uncertainty and doubt) that sometimes affects our choices. This post is thus thought for them, or for those who already use Python 2 but aren’t sure about moving to the next version, maybe because they tried version 3 only at the beginning when it was less refined and support for libraries was worse.
Two Dialects, One Language
A Concrete Example
# encoding: utf-8
from os import listdir, stat
# to keep this example simple, we won't use the `pwd` module
names = {1000: 'dario',
1001: u'олга'}
for node in listdir(b'.'):
owner = names[stat(node).st_uid]
print(owner + ': ' + node)
su олга -c "touch é"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
`TypeError: Can't convert 'bytes' object to str implicitly`
print(owner + ': ' + node)
Tools for Automated Conversion
Python 3 Is Not Just about Unicode
Optional Keyword Arguments
merge_dicts({'a':1, 'c':3}, {'a':4, 'b':2}, {'b': -1})
# {'b': -1, 'a': 4, 'c': 3}
from operator import add
merge_dicts({'a':1, 'c':3}, {'a':4, 'b':2}, {'b': -1}, withf=add)
# {'b': 1, 'a': 5, 'c': 3}
def second(a, b):
return b
def merge_dicts(*dicts, withf=second):
newdict = {}
for d in dicts:
shared_keys = newdict.keys() & d.keys()
newdict.update({k: d[k] for k in d.keys() - newdict.keys()})
newdict.update({k: withf(newdict[k], d[k]) for k in shared_keys})
return newdict
Unpacking Operator
a, b, *rest = [1, 2, 3, 4, 5]
rest
# [3, 4, 5]
Simpler APIs for Iterables
zip(itertools.count(1), 'abc')
Function Annotations
@get('/balance')
def balance(user_id: int):
pass
from decimal import Decimal
@post('/pay')
def pay(user_id: int, amount: Decimal):
pass
Wrapping Up
  • lzma, for an improved compression compared to gzip
  • asyncio, for asynchronous programming in Python
  • pathlib, a more pythonic and expressive way to manipulate paths
  • lru_cache, to automatically cache the results of expensive functions
  • mock (the same module mentioned above, previously available only from PyPI)
  • autocompletion inside pdb
  • the __pycache__ directory, that helps to avoid littering every other project folder with .pyc files

First of all, is it true that Python 2 and Python 3 are different languages? This is not a trivial question. Even if some people would settle the question with: “No, it’s not a new language”, as a matter of fact several proposals that would have broken compatibility without yielding important advantages have been rejected.
To Python 3 and Back Again: Is It worth the Switch?
Python 3 is a new version of Python, but it’s not necessarily backwards compatible with code written for Python 2. At the same time it’s possible to write code that is compatible with both versions, and this is not by chance but a clear commitment of the developers that drafted the several PEP (Python Extension Proposal). In the few cases in which syntax is incompatible, thanks to the fact that Python is a language with which we can dynamically modify code at runtime,  we can solve the problem without relying on preprocessor with a syntax completely alien to the rest of the language.
The syntax is thus not a problem (especially ignoring versions of Python 3 before 3.3). The other big difference is the behavior of code, its semantics and the presence/absence of big libraries only for one of the two versions. This is indeed a significant problem, but it’s not completely unique or new for those who already have experience with other programming languages. You probably already happened to get an old codebase/library that fails to build with recent versions of the same compiler used originally. It’s the compiler itself in these cases that will help you (in Python, instead help will come from your own test suite).
Why make the new version different then? What advantages will these changes bring to us?
Let’s assume we want to write a program to read the owner of files/directories (on a Unix system) in our current directory and print them on screen.
Does everything work correctly? Apparently it does. We specified the encoding for the file containing the source code, if we have a file created by олга (uid 1001) in our directory its name will be printed correctly, and even if we have files with non-ASCII names these will be printed correctly.
There’s still a case that we haven’t covered yet though: a file created by олга AND with non-ASCII characters in the name…
Let’s try to launch again our small script, and we’ll obtain a:
If you think about it, a similar situation could be nasty: You have written your program (thousands of lines long instead of the few 4 of this example), you start to gather some users, some of them even from non-English speaking countries with more exotic names. Everything is okay, until one of these users decides to create a file that users with more prosaic name can create without any problem. Now your code will throw an error, the server might answer every request from this user with a error 500, and you’ll need to dig in the codebase to understand why suddenly these errors are appearing.
How does Python 3 help us with this? If you try to execute the same script, you’ll discover that Python is able to detect right away when you’re about to execute a dangerous operation. Even without files with peculiar names and/or created by peculiar users, you’ll receive right away an exception like:
Related to line:
The error message is even more easy to understand, in my opinion. The str object is owner, and node is a bytes object. Knowing this, it’s obvious that the problem is due to the fact that listdir is returning us a list of bytes objects.
A detail that not everybody knows is that listdir returns a list of bytes objects or unicode strings depending on the type of the object that was used as input. I avoided using listdir('.') exactly to obtain the same behavior on Python 2 and Python 3, otherwise on Python 3 this would’ve been an unicode string that would’ve made the bug disappear.
If we try to change a single character, from listdir(b'.') to listdir(u'.') we’ll be able to see how the code now works on both Python 3 and Python 2. For completeness, we should also change 'dario' to u'dario'.
This difference in the behavior between Python 2 and Python 3 is however supported by a radical difference in how the two versions handle string types, a difference that is mainly perceived when porting from one version to the other.
In my opinion, this situation is emblematic of the maxim: “splitters can be lumped more easily than lumpers can be split”. What was lumped together in Python 2 (unicode strings and default strings of byte, which could be freely coerced together) has been split in Python 3.
For this reason tools like 2to3, even if well written and extremely useful to automate the conversion of every other difference, have some limitations. With the bytes/unicode split the difference in behavior surfaces at runtime, and a tool that can only do parsing/static analysis thus won’t be able to save you if you have a huge Python 2 codebase that mixes these two types. You’ll have to roll up your sleeves and properly design your API to decide if functions that until now accepted indiscriminately any type of strings should now work only with some of these (and which ones). Conversely, though getting a lot less use, tools of conversion from Python 3 to Python 2 have much easier life. Let’s see an example:
Sometime ago, I wrote a toy HTTP server (only dependency: python-magic), and this is the version for Python 2 (automatically converted from the Python 3 one without any need for manual changes): https://gist.github.com/berdario/8abfd9020894e72b310a
Now, if you want you can have a look directly at the code converted to Python 3 with 2to3, or you can convert it directly on your system. When trying to execute it you’ll realize how every error that you can try to fix by hand is related to the bytes/unicode split.
You can manually apply changes like these: https://gist.github.com/berdario/34370a8bc39895cae139/revisions
And thus, you get your program working again on Python 3. These are not complex changes, but they require nonetheless to reason on which data types your functions are working upon, and on the control flow. It’s 13 lines of changes out of 120, a ratio not too easy to handle: with thousands of lines of code to port, you could easily end up with hundreds to modify.
If you’re curious, you could then try to convert this code that you just brought to Python 3 back to Python 2. Using 3to2 you’d obtain this: https://gist.github.com/berdario/cbccaf7f36d61840e0ed. In which the only change that had to be applied manually is .encode('utf-8') at line 55.
Starting from Python 3 (if you’ll ever need to convert it back to Python 2), it’s much easier. But if you need to have your code working on another version, a complete conversion like this is not the best choice. It’s much better to maintain compatibility with both versions of Python. To do that you can rely on tools like futurize.
Even if you don’t have the chance to use Python 3 in production (maybe one of the libraries that you’re using is bulky and compatible with Python 2 only), I’d suggest for you to keep your code compatible with Python 3. You could even stub/mock out the incompatible libraries, just so that you could run continuously your tests on both versions. This will make it easier for you when in the future you’ll finally be ready to migrate to Python 3, not to mention how it can help you in better design your API, or to identify errors like in the example at the beginning of this post.
All this talking about porting and byte/unicode difference, even if you were initially skeptical about using/starting with Python 3, probably led you to think of it as the lesser evil rather than tackling the porting in the future. But if porting is the stick, where’s the carrot? Is it the new features added to the language and to its standard library?
Well, after 5 years of time from the release of the last minor version of Python 2, there are plenty of interesting tidbits that are piling up. For example I found myself relying quite often on things like the new keyword-only arguments.
When I wanted to write a function to merge an arbitrary number of dictionaries together (similar to what dict.update does, but without modifying the inputs) I found it natural to add a function argument to let the caller customize the logic. This way this function could be invoked as follows to simply merge multiple dictionaries by retaining values in the rightmost dicts.
Likewise, to merge by adding the values:
Implementing such an API in Python 2 would have required to define a **kwargs input and look for the withf argument. If the caller did mistype the argument as (e.g.) withfun the error would be silently ignored, though. In Python 3 instead it’s perfectly fine to add an optional argument after variable arguments (and it will be usable only with its keyword):
Since Python 3.5, the naive merging can actually be done with the new unpacking operator. But even before 3.5 Python got an improved form of unpacking:
This has been available to us since 3.0. Akin to destructuring, this kind of unpacking is a limited/ad-hoc form of the pattern matching commonly used in functional languages (where it is also used for flow control) and it’s a common feature in dynamic languages like Ruby and Javascript (where support for EcmaScript 2015 is available).
In Python 2, a lot of APIs that dealt with iterables were duplicated, and the default ones had strict semantics. Now instead everything will generate values as needed: zip()dict.items()map()range(). Do you want to write your own version of enumerate? In Python 3 it’s as simple as composing functions from the standard library together:
Is equivalent to enumerate('abc', 1).
Wouldn’t you like to define HTTP APIs as simply as this?
No more '<int:user_id>' ad-hoc syntax, and the ability to use any type/constructor (like Decimal) inside your routes without having to define your own converter.
Something like this has already been implemented, and what you see is valid Python syntax, exploiting the new annotations to make it more convenient to write APIs that are also self documenting.
These are just a couple of simple examples, but the improvements are far reaching and ultimately help you in writing more robust code. An example is exception chain tracebacks enabled by default, showcased in the aptly named post “The most underrated feature in Python 3” by Ionel Cristian Mărieș, which is also covered in this other post by Aaron Maxwell, together with the stricter comparison semantics of Python 3, and the new super behavior.
This is not all. There are plenty of other improvements, these are the ones I feel have the most impact day-to-day:
A more thorough panorama can be obtained with the “What’s New” pages of the documention, or for another overview of the changes I also suggest this other post by Aaron Maxwell and these slides from Brett Cannon.
Python 2.7 will be supported until 2020, but don’t wait until 2020 to move to a new (and better) version!
This post originally appeared in Toptal Engineering blog

1 comment:

RobertNelson said...

Thanks for this. I really like what you've posted here and wish you the best of luck with this blog and thanks for sharing. Marketing Company in San Diego