A few people (both here on the blog and through other discussion) raised legitimate points:
- My Python code was recompiling the regex every loop iteration because I was confused by how regex compilation and regex match objects work. Fixing this problem alone increased speed by 10%-25%.
- The timings I posted were sub-second and someone suggested that startup overhead may have been hurting Python. To address this, I used a more "real-life" input file that was 3,750 MB rather than the 8.588 MB input file I used earlier.
- The style of Perl I was using was archaic, and the style of Python I was using wasn't terribly Pythonic. I live in a programming bubble; I learned both of these languages from their respective O'Reilly books and that's it. I don't know anyone who knows either Perl or Python in real life, and I have never seen anyone else's code in either language. But as it turns out, poorly written Perl and poorly written Python follow the same trends as well-written Perl and Python (see below).
- Ubuntu Server 10.04 LTS
- Python 2.6.5 provided by the distribution
- Perl 5.10.1 provided by the distribution
- data resides on an ext4 lvm
- HP DL360 G7
- 2x Xeon X5672, 3200 MHz
- 24GB DDR3 RAM
- data resides on 6Gbit SAS RAID5
Methodology: I ran each code on the same 3750 MB input file five times in serial succession. Each execution was timed using the `time` builtin provided by the bash 4.1.5(1) included with Ubuntu 10.04. stdout was redirected straight to /dev/null.
|Trial||Walltime||Trial 1||Trial 2||Trial 3||Trial 4||Trial 5|
So even cleaner code runs over 40% faster in Perl than Python, which is not far off from the 50% slowdown I noted with my two crumbier versions of the code. Furthermore, it seems easier for a relative novice like myself to write inefficient Python code over Perl code. Of course, it's also easier to write Perl code that doesn't do what you expect, and trying to understand someone else's code is a crapshoot.
Judging by what others have told me and some comments have pointed out though, Python just isn't optimized for "practical extraction and reporting." Maybe someday I'll find a use for Python in my work.
In case the links to the codes I used ever go bad, here they are on pastebin: