Monday, April 2, 2012

Revisiting Perl and Python's Speed

I was really surprised to see the discussion that was generated as the result of my previous post comparing the speed of Python and Perl.  Many people much wiser than me posted a lot of valuable comments and suggestions, and two people were kind enough to post total rewrites of my routines which (to nobody's surprise) were much faster than the codes I wrote.

A few people (both here on the blog and through other discussion) raised legitimate points:
  1. My Python code was recompiling the regex every loop iteration because I was confused by how regex compilation and regex match objects work.  Fixing this problem alone increased speed by 10%-25%.
  2. The timings I posted were sub-second and someone suggested that startup overhead may have been hurting Python.  To address this, I used a more "real-life" input file that was 3,750 MB rather than the 8.588 MB input file I used earlier.
  3. The style of Perl I was using was archaic, and the style of Python I was using wasn't terribly Pythonic.  I live in a programming bubble; I learned both of these languages from their respective O'Reilly books and that's it.  I don't know anyone who knows either Perl or Python in real life, and I have never seen anyone else's code in either language.  But as it turns out, poorly written Perl and poorly written Python follow the same trends as well-written Perl and Python (see below).
So as to be a little more scientific about this (since I am a scientist and all), here are my starting parameters:
  • Software
    • Ubuntu Server 10.04 LTS
    • Python 2.6.5 provided by the distribution
    • Perl 5.10.1 provided by the distribution
    • data resides on an ext4 lvm
  • Hardware
    • HP DL360 G7
    • 2x Xeon X5672, 3200 MHz
    • 24GB DDR3 RAM
    • data resides on 6Gbit SAS RAID5
  • Codes

Methodology: I ran each code on the same 3750 MB input file five times in serial succession.  Each execution was timed using the `time` builtin  provided by the bash 4.1.5(1) included with Ubuntu 10.04.  stdout was redirected straight to /dev/null.

TrialWalltimeTrial 1Trial 2Trial 3Trial 4Trial 5
Old Python309.032310.971308.228311.331307.170307.461
New Python176.880178.099174.742175.463178.235177.863
Old Perl167.051166.916165.911167.361168.735166.333
New Perl126.860125.913124.709130.125127.809125.746

So even cleaner code runs over 40% faster in Perl than Python, which is not far off from the 50% slowdown I noted with my two crumbier versions of the code.  Furthermore, it seems easier for a relative novice like myself to write inefficient Python code over Perl code.  Of course, it's also easier to write Perl code that doesn't do what you expect, and trying to understand someone else's code is a crapshoot.

Judging by what others have told me and some comments have pointed out though, Python just isn't optimized for "practical extraction and reporting."  Maybe someday I'll find a use for Python in my work.

In case the links to the codes I used ever go bad, here they are on pastebin:
I'd post the input files I used, but I don't have anywhere I can host 3.7 GB files.  If you're interested in the input data, let me know and I can send a private link.