Monday, March 26, 2012

Switching from Perl to Python: Speed

The job listings in scientific computing these days seem to show a mild preference for applicants with backgrounds in Python over Perl. It has high-profile (or just highly visible?) packages like NumPy and Python's MPI bindings for scientific computing, and some molecular dynamics packages (e.g., LAMMPS) include analysis routines written in Python. Although I've invested a few years into Perl, I've decided to not pigeonhole myself and start picking up Python. After all, Perl is unintelligible after it's been written, and it's sometimes frustrating to deal with its odd quirks.

To this end, I reimplemented one of my most-used Perl analysis routines in Python. Here is my Perl version, written back in 2009:


And here is the Python version I cooked up today:


In the Python version, there are several ways to tear through a file and I tried all three. Method #1 is closest to the Perl functionality, where I can specify multiple input files on the command line and have all of them parsed sequentially. Method #2 is the method that the Python documentation seems to advocate the most. Method #3 loads the whole file contents into memory and works from there.

Unfortunately, in all three cases, Python seems to be slower than Perl. Average execution times for a typical input file are:



Maybe there's something I'm missing in the Python version, but the Perl version isn't exactly a shining example of simplicity in itself. What gives here? For a language that's being venerated in the scientific computing world, in the case of basic text parsing of large files, it isn't shining. At best, it's almost 50% slower than Perl.

25 comments:

  1. Reblogged this on ppant's blog and commented:
    A real time comparison. Long live Perl.

    ReplyDelete
  2. I guess perl version could be ever faster. For example, using of 'o' regexp modifier in long regular expression allows to perl to not parse this regexp every time.

    ReplyDelete
  3. As there are no interpolated variables in the regexp, Perl is smart enough to only compile it just once anyway. Splitting the line is fairly expensive though, so moving the "next if" one line up would make a difference. Then again, as only the first two fields are ever used, the split could be left out completely and replaced with something like
    unless(($first, $second) = $line =~ m/^(\d+)\s+([\d\w])+ .../)

    ReplyDelete
  4. The syntax `&printargs()` is obsolete since at least a decade in Perl.
    Did you try to read the whole file in Perl too? For instance, with `@lines = ;`.
    Your `next unless ...` line should be the first one of the loop (it's more readable and marginally faster).
    Your Python code is a bit different in the function, so you should do the same in Perl (remove the second loop of the function and put `$type{$specie} = 0;` into the first loop).
    BTW, the execution times of your programs depends on the versions of perl and python you are using.

    If you want to see many small benchmarks of this kind, there's the "Computer Language Benchmarks Game". See http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=python3&lang2=perl but don't compare with C++, you'd feel it would be necessary to code with the fastest language, and then enter an endless world of pain ;-)

    ReplyDelete
  5. In this particular case, the /o modifier isn't needed since the regexp is a fixed string. It is going to be compiled only once even without the modifier.

    ReplyDelete
  6. I'm willing to try another take at Perl. SInce you didn't provide a sample input I can't compare the results. But I think you can make the Perl version a little bit shorter, clearer, and more modern as follows:


    #!/usr/bin/env perl

    use utf8;
    use strict;
    use warnings;

    my @show = qw/Siloxane SiO4 Si3O SiO3 SiO2 SiO1 NBO FreeOH H2O H3O SiOH SiOH2 Si2OH/;

    my %type;
    @type{@show} = (0) x @show;

    printf('ird ' . ' %8.8s' x @show . "\n", @show);

    my $isave;
    my $current;
    my $linefmt = '%-8.8s' . ' %8d' x @show . "\n";

    while () {
    if (my ($id, $specie) = (/^\s*(\d+)\s+([\d\w]+)\s+\d+(?:\s+[\w\.]+){3}\s*$/)) {
    $current = $isave = $id unless defined $current;
    if ($id ne $current) {
    $current = $id;
    printf($linefmt, $isave++, @type{@show}) ;
    @type{@show} = (0) x @show;
    }
    ++$type{$specie};
    }
    }
    printf($linefmt, $isave, @type{@show}) if defined $current;

    ReplyDelete
  7. I'm afraid I don't know too much about Python, but it looks like your Python regex isn't anchoring to the beginning of the string like the Perl one - this means the line can't be discarded very quickly. I suspect there is an impact from that.

    ReplyDelete
  8. Peter,
    Match () in python is anchored to start-of-string, so a leading ^ isn't needed in the regexp.

    ReplyDelete
  9. Not sure how well this well be formatted, but here's a version that's slightly more Pythonic. Couldn't say on speed. Though with such small timings I'm guessing that you're using small files and a good portion of the wall clock time is spent initializing the VM. To get a better sense of the differences I'd use an input that took at least 30-60s before worrying too much about the timings.

    Anyway, here it is:


    #!/usr/bin/env python2

    import re
    import sys


    SHOW = """
    Siloxane SiO4 Si3O SiO3
    SiO2 SiO1 NBO FreeOH
    H2O H3O SiOH SiOH2 Si2OH
    """.split()


    def printargs(counts, isave):
    sys.stdout.write("%-8s" % isave)
    for s in SHOW:
    print "%8d" % counts[s],
    counts[s] = 0
    sys.stdout.write("\n")


    def main():
    sys.stdout.write("%-8s" % "ird")
    counts = {};
    for s in SHOW:
    counts[s] = 0
    sys.stdout.write("%8s" % s)
    sys.stdout.write("\n")

    isave = 0;
    current = 0;

    RE_LINE = re.compile(r"""
    ^
    (\d+)
    \s+
    ([\d\w]+)
    \s+
    \d+
    \s+
    [\w\.]+
    \s+
    [\w\.]+
    \s+
    [\w\.]+
    \s*
    $
    """, re.VERBOSE)

    with open("coord.out") as handle:
    for line in handle:
    line = line.lstrip(" \t\r")
    match = RE_LINE.match(line)
    if not match:
    continue

    specie = match.group(2)
    icur = int(match.group(1))

    if current == 0:
    current = icur
    isave = current
    elif current != icur:
    printargs(counts, isave)
    current = icur
    isave += 1

    if specie in SHOW:
    counts[specie] += 1;

    printargs(counts, isave)


    if __name__ == '__main__':
    main()

    ReplyDelete
  10. Definitely borked the formatting so here's a Gist:

    https://gist.github.com/2260142

    ReplyDelete
  11. Python's print is slow, you can try concatenate some string before printing them out:

    def printargs( counts, isave ):
    out = ["%-8s" % isave]
    for s in show:
    out.append("%8d" % counts[s])
    counts[s] = 0
    print "".join(out)

    ReplyDelete
  12. The truth is that Python is much slower than Perl, especially if you concatenate strings or use regular expressions.

    One of my first Python tests was to write a "buzz bang" program. Buzz Bang is a child's counting game. You simply count from one upwards. The trick is that any number that has a three in it, or is divisible by three, you say "bang", and any number that has a seven in it, or is divisible by seven, you say "buzz". If a number has both aspects, you say "buzz bang". The programs simply outputted "1, 2, bang, 4, 5, bang, buzz, 8, bang, 10, 11, bang, bang, buzz, bang..."

    I wrote a Perl version and a Python version, and the Perl version simply blew the Python version away. My Perl version was about three times faster. I sat there and rewrote and tweaked my Python version over and over. I found out that Python is very, very bad at string concatenation, so I had to get rid of all the places where I was concatenating strings. Python is very bad at regular expressions, so I switched from regular expressions to using the find and index string methods. That sped my program up quite a bit, but the Perl version was still faster.

    In the end, I spent a lot of time optimizing my Python code, and while optimizing it, my Python code became harder to read. I asked some PyHeads about it, and they told me that 1). Perl isn't faster, and 2). It doesn't matter anyway that Perl is faster because machines are faster, and 3). did they mention that Python rulz and Perl sux!

    Python does have some nice things, but it's not the lack of curly braces which makes cutting and pasting code almost impossible, or the lack of semicolons that makes it difficult to split overly long lines. It has features like sets and the ability to use arrays of arrays or hashes of arrays without resorting to references. If Python is more readable, it's because Python handles references for you and saves you a lot of ugly referencing and dereferencing stuff.

    However, to my surprise, despite its reputation as a true object oriented language, Python doesn't have private members in classes. In fact, its class/method structure is very similar to how'd you do it in Perl. Don't let anyone tell you that Python is a true object oriented language while Perl isn't.

    I also miss the strict and warnings pragmas that almost everyone in Perl uses. There is simply no Python equivalent. Basically, if you never set a variable, it can't be used. However, you can't define a variable, and variables defined inside loops still exist outside the loop. Perl can pick up these errors, and Python can't.

    I'm learning Python, but mainly for the same reason I would learn any language: It is part of my job. It's nice, but it's not the panacea that PyHeads claim it is.

    ReplyDelete
  13. Although Perl came first, and despite their differences, Perl and Python have both influenced eachother in many ways.

    "I don't really know much about Python. I only stole its object system for Perl 5. I have since repented." —Larry Wall

    ReplyDelete
  14. [...] Silica in Silico Computations, glass, and research Skip to content HomeAbout ← Switching from Perl to Python: Speed [...]

    ReplyDelete
  15. "After all, Perl is unintelligible after it’s been written"

    I'm not trying to be rude, but perhaps your Perl is unintelligible after it's written because you're not writing clear and clean code? The first thing I would suggest is to take a few days and read up on Modern Perl concepts. Outside of single-run throw-away code, there hasn't been a valid reason to write perl without "use strict;" and "use warnings;" in over a decade. Your use of the old &function() syntax also suggests that you've either been writing Perl a little bit for a long time, and haven't kept up with advancements, or you learned perl from outdated documentation or teachings.

    I would strongly urge you to pick up a copy of "Effective Perl", 2nd Edition. It's a great book for someone who has some decent Perl experience, but wants to write better Perl. It won't teach you Perl, but it'll make you a better Perl programmer.

    As for the speed specifically, Perl is usually among the fastest of the commonly used Scripting languages (Perl, Python, Ruby). In just about every benchmark test I've seen, Perl generally beats Python, especially with regex processing, and both tend to spank Ruby pretty hard. If you aren't writing truly time critical code, or processing monstrous amounts of data, it's usually not a big deal (if performance were *that* big of a deal, you'd likely either be writing C/C++, or you'd write some inline C/C++ in the performance critical parts to speed it up).

    ReplyDelete
  16. I was more just joking; I've often heard "Perl is a write-only language," and it's very easy for novices (not unlike myself) to make horrible code that works.

    You're right about my lack of clear and clean code; the routine I showed in this post was actually written when I was still just starting Perl. My code nowadays is cleaner, but the &subroutine() calls are how subroutines are introduced in Learning Perl by Schwartz and therefore how I learned to do it. Working in a programming bubble as I do, I just never knew any different, so I appreciate the comments you (and others) have posted about good practices.

    ReplyDelete
  17. You're definitely correct about Perl's functionality for new users. I know I was amazed at how much I could get done with it when I first started, even though I barely knew what I was doing.

    I've never actually read Learning Perl, although I've heard lots of good things about it. I know they just released a new edition about 6 months ago, so depending on when you read it, there's a decent chance it was an edition (or two) behind the current one. No surprise if some of the style choices were a little dated. Perl has been doing through something of a renaissance in the past few years, with the "Modern Perl" movement, and that's led to a lot of review of best practices and cleanups. Unfortunately, with Perl having been around as long as it has, and used by as many people as have used it, there's a *lot* of outdated and obsolete documentation and code running around.

    If you do most of your programming by yourself, or with a small close group of people (in a bubble, as you say), there's an excellent chance you won't get exposed to a lot of the newer ideas and changes happening with Perl. If you are interested in updating your Perl knowledge, the book I mentioned, "Effective Perl" (http://www.effectiveperlprogramming.com/) is probably the best place to go. Another good resource though, is chromatic's "Modern Perl" (http://onyxneon.com/books/modern_perl/) book. The great thing about it is that he's made it available online for free viewing.

    Good luck!

    ReplyDelete
  18. You said you cooked it up in one sitting? That's the speed of python. ;)

    ReplyDelete
  19. There are a number of issues with the python implementation, the worst being the "if show.count(specie) > 0:"

    This is a slow O(N) Linear search. Changing it to "if specie in show" as someone else commented makes it more readable but does not remove the slow search.

    The if statement should be removed entirely to match the perl code, or at least be changed to "if specie in counts" which will be a O(1) hash lookup.

    Using the re module properly will speed things up more too, but removing the linear search will have the biggest impact.

    ReplyDelete
  20. Regular expression parsing is fast in Perl. Storing and accessing data in variables is very slow. When the main part of your test is a regexp, Perl will win. Write code to do mathematical computation, or to move values in and out of arrays (eg re-implement a sorting algorithm); I think the results will come out differently.

    ReplyDelete
  21. [...] http://silicainsilico.wordpress.com/2012/03/26/switching-from-perl-to-python-speed/ [...]

    ReplyDelete
  22. Python tends to be faster when the same code is run after the first time, compiling slows down the first code execution. And Perl compiles into a Parrot bytecode format which is register based or translates into machine language on execution. Python uses a slower stack based system (great for functional languages) which may change in the future, python is implementable on the parrot virtual machine to attain to the same speed. The speed is due to the virtual machine used and not the language.
    Use python's functional capabilities to increase its speed.

    ReplyDelete
  23. Another problem with your code is how it is written. A more efficient way would be to use a list, and add strings to it while it is in a loop and then when execute of the loop ends, convert the list into a string.

    (Python)
    word = "go"
    for num in range(count):
    word += " go"
    in a loop, does these:
    # creates a new string and disposes of the other two objects.
    word_list = []
    for num in range(count):
    word_list.append("go")
    "".join (word_list)
    # create a new object appends it to the list but disposes of one object. .
    # The creation and destruction of an object slows down the code.
    # An even faster method is this list comprehension, only one object is destroyed of disposed of.
    ""..join("go" for num in range(count))
    # All the objects created here are added to the list directly and when the loop ends, the list is
    # converted in a string only the initial empty string "" is destroyed.

    ReplyDelete
  24. Is the data mostly matching your regex? If it isn't you can speed things up by using a string search method before applying the re. For instance, if your file was 50% commented out lines starting with '#' the doing:

    for line in contents:
    if line.startswith('#'): continue
    match = re.match(RE_LINE, line)
    if not match: continue

    would speed it up, but again only if a good percentage of the lines in the file are not data records.

    ReplyDelete
  25. With regexps Perl is still a way faster - especially compared to Python 3. And not only faster, for me its 'inline' syntax is much neater and the Perl regexp engine does a great optimisation for you. And it is true with every string function, print etc. In Perl these just do what you need, and do really good.
    On the other hand Pypy does do a great speedup for the Python code, but still the difference remains with the regexps. But if Your code is not heavy with regexps and string manipulations, Pypy may be Your Python friend.
    However, it is important to note that all depends on your context. For oldschool OO Perl and some optimisation module You can write very fast code ( in the scripting language niche.) But It would worth to bench this against Pypy... Anyway, as You go into modern OO Perl - Moose and MooseX declare - You seriously pay the toll. ( No, it is not true that only at start. )
    If you really need OO and complex object creation - I'd consider using Python over Perl-Moose. My experience is that it is more straightforward and easier to refactor even when You are less versed Python programmer.
    But I think the most important part of the story is he interface to other languages, formats and libraries matter more: at some points Python has interface to most of the tools and frameworks I am interested in, and has a lot of algorithmic library coded in C/C++ - while Perl may miss a number.

    ReplyDelete