Switching from Perl to Python: Speed

The job listings in scientific computing these days seem to show a mild preference for applicants with backgrounds in Python over Perl. It has high-profile (or just highly visible?) packages like NumPy and Python's MPI bindings for scientific computing, and some molecular dynamics packages (e.g., LAMMPS) include analysis routines written in Python. Although I've invested a few years into Perl, I've decided to not pigeonhole myself and start picking up Python. After all, Perl is unintelligible after it's been written, and it's sometimes frustrating to deal with its odd quirks.

To this end, I reimplemented one of my most-used Perl analysis routines in Python. Here is my Perl version, written back in 2009:

#!/usr/bin/perl
@show = qw/ Siloxane SiO4 Si3O SiO3 SiO2 SiO1 NBO FreeOH H2O H3O SiOH SiOH2 Si2OH/;
printf("\n%-8.8s ", "ird");
foreach $specie ( @show )
{
printf("%8.8s ", $specie);
}
print "\n";
$current = 0;
$isave = 0;
while ( $line = <> )
{
chomp($line);
$line =~ s/^\s+//g;
@arg = split(/\s+/, $line);
next unless $line =~ m/^\d+\s+[\d\w]+\s+\d+\s+[\w\.]+\s+[\w\.]+\s+[\w\.]+\s*$/o;
if ( $current == 0 )
{
$current = $arg[0];
$isave = $current;
}
if ( $arg[0] != $current )
{
&printargs();
$current = $arg[0];
$isave++;
}
$type{$arg[1]}++;
}
&printargs();
sub printargs( )
{
printf("%-8s ", $isave);
foreach $specie ( @show )
{
printf("%8d ", $type{$specie});
}
print "\n";
foreach $i ( keys(%type) )
{
$type{$i} = 0;
}
}
view raw analyzecoord.pl hosted with ❤ by GitHub

And here is the Python version I cooked up today:

#!/usr/bin/env python2
import fileinput
import re
show = [ "Siloxane", "SiO4", "Si3O", "SiO3", \
"SiO2", "SiO1", "NBO", "FreeOH", \
"H2O", "H3O", "SiOH", "SiOH2", "Si2OH" ]
def printargs( counts, isave ):
print "%-8s" % isave,
for s in show:
print "%8d" % counts[s],
counts[s] = 0
print "\n",
print "%-8s" % "ird",
counts = {};
for s in show:
counts[s] = 0
print "%8s" % s,
print "\n",
isave = 0;
current = 0;
RE_LINE = \
re.compile(r'\s*(\d+)\s+([\d\w]+)\s+\d+\s+[\w\.]+\s+[\w\.]+\s+[\w\.]+\s*$')
for line in fileinput.input():
# for line in file('coord.out'):
#contents = file('coord.out').readlines()
#for line in contents:
match = re.match(RE_LINE, line)
if not match: continue
specie = match.group(2)
icur = int(match.group(1))
if current == 0:
current = icur
isave = current
elif current != icur:
printargs(counts, isave)
current = icur
isave += 1
if show.count(specie) > 0:
counts[specie] += 1;
printargs(counts,isave)
view raw analyzcoord.py hosted with ❤ by GitHub

In the Python version, there are several ways to tear through a file and I tried all three. Method #1 is closest to the Perl functionality, where I can specify multiple input files on the command line and have all of them parsed sequentially. Method #2 is the method that the Python documentation seems to advocate the most. Method #3 loads the whole file contents into memory and works from there.

Unfortunately, in all three cases, Python seems to be slower than Perl. Average execution times for a typical input file are:



Maybe there's something I'm missing in the Python version, but the Perl version isn't exactly a shining example of simplicity in itself. What gives here? For a language that's being venerated in the scientific computing world, in the case of basic text parsing of large files, it isn't shining. At best, it's almost 50% slower than Perl.