http://www.dmst.aueb.gr/dds/pubs/trade/1997-login-Perl/html/shperl.html
This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:

The document's metadata is available in BibTeX format.

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Diomidis Spinellis Publications

A Conversation about Perl and the Shell: Choosing the Implementation Vehicle

Diomidis D. Spinellis and Rob Kolstad
<dds@senanet.com> <kolstad@bsdi.com>

People often need to decide whether to code the solution to a given task in Perl or as a shell script. Some factors that we consider are influenced by the answers to the questions below.

Can the task be easily mapped to a pipeline or sequence of Unix commands?

DDS: If this is the case then a shell script containing that pipeline will often yield an efficient solution in minimal time.

RBK: Of course, that requires fluency with each of the tools you'll be using. Even in the area of regular expressions, syntaxes and capabilities vary slightly among the various tools. I've pretty much revered to perl for just about everything that isn't a straightline set of running commands or simple pipes.

DDS: On the other hand, filter-type pipeline sequences can be succinctly expressed as a shell script; perl may require non-intuitive nested invocations of the grep function or the use of complicated control structures. Many problems can be easily solved with a sequence like:

  	grep .... |		# Find the data we are interested in
  	sed ... |		# Convert it to the right format
  	awk ... |		# Calculate something
  	sort ... |		# Sort output
  	pr			# And paginate
In my mind the expression of this sequence in perl as sequential operations on records accumulated into arrays (for sorting) is not as natural as the shell counterpart.

If the resulting solution will affect system performance does Perl's relatively large footprint justify its deployment?

DDS: For small jobs that can be programmed as a couple of shell commands I consider the launch of the Perl interpreter an overkill. This may or may not be a rational view depending on your system. On my system where the default shell is bash the executables of /bin/sh and /usr/bin/perl are of the same size order (299K and 410K respectively) and the resident set size of Perl is actually smaller than that of bash (552K versus 624K). Considering however, that on many systems the default shell could be a more frugal implementation than that of bash and that the shell's text (code) portion will almost certainly exist in memory at the time the script is executed, a small shell script will load faster and consume less resources than a similar Perl script.

RBK: My system's architecture pushes me in the other direction. Even with 32MB of RAM, perl should be sitting around in memory (along with the shells and the simpler tools like sed, grep, and sort). Amusingly enough, on my system perl requires 250KB of RAM while bash requires 1.35MB! Even if one swapped them both, that's less than half a second or so of real time if you're running on a modern UNIX-style system. Once the good parts are paged in, presumably the launch time would approach zero.

DDS: I aggree with RBK's point and plea for frugal tool implementations. Just because a feature can be added does not mean that is should be.

How will the program spend its execution time?

DDS: The decisive factor in this case is the ability to move most processing from the interpreted language to compiled C code. Most shell scripts execute a number of other Unix commands such as sort, awk, sed, and test. If the majority of the work is done by another specialized Unix program and the same work must be expressed in Perl code (i.e., it can not be performed by calling a single built-in Perl function) then the shell script could well execute faster than the Perl script. On the other hand, if the shell script will perform most of the work by executing one Unix command after the other and interpreting shell commands, then the cost of the repeated launching of the Unix commands coupled with the comparably higher cost of the shell interpretation will result in a shell script that will execute slower than its Perl cousin.

RBK: I no longer worry about the speed limitations of interpreted code vs. fast compiled code. It is a rare day that I translate programs into C for speed. Pentium 90s approach 100 MIPS; P-200s approach 200 MIPS. That's plenty of speed for 99% of what I do. I must admit that our bookkeeping software is finally starting to run a bit slow for my tastes (20 seconds to process 10,000 orders), but I'll upgrade the processor before re-writing the software. I can get 2x there and that's plenty.

I used to hate the overhead of lots of forks but even that has improved with the latest round of fast CPUs. I don't sweat that as much any more.

DDS: Having a fork (command execution) at the inner loop of a large data set processing program will yield to pathologicaly slow implementation. If find it difficult to persuade myself that this overhead is irrelevant. A quick experiment (grep implemented as a naive perl loop or shell script) on my system shows that the perl script is an order of magnitude slower than grep (1.37s vs 0.09s) and that the naive shell loop implementation taking 92.77s is more than three orders of magnitude slower than grep. According to Moore's law in six years (less than the age of the perl language) the processor technology provides the speedup needed for the perl script to match grep's current performance. On the other hand the shell script will match grep's current performance in 15 years (interestingly this is still less than the age of Unix).

Do the data volume and its processing requirements preclude the use of Perl?

DDS: If the operation to be performed on the data is more complicated than what can be handled by a number of passes over its elements and its volume substantially larger than that of the system's swap space then Perl may not be suitable for the job. Perl is structured to perform complicated data operations by processing in-core elements and therefore often requires the data to be loaded in memory. Unix-tool-based processing on the other hand can easily deal with large data sets by using tools such as sort, fgrep, uniq, and paste to sequentially process huge data files.

RBK: The point of reading-all-into-RAM is well taken. This approach ultimately can not scale well if your data is growing (e.g., order management systems). On the other hand, I've been able to structure all my programs so that they don't keep everything in memory unless it's extremely advantageous for me to do so. I agree that the best way in perl to implement pipelines that are easy structured with standard Unix toolset commands is by using the `system' directive to call them.

DDS: It is certainly possible to keep perl's data off memory and still keeping the implementation simple by using database-mapped associative arrays (dbm, gdbm etc.). Furthermore, binding perl to a relational database management system completely redefines perl's applicability and performance over very large data sets. At that point however the implementation domain has been transformed into an RDBMS which just happens to be programmed in perl. <2>How complex is the data and its relationships? DDS: Shell programs together with standard Unix tools can at its best deal with relational (join) or ordering (tsort) operations on sequential record lists. Dealing with more complicated data structures such as graphs can become a nightmare for even the most dedicated shell programmer. Version 5 of Perl, on the other hand, provides references (pointers) and therefore the possibility to create data structures of arbitrary complexity. Looking up data and dealing with unstructured text or binary data is also more easily performed in Perl than using Unix commands in a shell script. If the data lookups are to be performed on orderly structured line-based records and can be collected and executed in a batch then using a shell script together with one of fgrep, comm, uniq or join could fit the bill. If this is not the case (consider for example converting references to hypertext links in a LaTeX document) then the associative arrays and advanced regular expressions of Perl are the tool for the job.

RBK: I'm particularly enamored of the control structures available in perl. I think they enable programmers to keep fewer `contexts' in their mind and concentrate on procedural programming just as they would in C or C++ (or whatever).

Will the resulting script be setuid to a privileged user?

DDS: Setuid shell scripts are notoriously insecure. Some kernel implementations execute setuid shell scripts in a way that makes it possible for any user to gain the privileges of the script's owner. Even on systems that implement setuid shell scripts in a secure manner the plethora of the shell's lexical substitutions and dependencies on environment variables make it difficult to write a secure shell script. Perl excels in this area by providing a mechanism for securely executing setuid Perl scripts even on systems that do not support secure setuid script execution. In addition, Perl's variable tainting facility provides a security harness which identifies many possible security problems before they occur.

What type of portability is required?

DDS: A conservatively-written shell script will run unmodified on all Unix systems. Perl scripts will run on a wider variety of systems (e.g., MS-DOS, Windows, and VMS), but as Perl is not usually shipped as part of the standard operating system software the user is required to install (and possibly even compile) Perl in order to run the Perl script. For complicated system programs the tradeoffs may be different. Apart from commands that are nonsensical in a given environment (e.g., getpwent under Windows NT) Perl offers a consistent language platform across all its implementations. Unfortunately the same is not the case with shell scripts across different versions of Unix where subtle differences between shell implementations and wildly different command arguments can cause unexpected problems when the script is run on an obscure Unix flavor. Keep also in mind that references, object orientation, and readable regular expressions are only available in Perl version 5.

RBK: I don't think I agree that you can (in a reasonable amount of time) write a shell script conservatively enough to run unmodified on all Unix systems (though you can do it for `a large number'). Too many file locations, command locations, and other system dependencies (also known as `vendor enhancements') make this difficult. Perl on the other hand, has just one version and just one support organization. I like that.

How large and complicated is the resulting program likely to be?

DDS: Perl offers a number of language features which can be used to organize a large and complicated program. Blocks and local variables can be used to structure the code, packages provide encapsulation and information hiding, and the object-oriented features allow for intuitive solution modeling. On the other hand, although the shell offers only very modest structuring tools (subroutines and two types of variable visibility), a large system can be decomposed into a number of self-contained shell scripts that can be individually tested and debugged from the shell command line. This form of incremental and explorative bottom-up development and testing is the norm in shell programming, but not easily practiced when programming in Perl where one usually writes a monolithic program and depends on traditional techniques (stubs, test harnesses, and the debugger) for testing and debugging.

RBK: I rarely use Perl 5. As before, the control structures and simple data structures that I use in perl provide me with more than 90% of what I need. I do have some scripts that grew over time to be overly complex and ultimately needed a rewrite. I think that's generic, though.

DDS: When dealing with code that other people have written I usually find perl program's easier to understand than corresponding shell script.

A Real Example

DDS: In order to illustrate the different programming styles between the two languages DDS is including as an example a program written to solve a real problem: finding duplicate files in a directory hierarchy. Initially written as a shell script, the correct number of escape backslashes in a deeply nested sed command finally caused a rewrite in Perl. After the Perl program was finished, the shell script was completed in order to test the performance of both implementations.

The programs are run with a single argument which gives a directory. Both programs provide as output a list of file pairs that contain exactly the same data. I needed such a tool in order to clean up a set of 19000 files occupying 500MB as a first step for further processing. A naive solution to this problem involves comparing every file against all others. This is an O(n!) algorithm; obviously unusable for most practical purposes. A more efficient approach would be to sort the files by their content (an O(n log n) process) and then print duplicate files from the list. Unfortunately sorting files by their content can still be expensive since every file comparison involves opening, reading, and closing the file; in my case about 850000 system calls. In addition, sorting file names by file content can not be easily performed with standard Unix tools. I decided to minimize file accesses by operating on the checksums of the file data instead of the files themselves. Checksums are integer numbers and can be easily sorted and manipulated. Identical files will have identical checksums, the opposite however is not true. For this reason after finding all files with identical checksums a final step compares those files byte by byte and prints only files that actually match.

The shell program (illustrated bellow) creates a list of files in the directory (find), converts that list into a list of checksums (xargs sum) and sorts them to bring identical checksums together (sort). This is then saved (tee) into a temporary file for further lookups. The invocation of uniq prints the first entry of each identical checksum file sequence which feeds data into the loop for determining the matched files. The loop contains two nested inner loops: the first one loops over all possibly identical files (e.g., a1, a2, a3, a4), the second one loops over the file coming after the first loop file up the the last one (e.g., first a2, a3, a4, then a3, a4, then a4). The body of the inner loop simply compares the files and prints their names if they are equal.

     #!/bin/sh
     #
     # List duplicate files
     #
     TMP=/tmp/dup.$$
     trap "rm -f $TMP ; exit 1" 1 2 3 15
     # Look for candidates (having the same checksum)
     find $1 -type f -print |
     xargs sum |
     awk '{print $3, $1}' |
     sort +1 |
     tee $TMP |
     uniq -d -1 |
     while read NAME SUM
     do
          # List each file from the candidate set
          fgrep $SUM $TMP |
          awk '{print $1}' |
          while read FIRST
          do
               # List and process the remaining files of the set
               fgrep $SUM $TMP |
               sed -n '\*^'$FIRST'*,$p' |
               sed 1d |
               while read SECOND SUM2
               do
                    if cmp -s $FIRST $SECOND
                    then
                         echo $FIRST $SECOND
                    fi
               done
          done
     done
     rm -f $TMP
The Perl script (illustrated below) operates on exactly the same principles. It first calls the find library function which in turn calls process for every entry in the specified directory. Process determines whether the entry is in fact a file and calculates its checksum using the unpack function. It then pushes the filename into the appropriate entry of the files associative array. The files associative array is indexed by the checksum number. Every entry contains a list of files that have the given checksum. After all files have been processed a loop goes through all elements of the associative arrays and two inner loops compare (as in the shell script) the files in pairs. Identical files are printed.
     #!/usr/bin/perl
     #
     # List duplicate files
     #
     use File::Find;
     undef $/;
     # Call process for every found entry
     find(\&process, $ARGV[0]);
     # For every plain file found add its filename under its checksum hash
     sub process
     {
          return unless -f;
          open(IN, $_) || die;
          $sum = unpack("%32C*", <IN>) % 32767;
          close(IN);
          push @{$files{$sum}}, $File::Find::name;
     }
     # Go through all checksums comparing files that have the same sum
     foreach $files (values %files) {
          next if ($#$files == 0);
          for ($i = 0; $i <= $#$files; $i++) {
               for ($j = $i + 1; $j <= $#$files; $j++) {
                    print "$$files[$i] $$files[$j]\n"
                         if (same($$files[$i], $$files[$j]));
               }
          }
     }
     # Return true if two files are identical
     sub same
     {
          local($a, $b) = @_;
          open(A, $a) || die;
          open(B, $b) || die;
          return <A> eq <B>;
     }
With the shell script coded in 36 lines and the Perl script in 42 both scripts are approximately the same size. With three nested pipelines feeding a loop and a temporary file for storing intermediate results the shell script appears to be stretching the limits of shell programming. The Perl script executed in about half the time of the shell script (810 user and 816 system seconds for Perl versus 1672 user and 1842 system seconds for the shell script); a notable, but not dramatic performance difference. On the other hand at the peak of its execution the shell script was running 10 processes with a combined resident set size (RSS) of 4892K (probably a lot less since four and two of them shared the same text image) whereas the Perl script had an RSS of 8064K. It is also important to note that the RSS of Perl was constantly rising as entries accumulated into the associative array whereas the RSS of the shell script remained almost constant over its execution. This means that the shell script can handle much larger set sizes without running out of swap space as the Perl script might (in the October 1996 issue of ;login: Andrew Hume presented the listing of a directory containing 80GB of data). In this example both implementations are reasonable approaches for solving the problem; I hope that the example illustrated the different coding styles and tradeoffs between the two scripting languages and that the initial paragraphs provided you with some insights on the issues that are important when deciding the implementation vehicle for your problem at hand.

RBK: I coincidentally encountered this same problem when I was trying to reduce the number of typefaces I have stored on disk. Many are identical. I wrote a program to store checksums in a `SUM' file in each directory (new directories appear from time to time) and another program to traverse those directories and SUM files to see which were identical. The Perl programs differed slightly from DDS's, not enough for it to be worthwhile to show them here.

References

The following two papers provide comparisons between Perl and other languages:

[1] David G. Korn. Ksh - An extensible high level language. USENIX Very High Level Languages Symposium Proceedings, pp. 129-146. Santa Fe, October 1994.

[2] Theodore H. Romer et al., The structure and performance of interpreters. ACM Sigplan Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 50-159. Cambridge MA, October 1996.