Working with Unix Tools

A successful [software] tool is one that was used to do something undreamed of by its author.

— Stephen C. Johnson

Diomidis Spinellis

Line-oriented textual data streams are the lowest useful common denominator for a lot of data that passes through our hands. Such streams can be used to represent program source code, web server log data, version control history, file lists, symbol tables, archive contents, error messages, profiling data, and so on. For many routine, everyday tasks, we might be tempted to process the data using a Swiss army knife scripting language, like Perl, Python, or Ruby. However, to do that we often need to write a small, self-contained program and save it into a file. By that point we’ve lost interest in the task, and end-up doing the work manually, if at all. Often, a more effective approach is to combine programs of the Unix toolchest into a short and sweet pipeline that we can run from our shell’s command prompt. With the modern shell command-line editing facilities we can build our command bit by bit, until it molds into exactly the form that suits us. Nowadays, the original Unix tools are available on many different systems, like GNU/Linux, Mac OS X, and Microsoft Windows, so there’s no reason why you shouldn’t add this approach to your arsenal.

Many one-liners that you’ll build around the Unix tools follow a pattern that goes roughly like this: fetching, selection, processing, and summarization. You’ll also need to apply some plumbing to join these parts into a whole. Jump in to get a quick tour of the facilities.

Getting the data

Most of the time your data will be text that you can directly feed to the standard input of a tool. If this is not the case, you need to adapt your data. If you are dealing with object files, you’ll have to use a command like nm (Unix), dumpbin (Windows), or javap (Java) to dig into them. If you’re working with files grouped into an archive, then a command like tar, jar, or ar will list you the archive’s contents. If your data comes from a (potentially large) collection of files, find can locate those that interest you. On the other hand, to get your data over the web, use wget. You can also use dd (and the special file /dev/zero), yes¸ or jot to generate artificial data, perhaps for running a quick benchmark. Finally, if you want to process a compiler’s list of error messages, you’ll want to redirect its standard error to its standard output; the incantation 2>&1 will do this trick.

There are many other cases I’ve not covered here: relational databases, version control systems, mail clients, office applications, and so on. Always keep in mind that you’re unlikely to be the first one who needs the application’s data converted into a textual format; therefore someone has probably already written a tool for that job. For example, my Outwit tool suite (http://www.spinellis.gr/sw/outwit) can convert into a text stream data coming from the Windows clipboard, an ODBC source, the event log, or the registry.

Selection

Given the generality of the textual data format, in most cases you’ll have on your hands more data than what you require. You might want to process only some parts of each row, or only a subset of the rows. To select a specific column from a line consisting of elements separated by space or another field delimiter, use awk with a single print $n command. If your fields are of fixed width, then you can separate them using cut. And, if your lines are not neatly separated into fields, you can often write a regular expression for a sed substitute command to isolate the element you want.

The workhorse for obtaining a subset of the rows is grep. Specify a regular expression to get only the rows that match it, and add the -v flag to filter out rows you don’t want to process. Use fgrep with the -f flag if the elements you’re looking for are fixed and stored into a file (perhaps generated in a previous processing step). If your selection criteria are more complex, you can often express them in an awk pattern expression. Many times you’ll find yourself combining a number of these approaches to obtain the result that you want. For example, you might use grep to get the lines that interest you, grep -v to filter-out some noise from your sample, and finally awk to select a specific field from each line.

Processing

You’ll find that data processing frequently involves sorting your lines on a specific field. The sort command supports tens of options for specifying the sort keys, their type, and the output order. Having your results sorted you then often want to count how many instances of each element you have. The uniq command with the -c option, will do the job here; often you’ll post-process the result with another sort, this time with the -n flag specifying a numerical order, to find out which elements appear most frequently. In other cases you might want to compare results between different runs. You can use diff if the two runs generate results that should be the same (perhaps the output of a regression test), or comm if you want to compare two sorted lists. You’ll handle more complex tasks using, again¸ awk.

Summarizing

In many cases the processed data is too voluminous to be of use. For example, you might not care which symbols are defined with the wrong visibility in our program, but you might want to know how many there are. Surprisingly, many problems involve simply counting the output of the processing step using the humble wc (word count) command and its -l flag. If you want to know the top or bottom 10 elements of your result list, then you can pass your list through head or tail. To format a long list of words into a more manageable block that you can then paste into a program, use fmt (perhaps run after a sed substitution command tacks a comma after each element). Also, for debugging purposes you might initially pipe the result of intermediate stages through more or less, to examine it in detail. As usual, use awk when these approaches don’t suit you; a typical task involves summing-up a specific field with a command like sum += $3.

Plumbing

All the wonderful building blocks we’ve described are useless without some way to glue them together. For this you’ll use the Bourne shell’s facilities. First and foremost comes the pipeline (|), which allows you to send the output of one processing step as input to the next one. In other cases you might want to execute the same command with many different arguments. For this you’ll pass the arguments as input to xargs. A typical pattern involves obtaining a list of files using find, and processing them using xargs. So common is this pattern, that in order to handle files with embedded spaces in them, both commands support an argument (-print0 and -0) to have their data terminated with a null character, instead of a space. If your processing is more complex, you can always pipe the arguments into a while read loop (amazingly the Bourne shell allows you to pipe data into and from all its control structures.) When everything else fails, don’t shy away from using a couple of intermediate files to juggle your data.

Putting it all together

The following command will examine all Java files located in the directory src, and print the ten files with the highest number of occurrences of a method call to substring.

find src -name ’*.java’ -print |

xargs fgrep -c .substring |

sort -t: -rn -k2 |

head -10

The pipeline sequence will first use find to locate all the Java files, and apply fgrep to them, counting (-c) the occurrences of .substring. Then, sort will order the results in reverse numerical order (-rn) according to the second field (-k2) using : as the separator (-t:), and head will print the top ten.

Appalled? Confused? Disheartened? Don’t worry. It took me four iterations and two manual lookups to get the above command exactly right, but it was still a lot faster than counting by hand, or writing a program to do the counting. Every time you concoct a pipeline you become a little better at it, and, before you know it, you’ll become the hero of your group: the one who knows the commands that can do magic.

Diomidis Spinellis is an associate professor in the Department of Management Science and Technology at the Athens University of Economics and Business and the author of Code Reading: The Open Source Perspective (Addison-Wesley, 2003). Contact him at dds@aueb.gr.