This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:
10 Tips for Spotting Low-Quality Open Source Code
By Diomidis Spinellis
From the vast fields of Open Source Software (OSS) projects, we can gather both wheat and chaff. Distinguishing between the two can yield a significant edge in using OSS: We can avoid false starts and back champions with long-term prospects of success. For large, highly visible projects, we can use as our guide public reviews, comments in forums and blogs, and the opinion of colleagues who have used them. However, when we need to choose a project from the remarkably long trail of specialized OSS code that fits every conceivable need, these resources simply may not be available. In such cases, OSS lets us browse the underlying source code and reach our own opinion on what we’re dealing with.
Here’s a list of 10 clues for spotting bad code:
Writing good comments is more difficult than writing good code, so identifying code that has poor commenting is child’s play. If you see non-trivial functions or methods lacking a comment at their beginning, explaining what they do, you know you’re in trouble. The same goes for global variables, class and structure fields, and code blocks implementing a complex algorithm: All should appear with a comment. Not everything requires a comment; straightforward code and many local variables are better left to explain themselves.
Some comments are useless because they simply repeat what’s obvious from the name of the entity they explain. A particular thorn here are elaborate javadoc comments that some sophisticated editors automatically create and some unsophisticated programmers never fill-in. Such comments consume valuable screen real estate and distract the code’s reader without contributing something to the program’s understanding.
Excessive commenting is also a problem. As programmers maintain the code, they often forget to update the comments (especially when there are many detailed comments and they’re unsure about what they’re doing), and this causes the comments to diverge from reality. So, unless a program is written in assembly language, a comment on every line of code can be just as troublesome as too few comments.
Problems with the naming of the program’s entities are also easy to spot. One area is inconsistency. When you see in the same program CamelCaseIdentifiers mixed with underscore_separated_names, you know you’re treading on the remains of a battle between different personalities. Similarly, if you find a wanderGet and a setWander method, you know the programmer who wrote the code probably had his mind somewhere else at the time.
When choosing names, moderation should be the norm. A long, overly descriptive name used for a local index variable that should be named “i” is just as bad as a short, cryptic global variable name.
Another sign of bad code is inappropriate use of spaces for indenting the code and separating the program’s elements. There are many different ways for indenting the code to indicate its structure and no excuses for not using them, for inventing a new one, or for combining two in the same file. Any of those three sins should alert you that the code is sick and the problems could go deeper than the surface.
The same applies to use of spaces for forming statements and separating operands from operators in expressions. Novice programmers often fail to notice specific spacing rules for programs, so watch out for those spaces.
You also can spot bad code in calls to library functions that fail to an error return. In C, code errors are often passed from the Application Program Interface (API) to the program in-band (together with the function’s result). A special value, such as NULL or a negative integer, is used to indicate the result isn’t an entity the caller would expect to be returned, but an indication of an error. If the program fails to check the return value of, say, open or malloc, for NULL, then you can be almost certain there will be situations where the program will crash. In the Java and Microsoft .NET platforms, errors are typically communicated out of band through exceptions. In these cases, dodging an error situation requires explicit effort: an empty exception handler.
Handling errors is difficult because the actions you can take after the error occurs rarely rectify the error. So programmers must check one possible error after another merely to report it and exit gracefully. Programs that don’t check for API errors are often untrustworthy.
Copying code from one place to another is easy to do, but finding that code after many years and many programmers have left their signs on the copies is rarely that straightforward. We typically encounter the situation of repeated code sequences when we fix an error or introduce a change, then look in the program for further places we should adjust (this is always a good practice). Sometimes, we locate another subtly different instance of the code we modified, then another, and another. These code segments violate the “Don’t Repeat Yourself” (DRY) and “single point control” principles.
Although repetition is rarely glaringly apparent, the trouble it causes goes deeper than the surface. Repetition indicates a failure to use appropriate abstraction mechanisms such as methods, functions, subclasses, and generic types. So the code is longer than what is desirable, and will contain more bugs. Moreover, repeated instances of code hinder maintenance because they force you to manually track and modify each separate repeating fragment.
Despite what competitions such as the Obfuscated C Code Contest might let one believe, writing complex, undecipherable code isn’t a sign of programmer prowess. We can’t clearly express all algorithms. However, deliberately obtuse code is an insult to developers who need to maintain it in the future. When you see code that seems too complex, try to simplify it, starting with expression or statement-level refactoring operations. If you succeed, you’ll know the code’s quality isn’t all that great, and you’ll leave the code better than you found it.
The mother of unclear and complex code is code that expresses a complicated design. Here, look for mechanisms such as inheritance, interfaces, and generic types that are used with no apparent gains. Design patterns also are sometimes misused in a similar manner.
There’s a difference between a complex design that serves a purpose and one that simply shows off designer expertise or guards against vague, unspecified requirements that might appear in the future. The agile programming community has taught us that paying upfront the cost of future-proof gold plating is a disastrous waste of programmer productivity. The design becomes difficult to navigate today while the rewards may or may not be reaped in the future. So, when you encounter complex designs that fit this description, don’t feel intimidated; simply lament the fact that somebody’s paranoiac design ideas will cause you some everyday pain.
At the opposite end of complex designs, you may find programs that use too few abstractions, hiding several assumptions deep inside the code. Two chief and easily recognizable culprits in this area are fixed buffer sizes and magic numbers. Although nowadays, C++, Java, and .NET come with a feature-rich container framework, one can still find code that declares a fixed-size array instead of using the platform’s vector container, which implements an expandable array of objects. The number of elements the array will hold is hidden within the code, and the reasons for choosing that particular number are often similarly hidden in the programmer’s head. This code will crash when it faces more elements than those originally envisioned, and, sometimes, it may even expose the program to a buffer overflow attack.
Numbers appearing out of the blue don’t only occur in array declarations. They come in many shapes and sizes, and they’re aptly termed magic numbers. We can’t avoid them, because we do need, for example, to specify that Transmission Control Protocol (TCP) packets are identified by the number 6, and User Datagram Protocol (UDP) packets by the number 17. However, directly embedding these numbers in the code makes the code difficult to understand and maintain.
Using an array instead of a vector may be bad, but implementing a resizable collection from scratch is worse. Be wary when you find code implementing functionality that duplicates facilities available in the standard libraries. Although standard libraries are typically mature, stress-tested, and well-thought-out, homegrown code duplicating their functionality is anything but. When you find a complete XML parser or a specially crafted version of a sort function in a program, you’ll have to spend effort to understand its interface, verify its functionality, and ensure its performance is adequate. You’ll endure all that because somebody wanted to re-invent the wheel or didn’t bother to read the platform’s documentation.
Unit tests are becoming standard, but don’t assume that any code lacking unit tests is deficient. Other methods of testing also are acceptable, especially for legacy code. For example, the liberal use of assertions in the code, or the provision of a framework for regression testing, can often provide a solid testing infrastructure. Also, programs implemented as filters or programs providing a scriptable interface also can be easy to test through an external harness. On the other hand, when it’s obvious that no one ever considered how the code will be tested, and the code defies reasonable attempts to test it, then you have a problem.
Many of these bad code traits may seem trivial, and many are. Some can even be superficially corrected with a code formatting tool such as indent. However, these traits are simply danger signs. If those who wrote the code weren’t motivated enough or didn’t pay enough attention to keep the simple things in order, you can be sure that many worse issues may be lurking.
Having identified substandard code in a project we’d like to use, we need to take a step back and evaluate the situation. Is there perhaps an alternative project, written to a higher standard, that we can adopt? If not, we have to establish some risk mitigation measures: plan for more development time, expect bugs to emerge, and anticipate that small changes may require significant rework. Ideally, we also should arrange for an opportunity to give the code a facelift to fix the worst-designed elements and correct style problems. If possible, we should coordinate our code improvement work with the project’s managers, contributing our share to the community from which we benefited.
Diomidis Spinellis is an associate professor in the Department of Management Science and Technology at the Athens University of Economics and Business. He’s the author of two open source perspective books: Code Reading (Software Development Productivity Award 2004), and the recent Code Quality (Addison-Wesley, 2006). He’s currently heading the research project “Source Quality Observatory for Open Source Software (SQO-OSS)”, which is partially funded by the European Community’s Sixth Framework Programme.
Editor’s Note: Some of this material appeared on informit.com in “The Bad Code Spotter's Guide.”