Software Development: Defensive Programming

I talked a bit about defensive programming in my February Post. I also intended to add it as one of my fallacies in my article at Code Project, so I did a bit of research on what others have written about it. I was very surprised at some inaccuracies I found, especially in the Wikipedia article.

Note that this post is not about how to use defensive programming. That has been covered in depth (for example in Code Complete by Steve McConnell). I am just attempting to more accurately describe what it is, why it was invented, and problems with it.

What It Is?

Defensive programming or defensive coding is a style of writing computer software that attempts to be more resilient in the event of unexpected behavior. This unexpected behavior is generally considered to be a result of existing bugs in the software but could be due to other problems such as corrupted data, hardware failures, or even bugs introduced by later software changes. Generally, the code tries do do the most sensible thing with little or no performance penalty and without adding new error-conditions.

History

The first time I ever encountered the term "defensive programming" was in K&R (The C Programming Language, 1st Edition, by Kernighan and Ritchie). After extensive searching I can find no earlier references to the term. The term probably derives from the term "defensive driving" which came into common use in the early 1970's a few years before K&R was written.

It is mentioned twice in the index of K&R. On page 53 it clearly refers to making code resilient to bugs, but on page 56 it talks about writing code in a way that reduces the likelihood of future code changes introducing bugs. In any case many books since have used the term "defensive programming" to mean making the code resilient in the presence of bugs, for example The Pragmatic Programmer by Andrew Hunt and Dave Thomas (which talks about "coding defensively" in the chapter entitled "Pragmatic Paranoia"), and others. Even before that many software professionals, myself included, have used the term in this way since at least the mid-1980's.

Disagreement About Definition

Despite the term being fairly clearly understood for more than 20 years the exact definition of the term has recently become blurred after several (generally non-peer-reviewed) articles and blogs have appeared on various web sites. For example, the current Wikipedia article, and several sites that quote it, makes "defensive programming" sound like an approach to error-handling. Error-handling can be related to defensive programming but they are definitely not the same thing; and one is not a subset of the other (see below).

Another, well-regarded and often referenced article, entitled simply Defensive Programming has a very high ranking on Code Project. This is an excellent and worthy article in its own right, but it is not just about defensive programming. By its own admission it is about "... techniques useful in catching programming errors ...". As we will see below defensive programming has the opposite effect - it tends to hide errors not catch them. This article discusses many things and should be more accurately called something like "Good Coding Practices".

Error Handling vs Defensive Programming

The distinction between error-handling and defensive programming is not very clear in the minds of many programmers. I will explain the difference.

Error-handling detects and handles situations where something goes wrong that you know is possible, however unlikely. In contrast, defensive programming attempt to cater for problems that are assumed to be "impossible". There are two problems with this distinction that can cause confusion.

The first problem is that it can depend on circumstances whether something is impossible or not. For example, if a function is private to a module or program you may be able to ensure that it is always passed valid arguments; but if it is part of a public library you cannot be certain that it will never be passed bad data. In the first case you can program defensively to ensure that the function does something sensible even though you know it is "impossible" that this will happen. In the latter case you might add error-handling in case bad data is passed to the function.

So whether you choose to program defensively or add explicit error-handling depends on the scope of the software that you control. I discuss this further below under Scope.

The second problem is that there can be borderline cases where it is debatable whether something should be considered impossible. Consider this spectrum of scenarios for a hypothetical program that can be given invalid data:

The program accepts data directly from the user and the user may enter invalid data.
The programs accepts data from a text file that has been typed in by a person.
The program accepts data from an XML file (machine or manually generated).
The program reads a binary data file which was created by another program.
The program reads a binary file that was written by itself.
The program reads a binary file that includes a CRC to check it has not been corrupted.
The program reads a temporary binary file that it only created moments before.
The program reads a memory mapped file that it created.
The program reads from a local variable (ie, in memory) that it just wrote to.

At what point does invalid data become "impossible"? Personally, I would say that it is "impossible" that a data file has become corrupted and still generates the same CRC (see scenario 6). However, if security of the data is important you have to consider that the file was deliberately tampered with (in which case a cryptographic checksum such as SHA1 should be used).

However, I know that a lot of software assumes that binary data files are always valid (scenario 4 or 5). Much software will behave erratically if binary data files have become corrupted.

Of course, I think anyone would agree that you have to assume that the value of a local variable you just wrote to (see scenario 9) cannot change. However, even in that case a hardware error, deliberate tampering, or some other problem could change memory unexpectedly.

So it is not always clear when you need to have explicit error-handling code and when you should simply program defensively.

Example

The archetypal example of defensive programming occurs in just about every C program ever written, where the terminating condition is written as a test for inequality ( < ) rather than a test for non-equality ( != ). For example, a typical loop is written like this:

size_t len = strlen(str);
for (i = 0; i < len; ++i)
result += evaluate(str[i]);

rather than this:

size_t len = strlen(str);
for (i = 0; i != len; ++i)
result += evaluate(str[i]);

Clearly both of these should do exactly the same thing since the variable 'i' is only ever incremented and can never skip having the same value as 'len'. So then why are loop termination conditions always written in the first manner?

First, the consequences of the "impossible" condition are bad, probably resulting in all sorts of undesirable consequences in production software, such as an infinite loop or a memory access violation. The "impossible" condition may occur for any number of reasons such as:

bad hardware or a stray gamma ray photon means that one of the bits of 'i' is flipped randomly
another errant process (in a system without hardware memory protection) or thread changes memory that does not belong to it
bad supervisor level code (ie, the operating system or a device driver) changes memory
the 'evaluate' function has a rogue pointer that changes the value of 'i'
the 'evaluate' function corrupts the stack frame pointer and the location of 'i' is now at some random place on the stack
later code changes introduce bugs, for example:

for (i = 0; i != len; ++i)
{
while (!isprint(str[i])) // bad code change means that 'i' may never be equal to 'len'
++i;
result += evaluate(str[i]);
}

Of course, the last few, caused by bugs in the software, are the most common, which is why defensive programming is usually associated with protecting against bugs.

Culture of C

There are also two other aspects of the C language that affect how and when defensive programming is used - namely the emphasis on efficiency and the approach to error handling.

Looking at efficiency first -- it is one of the fundamental premises of C that it assumes the programmer knows what they are doing. The language does not protect from possible mistakes, as other languages try to do. For example, it is easy to write past the end of an array in C - but if all array access had bounds checking applied (by the compiler) then it would run more slowly even for perfectly safe code.

Due to this emphasis on efficiency, defensive programming is only used when it has little or no performance penalty. This is typified in the above example since a "less than" operation is normally just as fast as a "not equal" one.

The other aspect is the approach to error-handling in C. Errors in C are generally handled by using error return values. It is not unusual for C code to be dominated by error-handling, so error-conditions are ignored if they are considered unlikely to occur - eg, nobody ever checks the error return value from printf(). (In fact, error return values are often ignored when they should not be, but that is for another discussion.)

So, if "unlikely" errors are not generally handled it makes no sense for "impossible" conditions to be handled as errors since this would add to the existing error handling burden. (This is covered in more detail in item 10 of my Code Project article at http://www.codeproject.com/Articles/357065/Ten-Fallacies-of-Good-C-Code.) Of course, in languages with exception handling, many such "impossible" conditions can be easily handled by throwing a "software exception".

Scope

A lot of the confusion about defensive programming comes about because the scope of control is not always clearly defined. For example, if you have a function that takes a string (const char *) parameter you may want to assume that you are never passed a NULL pointer if it never makes sense to do so. If it is a private function you may be able to always ensure it; but if it's use is outside the scope of your control then you can't assume that unless you clearly document that a NULL pointer may not be used.

In any case even if you consider the condition to be impossible it is wise to allow for the possibility using defensive programming. Many functions do this by simply returning if unexpectedly passed a NULL pointer. (Again, note that this is different to error-handling since no error value is generated.)

So any discussion of defensive programming must clearly define the scope of the code being considered. This is one problem with the Wikipedia article on defensive programming.

Symptoms

When using buggy software the symptoms of defensive programming are seen often (but may be dismissed as operator error). I think everyone has at some time seen software that did something a little strange, like flash a window, ignore a command, or even display a message about an "unknown error". Usually this is caused by a bug which caused a problem from which the software attempted to recover.

This recovery can sometimes be successful but usually results in the program limping along. In the worst case it can silently cause massive problems like data loss or corruption. (After seeing something like this, I generally save my data and restart the software to ensure it is not in some weird state.)

Problems with Defensive Programming

By now it must be pretty clear that defensive programming has a major problem. It hides the presence of bugs.

Some people may think it is good to hide bugs. Certainly, for released software in use, you don't want to force the user to deal with a problem that they do not understand. On the other hand blindly continuing when something be broken can be dangerous. Also, some attempt should be made to notify someone of the problem - at least write an error message to a log file.

What is worse, though, is that defensive coding has been known to hide bugs during development and testing. Nobody can argue that this is a good thing. The alternative is to use what has been called "offensive programming" and sometimes "fail fast". This means to make sure someone knows about problems rather than hiding them.

I do use defensive programming so that unexpected or impossible situations are handled in release builds; but add assertions that check for the impossible situations so that bugs do not sneak through. I also do most testing use the debug build (so that assertions are used), except for final acceptance testing. For some critical things I also explicitly add error-handling code, since assertions are removed in release builds.

Standard C Library

Here are two more examples of how defensive programming is used, taken from the standard C library.
A nasty problem that occurs in far too many C programs is caused by buffer overruns. This mostly happens when copying or building a string and the size of the the output buffer is exceeded. In the name of defensive programming it is recommended to use string functions that take a buffer length (strncpy(), strncat(), snprintf(), etc). This avoids the buffer overrun, but hides the (possible) problem that the string was truncated.

Reports often require data nicely formatted into columns. This is usually achieved in C using the minimum field width of printf() format-specifiers. For example, to print numbers in a column five characters wide you would use the "%5d" specifier. If the integer is too big for the field then C just prints the extra characters anyway even though this will ruin your columns. (Contrast this with other languages like Fortran where field overflow results in silent truncation of numbers, which has caused some very nasty problems.) This is an example of defensive programming since when presented with an unexpected situation the code tries to do something sensible.

Exercise

Finally, here is something for you to think about. The standard C library includes a function that takes a string of digits and returns an integer called atoi.

If you are not familiar with atoi(), it does not return any error code but stops when it encounters the first unexpected character. As an example atoi("two") just returns zero.

Is the behavior of atoi() an example of defensive programming? Why?

How could it be improved?

Software Development

Wednesday 16 May 2012

Defensive Programming

1 comment: